OCR with Tesseract

OCR stands for optical character recognition. We want to use Tesseract to extract readable text from a scanned letter (you know, paper) as a pdf file.

cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)

The Erick Peirson tutorial does most of the work.

yum install ImageMagick

whereis convert
convert: /usr/bin/convert /usr/share/man/man1/convert.1.gz

Tesseract installation on CentOS is not a trivial matter but fortunately EisenVault has a working procedure. The operation described is executed in the /opt directory as root user.

We are interesting in Dutch language OCR therefore

wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/nld.traineddata

From then on testing a file test.pdf results in

convert -density 300 test.pdf -depth 8 -strip \
> -background white -alpha off test.tiff

tesseract -l nld test.tiff test.txt

1612  test.txt
307996  test.pdf
26138446  test.tiff

First impression based on a 20-lines text: mostly flawless except for the diacritics