Enabling Tesseract For Ghostscript 9.53 and later

Ghostscript 9.53 contains preliminary support for OCR devices.

It relies upon the open-source Tesseract and Leptonica libraries to achieve this. We do not currently ship Tesseract and/or Leptonica in the standard release build as this is alpha code and we are still deciding on a distribution model. If you wish to enable OCR support, you will need to build your own version of Ghostscript with this support included. This page gives you step by step instructions of what to do.

Updated source post-release

The code as shipped in 9.53.3 has been found to have minor problems on some systems. As we identify and fix such problems, we will keep an updated branch in git, called ghostpdl-9.53.x-ocr-fixes. People experimenting with this code are therefore encouraged to work from this branch, rather than with the initial release.

A snapshot of this can be found here.

Corresponding tesseract and leptonica archives can be found here and here respectively.

Windows Installer Binaries

For Windows users, both 32 and 64 bit pre-built binaries that include the ocr devices can be downloaded from:

Note: the above installers will overwrite an existing Ghostscript 9.53.3 installation.

Note further: these binaries are built from the original release, not from the ghostpdl-9.53.3-ocr-fixes branch.

Building on any platform

Step 1 – Fetch the Tesseract Source

By default, Ghostscript uses a slightly modified version of the Tesseract source, kept on the 'artifex' branch in the following git repository:

https://git.ghostscript.com/?p=thirdparty-tesseract.git;a=shortlog;h=refs/heads/artifex

The artifex branch is updated over time to track improvements. For the 9.53.3 release you want to use the artifex-9.53.3 tag.

For the Ghostscript 9.53 release, you can download a snapshot of this source here.

If our server is overloaded, downloads from that location will fail. Use the versions linked to above instead.

Download that, and unpack it into a directory called 'tesseract' within the ghostpdl sources.

Step 2 – Fetch the Leptonica Source

By default, Ghostscript uses a slightly modified version of the Leptonica source, kept on the 'artifex' branch in the following git repository:

https://git.ghostscript.com/?p=thirdparty-leptonica.git;a=shortlog;h=refs/heads/artifex

The artifex branch is updated over time to track improvements. For the 9.53.3 release you want to use the artifex-9.53.3 tag.

For the Ghostscript 9.53 release, you can download a snapshot of this source here.

If our server is overloaded, downloads from that location will fail. Use the versions linked to above instead.

Download that, and unpack it into a directory called 'leptonica' within the ghostpdl sources.

Step 3 – Fetch 'traineddata'

Tesseract relies on encapsulated knowledge so it can recognise particular languages and/or scripts. This knowledge comes in the form of 'traineddata' files. In order for Tesseract to work, it must have access to the appropriate 'traineddata' file for the selected language(s).

To complicate matters further, Tesseract can be built with different engines. These engines work in different ways, and hence need different information in the 'traineddata' file. It is therefore important to match the traineddata file you have with the build of Tesseract that you are using. Currently, by default, Ghostscript uses the "LSTM" engine (aka the 'modern' engine). The alternative is the 'legacy' engine. You can switch what engine is used by using the -dOCREngine= flag when you call Ghostscript. Details can be found in the Ghostscript documentation, and we will not deal with this more here.

Traineddata files are created by training Tesseract on a range of inputs. This is an involved and painstaking process that we will not cover here.

Fortunately, various sources exist on the net for getting ready prepared traineddata files.

By default, the Ghostscript OCR devices have OCRLanguage set to 'eng', thus the system will need 'eng.traineddata' in order to be able to run.

Now, you have a choice. You can either build your traineddata file(s) into the Ghostscript executable, or you can make them available on disc.

To build them into the executable, simply create a 'Tesseract' directory within the 'Resource' directory on disc (noting capitalisation!) and store your traineddata file(s) there.

If you would rather make them available on disc, then either you can put them into the current directory when Ghostscript is run, or you can set the environment variable 'TESSDATA_PREFIX' to point to the directory in which they live.

With the 9.53.3 release source, in order to allow Tesseract language data to be read from TESSDATA_PREFIX, you need to also tell Ghostscript to permit file reading from this location. For example:

export TESSDATA_PREFIX=/my/tesseract/data/
gs --permit-file-read=/my/tesseract/data/ -sDEVICE=...

Note the trailing '/' on the paths. With the code from ghostpdl-9.53.3-ocr-fixes this requirement has been lifted.

Step 4 – Rebuild Ghostscript

Do a full rebuild of Ghostscript.

On windows, use the 'Rebuild' option from the MSVC solution.

On unix, rerun the configure step if working from a release (or rerun autogen.sh if working from git). Then make as usual.

This should leave you with a working copy of Ghostscript that supports tesseract.

Step 5 – Run a Test

On windows, run:

bin/gswin32c.exe -sDEVICE=pdfocr8 -o out.pdf -r600 -dDownScaleFactor=3 zlib/zlib.3.pdf

On unix, run:

bin/gs -sDEVICE=pdfocr8 -o out.pdf -r600 -dDownScaleFactor=3 zlib/zlib.3.pdf

And you should hopefully get an out.pdf created with the contents of zlib/zlib.3.pdf rendered and OCRd within it.

Give us your feedback!

Please let us know how this works for you. The future of these devices will depend upon what feedback we get. Please let us know what they do well for you, what they do badly, what they don't do, but really should, etc. Feedback can be sent to support@artifex.com.