Performance with NEW Ghostscript PDF Interpreter

Implementing a new PDF interpreter was far from an easy decision for us, given that we had an apparently functional implementation already. It is, therefore understandable that many people queried us on our reasons for it, and we endeavoured to answer those here:
https://ghostscript.com/blog/pdfi.html

We were also aware that there was at least a theoretical performance benefit available, by removing the need for the core interpreter and memory management to cope with the far more demanding needs of PostScript. Equally, eliminating the need for much of the file structure code to be run in an interpreted language (PostScript) would yield benefits.

Initially, our efforts were largely focused on a robust and specification compliant interpreter and its integration with the graphics back end, more than directly on the speed of the implementation.

In general, though, interpretation time gets swamped by the rendering time for any print relevant resolutions so, given all those factors, we were wary about making claims about performance.

There were, however, some foreseen, and some unforeseen performance benefits that come with the new implementation.

Compiled vs Interpreted

The most obvious performance gain is that PostScript is an interpreted language whilst C is a compiled language. Compiled languages are invariably faster, and have the advantage of the optimisations implemented within the compiler.

PostScript and Ghostscript have ways to improve the efficiency of the interpretation (in PostScript, see the "bind" operator), but those will never be sufficient to overcome the overhead imposed by interpretation.

Fonts

Possibly the most significant benefit for "real world" performance relates to fonts and font management.

The PDF interpreter implemented in PostScript had to rely on the PostScript name based font loading and management machinery, which has become less and less able to cope with how PDF creators now expect font management to work.

It is relatively common to see, for example, pages in a PDF referencing different subsets of the same font, which share the same name. Thus we cannot rely on the font name to ascertain if the correct font has already been loaded. In PostScript, our only somewhat reliable solution was to surround each page description with a PostScript save/restore operation, discarding any fonts (and many other objects) defined during the page and any cached glyphs associated with those fonts, starting the next page with a "clean slate".

That approach has two problems. The first is that it is wasteful for "well-formed" files that correctly reuse font objects across more than one page. The second problem is that it is an incomplete solution: it works between pages but does not work, for example, where multiple PDF forms on the same page use incompatible font objects with the same font name.

Our C implementation allows the interpreter to directly associate internal objects with the PDF objects from which they derive. In the case of fonts, it means that rather than relying on the font name to know whether a font has already been loaded, we can use the PDF object number (and generation).

This means we no longer need to discard the fonts and cached glyphs and reload them as we interpret subsequent page contents. Obviously, removing the need to reload fonts is a useful benefit, but even more so, having the ability to reuse cached glyphs reaps significant performance gains in text-heavy PDF files.

ICC Profiles

A further, but smaller benefit, we've gained is in how we cache ICC profiles. The gain is smaller largely because ICC profiles are cached internally by the graphics library. But, once again, the closer integration between the new PDF implementation and the graphics library means we can more tightly tie the ICC cache to the objects from the input file which, in turn, means the caching is more efficient.

Patterns

Slightly more esoteric both in how we benefit and the types of PDF file content where the benefit is seen comes from Type 1 patterns - that is, patterns whose content is defined by sequences of PDF marking operations.

Internally, we represent the content of a Type 1 pattern as a “tile” which can be either a bitmap or a display list, depending on the dimensions of the individual tile. This means that we only have to interpret the PDF content stream once and, as the pattern is repeated across the drawing area, we reuse that internal tile, making the process much more efficient.

But from the PostScript world, we have no direct access to that tile, so once a given drawing operation is completed with a pattern, its tile is discarded and any subsequent use of the pattern requires the tile to be created again.

With an interpreter implemented in C, we integrate more closely with the graphics library back end, meaning we can more closely tie the PDF interpreter pattern object to the pattern tile created by the graphics library. This means we can cache the pattern and its tile, and avoid repeated recreation of the tile.

PDF files that see significant benefits from this are relatively rare (compared to those with text!), but those that do see benefits are very significant.

Structurally broken but recoverable PDF files

One of the major reasons that the PostScript implementation was becoming unmaintainable is the sheer volume and range of out-of-spec PDF files that users expect interpreters to cope with.

Scanning a PDF file to rebuild an Xref is a process that can be intensive in terms of file accesses and string manipulation and comparisons, neither of which are easy or particularly quick in PostScript.

The new code allows us to be both faster and more flexible in our file repairs.

Ironically, the extra flexibility in the repair code results in a few files in our internal test suite appearing to be considerably slower, because we now find and render more content - not surprisingly, a dozen or so empty pages takes considerably less time than the same number of pages filled with content!

Broken PDF Content Streams

PostScript is not a tolerant language. In the event of an error, PostScript's normal behaviour is to abort interpretation, signal an error, and flush the remainder of the job. And that behaviour makes a great deal of sense in a language targeted at the professional printing arena - generally, you wouldn’t want to pay for 5000 copies of a document that was almost right.

The real world of PDF is rather different, however. Whilst PDF does play a significant role in professional print, the vast majority of PDF use is for information exchange, and viewing on screen. In this application, being able to view something… anything, even from a broken file, is preferable to aborting the job.

PostScript has an operator called "stopped", which allows the job to catch errors before the default PostScript error handling takes over.

We've used "stopped" fairly extensively in the PostScript implementation of PDF but (as with the save/restore for fonts) it's a solution that comes with problems.

Firstly, it's an incomplete solution; interpreting a content stream in a "stopped context" allows us to trap errors and opt to carry on, but it also makes almost no guarantees about the interpreter state (especially the operand and dictionary stacks) when an error is trapped. Thus, depending on the nature of the erroneous content, it can trigger later errors or incorrect content. In order to ameliorate this, it is often necessary to store various pieces of interpreter state information (primarily the stack depths) so we stand a better chance of recovery from error.

Secondly, "stopped" involves putting the interpreter into a different state for interpreting the relevant content, trapping and recording any error that occurs, and then changing the interpreter state back to "normal". This all adds up to a relatively time-consuming set of operations.

Thirdly, trying to strike a balance between coping with breakages and remaining acceptably performant limits how fine-grained our error tolerance could be. Putting every PDF operator into a stopped content would end up unacceptably slow so, in general, we work at the content stream level – an error in a content stream aborts that stream, but not the rest of the content.

Lastly, it is important to remember that, although this is to cope with broken content streams in PDF files, this affects all PDF files, well-formed as well as out-of-spec.

Nevertheless, having to cope with more and more widely seen out-of-spec PDF files has seen increasing use of "stopped" in the PostScript code. These have added up to a significant performance penalty by this point.

Implementing in C frees us from coping with the PostScript error handling and allows us greater control of the interpreter state. This means we can recover more gracefully from errors, and work at a much finer grain - we rarely end up aborting an entire content stream due to an error.

Additionally, the entire interpreter works more efficiently, because we don't have the overheads imposed upon us by PostScript's draconian error handling.

Some Examples

As mentioned above, our focus for the initial implementation was function, rather than speed. Since the first release of the new PDF interpreter, we have revisited some aspects of performance. What follows includes the fruits of that further work. The testing configuration used for these is fairly low resolution, the idea being to highlight the improvement in the interpretation. Thus the following configuration was used:

-dNEWPDF= -q -dMaxBitmap=3g -sDEVICE=ppmraw -o /dev/null -r150

The testing environment isn't really relevant, since the following numbers are just comparisons to give an idea about the benefit of the new implementation, and not intended as any kind of benchmarks.

Text heavy "office" type documents are likely to be representative of most PDF files "in the wild", thus the PDF and PostScript reference manuals are good examples for that use.

"PDF Reference sixth edition" -dNEWPDF=false -dNEWPDF=true
22.188 seconds 5.583 seconds
"PostScript Language Reference third edition" -dNEWPDF=false -dNEWPDF=true
8.862 seconds 2.663 seconds

From the group of widely used test cases, as used in our internal performance testing:

"j12_acrobat.pdf" -dNEWPDF=false -dNEWPDF=true
3.365 seconds 1.085 seconds

Finally, a special case which gave rise to the improvements in Type 1 pattern handling. A PDF that could have been designed to cause problems in an implementation such as the PostScript one:

Example file from https://bugs.ghostscript.com/show_bug.cgi?id=704236: -dNEWPDF=false -dNEWPDF=true
2 minutes 14 seconds 8 seconds

Even more significantly, the same test file, dropping the "-dMaxBitmap=3g" command line opton:

Example file from https://bugs.ghostscript.com/show_bug.cgi?id=704236: -dNEWPDF=false -dNEWPDF=true
>35 minutes 1 minute 45 seconds

Given those comparisons, clearly there are significant gains with the PDF interpreter implemented in C.