The NEW Ghostscript PDF Interpreter
Update to the original post – March 4, 2022
The new PDF Interpreter is now the default!
We are happy to announce the new PDF Interpreter code is feature complete and is now enabled by default in Ghostscript 9.56.1. The old PDF interpreter can still be accessed as a fallback by specifying
-dNEWPDF=false. We’ve provided this so users who encounter issues with the new interpreter can keep working while we iron out those issues, the option will not be available in the long term.
This also allows us to offer a new executable (gpdf, or gpdfwin??.exe on Windows) which is purely for PDF input. For this release, those new binaries are not included in the “install” make targets, nor in the Windows installers.
Rewritten entirely in C, the new implementation delivers a standalone PDF interpreter that is faster and more secure than its predecessor, written in PostScript.
Read on for answers to frequently asked questions, and highlights of the changes.
Why the Change?
The original PDF interpreter, as previously supplied with Ghostscript, is written in PostScript. When the original implementation was done this made good sense; the graphics model of PostScript and PDF was compatible and the PDF syntax is (or at least was) broadly similar to PostScript. Indeed that original PDF interpreter has served us well for decades.
However, there are problems, mainly invisible to our users but nevertheless still present. PostScript has been described, with some justification, as a ‘write-only’ language and, being now an elderly language is a rare skill for developers making it quite hard to recruit new engineers with PostScript programming skills. Not all of the Artifex development team are experienced PostScript programmers and even for those of us skilled in the language, the PDF interpreter code is now so large and arcane that it is difficult to fully understand some aspects of the PostScript program which performs the PDF interpretation.
In addition, the PDF specification has continued to evolve, whereas the PostScript language has remained static. PDF has added features like transparency, which have no equivalent in PostScript, and the only way for us to support these has been to add special, often undocumented, PostScript extensions. These extensions have proven to be a security problem in the past and we would like to remove our PDF interpreter’s dependence on them.
It has also become increasingly evident that many PDF producers do not create PDF files that conform to the specification. Since there is no means to ‘verify’ that a PDF file conforms, creators fall back on using Adobe Acrobat, the de facto standard. If Acrobat will open the file then it must be OK! Sadly it turns out that Acrobat is really very tolerant of badly formed PDF files and will always attempt to open them. Often it silently repairs the file in the background; the first time an alert user would be aware of this is when Acrobat offers to ‘save changes’ to a file the user has not modified, frequently Acrobat doesn’t even do that.
Because Acrobat will open these files, there is considerable pressure for Ghostscript to do so as well, though we do try to at least flag warnings to the user when something is found to be incorrect, giving the user a chance to intervene.
But Ghostscript’s PDF interpreter was, as noted, written in PostScript, and PostScript is not a great language for handling error conditions and recovering. In general, when something goes wrong in a PostScript program the expectation is that the PostScript interpreter will generate an error message and stop. It is possible to do better, but it is not trivial. As time has gone on, and we have encountered more and more PDF files with ever more unexpected deviations from the specification, it has become harder and harder to come up with new strategies to work around these faults without re-introducing previously fixed problems or failing to process compliant files. It is also true that many of these workarounds have led to decreased performance when processing all PDF files, not just the malformed ones.
Finally, because the PDF interpreter was written in PostScript, there was no way to divorce it from Ghostscript and its PostScript interpreter. This had performance implications (starting up a PostScript interpreter is quite a complex process) and imposed a resource overhead in that we needed both the PostScript interpreter and a complex PostScript program before we even started to interpret the PDF file. Using the PostScript interpreter also exposed us to potential security issues due to the use of non-standard PostScript extensions. There was also the possibility of being forced to run PostScript XObjects (long since deprecated) in a PDF file, which potentially opened up some security problems as this program was run in the PDF environment which is less protected than regular PostScript.
The new PDF interpreter is written entirely in C, but interfaces to the same underlying graphics library as the existing PostScript interpreter. So operations in PDF should render exactly the same as they always have (this is affected slightly by differing numerical accuracy), all the same devices that are currently supported by the Ghostscript family, and any new ones in the future should work seamlessly.
Because the interpreter no longer relies on PostScript, however, it can be divorced from it. It is now possible to create a stand-alone PDF interpreter, GhostPDF, and it is integrated as a separate module in the language-switching product GhostPDL.
This offers us some advantages in that the memory footprint is smaller, and the startup time of the stand-alone PDF interpreter is less than starting up the PostScript interpreter.
That said, we do recognise that people are used to being able to process PDF files through Ghostscript, and indeed over the years we have offered customers and free users a wide range of solutions which were based on the fact that the PDF interpreter was written in PostScript, and its behaviour could be controlled or influenced from the PostScript environment.
So one of the goals of this project was to enable the C PDF interpreter to be integrated into the PostScript environment in such a way that PostScript can be used to influence the graphics state of the PDF interpreter, and PostScript functionality like BeginPage and EndPage continue to function with it. And of course not forgetting that initial point, Ghostscript today can process PDF files and our users will expect that ability to continue. We’ll set out some of the means for that below.
Using the New Code
If you are using Ghostscript, the new PDF interpreter is enabled by default. As a fallback, use -dNEWPDF=false to return to the old interpreter. Explicitly setting NEWPDF to true or false makes it clearer what is required.
Command line switches should work in both cases the same as they do in Ghostscript right now. Please note that the gpdf executable does not permit you to use the pdfmark operator (or otherwise send arbitrary PostScript to the interpreter using the -c switch). The pdfmark operator is a PostScript operator and therefore requires you to use the PostScript interpreter.
Obviously, the gpdf interpreter will not execute PostScript XObjects embedded in PDF files, for the same reason.
Using the PDF Interpreter From PostScript
The new code has been integrated following the old PDF interpreter; if all you want to do is process a PDF file then simply putting the file on the Ghostscript command line is sufficient. Also, the definition of the PostScript ‘run’ operator works with the new PDF interpreter, so you can still use code such as ‘(/home/myfile.pdf) run’.
This is covered in https://ghostscript.com/doc/9.56.1/Language.htm#PDF_scripting
Last revised: 04 March 2022