[fitz-dev] Ideas for name contexts (atom repositories)

Raph Levien raph.levien at artifex.com
Wed Feb 19 23:19:33 PST 2003


All,

   Tor and I have been discussing the Fitz runtime design on IRC
recently. I wanted to move at least some of the discussion to the
mailing list so more people would get a chance to participate.

   It seems like first real chunk of code in the Fitz universe will be
a filter library, for processing PostScript and PDF streams. It will
include all the standard compression filters and so on. Over time,
we'll migrate Ghostscript over to it.

   One of the problems with a general filter library is how to select
the filter name, and also how to pass in the filter parameters. A big
enum for the names, and lots of C structures for the parameter blocks,
would be one way to do it. But this has the problem of being somewhat
painful to extend. A better approach would be to have runtime support
for dynamically typed objects similar to PDF objects. The major types
include numbers, strings, lists, dicts (hash tables), and names. Most
of these are straightforward, but names are posing a challenge.

   For review, Ghostscript currently uses a global name table (it's
defined as extern name_table *the_gs_name_table in igc.c). In the
default configuration, it's limited to 64k names, and interned names
are 16 bit indices into the name table (things get a little more
complex when EXTEND_NAMES is configured to be non-zero, but I don't
know of anyone who does this). The name table is global, but it is
subject to garbage collection. Thus, the same name might have two
different indices at two different times, if there's an intervening
GC when it's not in use.

   This is actually a reasonable design, but exporting the GC logic
across an API seems too painful. So we're looking at a number of
alternate designs.

1. Intern-based

1a. Global intern with no reuse. If you intern all the names in PDF
files, it has the problem that it will tend to leak memory if you
have a long-running instance that processes a sequence of files. Not
acceptable in printer environments.

1b. Global intern with refcounting. At least it doesn't leak, but
keeping to the refcount discipline seems a bit painful. Also, you'll
have a fair amount of memory traffic to keep the refcounts updated.

1c. Name context with explicitly-managed lifetime. In particular,
one can imagine creating a new name context for each input file. It
does make life difficult if you need to have names persist across
file lifetimes.

2. String-based

2a. Names are alloc'ed strings. On freeing a dynamically-typed object
containing names, free all contained strings, especially dict keys.
This potentially chews up a significant amount of memory to store all
the copies, and also requires the equivalent of strdup every time you
want to pass a name around.

2b. Names are refcnt'd strings. Still potentially a lot of memory
use and copying bandwidth, in addition to the same refcnt overhead as
1b.

2c. Names are caller-allocated C strings. Upon return, the caller has
responsibility of freeing the memory. An advantage is that constant
strings can just be plain C string constants. Not particularly helpful
for token-parsing code, where the ownership has to be passed from the
parser to the caller. If the caller wants to hold a reference to a
dynamic object, it must deep-copy all contained names therein.

2d. Names used as dict keys can be caller-allocated or alloc'ed, and
a flag in the dict distinguishes.

3. Dicts store keys directly

Consider the following design for a dict. A dict contains a refcnt
header, n, n_max, and a pointer to an allocation of n_max dict
entries. Each dict entry is 16 bytes, and is laid out as follows.
The first byte is the length of the key (or 255 if big). The
following 11 bytes are the first 11 bytes of the key, padded with
zeros. Finally, 4 bytes store the pointer to the value. (on 64
bit architectures, an entry will probably be 24 bytes).

The keys are sorted in lexicographical order with the length prefix;
i.e. all 2-byte names come before the 3-byte names. The main dict
lookup routine first determines the length of the key, binary
searches, then compares the rest of the key. Probably the most
efficient implementation will do the initial binary search based
on the initital 4 bytes of the dict entry.

For keys longer than 11 bytes, the excess is stored somewhere in
the dict, details to be determined. Since the first step is to
compute the length of the key, it's easy to switch on whether you
need to look at the excess or not.

Names other than dict keys are treated the same as other dynamic
objects, which is to say probably refcnt'ed. This design basically
assumes that the names you worry the most about allocating are
dict keys.

I've sketched the implementation, but the user-visible interface
is essentially that the dictionary owns the strings for the keys.
Lookup and insert methods take a C string as an argument, probably
with a variant that takes a (length, data) pair, as that's both
more consistent with the GS runtime and also more convenient when
you're parsing.

What do people think? Is the extra implementation complexity worth
not having to worry about the lifetime of the atom repository? The
fact that none of the complexity is exported to the interface gives
me a warm and fuzzy feeling. Is there some other alternative I've
missed?

Raph




More information about the fitz-dev mailing list