[Gs-devel] Need a help with CJK encodings

mpsuzuki@hiroshima-u.ac.jp mpsuzuki@hiroshima-u.ac.jp
Tue, 04 Dec 2001 15:54:53 +0900


Dear Mr. Melichev,

>Abode defines the following names of CID glyph collections
>(the Ordering key of CIDSystemInfo) :
>
>Japan1
>Japan2
>GB1
>Korea1
>CNS1
>(Do you know more ones ?)

If you have diskspace to download GS_6_5 branch,
please check CJKTTCID.htm written by Taiji Yamada.

In addition to official & common-use CMaps (CNS1, GB1, Japan1,
Japan2, Korea1), Adobe distributes some unclassifiable "RKSJ"
CMaps:

ftp://ftp.oreilly.com/pub/examples/nutshell/ujip/adobe/rksj-cmaps.tar.Z

>From CIDSystemInfo in these CMaps, it's supposed that Adobe
have additional registries:

        Adobe-CNS2
        Adobe-HongKong1
        Adobe-Korea2
        Adobe-Vietnam1

However, there's no Technical Notes published to identfy
which CID is given to which glyph. We have no PS/PDF
documentations using these registry, nor infos of CID font
for these registry. I don't know about these registry are
under development or obsoleted before publishing, except
of Adobe-Vietnam1.

Adobe had deeply committed in the standardization of Hanzi
encoding in Vietnam (Chu-Han), and Adobe-Vietnam1 looks
identical mapping of Vietnam standard for Chu-Han.
Adobe-Vietnam1 is told as "under development" in "CJKV" by
Ken Lunde (p. 293).
Also it might be possible to Adobe-Vietnam1 emulation by
re-ordering Taiwanese Hanzi font. In fact, Chu-Han glyph
table in "CJKV" is printed by Taiwanese font.


>#define JIS                   0xFFFF   /* JIS character encoding */
>#define SHIFT_JIS             0xFFFE   /* Shift-JIS character encoding */
>#define EUC                   0xFFFD   /* EUC character encoding */
>#define UNICODE               0xFFFC   /* UNICODE character encoding */
>#define BIG5                  0xFFFB   /* BIG5 character encoding */
>#define TCA                   0xFFFA   /* TCA character encoding */
>#define GB                    0xFFF9   /* GB character encoding */
>#define KSC                   0xFFF8   /* K character encoding */
>#define WANSUNG               0xFFF4   /* WANSUNG character encoding -
>KOREAN */
>#define JOHAB                 0xFFF3   /* JOHAB character encoding - KOREAN
>*/

>I guess, Agfa took this from True Type specification.
>I need to map Adobe to Agfa.

Hmm, what purpose these macros are used for? Agfa library returns
these values to notify TTF internal encoding? or to notify TTF
internal charset? or caller passes it to Agfa library, to specify
encoding / charset the caller uses?

I've never heard the exist of TrueType encoded in JIS, EUC, TCA,
and KSC. I took a glance on OpenType specification, but I could
not find these values. If you know, please let me know.

>I've coded :
>
>        switch (ff->client_charset) {
>            case FAPI_CHARSET_Japan1 : fc->ssnum = JIS;       break;
>            case FAPI_CHARSET_Japan2 : fc->ssnum = SHIFT_JIS; break;
>            case FAPI_CHARSET_GB1    : fc->ssnum = GB;        break;
>            case FAPI_CHARSET_Korea1 : fc->ssnum = WANSUNG;   break;
>            case FAPI_CHARSET_CNS1   : fc->ssnum = BIG5;      break;
>            /* fixme : need to verify this with CJK people. */
>        }
>
>Please check and comment it.

Again, please let me know what fc->ssnum is used for.
Anyway the charset: Adobe-Japan1 & Japan2 classification
has no relationship with encoding: JIS & SHIFT_JIS.

(sorry, yet I could not find this code in your posts in
 gs-code-review, please let me know archive URL or Message-ID)

>Also, what does Agfa's "EUC", "TCA" and "K" may mean ?
>I recall that sometimes I've met "EUC", but not sure.
>Now I see "TCA" and "K" at first time without an explanation.

I really cannot imagine what Agfa thought and wrote such
macros...

"EUC" is supposed to be EUC-JP (CMap: EUC-H).

"K" is supposed to be KS C 5601 (later renamed KS X 1001),
but "WanSung" means EUC-KR, so, I suppose "K" means
ISO-2022-KR (CMap: KSC-H).

"TCA" is initial for "Taipei Computer Association",
possibly, in this context, I suppose "TCA" means
iso-2022 compliant encoding for charset CNS-11643:1986
(which includes all unique glyphs of Big5, but the
ordering is different from Big5) or its successor
CNS-11643:1992. But I don't know it means 7bit encoding
(CMap: CNS1-H) or 8bit encoding (CMap: EUC-CNS-H).

>Also what is the difference between JIS and SHIFT_JIS ?
>I've coded it above from scratch.

Primaly, "JIS" means Japanese national industrial standards.
Following 2 are most important.

JIS X 0201:1969
	defines ASCII (7bit area) and Katakana (8bit area),
	No Hiragana, No Kanji. encoding unit is 8bit.

JIS X 0208:1978
	defines Katakana, Hiragana, Kanji and full-width alphabets,
	Cyrillic, Greeks. encoding unit is 16bit.

"JIS" & "SHIFT_JIS" are often used as names of encoding method:
JIS is for ISO-2022-JP (7bit encoing of JIS X 0208:1978),
SHIFT_JIS is for Microsoft encoding (8bit encoding of JIS X
0201:1969 and JIS X 0208:1978).

JIS X 0208:1978 is not upper compatible, because it re-use
the 8bit code area which had ever used in JIS X 0201:1969.
To leave JIS X 0201:1969 and add JIS X 0208:1978 into single
encoding system, SHIFT_JIS encoding violates iso-2022 8bit
scheme. However, the basic glyph ordering of JIS X 0208:1978
area is same with "JIS", it can be algorithmically conevertable.

Followings are well-known problems of ShiftJIS.

"backslash and yen mark"
------------------------
	JIS X 0201:1969 defines "yen mark" to 0x4c.
	In original ASCII, it was assigned for "backslash".
	In ShiftJIS encoding system, no character code is
	assigned fot "backslash".

	JIS X 0208:1978 does not mention about 7bit code area.
	so, "JIS" can leave it to original ASCII using backslash.
	
	
"hankaku kana"
--------------
	JIS X 0201:1969 itself does not mention about glyph form.
	But, usually Katakana in JIS X 0201 was designed to be
	half-width (so-called "Han-Kaku"), to matching with
	fixed-half-width ASCII glyphs.

	In ShiftJIS, "hankaku kana" are different characters from
	16bit full-width Katakana. But "JIS" cannot provide such
	characters.


"system specific characters"
----------------------------
	Microsoft and various vendors following to Microsoft added
	their own characters (mainly symbols, punctuations, enclosed
	letters, and sometimes kanji) into unassigned code area.
	These characters is really system-specific.
	For example, "enclosed one" in MS Windows is
	displayed as "enclosed Monday" on Macintosh.

Regards,

mpsuzuki