Re: UCS-2 vs. UCS-4


Subject: Re: UCS-2 vs. UCS-4
From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Fri Jun 22 2001 - 11:18:32 CDT


Mike Nordell wrote:
>
> Please see this post as more-or-less brainstorming.
>
> It seems that currently all (?) of us don't use anything larger than UCS-2,
> but in a not too distand future perhaps we will have to use 2^32 for
> character representations (makes me whish for plain ASCII and console-mode
> again - I sure as hell don't want to keep track of 4 _billion_ chars).
>
> I don't know if this is a problem already, but if it is; what about creating
> a factory for encoding? Like:
>
> ASCII_Factory
> UTF8_Factory
> UCS2_Factory
> UCS4_Factory
>
> and let them return objects that can handle (what to the outside looks like
> a linked list of "void*") the chars from a document (or piece table or
> whatever, I'm not sure at what level this should be implemented)?
>
> My idea was something like:
> Start at ASCII. If someone enter an outside-ASCII-range char the
> document is "upgraded" to the nect level that can handle that type of chars.
>
> When saving, check what max "level" is used, and save using that one.
> Example: If someone used 16-bit chars but entered a UCS-4 char, the engine
> would "upgrade" the full document [1] to UCS-4. When saving, if those
> specific characters were removed, it would "back down" to UCS-2.

I think this is a good idea. The key is that we have a "string" class
about which we do not make assumptions regarding character
representation.
UT_UCSChar is such an assumption currently.

However, UTF-8, UTF-16, and UTF-32 can all handle 32-bit codepoints.
I'm still not sure if UCS-2 is defined as different to UTF-16 in this
regard. UTF-16 definitely handles surrogates but I'm not sure if
they're part of UCS-2 nor not. Otherwise the two are the same thing.
Correct me if I'm wrong please. Anyway, UTF-8 is always 8 bit
based so it is a superset of ASCII - no upgrading needed. It is
multibyte meaning for a 31 (not 32) bit range of characters it can
take from 1 to 6 bytes. Usually 1 byte for English, 2 bytes for
accented characters, and 3 bytes for Chinese/Japanese/Korean, and
more for really exotic stuff hardly used yet. UCS-2 can handle
up to 2^16 range of characters in sixteen bits. Above that we
have "surrogates" which mean we now have to handle possible pairs
of sixteen bit values. This is where our UT_UCSChar is not
compatible. UTF-32 always uses a single 32 bit value to hold
any character whatsoever.

So if we used UTF-8 internally we never need to upgrade and
sizes are always pretty good. But we need functions which
expect to iterate through the string and never use true
random access. We can also do this using UCS-2/UTF-32 if
we handle surrogates properly. This also means no true
random access. If we really do need random access we might
be able to have a UTF-8 -> UCS-2 (no surrogates) -> UTF-32
system of upgrades.

There are also issues involving the concept of a character
versus a codepoint versus a glyph which boil down to the
reality that we should always treat a single "character" as
a string.

Properly designed and coded I don't think this is difficult
and should mean we can still have an ASCII only build
and a fully multilingual build and keep everyone happy.

Andrew Dunbar.

-- 
http://linguaphile.sourceforge.net

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com




This archive was generated by hypermail 2b25 : Fri Jun 22 2001 - 11:16:34 CDT