Arc Forumnew | comments | leaders | submitlogin
4 points by sacado 6131 days ago | link | parent

If I'm right, there are many concepts behind Unicode :

A Unicode string is a sequence of codepoints (or characters).

A codepoint (or character) is a numeric id that can be represented by many ways (UTF-32 : always 4 bytes ; UTF-8 : only one byte for codepoints < 128 ; from 2 to 4 bytes for codepoints >= 128 ; ...).

A glyph is what you display on the screen : a Unicode string is displayed as a sequence of glyphs. A glyph can be a single character, or the combination of 2 or more characters.

e.g., 10 glyphs on the screen can be represented as 11 characters (the two last ones being composed in a single glyph). Depending on the underlying "physical" encoding, these 11 characters can occupy 44 bytes (UTF-32) (with a O(1) access to substrings) or, say, 25 bytes (UTF-8) (with a O(n) access to substrings), or ...

In a few words : characters have a meaning in Unicode, but they don't match well with bytes (and even with physical representation) and, sometimes, with the way things are representing on the string.

Correct me if I'm wrong.