For this week, here's the plan, Fran…
C
In computer concepts, a character is a unit of information in textual data, which has a
name associated with it. Some easy examples, of course, are the alphabet we are used to using, in which
each of the symbols, or letters, is usually called a character
. However, there are LOTS of other
characters, which go by names such as:
A grapheme is a minimally distinctive unit of writing in some writing system. It is what a person usually thinks of as a character. However, it may take more than one character to make up a grapheme. For example, the grapheme:
R̊
is made up of two characters (1) LATIN CAPITAL LETTER R and (2) COMBINING RING ABOVE. The grapheme:
நி
is made up of two characters (1) TAMIL LETTER NA and (2) TAMIL VOWEL SIGN I. This grapheme:
🚴🏾
is made up of two characters (1) BICYCLIST and (2) EMOJI MODIFIER FITZPATRICK TYPE-5. This grapheme:
🏄🏻♀
is made up of four characters (1) SURFER, (2) EMOJI MODIFIER FITZPATRICK TYPE-1-2, (3) ZERO-WIDTH JOINER, (4) FEMALE SIGN. And this grapheme:
🇨🇻
requires two characters: (1) REGIONAL INDICATOR SYMBOL LETTER C and (2) REGIONAL INDICATOR SYMBOL LETTER V. It's the flag for Cape Verde (CV).
Finally, a glyph is a picture of a character [or grapheme]. Two or more characters can share the same glyph [e.g. LATIN CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA], and one character can have many glyphs [think fonts, e.g., A A A and so on…]
Investigation: Go find out from the Internet what you need to know to answer
the question,
What glyph can represent the three characters LATIN SMALL LETTER O WITH STROKE,
DIAMETER SIGN, andEMPTY SET.?
Then draw your own copy of it.
Investigation: Go find out from the Internet what you need to know to answer
the question, What two characters would you use to make the Spanish flag?
A character set, then, is a set of these characters that are linked by some important unifying characteristic[s]. It has two parts: (1) a repertoire, which is the characters which make up the set, and (2) a code position mapping, which is a function mapping a non-negative integer to EACH INDIVIDUAL character in the repertoire.
As usual, the term mapping means a one-to-one correspondence between these
two values. For example, if the value 65
is set up to correspond to the character
, that is the kind of mapping referred to here. Thus, another term
for you to know is, when an integer Uppercase Latin A
i
maps to a character c
we say i
is the code point of c
.
A character set, then, can also be considered a set of code points.
Three older examples of character sets are BCD [which stands for Binary-Coded Decimal], EBCDIC [which stands for Extended Binary Coded Decimal Interchange Code], and ASCII [which stands for American Standard Code for Information Interchange].
BCD is the oldest of the three. It is a way of representing decimal numbers in which each digit is encoded into a specific fixed number of bits. Usually, either four or eight bits are used, with four being the default since that is all that are required. Note that BCD is specifically used for numeric digit values, NOT for letter characters, and as implied by the name, it encodes decimal digits.
The main value of BCD is that it is more easily human readable than normal binary, because each four bits
[or tetrade, another word for nybble
] only handles one digit. Since the values
only go from zero to nine, the bit patterns correspond to the binary tables we saw last week. BCD was a
part of most operating systems and processor instruction sets up until modern times, but with the advent
of modern processors such as the ARM and later versions of x86 it is no longer used. It is still useful
to know, however, because there are hardware chips that use the incoding.
This link will give you more information, and this table shows the encoding scheme.
|
Because in most computers the smallest unit of data storage is one
byte, there is a possibility for a lot of wasted space with BCD. For example, the number
Packed BCD also allows for signed values, but the sign is put at the END instead of at the
beginning, and consists of four bits, as shown in the |
EBCDIC is probably one of the oldest ways of encoding characters so that digital computers can haneld
them. It started in 1963 – 1964, and was invented by IBM for use on its System/360 mainframe computers
[which should give you a heads-up on something else all good computer scientists should know about, the
book called The Mythical Man Month
by Fred Brooks, who was the chief system architect of that
system].
EBCDIC is an 8-bit encoding scheme, which was created to extend BCD and its follow-on, BCDIC. These two encodings were used to help insure the physical integrity of punched cards, which at the time were used to imput both code and data into the mainframe computers. All IBM mainframe computers used EBCDIC as the basic encoding scheme for many years. In fact, there is actually an EBCDIC-oriented Unicode Transformation Format called UTF-EBCDIC proposed by the Unicode Consortium, which is designed to allow EBCDIC software to handle Unicode! |
ASCII is, apart from the modern Unicode sets, the most well-known character encoding scheme. It was also developed primarily by IBM, and actually its development was in process during the same period that the EBCDIC set was in work.
ASCII was originally a 7-bit coding, but over time, since the standardization of 8-bits for
the byte and the need for more than 128 characters, it turned into an 8-bit code. The first
32 characters of ASCII are the so-called Another useful artifact of ASCII is that several programming languages [such as Java] treat individual characters as synonymous with their ASCII code points. In such cases, you can actually add and subtract numeric values from the characters, so they can be used as array indices and for other such operations. |
Another example of a character set is Unicode. Here is part of its code point mapping [note that code points are traditionally written in hex]:
25 PERCENT SIGN 2C COMMA 54 LATIN CAPITAL LETTER T 5D RIGHT SQUARE BRACKET B0 DEGREE SIGN C9 LATIN CAPITAL LETTER E WITH ACUTE 2AD LATIN LETTER BIDENTAL PERCUSSIVE 39B GREEK CAPITAL LETTER LAMDA 446 CYRILLIC SMALL LETTER TSE 543 ARMENIAN CAPITAL LETTER CHEH 5E6 HEBREW LETTER TSADI 635 ARABIC LETTER SAD 71D SYRIAC LETTER YUDH 784 THAANA LETTER BAA 94A DEVANAGARI VOWEL SIGN SHORT O 9D7 BENGALI AU LENGTH MARK BEF TAMIL DIGIT NINE D93 SINHALA LETTER AIYANNA F0A TIBETAN MARK BKA- SHOG YIG MGO 11C7 HANGUL JONGSEONG NIEUN-SIOS 1293 ETHIOPIC SYLLABLE NAA 13CB CHEROKEE LETTER QUV 2023 TRIANGULAR BULLET 20A4 LIRA SIGN 20B4 HRYVNIA SIGN 2105 CARE OF 213A ROTATED CAPITAL Q 21B7 CLOCKWISE TOP SEMICIRCLE ARROW 2226 NOT PARALLEL TO 2234 THEREFORE 2248 ALMOST EQUAL TO 265E BLACK CHESS KNIGHT 30FE KATAKANA VOICED ITERATION MARK 4A9D HAN CHARACTER LEATHER THONG WOUND AROUND THE HANDLE OF A SWORD 7734 HAN CHARACTER DAZZLED 99ED HAN CHARACTER TERRIFY, FRIGHTEN, SCARE, SHOCK AAB9 TAI VIET VOWEL UEA 1201F CUNEIFORM SIGN AK TIMES SHITA PLUS GISH 1D111 MUSICAL SYMBOL FERMATA BELOW 1D122 MUSICAL SYMBOL F CLEF 1F08E DOMINO TILE VERTICAL-06-01 1F001 SQUID 1F0CE PLAYING CARD KING OF DIAMONDS 1F382 BIRTHDAY CAKE 1F353 STRAWBERRY 1F4A9 PILE OF POO
Because characters can have multiple glyphs, Unicode lets you represent characters unambiguously with
U+
followed by four to six hex digits [e.g. U+00C9
, U+1D122
].
How many characters are possible with Unicode? The powers that be have declared that the highest code point they will ever map is 10FFFF, so it seems like 1,114,112 code points are possible. However, they also declared they would never map characters to code points D800..DFFF, FDD0..FDEF, {0..F}FFFE, {0..F}FFFF, 10FFFE, 10FFFF [2114 code points]. So the maximum number of possible characters is 1,111,998.
How many characters have been assigned [so far]? See this cool table at Wikipedia to see how many characters have actually been mapped in each version of Unicode. [BTW, if you don't peek, Unicode Version 10.0 has 136,755 characters mapped, so there is a lot of room to grow.]
Please see the complete and up-to-date code charts, with example glyphs of every character. If you would like a much easier way to browse the characters [and why wouldn't you?], check out the beautiful charbase.com and codepoints.net. [Charbase and Codepoints are run by volunteers, so they may not be up to date.]
The code points are not assigned haphazardly. They are allocated into blocks. There are around 300 blocks. Make sure to see the official blocks file at Unicode.org. If you need a little preview, here is a little slice of that file:
0000..007F Basic Latin 0080..00FF Latin-1 Supplement 0250..02AF IPA Extensions 0400..04FF Cyrillic 0F00..0FFF Tibetan 2200..22FF Mathematical Operators 2700..27BF Dingbats 3100..312F Bopomofo 10860..1087F Palmyrene 12000..123FF Cuneiform 1D100..1D1FF Musical Symbols 1F000..1F02F Mahjong Tiles F0000..FFFFF Supplementary Private Use Area-A 100000..10FFFF Supplementary Private Use Area-B
Every Unicode code point is assigned various properties, such as name and category. Here are the categories:
Code | Description | Examples | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Lu | Letter, Uppercase | U+0059 LATIN CAPITAL LETTER Y U+048C CYRILLIC CAPITAL LETTER SEMISOFT SIGN U+10C1 GEORGIAN CAPITAL LETTER HE U+1041C DESERET CAPITAL LETTER THEE | |||||||||
Ll | Letter, Lowercase | U+00EE LATIN SMALL LETTER I WITH CIRCUMFLEX U+03CE GREEK SMALL LETTER OMEGA WITH TONOS U+A755 LATIN SMALL LETTER P WITH SQUIRREL TAIL U+ABBC CHEROKEE SMALL LETTER WO | |||||||||
Lt | Letter, Titlecase | U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z U+1FFC GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI | |||||||||
Lm | Letter, Modifier | U+02C9 MODIFIER LETTER MACRON U+0971 DEVANAGARI SIGN HIGH SPACING DOT U+3031 VERTICAL KANA REPEAT MARK U+16F98 MIAO LETTER TONE-7 | |||||||||
Lo | Letter, Other | U+01C2 LATIN LETTER ALVEOLAR CLICK U+06BD ARABIC LETTER NOON WITH THREE DOTS ABOVE U+2D85 ETHIOPIC SYLLABLE BOA U+A42E YI SYLLABLE JJYX | |||||||||
Nd | Number, Decimal Digit | U+0038 DIGIT EIGHT U+0B68 ORIYA DIGIT TWO U+1815 MONGOLIAN DIGIT FIVE U+1D7E0 MATHEMATICAL DOUBLE-STRUCK DIGIT EIGHT | |||||||||
Nl | Number, Letter | U+16EF RUNIC TVIMADUR SYMBOL U+2166 ROMAN NUMERAL SEVEN U+3028 HANGZHOU NUMERAL EIGHT U+10173 GREEK ACROPHONIC DELPHIC FIVE MNAS | |||||||||
No | Number, Other | U+00BD VULGAR FRACTION ONE HALF U+137A ETHIOPIC NUMBER NINETY U+2472 CIRCLED NUMBER NINETEEN U+10120 AEGEAN NUMBER EIGHT HUNDRED | |||||||||
Sm | Symbol, Math | U+00D7 MULTIPLICATION SIGN U+0608 ARABIC RAY U+221E INFINITY U+2978 GREATER-THAN ABOVE RIGHTWARDS ARROW | |||||||||
Sc | Symbol, Currency | U+00A5 YEN SIGN U+20A9 WON SIGN U+20AC EURO SIGN | |||||||||
Sk | Symbol, Modifier | U+00B4 ACUTE ACCENT U+02F9 MODIFIER LETTER BEGIN HIGH TONE U+FBC0 ARABIC SYMBOL SMALL TAH ABOVE | |||||||||
So | Symbol, Other | U+00A9 COPYRIGHT SIGN U+1391 ETHIOPIC TONAL MARK DERET U+230D BOTTOM LEFT CROP U+23E6 AC CURRENT U+2677 RECYCLING SYMBOL FOR TYPE-5 PLASTICS U+2827 BRAILLE PATTERN DOTS-1236 | |||||||||
Pc | Punctuation, Connector | U+005F LOW LINE U+2040 CHARACTER TIE | |||||||||
Pd | Punctuation, Dash | U+002D HYPHEN-MINUS U+2013 EN DASH U+2014 EM DASH U+301C WAVE DASH | |||||||||
Ps | Punctuation, Open | U+0028 LEFT PARENTHESIS U+005B LEFT SQUARE BRACKET U+007B LEFT CURLY BRACKET U+29D8 LEFT WIGGLY FENCE | |||||||||
Pe | Punctuation, Close | U+0029 RIGHT PARENTHESIS U+005D RIGHT SQUARE BRACKET U+007D RIGHT CURLY BRACKET U+0F3D TIBETAN MARK ANG KHANG GYAS U+301B RIGHT WHITE SQUARE BRACKET | |||||||||
Pf | Punctuation, Final quote | U+2019 RIGHT SINGLE QUOTATION MARK U+201D RIGHT DOUBLE QUOTATION MARK | |||||||||
Pi | Punctuation, Initial quote | U+2018 LEFT SINGLE QUOTATION MARK U+201C LEFT DOUBLE QUOTATION MARK | |||||||||
Po | Punctuation, Other | U+0021 EXCLAMATION MARK U+002A ASTERISK U+104F MYANMAR SYMBOL GENITIVE U+203D INTERROBANG | |||||||||
Zl | Separator, Line | U+2028 LINE SEPARATOR | |||||||||
Zp | Separator, Paragraph | U+2029 PARAGRAPH SEPARATOR | |||||||||
Zs | Separator, Space | U+0020 SPACE U+00A0 NO-BREAK SPACE U+2009 THIN SPACE | |||||||||
Cc | Other, Control | U+0000 NULL U+0008 BACKSPACE U+000A LINE FEED [LF] U+0093 SET TRANSMIT STATE U+009C STRING TERMINATOR | |||||||||
Cf | Other, Format | U+00AD SOFT HYPHEN U+0604 ARABIC SIGN SAMVAT U+200B ZERO WIDTH SPACE U+200D ZERO WIDTH JOINER U+200F RIGHT-TO-LEFT MARK U+E007F CANCEL TAG | |||||||||
Co | Other, Private Use | This category is given to code points in the following [inclusive] ranges: E000..F8FF, F0000..FFFD, and 100000..10FFFD. | |||||||||
Cs | Other, Surrogate | The 2048 code points in the range D800..DFFF [inclusive] are called surrogate code points and do not, and will not map to any character | |||||||||
Cn | Other, Not Assigned | Many code points are not yet assigned to a character, but 66 of the code points in this group are permanently and forever unassigned. These 66 are FDD0..FDEF [inclusive], FFFE, FFFF, 1FFFE, 1FFFF, 2FFFE, 2FFFF, 3FFFE, 3FFFF, 4FFFE, 4FFFF, 5FFFE, 5FFFF, 6FFFE, 6FFFF, 7FFFE, 7FFFF, 8FFFE, 8FFFF, 9FFFE, 9FFFF, AFFFE, AFFFF, BFFFE, BFFFF, CFFFE, CFFFF, DFFFE, DFFFF, EFFFE, EFFFF, FFFFE, FFFFF, 10FFFE, 10FFFF.
Mc | Mark, Spacing Combining | U+094C DEVANAGARI VOWEL SIGN AU | U+0DF3 SINHALA VOWEL SIGN DIGA GAYANUKITTA U+16F7E MIAO VOWEL SIGN NG Mn | Mark, Nonspacing | U+0301 COMBINING ACUTE ACCENT | U+064B ARABIC FATHATAN U+0EB9 LAO VOWEL SIGN UU Me | Mark, Enclosing | U+20DD COMBINING ENCLOSING CIRCLE | U+A670 COMBINING CYRILLIC TEN MILLIONS SIGN |
It's common to use multiple characters to make up graphemes. This is often done to put marks on letters, modify skin tones, modify genders, and create more complex graphemes. Examples:
ş̌́ | U+0073 LATIN SMALL LETTER S U+0327 COMBINING CEDILLA U+030C COMBINING CARON U+0301 COMBINING ACUTE ACCENT | [Note that this looks different on different browsers…] |
👶 | U+1F476 BABY | baby |
👶🏻 | U+1F476 BABY U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 | baby: light skin tone |
👶🏼 | U+1F476 BABY U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 | baby: medium-light skin tone |
👶🏽 | U+1F476 BABY U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 | baby: medium skin tone |
👶🏾 | U+1F476 BABY U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 | baby: medium-dark skin tone |
👶🏿 | U+1F476 BABY U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-6 | baby: dark skin tone |
👩⚖️ | U+1F469 WOMAN U+200D ZERO WIDTH JOINER U+2696 SCALES U+FE0F VARIATION SELECTOR-16 | woman judge |
👩🏽⚕️ | U+1F469 WOMAN U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 U+200D ZERO WIDTH JOINER U+2695 STAFF OF AESCULAPIUS U+FE0F VARIATION SELECTOR-16 | woman health worker: medium skin-tone |
👨🌾 | U+1F468 MAN U+200D ZERO WIDTH JOINER U+1F33E EAR OF RICE | man farmer |
👩🏿🔬 | U+1F469 WOMAN U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 U+200D ZERO WIDTH JOINER U+1F52C MICROSCOPE | woman scientist: dark skin-tone |
🧜🏼♀ | U+1F9DC MERPERSON U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 U+200D ZERO WIDTH JOINER U+2640 FEMALE SIGN | mermaid: medium-light skin tone |
🧘♂ | U+1F9DC PERSON IN LOTUS POSITION U+200D ZERO WIDTH JOINER U+2642 MALE SIGN | man in lotus position |
🤽🏾♀️ | U+1F93D WATER POLO U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 U+200D ZERO WIDTH JOINER U+2640 FEMALE SIGN U+FE0F VARIATION SELECTOR-16 | woman playing water polo: medium-dark skin tone |
See all the emojis at Unicode.org.
Investigation: Research and describe the function of VARIATION SELECTOR-16. How is it different from VARIATION SELECTOR-15?
Where can I get a list of all the fun stuff?
Awesome Codepoints is super cool.
Here are two character sequences:
In some sense they should represent the same text, right? In fact, they only differ in that the second sequence has a pre-composed character. Since the character sequences are different, we can make get them to compare equal only if we normalize them. Unicode defines four Normalization Forms.
Name | Stands for | Algorithm |
---|---|---|
NFD | Canonical Decomposition | Decompose by canonical equivalence, arranging combiners in specific order [gives fully expanded strings, so more space but fastest algorithm] |
NFC | Canonical Composition | Decompose then recompose by canonical equivalence [takes longer but gives shorter strings] |
NFKD | Compatibility Decomposition | Decompose by compatibility, arranging combiners in specific order |
NFKC | Compatibility Composition | Decompose by compatibility then recompose by canonical equivalence |
WHAT? Okay, take it slow. Canonical decomposition and composition does the right thing. Compatibility is lossy [one-way] but useful for search: It turns things like ligatures, roman numerals, subscripts, and superscripts into simpler forms. See why it's great for search?
For example:
Input | NFD | NFC | NFKD | NFKC |
---|---|---|---|---|
U+00F1 LATIN SMALL LETTER N WITH TILDE U+0063 LATIN SMALL LETTER C U+0307 COMBINING DOT ABOVE U+0327 COMBINING CEDILLA U+2077 SUPERSCRIPT SEVEN U+FB01 LATIN SMALL LIGATURE FI U+2168 ROMAN NUMERAL NINE U+2468 CIRCLED DIGIT NINE | n ̃ c ̧ ̇ ⁷ fi Ⅸ ⑨ | ñ ç ̇ ⁷ fi Ⅸ ⑨ | n ̃ c ̧ ̇ 7 f i I X 9 | ñ ç ̇ 7 f i I X 9 |
Unicode is really the only character set you should be working with. However, other character sets exist, and you should probably know something about them.
ISO8859-1 is a character set that is exactly equivalent to the first 256 mappings of Unicode. Obviously it doesn't have enough characters.
These 15 charsets also have 256-character repertoires. They all share the same characters in the first 128 positions, but differ in the next 128.
Details at http://www.unicode.org/Public/MAPPINGS/ISO8859/.This character set, with a repertoire of 256 characters, also known as CP1252, can be found at http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT. It is very close to ISO8859-1. Be careful with this set! Users of Windows systems often unknowingly produce documents with this character set, then forget to specify it when making these documents available on the web or transporting them via other protocols with tend to default to Unicode. Then the end result is annoying. It's best to avoid this.
The last thing about ASCII is that is exactly equivalent to the first 128 mappings of Unicode. Obviously it doesn't have enough characters for EVERYTHING. However it is commonly used, because many Internet protocols require it! It is a common subset of many character sets and something most people can agree on. Here is a link to an ASCII table.
A character encoding specifies how a character [or character string] is encoded in a bit string. Character sets with 256 characters or less have only a single encoding: they encode each character in a single byte [so in some sense the character set and the character encode are about the same]. But there are, naturally, many encodings of Unicode. The most important are UTF-32, UTF-16 and UTF-8.
This is the simplest. Just encode each character in 32 bits. The encoding of a character is simply its code point! Couldn't be more straightforward. Of course, you try to convince people to actually USE four bytes per character.
There are actually two kinds: UTF-32BE [Big Endian] and UTF-32LE [Little Endian]. Examples:
Unicode Character | UTF-32BE | UTF-32LE |
---|---|---|
RIGHT SQUARE BRACKET [U+005D] | 00 00 00 5D | 5D 00 00 00 |
CHEROKEE LETTER QUV [U+13CB] | 00 00 13 CB | CB 13 00 00 |
MUSICAL SYMBOL F CLEF [U+1D122] | 00 01 D1 22 | 22 D1 01 00 |
Note that every character sequence can be encoded into a byte sequence, but not every byte sequence can
be decoded into a character sequence. For example, the byte sequence CC CC CC CC
does not
decode to any character because there is not code point CCCCCCCC
in Unicode.
UTF-32
The Good: It's fixed width! Constant-time to find the nth character in a string [provided you care].
The Bad: It's pretty bloated.
In UTF-16 some characters are encoded in 16 bits and some in 32 bits.
Character Range | Bit Encoding |
---|---|
U+0000 ... U+FFFF | xxxxxxxx xxxxxxxx |
U+10000 ... U+10FFFF | let y = X-1000016 in 110110yy yyyyyyyy 110111yy yyyyyyyy |
Note that all characters requiring 32-bits have their first 16 bits in the range D800..D8FF. Pretty slick, right? Those code points are never assigned to any character in Unicode.... How perfect, dontcha think?
There are actually two kinds: UTF-16BE and UTF-16LE. Examples:
Unicode Character | UTF-16BE | UTF-16LE |
---|---|---|
RIGHT SQUARE BRACKET [U+005D] | 00 5D | 5D 00 |
CHEROKEE LETTER QUV [U+13CB] | 13 CB | CB 13 |
MUSICAL SYMBOL F CLEF [U+1D122] | D8 34 DD 22 | 22 DD 34 D8 |
UTF-16
The Good: Nothing is good about this encoding.
The Bad: Variable width, almost always uses more space than UTF-8, even for East Asian scripts, people.
The Ugly: It's what JavaScript thinks characters are. Facepalm.
Here's another variable length encoding.
Character Range | Bit Encoding | Number of Bits |
---|---|---|
U+0000 ... U+007F | 0xxxxxxx | 7 |
U+0080 ... U+07FF | 110xxxxx 10xxxxxx | 11 |
U+0800 ... U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx | 16 |
U+10000 ... U+1FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 21 |
Examples:
Unicode Character | UTF-8 Encoding |
---|---|
RIGHT SQUARE BRACKET [U+005D] | 5D |
LATIN CAPITAL LETTER E WITH ACUTE [U+00C9] | C3 89 |
CHEROKEE LETTER QUV [U+13CB] | E1 8F 8B |
MUSICAL SYMBOL F CLEF [U+1D122] | F0 9D 84 A2 |
And yeah, you never have to care about big-endian or little-endian with UTF-8.
UTF-8 absolutely rocks. The number of advantages it has is stunning. For example:
- ASCII text is unchanged in UTF-8.
- Non-ASCII characters are never coded with ASCII characters.
- C programmers that have assumed that the octet 00 terminates a string won't have to change their code [UTF-16 is full of these things!].
- Western languages can represent text in about 1.1 bytes per character—big savings over the other UTFs.
- When decoding you can always determine if you've ended up in the middle of a multibyte encoding [no encoding starts with 10].
- The number of leading ones in the first byte tells you how many bytes the character needs.
- Encoding and decoding are done with shifts and logical bitmask operations, no division, so processing is very fast.
- You can [lexicographically] sort by simply pretending each byte is a character.
- The octets FE and FF never appear, so mass confusion about byte-order marks and end-of-file-markers improperly processed by lazy C programmers never occurs.
All character sequences can be encoded as UTF-8 byte sequences, but many byte sequences cannot be decoded as UTF-8. Given a 32-bit unsigned integer, you can use this pseudocode to get its [1–4] bytes:
void to_utf_8( uint32 c ) { if( c > 0x10FFFF ) { throw new Error( 'Code point too large'); } else if( c <= 0x7F ) { emit_byte( c ); } else if( c <= 0x7FF ) { emit_byte( 0xC0 | c >> 6 ); emit_byte( 0x80 | c & 0x3F ); } else if( c <= 0xFFFF ) { emit_byte( 0xE0 | c >> 12 ); emit_byte( 0x80 | c >> 6 & 0x3F ); emit_byte( 0x80 | c & 0x3F ); } else { emit_byte( 0xF0 | c >> 18 ); emit_byte( 0x80 | c >> 12 & 0x3F ); emit_byte( 0x80 | c >> 6 & 0x3F ); emit_byte( 0x80 | c & 0x3F ); } }
Investigation: The code above admits too many characters. Modify this code to throw an exception for the 66 permanently unassigned code points as well.
Here's Tom Scott's overview of UTF-8:
You should also read the UTF-8 Everywhere Manifesto.
Investigation: For each of the following characters, give their encodings (where possible) in the UTF-32BE, the UTF-16BE, the UTF-8, the ISO8859-1 and the Windows-1252 character sets:
Do these encodings by hand. The purpose of this practice is to develop understanding.
A huge and annoying mistake people make is giving people text and not telling others what encoding it is using. For example, if you type in the text:
"Olé"™
…into a text editor that encodes your text in UTF-8, then you have stored the data:
E2 80 9C 4F 6C C3 A9 E2 80 9D E2 84 A2
Now if you didn't know the encoding and you just assumed the data was encoded in Windows-1252, you would decode the bytes as follows:
“Olé�™
Actually that was cheating a little bit: that ?
isn't really correct; it's a placeholder because
there is no character in Windows-1252 for the code point 9D. Oh, well. Anyway, conversely, suppose you
started with "Olé"™
and encoded this in Windows-1252. This gives the byte sequence:
93 4F 6C E9 94 99
If you tried to decode this as a UTF-8 character sequence you would get an error. Decoding as a UTF-16BE sequence you would get:
???
How hard can this be? Do you know the answers to these two simple questions?
Well, it's hard. But if you've read this far, you at least are now better able to debug problems.
But wait, think about this a bit. Maybe the fact that these questions are hard means these are the wrong questions. Really. Consider:
Aren't these problems with strings widely known? Well, Edaqa Motoray has written about this. He says programming languages dont't need a string type at all. And in fact, if your language does have one, it is probably badly broken. It helps, if you want to get a handle on all this, that strings and text are not the same.
Unicode is a huge topic. The standard is over 1,000 pages long. There is a ridiculous amount to learn. These notes have barely scratched the surface. Maybe you would like to go deeper. Maybe you would like to know everything there is to know.
Start with these articles, and follow links within them as well:
Got weeks of free time? You can read the entire one-thousand page Unicode standard.
C
We have seen the basics of a C
program from the previous exercises. This week we'll go a step
further, and get into some details that you will need so that you can understand the code you'll be
seeing and writing during the semester when you do your homework sets.
Because Java was originally based on C
, the syntax is very similar. The if
statement,
the for
loop, the while
loop, the switch
statement, as well as
things like break
, continue
, and other keywords are all [pretty much] the
same in C
as you see them in Java. Further, the languages are both statically typed, which is
another similarity. Also, variable declarations are as you would expect, such as int i = 23;
or double d;
or long l = 32547619845;
. All statements end with a semi-colon,
as you would expect, and curly braces, square brackets, and parentheses all work the way you think. So
do all the operators.
The differences arise when you begin to look at things like Strings and Arrays. What we usually treat
in Java as objects
can't be handled the same in C
. Further, Java handles the objects
internally by treating them as entities that the JVM will manage for you, but in C
the
programmer has to take on the task of that management herself. As an example, here are a couple of
code snippets:
// in Java... public static final int MY_SYMBOLIC_CONSTANT = 23; String s = new String( "This is a test string" ); int[] myRA1 = new int[10]; int[] myRA2 = { 1, 2, 3, 4, 5 }; System.out.println( "String s contains: " + s ); for( int i = 0; i < myRA1.length; i++ ) { myRA1[i] = i * 2; } |
// inC#define MY_SYMBOLIC_CONSTANT 23 char s[] = "This is a test string"; int myRA1[10]; int myRA2[5]; int i = 0; for( i = 0; i < 5; i++ ) { myRA2[i] = i + 1; } printf( "String s contains: %s", s ); for( i = 0; i < 10; i++ ) { myRA1[i] = i * 2; } |
Note several things here. Pick this code apart.
There are quite a few differences shown here, which also shows you why programmers that have cut their
teeth on Java and other nice
languages are loath to return to those thrilling days of yesteryear
when we had to handle a lot of things on our own. Nevertheless, THIS IS WHERE
MUCH OF THE POWER OF THE C
LANGUAGE COMES FROM, because the programmer has the flexibility
to do a LOT of things that Java won't let her do.
Here is some code from the K&R book that shows something we did in CMSI 281. We implemented a program to copy a file. Here is something similar, that copies one character at a time from its input to its output:
#include <stdio.h> int main( int argc, char * argv[] ) { int c; c = getchar(); while( c != EOF ) { putchar( c ); c = getchar(); } } // BecauseChas expressions like Java, we can write this as: #include <stdio.h> int main( int argc, char * argv[] ) { int c; ; while( (c = getchar()) != EOF ) { putchar( c ); } }
Note several things here. Pick this code apart.
In Java, sub-program units
are called methods
. In C
, as in JavaScript, they are
called functions. A function provides a convenient way to perform some
computation, and once the function is tested and shown to operate properly, it can be used over and
over again without worrying about its internals. All that is needed at that point is for the person
using the function to know its calling signature. As we've seen, there are
a large number of functions that are defined for us, which are part of the C standard library
.
Here is a function that provides one of the methods of Java's Math.pow()
method, along
with the test code for it:
#include <stdio.h> int power( int m, int n ); int main( int argc, char * argv[] ) { int i; for( i = 0; i < 10; i++ ) { printf( "%d %d %d\n", i, power( 2, i ), power( -3, i ) ); } return 0; } int power( int base, int n ) { int i, p; p = 1; for( i = 1; i <= n; ++i ) { p = p * base; } return p; }
Note several things here. Pick this code apart.
The string is arguably one of the most used arrays in the c
language. Yes, strings are treated
as arrays, in this language, specifically an an array of characters. Also,
they are a one-dimensional array, linear, as you'd expect. However, what you DON'T expect is
that you have to do the management of the string yourself. The special character \0
is used to
mark the end of the string, so you need to have the string dimensioned one
character longer than the maximum length of the string to account for the space required. Here is
one last program for this week, which does a string manipulation. We'll write several functions for
this effort, and put them all into the same source file. We'll pick it apart to mine the gold. See if
you can figure out just what this code does.
#include <stdio.h> #define MAXLINE 100 int getline( char line[], int maxline ); void copy( char to[], char from[] ); /* main program - what does this actually *DO*?? */ int main( int argc, char * argv[] ) { int len; /* current line length */ int max; /* maximum length seen so far */ char line[MAXLINE]; char longest[MAXLINE]; max = 0; while( (len = getline( line, MAXLINE )) > 0 ) { if( len > max ) { max = len; copy( longest, line ); } } if( max > 0 ) { printf( "%s", longest ); } return 0; } /* read a line into string s, and return the length */ int getline( char s[], int lim ) { int c, i; for( i = 0; ((i < lim - 1) && ((c = getchar()) !- EOF) && (c != '\n')); ++i ) { s[i] = c; } if( c == '\n' ) { s[i] = c; ++i; } s[i] = '\0'; return i; } /* copy values in "from[]" into "to[]"; assume that "to[]" is big enough! */ void copy( char to[], car from[] ) { int i = 0; while( (to[i] = from[i]) != '\0' ) { ++i; } }
Note several things here. Pick this code apart.
pointerand associated pointer math can be used to access memory in a type-dependent fashion.
One of the strongest parts of the C
language is the ability to access the address
of a variable, not just its value. In Java we don't have this ability, because we only either have
primitive types which are allocated on the stack, or we have REFERENCE variables which are
allocated on the heap. However, in C
we can at any time specifically gain access
to the address in memory of a variable. There are two operators which
facilitate this, the star or asterisk, and the and or ampersand
operator. Notice that these two things are overloaded. Star is also used for multiplication, and
ampersand is also used for logical ANDing operations. The compiler has to use the context of the
code to determine which you mean, but the compiler is very good at this, and will also tell you if
it doesn't understand or if it thinks you've screwed up.
This facility involves what is commonly known as pointers. If V
is
a variable, let's say an int
, then to get the address of where V
is in
memory we use the unary operator &
and write &V
. This is read the
address of V
; since the unary operator is right to left associative, the compiler will
determine that you mean the address of V. Addresses can be declared in programs, as well.
The corresponding thing, the pointer, is declared in the program and can then be used to obtain an
address as a value. The declaration int *p;
declared p
to be a variable
of the type pointer to
. In a certain sense, the star operator is
the inverse of the ampersand operator in this context.int
Because of the way arrays are stored in memory, the name of an array is treated as the pointer to
the first location in memory of that array. Further, since strings in C
are treated as an
array of characters, the name of a string is treated as the pointer to the first location of the
string. This can be a bit conf using when it is first encountered. The in-class exercise for this
week will give you practice with pointers, addresses, strings, and arrays.
If p
is a pointer, then the direct value of p
is a memory location and
the vlaue of *p
is the value at the memory location which is
stored in p
. Suppose we declare:
double x, y, *p;
…then the two statements:
p = &x; y = *p;
…are equivalent to the statement:
y = *&x;
…which in turn is equivalent to the statement:
y = x;
The following short program should help to illustrate the distinction between a pointer value and its dereferenced value:
int main( int argc, char * argv[] ) { int i = 17; int *p; p = &i; printf( "\n The value of i is %d\n", *p ); printf( "\n The location [address] of i is %d\n\n", p ); }
For the exercise, write a program called
goFish.c
that sums and averages a list of numbers, then creates a string and produces a count. You must use EITHER an array, or a set of pointers as the data storage. The program's specifications are as follows:
-9999is used to indicate that the entries are complete
int
elements7[
seven] characters; this is like in the game
Go Fishwhen you ask another player,
GIMME ALL YOUR SEVENS.
Homework number three is due Friday this week. Links to all the homework assignments are available on the syllabus page, but this reminder is just to make sure you know…
That's probably enough for the this week. Be sure to check out the class links page to see what's in the pipeline for next week.