Character encodings

What are character encodings

Modern computers are binary machines and store all information using bits. For numbers, this would be easy as we can simply store the binary value of a number: a digital ‘10’ can be stored as a binary ‘1010’ in memory (RAM or disk). Graphics can be quantified as numerical color maps (such as RGB map), and thus do not require new ‘inventions’ to be stored. However, letters, characters and other repeating symbols such as emojis do need to be first ‘converted’ to a number in an arbitrary way before being able to be stored. This is where character encodings come from. Since this process is arbitrary, many encodings exist, but a few of them became popular and formed a standard.

Terminology

First, let’s get more concrete with the terminology of character encodings. We have the following definitions:

grapheme: a single unit of a human writing system, which can be a letter such as ‘d’, a character such as ‘空’, an emoji such as ‘’, etc.

code point: code point is a unique position in a quantized character mapping scheme. For example, in ‘ASCII’, the code point associated with letter ‘d’ is the decimal value ‘100’. In more complex encoding schemes such as ‘UTF-8’, certain graphemes such as ‘é’ can be mapped in two ways: it has its own code point ‘233’, or it can be represented as the code point of ‘e’ (101) combined with the code point of acute accent modifier (769).

encoding: encodings are the binary representations of the graphemes. Through an encoding map, each grapheme is linked to a corresponding code point, which can then be translated into an encoding. However, the size of the encoding is an arbitrary choice. For example, in ‘UTF-32’, every grapheme will be mapped into 4 bytes, even the simpler ones such as the letter ‘d’ (d -> 100 -> 00000000 00000000 00000000 01100100), yet in ‘UTF-32’, the simpler letters such as ‘d’ only take up 1 byte (d -> 100 -> 01100100). Different choices of encoding will thus affect memory consumption.

All the above form the basic blocks of all encoding standards.

ASCII

ASCII (American Standard Code for Information Exchange) is an early and influential character encoding standard. It has only 128 code points, and only 95 out of the 128 are printable characters. ASCII is therefore limited in scope, however due to its simplicity, ASCII has the benefit that all graphemes has a unique 1 byte encoding. Thus, the lens of the encoding is the same as the lens of graphemes. Here is the full ASCII table:

Code point (Dec)	Hex	Oct	Char
0	00	000	NUL (null)
1	01	001	SOH (start of heading)
2	02	002	STX (start of text)
3	03	003	ETX (end of text)
4	04	004	EOT (end of transmission)
5	05	005	ENQ (enquiry)
6	06	006	ACK (acknowledge)
7	07	007	BEL (bell)
8	08	010	BS (backspace)
9	09	011	TAB (horizontal tab)
10	0A	012	LF (line feed)
11	0B	013	VT (vertical tab)
12	0C	014	FF (form feed)
13	0D	015	CR (carriage return)
14	0E	016	SO (shift out)
15	0F	017	SI (shift in)
16	10	020	DLE (data link escape)
17	11	021	DC1 (device control 1)
18	12	022	DC2 (device control 2)
19	13	023	DC3 (device control 3)
20	14	024	DC4 (device control 4)
21	15	025	NAK (negative acknowledgment)
22	16	026	SYN (synchronous idle)
23	17	027	ETB (end of transmission block)
24	18	030	CAN (cancel)
25	19	031	EM (end of medium)
26	1A	032	SUB (substitute)
27	1B	033	ESC (escape)
28	1C	034	FS (file separator)
29	1D	035	GS (group separator)
30	1E	036	RS (record separator)
31	1F	037	US (unit separator)
32	20	040	(space)
33	21	041	!
34	22	042	”
35	23	043	#
36	24	044	$
37	25	045	%
38	26	046	&
39	27	047	’
40	28	050	(
41	29	051	)
42	2A	052	*
43	2B	053	+
44	2C	054	,
45	2D	055	-
46	2E	056	.
47	2F	057	/
48	30	060	0
49	31	061	1
50	32	062	2
51	33	063	3
52	34	064	4
53	35	065	5
54	36	066	6
55	37	067	7
56	38	070	8
57	39	071	9
58	3A	072	:
59	3B	073	;
60	3C	074	<
61	3D	075	=
62	3E	076	>
63	3F	077	?
64	40	100	@
65	41	101	A
66	42	102	B
67	43	103	C
68	44	104	D
69	45	105	E
70	46	106	F
71	47	107	G
72	48	110	H
73	49	111	I
74	4A	112	J
75	4B	113	K
76	4C	114	L
77	4D	115	M
78	4E	116	N
79	4F	117	O
80	50	120	P
81	51	121	Q
82	52	122	R
83	53	123	S
84	54	124	T
85	55	125	U
86	56	126	V
87	57	127	W
88	58	130	X
89	59	131	Y
90	5A	132	Z
91	5B	133	[
92	5C	134	\
93	5D	135	]
94	5E	136	^
95	5F	137	_
96	60	140	`
97	61	141	a
98	62	142	b
99	63	143	c
100	64	144	d
101	65	145	e
102	66	146	f
103	67	147	g
104	68	150	h
105	69	151	i
106	6A	152	j
107	6B	153	k
108	6C	154	l
109	6D	155	m
110	6E	156	n
111	6F	157	o
112	70	160	p
113	71	161	q
114	72	162	r
115	73	163	s
116	74	164	t
117	75	165	u
118	76	166	v
119	77	167	w
120	78	170	x
121	79	171	y
122	7A	172	z
123	7B	173	{
124	7C	174	\|
125	7D	175	}
126	7E	176	~
127	7F	177	DEL (delete)

Unicode

As ASCII is too limited in scope and does not include languages such as Chinese, Arabic, emojis, etc. A standard known as Unicode is formed to consistently handle text across most of the world’s writing systems. Unicode is synchronized with international standard setting bodies such as ISO and is labeled ISO/IEC 10646. Due to ASCII’s influence on early computing, Unicode is backward compatible with ASCII, maintaining the same code points and characters in its mapping scheme. However, different encoding schemes for Unicode exist, the most popular one being ‘UTF-8’. Others include ‘UTF-16’, ‘GB 18030’ (Chinese), etc.

Since unicode comprises of text all across the world, it is organized in blocks, where each block is a continuous range of similar character codes. For example, the block Basic Latin comprises the first 128 character codes of unicode, and is nothing but the original ASCII encodings. The block Cyrillic comprises of the 1024th to 1280th character codes, and incorporate the basics of the Cyrillic languages. There are 327 blocks in total.

UTF-8

UTF-8 is the most popular encoding scheme of Unicode, unlike UTF-32 for example, UTF-8 has varying encoding sizes for different lexemes. The more common letters such as those in ASCII only require 1 byte of memory, while the more complex graphemes such as ‘’ may take up to 4 bytes of memory.

In order to determine the number of bytes needed during decoding, UTF-8 encoding’s first byte has special meaning, below are the 4 forms of the first byte:

0xxxxxx -> 1 byte encoding 
110xxxxx -> 2 bytes encoding
1110xxxx -> 3 bytes encoding
11110xxx -> 4 bytes encoding

bytes 2 to 4 only start with 10xxxxxx, thus it is possible to determine when a new grapheme has started. For ASCII (or Basic Latin) code points, UTF-8 and ASCII result in the same encodings.

Mojibake

Mojibake is the garbled and unintelligible text that is the result of text being decoded using an unintended character decoding. This display may include the generic replacement character (“�”) in places where the binary representation is considered invalid. Opening a binary file with a text editor for example will result in mojibake, as the bytes in memory are not intended to be decoded as text at all.

Mojibake can also happen when decoding text using a different character encoding scheme than what was used in encoding the text. This is especially likely to happen when the encoding is done using multi-bytes schemes such as UTF-8, but decoded using 1 byte schemes such as ASCII. For example, when we try to decode , which has a four bytes hexadecimal encoding of 0xF0 0x9F 0x99 0x88, but instead uses a 1 byte decoding scheme, we will decode 0xF0 into the mojibake ð, 0x9F, 0x99, 0x88 into the mojibakes ��. Modern browsers do a great job at guessing the encodings used by the websites and thus decode accordingly. However, manually determining a webpage’s decoding such as when web scraping can be a troublesome process.

Other encoding schemes

Some other encoding schemes occasionally pop up. One is called ISO-8859-1 which is a 1 byte encoding scheme for extended Latin alphabets. It precedes Unicode and heavily influences it, thus is compatible with Unicode for much of its code points. Windows-1252 is a similar encoding for Latin alphabets proposed by Microsoft and used by default in Microsoft Windows. It is similar to ISO-8859-1 for most of its code points, and at the time of this writing is the most-used single byte encoding scheme in the world. Browsers treat ISO-8859-1 and ASCII as Windows-1252.