Character encoding
Version 6.1.0
In this article
This article describes aspects of Unicode and character encoding in general. This information may be helpful when using PDFsharp.
Unicode
Unicode is the industrial encoding standard widely used today. It encodes letters, symbols, or emojis as a number called a code point. A code point is a value in the Unicode codespace, that has the range from 0 to 1114111 or in hexadecimal from 0 to 10FFFF16.
The Latin capital letter A
for example has in the Unicode-specific hexadecimal notation the code point U+0041
.
The rose 🌹
emoji has the code point U+1F339.
PDFsharp uses exclusively Unicode for text in its programming interface. Other encodings are used internally in a PDF document, though all conversions are done by PDFsharp.
UTF-32
To work with a code point in e.g. a C# program several Unicode Transformation Formats (UTF) are available. The most obvious one is UTF-32. It stores a code point in a 32 bit unsigned integer value.
PDFsharp uses UTF-32 encoding to represent a code point as an Int32 value.
UTF-16
In .NET a character of a string is a 16 bit unsigned value from 0 to 65535 or FFFF16. This encoding is called UTF-16 and it is the primary character encoding scheme of .NET. Because PDFsharp maps UTF-16 characters to Unicode code points and than further to the glyphs of a particular font face, it is helpful that you understand this projection procedures in detail.
The Unicode codespace is separated into 17 planes. Each plane contains 65536 entries and tiles the Unicode codespace into 17 consecutive slices. The name of the first one is Basic Multilingual Plane (BMP) and it maps directly to the code points from U+0000 to U+FFFF. The majority of the characters used in the world are in this range. Wikipedia shows them all in a very enlightening visual BMP map.
The remaining code points in range U+10000 to U+10FFFF map to the 16 supplementary planes. Supplement code points in this range need two characters for representation in UTF-16.
Surrogate pairs
Supplement code points map to two 16 bit characters as explained in the following. If you already know that a code point is a supplement code point then you only need 20 bits to encode it. This is one bit less than for encoding all possible code points. The 20 bits are divided into 2 times 10 bits and become part of the so called high and low surrogate code points.
The BMP reserves a range of 2048 entries from U+D800 to U+DFFF for these surrogates. Take a look at the visual BMP above and you see the eight empty lines starting at row D816.
Now, with a little math you can map each supplement Unicode code point (≥ U+10000) to a high and a low surrogate code point. The hight surrogate has a range from U+D800 to U+DBFF and the low surrogate has a range from U+DC00 to U+DFFF. Both ranges are 1024 = 210 entries long to store 10 bits of information each.
For example, the rose 🌹
emoji with code point U+1F339 maps to "\ud83c\udf39"
in a C# string.
The high surrogate is U+D83C and the low surrogate is U+DF39.
Note that PDF documents and PDFsharp use only big-endian UTF-16. You never need to think about byte order when using PDFsharp.
Some advantages of UTF-16 over UTF-32 and UTF-8 are:
- A good balance of memory usage. For most usual characters 16 bit are enough.
- Unambiguity. Each 16 bit character is either a BMP Unicode code point or a high or a low surrogate. This is important when you randomly access a character in a .NET string.
- Easy transformation between UTF-16 and Unicode code points.
Symbol fonts
Fonts like Symbol, Wingdings, and others are an exception.
These fonts are called symbol fonts and have their own font specific encoding.
For example, the code for club ♣
in Windows Symbol font is A716 and this value is special to this font.
The sentence before uses the Unicode code point U+2663 for club to make the symbol appear in this text.
When you draw text using Symbol font, PDFsharp has no way to translate U+2663 into A716.
Therefore you must provide the code directly in the C# character like this: \u00A7
.
PDFsharp knows about the fact that a particular font is a symbol font. It accepts code points in the range from U+0000 to U+00FF and from U+FF00 to U+FFFF for a symbol font without modification.
Both ranges map to the same glyph, i.e. club can be encoded as \u00A7
or \uFFA7
for the Windows Symbol font.
UTF-8
UTF-8 encodes code points unambiguously to one, two, three, or four bytes. This saves even more space than UTF-16 for Latin character-based languages like English or German, but makes encoding/decoding a little bit more complicated.
UTF-8 is only used internally in PDF documents. When you use PDFsharp to write text to a PDF file or retrieve text, it is, as already said, always UTF-16 encoded.
ANSI encoding
ANSI has a limited set of Latin characters and is only used internally in PDF documents. When you use PDFsharp to write text to a PDF file or retrieve text, it is always UTF-16 encoded. Drawing text on a PDF page uses either ANSI encoding or glyph encoding, depending on the characters of the text. More about this in Character to glyph mapping.
See also Unicode (Wikipedia) · UTF-8 (Wikipedia) · UTF-16 (Wikipedia) · UTF-32 (Wikipedia) · Character encoding in .NET