Character to glyph mapping
Version 6.2.0
In this article
This article describes how PDFsharp maps the characters of a text to the glyphs of a particular font.
Introduction
When you draw the text "Hello, World"
into a PDF document, its appearance depends on the font face you use.
Different fonts provide different glyphs for each character or code point.
If you change the style e.g. from regular to italic, another font face of the same font family is used and the text appears italic.
Technically you can think of a glyph as a little program that draws the specific outline of a character when executed. Over the decades, several font technologies were invented. PDFsharp uses only OpenType fonts. OpenType fonts provide a unique file format for both PostScript fonts and TrueType fonts. It unites Postscript fonts, originally developed by Adobe, and TrueType fonts, originally developed by Microsoft and Apple. The ways how glyphs are represented are sightly different. Postscript uses cubic Bézier curves to describe glyphs, while TrueType uses quadratic Bézier curves. Because glyphs are not just curves, but rather programs that compute the so called hinting, there is no lossless conversion from one type into the other. Which type of font you use with PDFsharp makes no difference when you create a PDF document. But with the GDI build you cannot use OpenType fonts with PostScript outlines to draw text using GDI+. It is not implemented in this technology. The WPF build fully supports such PostScript based fonts.
ANSI and glyph encoding
The short text above writes the following graphical command to a PDF file.
(Hello, World!) Tj
Acrobat understands the text in parentheses as ANSI encoded text and draws it at the current position using the currently selected font.
So far, so simple. Each character maps directly to a glyph in the font face. But this easy mapping works only with ANSI characters. Here are some points that must be taken into consideration when characters are mapped to glyphs.
- Ligatures A font may define combinations of characters that have their own glyph. For such a ligature there is not always a code point, because they may be invented by the font designer. And a viewer app does not know whether you want to use them or not.
- Direction A language may flow from left to right, right to left, or top to bottom.
- Context of character In the Arabic language as an example each character has five different glyphs. One general and one for each occurrence: alone, at the beginning, in the middle, or at the end of a word. In this case there are five different code points for each character.
There are many more options. It is almost impossible for a PDF viewer like Acrobat to map a sequence of characters to the glyphs someone expects to see. And therefore it doesn’t even try.
Instead, not the character is placed in the PDF, but the index of the glyph from the font. Each font can contain up to 65536 glyphs, while glyph 0 is always the unknown glyph. The font file contains so called character maps (cmaps) that map characters to glyph indices. PDFsharp uses the cmap type 4 for BMP code points and the cmap type 12 for surrogate encoded code points.
Here is the "Hello, World"
text glyph encoded:
<002B0048004F004F0052000F0003003A00520055004F00470004> Tj
The angle brackets contain 16 bit long big-endian encoded hexadecimal glyph indices.
002B16 or decimal 43 is e.g. the glyph index of the uppercase letter H
in the font Arial.
Note that the fact whether a text is interpreted as ANSI or glyph encoded is determined by the type of the currently selected font object which is not explained here.
Drawing text
In the following we discuss in detail what happens when PDFsharp executes the following lines.
var font = new XFont("Arial", 10, XFontStyleEx.Regular);
…
gfx.DrawString("Hello, World!", font, XBrushes.Black, x, y);
PDFsharp converts the UTF-16 .NET string into an array of pairs of Unicode code points and their corresponding glyph indices for the font Arial. The result is an array of CodePointGlyphIndexPair objects. If the user registers a RenderTextEvent at the PDF document object, the event is called with the code point glyph index array. In the event handler the user can inspect and modify the array. For example, if a code point has a glyph index with the value 0, that indicates the font has no glyph for this code point. The user can modify the code point or the glyph index to fix this.
In the next step PDFsharp analyses the (maybe modified) code points and checks, whether all of them have a valid ANSI character. If so, the ANSI encoding is used. Otherwise, glyph encoding is used.
To be very clear here, the code points are always Unicode values. Assume the text contains a Euro €
character with the Unicode code point U+20AC.
Because the Euro code point has the corresponding ANSI character with the value 8016 or 128 decimal, it is treated as an ANSI value.
If all Unicode values of the text have ANSI counterparts, the text is treated as ANSI.
In most applications this default behavior is appropriate. However, you can turn off the automatic differentiation by specifying the encoding option explicitly when you create the font.
var fontWithAnsiEncoding = new XFont("Arial", 10, style,
new XPdfFontOptions(PdfFontEncoding.WinAnsi));
var fontWithGlyphEncoding = new XFont("Arial", 10, style,
new XPdfFontOptions(PdfFontEncoding.Unicode));
If you force ANSI encoding and your text contains Unicode characters with no ANSI counterparts they are replaced with blanks. Note that Unicode encoding (glyph index encoding) was always used by PDFsharp prior to version 6.1.
Also note the two XFont objects in the sample above are the same sides of the same coin. Only one Arial font face was internally created and embedded. The embedded font subset was computed from the union of all characters / glyph indices used either with ANSI or glyph encoding. PDFsharp prior to version 6.1 embeds two different subsets, one for each encoding.
Drawing surrogate pairs
When you draw UTF-16 text containing surrogate pairs, PDFsharp decodes them to code points. The code points are mapped to the correct glyphs if the used font face contains glyphs for them.
Emojis
PDFsharp can draw emojis like RedRose 🌹
(U+1F339) or SmilingFaceWithHearts 😍
(U+1F970).
Use an appropriate font like Segoe UI Emoji and draw them directly in a string using their surrogate pairs or with the \uxxxx
notation.
gfx.DrawString("🌹 😍", fontEmoji, XBrushes.Black, 50, 100);
gfx.DrawString("\ud83c\udf39 \ud83d\ude0d", fontEmoji, XBrushes.Black, 50, 100);
It works as designed, but the results may look disappointing. You will not get what you see here in the browser, but just two boring monochrome emoji characters. It’s not a matter of font sub-setting. Embedding the whole Segoe UI Emoji font makes no difference. The reason is simply that PDF, even version 2.0, lacks a specification to display colored characters. Therefore, emojis are rendered monochrome like any other character.
Applications like Microsoft Word nevertheless create PDF files with colorful emojis. Word extracts the required information from the font’s color tables and converts them to vector drawings.
As of PDFsharp 6.2.0 Preview 1, there is a new option and colored emojis are possible. Please read on.
Colored glyphs
As written above, there is no standard solution for colored glyphs in the PDF specification. However, thanks to GitHub user packdat, we add a way to draw colored glyphs like emojis in PDFsharp 6.2.
Create a font with the new PdfFontOption PdfFontColoredGlyphs.Version0.
var options = new XPdfFontOptions(PdfFontColoredGlyphs.Version0);
var font = new XFont("Segoe UI Emoji", 12, XFontStyleEx.Regular, options);
gfx.DrawString("Colored 😍🎈🍕🚲🤑💪💕", font, XBrushes.Black, new XRect(0, 0, width, height), XStringFormats.Center);
gfx.DrawString("glyphs \ud83d\udca9\ud83d\udc1b\ud83e\udd84\u2615\ud83d\ude82\ud83d\udef8\u2714", font, XBrushes.Black, new XRect(0, 20, width, height), XStringFormats.Center);
(0, 0, width, height), XStringFormats.Center);
This new option uses color information from the OpenType tables COLR
and CPAL
if available.
If no color information is available for a particular glyph, the option has no effect.
The trick is to replace the original monochrome glyph by a sequence of glyphs, one for each color layer.
The implementation uses the version 0 of the color table (COLR), i.e. only solid colors are used.
A font may contain a version 1 color table that extends solid colors by gradient colors, but its use is not implemented.
The only drawback of the solution is that it is not currently possible to copy and paste the glyph e.g. to Microsoft Word. Furthermore this technique may not be compatible with PDF/A and PDF/UA. But if you don’t care about these details, emojis look great in PDF documents.