Unicode X-Ray

Return to home

About

  1. General
  2. Normalization
  3. Naming

General

When you type text into this tool, it breaks down what you see into two important layers that make up modern digital text:

café
c+a+f+

The word café is made up of four graphemes: c, a, f, and .

Graphemes are what we typically think of as individual characters — the visual units that make sense to our eyes. For example, in the word café, we see four graphemes. But some graphemes are more complex than they appear, such as the é which is actually built from multiple pieces.

e+◌́
or
é
é

The grapheme can be represented as either as two separate code points or as a single code point.

Code points are the underlying building blocks of text that computers use to represent each piece as individual Unicode numbers. While a grapheme is what you see, code points are what the computer sees. That é we mentioned? It could be stored as either a single code point or as two separate ones (an e followed by an accent mark) that combine to create what you see.

Try typing or pasting some text to see how what appears simple on screen can be more complex under the hood. You might be surprised to find that emojis, accented characters, and text from different writing systems often break down in unexpected ways!

Normalization

This is an advanced topic that you can safely ignore! Set the dropdown to "None" to apply no normalization.

Above, I mentioned that the é grapheme could be expressed as either a single code point or as two separate ones, even though the graphemes for both are semantically identical. These are examples of normalizations: processes to compose or decompose text into code points in a regular way. This is useful when comparing 2 texts for semantic equivalence.

You can read more about this in Section 3.11 Normalization Forms of The Unicode Standard.

I provide control for normalization on this tool because you may be interested in the different forms that text can take. Try looking at the codepoints produced by looking at é under NFC and NFD forms.

Naming

This tool compiles code point names in a novel way that I thought was most helpful. For each code point, the following data (produced by the Unicode Consortium) are applied, from lowest precedence to highest:

  1. The name field from UnicodeData.txt. If the code point category is Cc (indicating a control character), then the Unicode 1.0 name field is used instead.

    This file is, by far, the largest source for code point names in this process.

  2. The name field from DerivedName.txt.

    The names from this file provide more human names to some CJK ideographs and also append Jamo (the alphabetic components of Korean Hangul) to their respective Hangul characters.

  3. kDefinition entries from Unihan_Readings.txt in Unihan.zip. Instead of replacing the name, the value here is appended to the value from the previous steps.

    kDefinition entries are provided for many Chinese, Japanese, and Korean ideographic characters. This can help non-speakers access the meaning of these characters.