Homespacer>Support>Some important things to know about text encodings

spacer

Unicode

In order to represent textual characters in a file, some sort of mapping is used to assign numeric values to the characters. The mapping varies depending on the character set, which depends on factors like the language being used. Larger character sets, such as the Japanese Kanji set, use more bytes to represent each of their members. Interpretive problems may occur if a computer attempts to read data encoded with a mapping different from what it expects. So to handle text correctly, some method of identifying the various mappings and converting between them is necessary.

Most character sets and character encoding schemes developed in the past are limited in their coverage, usually supporting just one language or a small set of languages. Multilingual software has traditionally had to implement methods for supporting and identifying multiple character encodings. A simpler solution is to combine the characters for all commonly used languages and symbols into a single universal coded character set. Unicode is such a universal coded character set, and offers the simplest solution to the problem of text representation in multilingual systems. Because Unicode includes the character repertoires of most common character encodings, data can be encoded in a single coded character set.


UTF-8, UTF-16, UTF-32 and BOM

There are several possible representations of Unicode data, including UTF-8, UTF-16 and UTF-32. UTF-8 is the most common on the web. UTF-16 is used by Java and Windows. UTF-8 and UTF-32 are used by Linux and various Unix systems like Mac OS X. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats. Whether Mergemill Pro runs on Windows or Mac OS X, it always uses UTF-8 in its documents for data storage and in the cache files it creates during processing.

UTF-16 and UTF-32 use code units that are two and four bytes long respectively. For these UTFs, there are three sub-flavors: BE, LE and unmarked. The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark (BOM) at the beginning of the file to indicate the actual byte serialization used. The BOM is invisible to the users.

UTF-8 has some advantages. An important one is that it preserves ASCII, though not Latin-1, because the characters beyond ASCII 127 are different from Latin-1. UTF-8 uses the bytes in the ASCII only for ASCII characters. Therefore, it works well in any environment where ASCII characters are significant as syntax characters, such as markup languages. Another advantage of UTF-8 is that the BOM is unnecessary in most common applications. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem like those that use 16-bit or 32-bit code units. Where a BOM is used with UTF-8, it has nothing to do with byte order. It is only used as an encoding signature to distinguish UTF-8 from other encodings.

To learn more, please visit this FAQ page of unicode.org.


Text Encoding and Mergemill Pro

Mergemill Pro can work with templates in any supported text encoding that is correctly specified in their associated job settings. This is because the templates are first converted internally into UTF-8. When they are then parsed, the mark up tags are all in ASCII codes which Mergemill Pro correctly interprets. Two things you still need to ensure, however. One is to use a text editor to create and edit your templates such that it doesn't insert invisible characters that break the Mergemill tag syntax. The other is to use plain English (ASCII) symbols in Mergemill tags. "<?" (ASCII) and "<?" (Chinese) are not the same. The first one is understood by Mergemill Pro, but not the second.

Source documents you use to provide the data need more caution. The BOM is a common cause of problems. If the first value in your CSV document is enclosed in double-quotes, the BOM is sure to break the CSV format. Because the beginning double-quote is now treated as inside and part of the data value, the CSV format requires it to be escaped by making it a double double-quote. The widowed double-quote at the end of the first data value has the same issue. Even if you have fixed these, you still need to enclose the entire first data value in double-quotes, which you can't do in this case.

Problems occur even if the first data value in a CSV document doesn't contain commas, line breaks or double-quotes. You do not need to enclose the data value in double-quotes in this case, but the BOM is included as part of the first data value and is carried to wherever the data value is used. If the CSV document is specified to be containing a header row of column labels, the first column name will not be matched because of the presence of the BOM. These problems also occur when you use tab-delimited documents.

If you need to work with data documents in an encoding with BOM, you have three options:

  1. Convert the encoding before reading the data to generate the output.
    • Create a template with just one field tag, say <?[ConvertEncoding]?>.
    • Set up a simple job associated with the template to read in the source documents as plain text in their orginal encoding.
    • Output the documents in an appropriate encoding without BOM.
    • You should use the same source filename and extension and save the documents to a different folder.
    • Use the converted documents as your data source in generating the output.
  2. Use the XML format.
  3. Use Mergemill Data Markers.
    • Specify the source documents in Mergemill Pro as plain text.
    • Enclose the data values in Mergemill Data Markers, like [data_field_name]...[/].
    • Extract the data values using Mergemill Pro's fetch filter.

Mergemill Pro accepts a plain list of URLs in a text file as a data input source, and treats it like a folder of files. Each web page in the list provides a stream of data in the feed. You specify a text encoding for the URL list, which Mergemill Pro assumes to be the same for all the included web pages. If the web pages are in different text encodings, you need to put them on separate URL lists.

spacer

Top of Page

Featuresspacer::spacerDownloadsspacer::spacerBuy Nowspacer::spacerSupportspacer::spacerTutorialsspacer::spacerTags Guidespacer::spacerSite Map


Copyright © 2001-2017 Cross Culture Ltd. All Rights Reserved.