Help


[permalink] [id link]
+
Page "Character encodings in HTML" ¶ 15
from Wikipedia
Edit
Promote Demote Fragment Fix

Some Related Sentences

UTF-16 and UTF-32
For UTF-8, the BOM is optional, while it is a must for the UTF-16 and the UTF-32 encodings.
( Note: UTF-16 and UTF-32 without the BOM are formally known under different names, they are different encodings, and thus needs some form of encoding declaration – see UTF-16BE, UTF-16LE, UTF-32LE and UTF-32BE.
Processing applications need only look for an initial 0x0000FEFF, 0xFEFF or 0xEFBBBF in the byte steam to identify the document as UTF-32, UTF-16 or UTF-8 encoded respectively.
If the document lacks a byte-order mark, the fact that the first non-blank printable character in an HTML document is supposed to be "<" ( U + 003C ) can be used to determine a UTF-8 / UTF-16 / UTF-32 encoding.
It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.
Non-BMP characters are so rare in most texts, they may as well be considered non-existent for sizing issues, making UTF-32 twice the size of UTF-16 and up to four times the size of UTF-8.
HTML5 states that " authors should not use UTF-32, as the encoding detection algorithms described in this specification intentionally do not distinguish it from UTF-16.
The Unicode Standard 5. 0 only lists UTF-8, UTF-16 and UTF-32.
Therefore they must either process the data in a variable width format ( such as UTF-8 or UTF-16 ), which are the most common choices, or move to a larger fixed width format ( such as UCS-4 or UTF-32 ).
The Microsoft Windows application programming interface Win32, as well as the Java and. Net Framework platforms, require that wide character variables be defined as 16-bit values, and that characters be encoded using UTF-16 ( due to former use of UCS-2 ), while modern Unix-like systems generally require 32-bit values encoded using UTF-32.
In detail: ISO-8859-1, ISO-8859-2, ISO-8859-5, ISO-8859-6, ISO-8859-7, KOI8-R, Shift JIS, GB2312, UTF-8, UTF-16 and UTF-32.
UTF-16 and UTF-32 are incompatible with ASCII files, and thus require Unicode-aware programs to display, print and manipulate them, even if the file is known to contain only characters in the ASCII subset.
Since characters outside the basic multilingual plane ( BMP ) are typically rare, a document encoded in UTF-32 will often be nearly twice as large as its UTF-16 / UCS-2 – encoded equivalent because UTF-16 uses two bytes for the characters inside the BMP, or four bytes otherwise.
For example, UTF-16 use less space than UTF-32 only for characters from BMP, which are though overwhelmingly most common of all Unicode.
UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but UTF-7 and GB 18030 do not.
If any stored data is in UTF-8 ( such as file contents or names ), it is very difficult to write a system that uses UTF-16 or UTF-32 as an api.
UTF-16 and UTF-32 are not byte oriented, so a byte order must be selected when transmitting them over a byte-oriented network or storing them in a byte-oriented file.
UTF-16 and UTF-32 will handle corrupt ( altered ) bytes by resynchronizing on the next good code point, but an odd number of lost or spurious byte ( octet ) s will garble all following text.
For example, Unicode is a code page that has several encoding forms, like UTF-8, UTF-16 and UTF-32.

UTF-16 and which
Windows file names and other identifiers, which are encoded as UTF-16, are automatically converted to and from the selected character encoding.
It is also possible to use UTF-16 where most characters are stored as two bytes with varying endianness, which is supported by modern browsers but less commonly used.
Programs should use the NewString, GetStringLength, GetStringChars, ReleaseStringChars, GetStringRegion, GetStringCritical, and ReleaseStringCritical functions, which use UTF-16LE encoding on little-endian architectures and UTF-16BE on big-endian architectures, and then use a UTF-16 to standard UTF-8 conversion routine.
Standard SMS messaging uses 140 bytes per message, which translates to 160 characters of the English alphabet using 7-bit encoding or as few as 70 characters for languages using non-Latin alphabets using UTF-16 encoding.
The Unicode code point for the pound sign is U + 00A3 ( which is < tt > 00A3 </ tt >< sub > 16 </ sub > in UTF-16 ), which converts into modified Base64 as in the table below.
Some of the more unmodern features of the language include the dependency on the EBCDIC character sets which means that use of UTF-8 requires considerable effort and is quite risky ( the programmer has to take responsibility not to get his / her character data mixed up with a single byte CCSID ), whereas 16 bit characters ( e. g. UTF-16 ) are supported directly via a separate datatype ( Graphic ).
HFS Plus is an improved version of HFS, supporting much larger files ( block addresses are 32-bit length instead of 16-bit ) and using Unicode ( instead of Mac OS Roman or any of several other character sets ) for naming the items ( files, folders ) – names which are also character encoded in UTF-16 and normalized to a form very nearly the same as Unicode Normalization Form D ( NFD ) ( which means that precomposed characters like å are decomposed in the HFS + filename and therefore count as two characters and UTF-16 implies that characters from outside the Basic Multilingual Plane – often seldom used and characters from ancient writing systems – also count as two characters in an HFS + filename ).
However, using UTF-16 makes characters outside the Basic Multilingual Plane a special case which increases the risk of oversights related to their handling.
Since the late 1990s, software and systems are increasingly adopting more direct encodings of Unicode, in particular UTF-8 and UTF-16 ; this trend has been improved by the widespread adoption of XML, which provides a more adequate mechanism for labelling the encoding used.

UTF-16 and can
For the serialisation then, as long as the page is encoded in an extension of ASCII ( such as UTF-8, and thus, not if the page is using UTF-16 ), a element, like or ( starting with HTML5 ) can be used.
In the Windows NT-based versions of Windows, Notepad can edit traditional 8-bit text files as well as Unicode text files ( both UTF-8 and UTF-16, and in case of UTF-16, both little-endian and big-endian ).
Editors that limit themselves to left-to-right languages and precomposed characters can take advantage of fixed-sized code units, but such editors are unlikely to support non-BMP characters and thus can work equally well with 16-bit UTF-16 encoding.
Python can be compiled to use them instead of UTF-16.
Microsoft SQL Server cannot ( including SQL Server 2008 ) as it can use UCS-2 but not UTF-16 ( except through the use of varbinary ' blobs ').
SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.
Unicode's Thai range is U + 0E01 through U + 0E7F, and TIS-620 Thai characters can be converted to UTF-16 simply by prefixing each byte with 0E and subtracting hex A0 from the value.
Each CESU-8 character code ( 1, 2, or 3 bytes ) can be converted to exactly one UTF-16 code unit ( 2 bytes ).
The opposite is not true, it is trivial to translate invalid UTF-16 to a unique ( though technically invalid ) UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names, making UTF-8 preferred in any such mixed environment.

UTF-16 and be
In UTF-16, a BOM () may be placed as the first character of a file or character stream to indicate the endianness ( byte order ) of all the 16-bit code units of the file or stream.
In all cases they will probably display the rest of the file as garbage ( a UTF-16 text containing ASCII only will be fairly readable ).
Other characters must be encoded in UTF-16 ( hence U + 10000 and higher would be encoded into surrogates ) and then in modified Base64.
On the other hand, UTF-16 must be transfer encoded to fit SMTP data format.

0.203 seconds.