UTF-16-(-16-bit-Unicode-Transformation-Format-)-is

[permalink] [id link]

+ −

Page "UTF-16" ¶ 0

from Wikipedia

Promote Demote Fragment Fix

« More previous Okay Cancel More next »

Some Related Sentences

UTF-16 and 16-bit

UTF-16 extends UCS-2, using two 16-bit units ( 4 × 8 bit ) to handle each of the additional characters.

It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 97 %

In UTF-16, a BOM () may be placed as the first character of a file or character stream to indicate the endianness ( byte order ) of all the 16-bit code units of the file or stream.

Editors that limit themselves to left-to-right languages and precomposed characters can take advantage of fixed-sized code units, but such editors are unlikely to support non-BMP characters and thus can work equally well with 16-bit UTF-16 encoding.

HFS Plus is an improved version of HFS, supporting much larger files ( block addresses are 32-bit length instead of 16-bit ) and using Unicode ( instead of Mac OS Roman or any of several other character sets ) for naming the items ( files, folders ) – names which are also character encoded in UTF-16 and normalized to a form very nearly the same as Unicode Normalization Form D ( NFD ) ( which means that precomposed characters like å are decomposed in the HFS + filename and therefore count as two characters and UTF-16 implies that characters from outside the Basic Multilingual Plane – often seldom used and characters from ancient writing systems – also count as two characters in an HFS + filename ).

The Microsoft Windows application programming interface Win32, as well as the Java and. Net Framework platforms, require that wide character variables be defined as 16-bit values, and that characters be encoded using UTF-16 ( due to former use of UCS-2 ), while modern Unix-like systems generally require 32-bit values encoded using UTF-32.

UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width.

UTF-16 and Unicode

The older UCS-2 ( 2-byte Universal Character Set ) is a similar character encoding that was superseded by UTF-16 in version 2. 0 of the Unicode standard in July 1996.

Formatted text, justification, and even the inclusion of graphics and other multimedia elements are supported by TextEdit, as well as the ability to read and write to different character encodings, including Unicode ( UTF-8 and UTF-16 ).

Clause D98 of conformance ( section 3. 10 ) of the Unicode standard states, " The UTF-16 encoding scheme may or may not begin with a BOM.

In the Windows NT-based versions of Windows, Notepad can edit traditional 8-bit text files as well as Unicode text files ( both UTF-8 and UTF-16, and in case of UTF-16, both little-endian and big-endian ).

Around 2004, after most computer software became compatible with Unicode UTF-8 or UTF-16 systems, use of Greeklish was strongly discourraged in many Greek online Web discussion boards ( forums ) where Greeklish was in use before.

The Unicode Standard 5. 0 only lists UTF-8, UTF-16 and UTF-32.

The Unicode code point for the pound sign is U + 00A3 ( which is < tt > 00A3 </ tt >< sub > 16 </ sub > in UTF-16 ), which converts into modified Base64 as in the table below.

* Unicode text files encoded in UTF-16 often start with the Byte Order Mark to detect endianness ( for big endian and for little endian ).

This is one reason why applications are recommended to use Unicode such as UTF-8 or UTF-16 instead.

* RapidCRC Unicode-RapidCRC with Unicode support ( v0. 3. 4 as of 05 / 27 / 2012 supports UTF-8 with or without BOM and UTF-16 LE )

WordPad for Windows XP added full Unicode support, enabling WordPad to support multiple languages, but UTF-16 / UCS-2 Big Endian is not supported.

It was largely replaced with, firstly, Windows-1252 ( often mislabeled as ISO-8859-1 ), and later with UCS-2, and finally with UTF-16 ( the NT line was natively Unicode from the start, but issues of development tool support and compatibility with Windows 9x kept most applications on the 8-bit code pages ).

From Outlook 2003 and onward, the standard format for PST and. ost files is Unicode ( UTF-16 little-endian ).

IBM EBCDIC based mainframe operating systems, like z / OS, usually use UTF-16 for complete Unicode support.

Modern applications prefer Unicode UTF-8 or UTF-16 to Windows-1255.

The Compatibility Encoding Scheme for UTF-16: 8-Bit ( CESU-8 ) is a variant of UTF-8 that is described in Unicode Technical Report # 26.

A Unicode supplementary character, i. e. a code point in the range U + 10000 to U + 10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8.

UTF-16 and is

UTF-16 or UTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

It is also possible to use UTF-16 where most characters are stored as two bytes with varying endianness, which is supported by modern browsers but less commonly used.

For the serialisation then, as long as the page is encoded in an extension of ASCII ( such as UTF-8, and thus, not if the page is using UTF-16 ), a element, like or ( starting with HTML5 ) can be used.

For UTF-8, the BOM is optional, while it is a must for the UTF-16 and the UTF-32 encodings.

If the document lacks a byte-order mark, the fact that the first non-blank printable character in an HTML document is supposed to be "<" ( U + 003C ) can be used to determine a UTF-8 / UTF-16 / UTF-32 encoding.

However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

Searching for ASCII characters or just the space character ( U + 0020 ) is a method of determining the UTF-16 byte order.

However, this function is imperfect, incorrectly identifying some all-lowercase ASCII text as UTF-16.

The start of these blocks of modified Base64 encoded UTF-16 is indicated by a sign.

Some of the more unmodern features of the language include the dependency on the EBCDIC character sets which means that use of UTF-8 requires considerable effort and is quite risky ( the programmer has to take responsibility not to get his / her character data mixed up with a single byte CCSID ), whereas 16 bit characters ( e. g. UTF-16 ) are supported directly via a separate datatype ( Graphic ).

One of several problems with SCSU is then that the savings of SCSU versus UTF-16 or UTF-8 drop after external compression, often dramatically so.

UTF-16 and character

Windows file names and other identifiers, which are encoded as UTF-16, are automatically converted to and from the selected character encoding.

In November 2003 UTF-8 was restricted by RFC 3629 to four bytes in order to match the constraints of the UTF-16 character encoding.

Some people use DBCS to mean the UTF-16 and UTF-8 encodings, while other people use the term DBCS to mean older ( pre-Unicode ) code pages that use more than one byte per character.

Each CESU-8 character code ( 1, 2, or 3 bytes ) can be converted to exactly one UTF-16 code unit ( 2 bytes ).

UTF-16 and encoding

( Note: UTF-16 and UTF-32 without the BOM are formally known under different names, they are different encodings, and thus needs some form of encoding declaration – see UTF-16BE, UTF-16LE, UTF-32LE and UTF-32BE.

HTML5 states that " authors should not use UTF-32, as the encoding detection algorithms described in this specification intentionally do not distinguish it from UTF-16.

Programs should use the NewString, GetStringLength, GetStringChars, ReleaseStringChars, GetStringRegion, GetStringCritical, and ReleaseStringCritical functions, which use UTF-16LE encoding on little-endian architectures and UTF-16BE on big-endian architectures, and then use a UTF-16 to standard UTF-8 conversion routine.

Standard SMS messaging uses 140 bytes per message, which translates to 160 characters of the English alphabet using 7-bit encoding or as few as 70 characters for languages using non-Latin alphabets using UTF-16 encoding.

In modern applications UTF-8 or UTF-16 is a preferred encoding.

( An unfortunate but far more common " solution " used by UTF-16 systems is to interpret the UTF-8 as some other encoding such as cp1252 and ignore the mojibake for any non-ASCII data )

Since the late 1990s, software and systems are increasingly adopting more direct encodings of Unicode, in particular UTF-8 and UTF-16 ; this trend has been improved by the widespread adoption of XML, which provides a more adequate mechanism for labelling the encoding used.

For example, Unicode is a code page that has several encoding forms, like UTF-8, UTF-16 and UTF-32.

0.300 seconds.