Chinese character code

xiaoxiao2021-03-06  25

ASCII code ------------------------------------------------ -------------------------------- 7 bits (00 ~ 7F). 32 ~ 127 represent characters. 32 is a space, 32 or the following is the control character (invisible). The 8th is not used. Many of the world have developed different uses at the meaning of this position. For example, the OEM character set in the IBM PC. Finally, a consensus is reached on the 128-bit use, and the ASCII standard is developed. More than 128 or more may have different explanations, these different explanations are called Code Pages. Even useful for explaining a multilingual code page on the same computer.

At the same time, there has been a more crazy thing in Asia. The character set of Asian languages ​​usually thousands, 8 is not enough to express, which is usually solved with a system called DBCS (Double-word character set, double byte character set). In this system, some characters occupy 1 byte, some 2 bytes. In this way, it is very easy to resolve in the string, but the reverse retreat is very troublesome. Programmers are suggested that do not use S or S - to advance and retreat, and use some functions, such as Windows's ANSINEXT and ANSIPREV. Because these functions know what is going on.

These different hypothesis have no problem on a single machine. With the development of the Internet, the string should be moved from one machine to another machine, which has a problem. So, Unicode appeared.

Unicode ------------------------------------- ------------------------------------- Unicode is a brave achievement. It integrates every reasonable text system on this planet into a single character set. Many people still have such misunderstandings: Unicode is just 16 bits of simple, each character accounted for 16 bits, so there are 65536 possible characters. However, this is wrong. But don't matter, because it is a general mistake that most people will commit.

In fact, Unicode understands characters is very different, and this is what we must understand. So far, we have all believed that a character corresponds to a bit (bits) stored in the disk or memory. Such as: a-> 0100 0001, in Unicode, one character is actually called Code Point thing. For example, this character is abstract (original: Platonic, Plato, ideal). Whether it is Times New Roman or Helvetica or other fonts, all represents the same character. But it is different from the lowercase letter A. However, in other languages, such as Hebelew or German, Arabian (Arabian), it is controversial in the same meaning of the different glyphs of the same letter. After a long debate, these were finally determined.

Each abstract letter in each alphabet is given a number, such as u 0645. This called Code Point.u indicates: Unicode, the number is 16-based. You can view all of these codes through the charmap command. (Windows 2000 / XP). Or Accepting UNICODE website (http://www.unicode.org) Unicode's size is not limited, and it has long been more than 65535. So not each character Can be stored in two bytes. Then, a string "Hello", in Unicode, expressed 5 Code Points: U 0048 U 0065 U 006C U 006C u 006f is just some numbers. But we have not mentioned how to represent this information in disk or email, this is what we need to mention on the encoding (eNCoding). Encodings ------------------------------------------------------------------------------------------------------------------------------------------------------ -------------------------- The initial Unicode Encoding uses two bytes to represent a character. Then "Hello" is represented as: 00 48 00 65 00 6C 00 6C 00 6F In fact, there is also a means: 48 00 65 00 6C 00 6C 00 6F 00 Between the high byte before or the low byte is in front, Two different modes. This depends on how the specific CPU works faster. So these are all. This has two different Unicode representations. In order to distinguish, people use a strange way: in front of every Unicode string, plus fEFF (this is called Unicode byte order sign, Unicode Byte Order Mark. If you exchange the high and low order, you will add a FFFE. This way, talents who read this string know that every two adjacent bytes are swapped. But in the initial time, it is not every Unicode string has this flag.

This looks very good. The programmers began to complain, "Take a look at those zero!". Because some are Americans, they use English. In English, there are rarely using U 00FF or more characters, and some people cannot endure the double storage space to store each character. Based on these reasons, many people decided to ignore Unicode, and at the same time, things became worse.

Then people have developed UTF-8. UTF-8 is another system for saving Unicode Code Points. Each U number, occupying 8 bit in memory. In UTF-8, any 0 ~ 127 Code Point occupies one byte. Only 128 and greater only 2, 3 until 6 bytes. Specifically, as shown below:

16-Bao minimum number 16 enrollment of the most large number of bytes in memory ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------- 000000000000007F 0vvvvvvv00000080 000007FF 110vvvvv 10vvvvvv00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv it looks good, English and ASCII characters in them the same. Therefore, the Americans did not realize what errors. Only other countries in the world need to use high bytes. Special, "Hello" string, Unicode Code Point is U 0048 U 0065 U 006C U 006C U 006F, which is stored as 48 65 6C 6C 6F. And ASCII, ANSI, and the meaning representing the meaning of any OEM on this planet. Now, if you need to indicate the character, or Greek, you need to use multiple bytes to represent a code point. But Americans will not mind these. (UTF-8 has an advantage that the old string handler uses a 0-based byte to represent Null-Terminator, not truncated strings)

Three unicode representations have been introduced so far:

Traditional double-byte representation, called UCS-2 (because 2 bytes) or UTF-16 (because there are 16 bits) and you still have to know the high in front, or high in UCS -2.

Another is a new UTF-8. If your program uses English, it will work properly.

There is actually a bunch of other ways to encode Unicode: There is UTF-7, which is the same as UTF-8, but the high level must be 0. So if you have to transfer Unicode through some kind of Email system, these The system believes that 7 are sufficient, and it will be normal using UTF-7. There is also UCS-4, stores each Code Point to 4 bytes. Its advantage is that every character is saved as the same length. But it is obvious that the disadvantage is to waste too much storage space.

So, now you think about the problem to imagine every character into an abstract unicode code point point. And they can also use any old way to encode. For example, you can encode the Unicode string Hello (U 0048 U 006F) encoding (Encode) as ASCII, or ancient OEM Greek coding, or Hiplai ANSI code, and many more. And some strings cannot be displayed! That is, if you want to indicate a Unicode Code Point that does not have a corresponding in a certain encoding, it is usually displayed as one or a white cottage frame.

Some English commonly used, Windows-1252 (Windows 9X standard for Western European language) and ISO-8859-1, Aka Latin-1 (valid for any Western European language) If you use these codes to try to store Russian characters, you will Get a pile of UTF 7, 8, 16, and 32 have a advantage to store any Code Point correctly.

The simplest, the most important concept ======================================== ============================ Is a string does not specify what coding it is meaningless. Never assume it, "pure" text is ASCII. There is no "plain text".

If you have a string, in memory, in the file, or in the Email message, you must know what it is encoded. Otherwise you can't explain correctly or display it to the user. All the stupid problems such as "My Page cannot be displayed normally", or "email message can not normally", because of the coding that does not tell you, UTF-8 is still ASCII or ISO 8859-1 or Windows 1252 ?? So naturally unable to explain and display properly, and even know where the string should end.

So how do you retain such an encoded flag to indicate the code of the string? There are some basic ways. For example, for Email, add:

Content-type: text / plain; charset = "UTF-8"

For web pages, the original approach is that the web server sends a http header similar to the Content-Type with the web page itself. (Not in HTML, but as a response header before the HTML page)

This has a problem. If your web server has multiple sites at the same time, the site is mixed with a plurality of different people developed in different languages. Then the web server will not know, each file is written in what encoding method. This will not be sent to the correct Content-Type header. If you can record Content-Type information in every HTML file, it is very convenient. But this thought seems to be very crazy, because you still haven't known what coding method to read this file, how can you read the code information? Fortunately, in almost every coding, the characters of 32 to 127 are interpreted. So you can write this in every HTML file:

But note that this Meta label must be placed in front of the front position to ensure that there is no problem. Because the web server reads here, the resolution is stopped and then re-analyzes the page with the read code read. Then, as a web browser, what will I do if I don't find Content-Type in the Meta tab or http headers? IE is doing this: Try to guess first, appear in a typical code in various languages ​​based on a particular byte. If the coding setting is not normal, the user can try different encoding methods through the View | Encoding menu. (Of course, not everyone knows this)

In VB, COM, Windows NT / 2000 / XP, the default string type is UCS-2 (2 bytes). In C code, we can define the string for Wchar_t (Wide Char), with a WCS series function instead of the STR series function. Such as WCSCAT, WCSLEN, not strcat, strlen. In C code, you want to create a UCS-2 string, as long as you add a "L" in front, such as l "Hello"

For web pages, it is best to unify UTF-8 encoding. This encoding has been supported by a variety of web browsers for many years.

(End) Original: http://www.donews.Net/xzwenlan/archive/2005/01/18/246416.aspx

转载请注明原文地址:https://www.9cbs.com/read-65345.html

New Post(0)