Chinese character coding note (reproduced)

xiaoxiao2021-03-06  25

Java Chinese Processing Learning Notes - Hello Unicode

Author: Cha Dong Email: chedongATbigfoot.com/chedongATchedong.com

Copyright Notice: You can reprint anything, please sure to indicate the original source and author information and this statement by hyperlink. Http://www.chedong.com/tech/hello_unicode.html

Keywords: Linux Java Mutlibyte Encoding Locale I18N I10N Chinese ISO-8859-1 GB2312 BIG5 GBK Unicode

abstract:

I don't know if you have this feeling: Why do PHP rarely have garbled problems and use Java to do web applications so troublesome? Why can I use Simplified Chinese on Google to find traditional Chinese, or even Japanese results? And when you use Google, it can actually automatically call out the Chinese interface according to the language of the language I use the browser?

Many international applications let me understand such a truth: Unicode is designed for more convenient application, while Java core characters are based on Unicode, this mechanism provides applications to Chinese "Word" control (Not bytes). But if you don't carefully understand the specifications, this freedom will become cumbersome, resulting in more garbled problems:

About some basic concepts of character set; test 1: Display system environment settings and supported coding methods; test 2: The system default encoding method affects the input and output of Java applications; test 3: Output and output in web applications Character set problem;

Preparation for Character Set: ISO-8859-1 GB2312 BIG5 GBK GB18030 Unicode Why is there so many character set encoding?

Note: The following instructions are not strictly defined, and some metapys are only used as a convenient understanding.

Suppose a character is a chess piece on the board, with its fixed coordinates, if you need to distinguish all characters, you need enough chess to accommodate different "characters".

SingleByte Charsets in English and Europe: First of all, I think that the character set in the ISO-8859 series is one: 2 ^ 8 = 16 * 16 = 256 lattice chessboard, so all Western characters (English) It is basically covered with such a 16 × 16 coordinate system. The English is actually only one of the parts of which is less than 128 (/ x80). Different definition rules with spaces greater than 128 form a true to other European languages: ISO-8859-2 ISO-8859-4, etc.

ISO-8859-1ISO-8859-7 other languages

English other Western European characters ēē

English Greek character μ!

English other single-byte character set

MultiByte Charsets, such as GB2312 BIG5 SJIS:

For Asian languages: Chinese characters so much, use such a 256-grid small chessboard, so it is necessary to distinguish thousands of Chinese characters solutions to locate a "word" in the board with 2 bytes (coordinates) The position of the above rules are extended:

If the first character is still compatible with the English character set encoding method; if the first character is greater than 128 (/ x80), it is the first byte of Chinese characters. And the 1 byte of the back, form a Chinese character;

As a result, it is equivalent to each small chess grid in a small checker in 128, and a 16 × 16 small chessboard is divided. Such a lattice number in such a checkerboard becomes 128 128 * 256. According to a similar manner, the GB2312 standard of Simplified Chinese, traditional BIG5 character set and Japanese SJIS character set, etc., the GB2312 character set contains approximately six common Simplified Chinese characters. Simplified Chinese Japanese SJIS Traditional Chinese

English Simplified Chinese

English Japanese

English Chinese

It can be seen that all of these is compatible from the ASCII extended encoding mode: English part is compatible, but the encoding method of the extended part is incompatible, although many words are consistent in three systems (such as "Chinese" These 2 words) but the coordinates in the corresponding character set are inconsistent, so the page written by GB2312 has become unrecognizable with BIG5. And sometimes frequently changing Chinese characters often in browsing other non-English countries (such as people who contain German), in fact, it is caused by the encoding conflict between the extension.

I understand GBK and GB18030 into a small unicode: GBK character set is GB2312 extension (k), about 10,000 thousand more characters in GBK, in addition to keeping and GB2312 compatibility, traditional Chinese characters, even Japanese pseudonym Characters can also be displayed. The GB18030-2000 is a more complex character set that uses the beam-length byte encoding method to support more characters. A detailed definition specification for the coding method of Chinese characters can be referred to: http://www.unihan.com.cn/cjk/ana17.htm

ASCII (English) ==> Western Europe text ==> Eastern European character set (Russian, Greek, etc.) ==> East Asian character set (GB2312 BIG5 SJIS, etc.) ==> Extended character set GBK GB18030 This development process is basically reflected The development process of character set standards, but so much over time, especially when the Internet makes the interaction between the cross-language information more and more, there are only a number of encoding standards for local languages ​​to lead to an application. Internationalization has become very costly. Especially if you want to write a document that contains French and Simplified Chinese, this generally thinks that if you can use a universal character set to display all the words all languages, and doing applications can be more convenient, In order to achieve this goal, even if the application sacrifices some space and program efficiency is also very worthwhile. Unicode is such a universal solution.

Unicode double-word character set, so you can imagine Unicode as this: let all characters (including English) indicate two bytes (2 8 digits), this is 2 ^ (8 * 2) = 256 * 256 = 65536 big chessboards for lattice. In this checkerboard, this (simple) Japan and South Korea (also included Vietnam) is placed in a certain location as a CJK character set, and a "chess" is shared in order to reduce repetitive, the same word in various languages. Detailed location See Appendix A

Unicode: (DoublebyTe Charsets)

Western C in European J Japanese English K Korean language

What else must I have UTF-8? After all, more than 70% of the information is still in English. If you use 2 bytes to access (UCS-2), the space is not too much? The so-called UTF-8 is such a character set conversion format for improving English access efficiency: Unicode Transformation Form 8-bit Form. With UTF-8, Unicode's 2-byte characters are used to expand (1-3 bytes): in English, still use 1 byte as ASCII, this byte is less than 128 (/ x80) For other languages, a value between 128-256 is started with a value of 128-256, and then the two bytes of followed by the following characters are 3 bytes;

Therefore, all characters are 16-bit (double bytes) during the application, but use UTF-8 format conversion when accessing into by-line flow, for English characters, and the original use of ASCII mode When the size is still the same, the size is: (3 bytes / 2 bytes) = 1.5 times compared to the original GB2312 encoding method.

Section:

Suppose the English character set is a 16 × 16 chessboard, and the character set in other languages ​​is to re-split the high-level district (> 128) medium-sized board, multiple character sets are not compatible with each other and Unicode itself is equivalent to a 256 × 256 big chessboard, including characters in English and all languages ​​through a certain rule.

Test 1: Operating System Language Environment Settings The impact of the default coding method for Java applications

In order to understand the mechanism of the encoding process of Java applications, you must first understand the impact of the operating system's default coding method for JVM, so I made an env.java, which is used to print the property and system supported by the JVM under different systems. The program is simple:

/ * * Copyright (c) 2002 email: chedongatbigfoot.com/chedongatchedong.com * $ ID: Hello_Unicode.html, v 1.6 2003/11/09 07:57:11 Chedong Exp $ * / Import java.util. *; Import Java.text. *; / ** * Purpose: * Display environment variables and JVM default properties * Enter: None * Output: * 1 Support Locale * 2 JVM default attribute * / public class env {/ ** * main entrance * / public static void main (String [] args) {system.out.println ("Hello, It's:" new date ()); //print available locales locale list [] = DateFormat.getavailaLALOCALES () System.out.println ("====== System available locals: ========"); for (int i = 0; i The decoding method of the character stream is encoded by the character stream ==> byte stream.

Locale under Linux can be set by lang = zh_cn; lc_all = zh_cn.gbk; export lang lc_all. The locale command can display the current environment settings of Windows, which can be implemented by the control panel ==> zone setting settings.

GNU / Linux 2.4.x (J2SE1.3.1) lang = en_us lc_all = en_usgnu / linux 2.4.x (j2se1.3.1) lang = zh_cn lc_all = zh_cn.gbkwindows 2000 (J2SE1.3.0) Area setting: China Chinese Windows 2000 (J2SE1 . 3.0) Regional setting: English Hello, It's: Tue Jul 30 11:05:44 CST 2002 ====== System available locales: ========

en Englishen_US English (United States) ar Arabicar_AE Arabic (United Arab Emirates) ar_BH Arabic (Bahrain) ar_DZ Arabic (Algeria) ar_EG Arabic (Egypt) ar_IQ Arabic (Iraq) ar_JO Arabic (Jordan) ar_KW Arabic (Kuwait) ar_LB Arabic (Lebanon) ar_LY Arabic (Libya) ar_MA Arabic (Morocco) ar_OM Arabic (Oman) ar_QA Arabic (Qatar) ar_SA Arabic (Saudi Arabia) ar_SD Arabic (Sudan) ar_SY Arabic (Syria) ar_TN Arabic (Tunisia) ar_YE Arabic (Yemen) be Byelorussianbe_BY Byelorussian ( belarus) bg Bulgarianbg_BG Bulgarian (Bulgaria) ca Catalanca_ES Catalan (Spain) ca_ES_EURO Catalan (Spain, Euro) cs Czechcs_CZ Czech (Czech Republic) da Danishda_DK Danish (Denmark) de Germande_AT German (Austria) de_AT_EURO German (Austria, Euro) de_CH German ( switzerland) de_DE German (Germany) de_DE_EURO German (Germany, Euro) de_LU German (Luxembourg) de_LU_EURO German (Luxembourg, Euro) el Greekel_GR Greek (Greece) en_AU English (Australia) en_CA English (Canada) en_GB English (United Kingdom) en_IE English (Ireland) EN_IE_EURO ENGLISH (Ireland, EURO) EN_NZ English (New Zealand) en_ZA English (South Africa) es Spanishes_BO Spanish (Bolivia) es_AR Spanish (Argentina) es_CL Spanish (Chile) es_CO Spanish (Colombia) es_CR Spanish (Costa Rica) es_DO Spanish (Dominican Republic) es_EC Spanish (Ecuador) es_ES spanish (Spain) es_ES_EURO spanish (Spain, Euro) es_GT spanish (Guatemala) es_HN spanish (Honduras) es_MX spanish (Mexico) es_NI spanish (Nicaragua) et Estonianes_PA spanish (Panama) es_PE spanish (Peru) es_PR spanish (Puerto Rico) es_PY spanish (Paraguay) es_SV Spanish (El Salvador) es_UY Spanish (Uruguay) es_VE Spanish (Venezuela) et_EE Estonian (Estonia) fi Finnishfi_FI Finnish (Finland) fi_FI_EURO Finnish (Finland, Euro) fr Frenchfr_BE French (Belgium) fr_BE_EURO French (Belgium,

Euro) fr_CA French (Canada) fr_CH French (Switzerland) fr_FR French (France) fr_FR_EURO French (France, Euro) fr_LU French (Luxembourg) fr_LU_EURO French (Luxembourg, Euro) hr Croatianhr_HR Croatian (Croatia) hu Hungarianhu_HU Hungarian (Hungary) is Icelandicis_IS icelandic (Iceland) it Italianit_CH Italian (Switzerland) it_IT Italian (Italy) it_IT_EURO Italian (Italy, Euro) iw Hebrewiw_IL Hebrew (Israel) ja Japaneseja_JP Japanese (Japan) ko Koreanko_KR Korean (South Korea) lt Lithuanianlt_LT Lithuanian (Lithuania) lv Latvian ( Lettish) lv_LV Latvian (Lettish) (Latvia) mk Macedonianmk_MK Macedonian (Macedonia) nl Dutchnl_BE Dutch (Belgium) nl_BE_EURO Dutch (Belgium, Euro) nl_NL Dutch (Netherlands) nl_NL_EURO Dutch (Netherlands, Euro) no Norwegianno_NO Norwegian (Norway) no_NO_NY Norwegian ( norway, Nynorsk) pl Polishpl_PL Polish (Poland) pt Portuguesept_BR Portuguese (Brazil) pt_PT Portuguese (Portugal) pt_PT_EURO Portuguese (Portugal, Euro) ro Romanianro_RO Romanian (Romania) ru Russianru_RU Russian (Russia) sh Serbo-Croatia nsh_YU Serbo-Croatian (Yugoslavia) sk Slovaksk_SK Slovak (Slovakia) sl Sloveniansl_SI Slovenian (Slovenia) sq Albaniansq_AL Albanian (Albania) sr Serbiansr_YU Serbian (Yugoslavia) sv Swedishsv_SE Swedish (Sweden) th Thaith_TH Thai (Thailand) tr Turkishtr_TR Turkish (Turkey) uk Ukrainianuk_UA Ukrainian (Ukraine) zh Chinesezh_CN Chinese (China) zh_HK Chinese (Hong Kong) zh_TW Chinese (Taiwan) ====== System property ======== - listing properties --java.runtime.name = Java (TM) 2 Runtime Environment,

Stand ... sun.boot.library.path = / usr / java / jdk1.3.1_04 / jre / lib / i386java.vm.version = 1.3.1_04-b02java.vm.vendor = Sun Microsystems Inc.Java.vendor. URL = http://java.sun.com/path.sepaarat=: java.vm.name = java hotspot (tm) client vmfile.Encoding.pkg = sun.iojava.vm.Specification.name = Java Virtual Machine SpecificationUser. dir = / home / chedong / src / char_testjava.runtime.version = 1.3.1_04-b02java.awt.graphicsenv = sun.awt.X11GraphicsEnvironmentos.arch = i386java.io.tmpdir = / tmpline.separator = java.vm.specification. Vendor = Sun Microsystems Inc.java.awt.fonts = os.name = Linuxjava.library.path = / usr / java / jdk1.3.1_04 / jre / lib / i386: / u ... java.specification.name = java Platform API Specificationjava.class.version = 47.0os.version = 2.4.7-10user.home = / home / chedonguser.timezone = Asia / Shanghaijava.awt.printerjob = sun.awt.motif.PSPrinterJobfile.encoding = ISO-8859- 1Java.Specification.Version = 1.3

User.name = Chedong

Java.class.path = / home / chedong / classes

Java.vm.Specification.Version = 1.0

Java.home = / usr / java / jdk1.3.1_04 / jre

User.language = en

Java.specification.vendor = Sun Microsystems Inc.

Java.vm.info = Mixed Mode

Java.version = 1.3.1_04

Java.ext.dirs = / usr / java / jdk1.3.1_04 / jre / lib / ext

Sun.boot.class.path = / usr / java / jdk1.3.1_04 / jre / lib / rt.jar: ...

Java.vendor = Sun Microsystems Inc.

File.separator = /

Java.vendor.url.bug = http://java.sun.com/cgi-bin/bugReport ...

Sun.cpu.endian = Little

Sun.io.Unicode.encoding = unicodelittle

User.Region = US

Sun.cpu.isalist =

Hello, It's: Tue Jul 30 11:07:34 CST 2002 ====== System available locales: ========

EN English EN_USE (USA) AR Arabic AR_AE Arabic (United Arab Emirates) Ar_DZ Arabic (Algeria) Ar_EG Arabic (Egypt) Ar_iq Arabic (Iraq) Ar_JO Arabic (Jordan) Ar_kw Arab Wen (Kuwait) AR_LB Arabic (Lebanon) Ar_ly Arabic (Libya) Ar_ma Arabic (Morocco) Ar_om Arabic (Oman) Ar_QA Arabic (Qatar) Ar_SA Arabic (Saudi Arabia) AR_SD Arabic (Sudan) Ar_sy Arabic (Syria) AR_TN Arabic (Tunisia) Ar_ye Arabic (Yemen) Beoroth BE_BY Belarusian (Belarus) BG Bulgarian BG_BG Bulgarian (Bulgaria) CA Catalonia CA_E Catalonia (Spain) CA_ES_EURO CARONES (Spain, EURO) CS Czech CS_CZ Czech (Czech Republic) DA Danish DA_DK Danish (Denmark) de Wen DE_AT De Wen (Austria) DE_AT_EURO De Wen (Austria, EURO) DE_CH DR Wen (Switzerland) de Wen (Germany) DE_DE_EURO German (Germany, EURO) DE_LU De Wen (Luxembourg) de Wen (Luxembourg, EURO) EL Greek EL_GR Greek (Greece) EN_AU English (Australia) EN_CA English (Canada) EN_GBE (Ireland) EN_IE English (Ireland) EN_IE_EUROE English (Ireland, EURO) EN_NZ English (New Zealand) EN_ZA English (South Africa) ES Spanish ES_BO Spanish (Bolivia) ES_AR Spanish (Argentina) ES_CL Spanish (Chile) ES_CO Spanish (Colombia) ES_CR Spanish (Costa Rica) ES_DO Spanish (Dominican Republic) ES_EC Spanish (Ecuador) ES_ES Spanish (Spain) ES_EURO Spanish (Spain, EURO) ES_GT Spanish (Guatemala) ES_HN Spanish (Honduras) ES_MX Spanish (Mexico) ES_NI Spanish (Nicaragua) ET Estonia ES_PA Spanish (Panama) ES_PE Spanish (Peru) ES_PR Spanish (Puerto Rico) E S_PY Spanish (Paraguay) ES_SV Spanish (Salvador) ES_UY Spanish (Uruguay) ES_VE Spanish (Venezuela) ET_EE Estonia (Estonia) FI Finnish Fi_Fi Finnish (Finland) Fi_Fi_Euro Finnish (Finland, EURO) fr French FR_BE French (Belgium) FR_BE_EURO French (Belgium, EURO) FR_CA French (Canada) FR_CH French (Switzerland) FR_FR French (France) FR_FR_EURO French (France, EURO) FR_LU French (Luxembourg) fr_lu_euro French (Luxembourg, EURO) HR Croatian HR_HR Croatia Wen (Croatia) hu Hungarian hu_hu Hungary (Hungary) IS Icelandic IS_IS Icelandic (Iceland) IT Italian IT_CH Italian (Switzerland) IT_IT Italian (Italy) IT_IT_EURO Italian (Italy, EURO) IW Hebrew IW_IL Hebrew (Israel) JA Japanese JA_JP Japanese (Japan) Ko Korean KO_KR North Korea (South Korea) LT Lithuaniand LT_LT Lithuanian (Lithuania) LV Ratovia (Tour) LV_LV Latvia (River) MK MK_MK Macedonia (Kingdom) NL Dutch NL_BE Dutch (Belgium) NL_BE_EURO Dutch (Belgium, EURO) NL_NL Dutch (Netherlands) NL_NL_EURO Dutch (Netherlands, Euro) No Norway Wen NO_NO Norwegian (Norway) No_NO_NY Norwegian (Norway,

NYNORSK) PL Polish PL_PL Polish (Poland) PT Portuguese PT_BR Portuguese (Brazil) PT_PT Portuguese (Portugal) PT_PT_EURO Portuguese (Portugal, EURO) Ro Romanian Ro_RO Romania (Romania) Russian RU_RU Russian Russia) SHSPNES - Croatian SH_YU Senis - Croatia (Yugoslav) SK Slovak SK_SK Slovak (Slovakia) SL Slovenia SL_SI Slovenia (Slovenia) SQ_Al Albania (Albania) SR Serbian SR_YU Serbian (Yugoslav) SV Swedish Wen SV_SE Swedish (Sweden) Th_th Thai (Thailand) Turkish TR_TR Turkish (Turkey) Uk - Ukraine Uk_ua Ukrainian (Ukraine) EN Chinese ZH_CN Chinese (China) ZH_HK Chinese (Hong Kong) "EN_TW Chinese (Taiwan) ====== System Property ============== - Listing Properties --Java.Runtime.Name = Java (TM) 2 Runtime Environment, Stand ... Sun.boot. Library.path = / usr / java / jdk1.3.1_04 / jre / lib / i386java.vm.version = 1.3.1_04-b02java.vm.vendor = sun microsystems incaps: // java. Sun.com/path.separator=:java.vm.name=java Hotspot (TM) Client Vmfile.Encoding.pkg = Sun.iojava.vm.Specification.Name = Java Virtual Machine SpecificationUser.dir = / Home / Chedong / SRC /char_testjava.runtime.version=1.3.1_04-b02java.awt.graphicsenv = Sun.awt.x11graphicsenvironmentingos.Arch=i386java.io.tmpdir=/tmpline.separator=java.vm.Specific at.vendor = sun microsystems inc.java.awt.fonts = os.name = Linuxjava.library.path = / usr / java / jdk1.3.1_04 / jre / lib / i386: / u ... java.specification.name = Java Platform API Specificationjava.class.version = 47.0os.version = 2.4.7-10user.home = / home / chedonguser.timezone = Asia / Shanghaijava.awt.printerjob = sun.awt.motif.PSPrinterJobfile.encoding = GBKjava. Specification.version = 1.3

User.name = Chedong

Java.class.path = / home / chedong / classes

Java.vm.Specification.Version = 1.0

Java.home = / usr / java / jdk1.3.1_04 / jre

User.language = zh

Java.specification.vendor = Sun Microsystems Inc.

Java.vm.info = mixed modejava.version = 1.3.1_04

Java.ext.dirs = / usr / java / jdk1.3.1_04 / jre / lib / ext

Sun.boot.class.path = / usr / java / jdk1.3.1_04 / jre / lib / rt.jar: ...

Java.vendor = Sun Microsystems Inc.

File.separator = /

Java.vendor.url.bug = http://java.sun.com/cgi-bin/bugReport ...

Sun.cpu.endian = Little

Sun.io.Unicode.encoding = unicodelittle

User.region = CN

Sun.cpu.isalist =

Hello, It's: Tue Jul 30 11:49:36 CST 2002 ====== System available locales: ========

en Englishen_US English (United States) ar Arabicar_AE Arabic (United Arab Emirates) ar_BH Arabic (Bahrain) ar_DZ Arabic (Algeria) ar_EG Arabic (Egypt) ar_IQ Arabic (Iraq) ar_JO Arabic (Jordan) ar_KW Arabic (Kuwait) ar_LB Arabic (Lebanon) ar_LY Arabic (Libya) ar_MA Arabic (Morocco) ar_OM Arabic (Oman) ar_QA Arabic (Qatar) ar_SA Arabic (Saudi Arabia) ar_SD Arabic (Sudan) ar_SY Arabic (Syria) ar_TN Arabic (Tunisia) ar_YE Arabic (Yemen) be Byelorussianbe_BY Byelorussian ( belarus) bg Bulgarianbg_BG Bulgarian (Bulgaria) ca Catalanca_ES Catalan (Spain) ca_ES_EURO Catalan (Spain, Euro) cs Czechcs_CZ Czech (Czech Republic) da Danishda_DK Danish (Denmark) de Germande_AT German (Austria) de_AT_EURO German (Austria, Euro) de_CH German ( switzerland) de_DE German (Germany) de_DE_EURO German (Germany, Euro) de_LU German (Luxembourg) de_LU_EURO German (Luxembourg, Euro) el Greekel_GR Greek (Greece) en_AU English (Australia) en_CA English (Canada) en_GB English (United Kingdom) en_IE English (Ireland) EN_IE_EURO ENGLISH (Ireland, EURO) EN_NZ English (New Zealand) en_ZA English (South Africa) es Spanishes_AR Spanish (Argentina) es_BO Spanish (Bolivia) es_CL Spanish (Chile) es_CO Spanish (Colombia) es_CR Spanish (Costa Rica) es_DO Spanish (Dominican Republic) es_EC Spanish (Ecuador) es_ES spanish (Spain) es_ES_EURO spanish (Spain, Euro) es_GT spanish (Guatemala) es_HN spanish (Honduras) es_MX spanish (Mexico) es_NI spanish (Nicaragua) es_PA spanish (Panama) es_PE spanish (Peru) es_PR spanish (Puerto Rico) es_PY spanish ( paraguay) es_SV Spanish (El Salvador) es_UY Spanish (Uruguay) es_VE Spanish (Venezuela) et Estonianet_EE Estonian (Estonia) fi Finnishfi_FI Finnish (Finland) fi_FI_EURO Finnish (Finland, Euro) fr Frenchfr_BE French (Belgium) fr_BE_EURO French (Belgium,

Euro) fr_CA French (Canada) fr_CH French (Switzerland) fr_FR French (France) fr_FR_EURO French (France, Euro) fr_LU French (Luxembourg) fr_LU_EURO French (Luxembourg, Euro) hr Croatianhr_HR Croatian (Croatia) hu Hungarianhu_HU Hungarian (Hungary) is Icelandicis_IS icelandic (Iceland) it Italianit_CH Italian (Switzerland) it_IT Italian (Italy) it_IT_EURO Italian (Italy, Euro) iw Hebrewiw_IL Hebrew (Israel) ja Japaneseja_JP Japanese (Japan) ko Korean ko_KR Korean (Republic of Korea) lt Lithuanianlt_LT Lithuanian (Lithuania) lv Latvian ( Lettish) lv_LV Latvian (Lettish) (Latvia) mk Macedonianmk_MK Macedonian (Macedonia) nl Dutchnl_BE Dutch (Belgium) nl_BE_EURO Dutch (Belgium, Euro) nl_NL Dutch (Netherlands) nl_NL_EURO Dutch (Netherlands, Euro) no Norwegianno_NO Norwegian (Norway) no_NO_NY Norwegian ( norway, Nynorsk) pl Polishpl_PL Polish (Poland) pt Portuguesept_BR Portuguese (Brazil) pt_PT Portuguese (Portugal) pt_PT_EURO Portuguese (Portugal, Euro) ro Romanianro_RO Romanian (Romania) ru Russianru_RU Russian (Russia) sh Serbo-Croatiansh_YU Serbo-Cr oatian (Yugoslavia) sk Slovaksk_SK Slovak (Slovakia) sl Sloveniansl_SI Slovenian (Slovenia) sq Albaniansq_AL Albanian (Albania) sr Serbiansr_YU Serbian (Yugoslavia) sv Swedishsv_SE Swedish (Sweden) th Thaith_TH Thai (Thailand) tr Turkishtr_TR Turkish (Turkey) uk Ukrainianuk_UA Ukrainian ( Ukraine) zh_hk Chinese (Hong Kong) EN_TW Chinese (Taiwan) ====== System Property ======== - Listing Properties --Java.Runtime.name = Java (TM) 2 Runtime Environment,

Stand ... sun.boot.library.path = c: / program files / javasoft / jre / 1.3.0_0 ... java.vm.version = 1.3.0_02java.vm.vendor = sun microsystems inc.java.vendor. URL = http: //java.sun.com/path.seParetor=; java.vm.name = java hotspot (tm) client vmfile.Encoding.pkg = sun.iojava.vm.Specification.name = Java Virtual Machine SpecificationUser. DIR = d: /java/src/char_testjava.runtime.version=1.3.0_02java.awt.graphicsenv =Sun.awt.win32graphicsenvironmentS.Arch=x86java.io.tmpdir=d: /temp/line.seParator=java.vm. Specification.vendor = sun microsystems inc.java.awt.fonts = Os.name = Windows 98java.library.path = c: / windows;.; c: / windows / system; c: / win ... java.specification. name = Java Platform API Specificationjava.class.version = 47.0os.version = 4.90user.home = C: /WINDOWSuser.timezone=Asia/Shanghaijava.awt.printerjob=sun.awt.windows.WPrinterJobfile.encoding=GBKjava.specification. Version = 1.3

User.name = sicci

Java.class.path = d: / java / classes

Java.vm.Specification.Version = 1.0

Java.home = C: / Program files / javasoft / jre / 1.3.0_02

User.language = zh

Java.specification.vendor = Sun Microsystems Inc.

AWT.TOOLKIT = sun.awt.windows.wtoolkit

Java.vm.info = Mixed Mode

Java.version = 1.3.0_02

Java.ext.dirs = C: / Program Files / JavaSoft / JRE / 1.3.0_0 ...

Sun.boot.class.path = c: / program files / javasoft / jre / 1.3.0_0 ...

Java.vendor = Sun Microsystems Inc.

File.separator = /

Java.vendor.url.bug = http://java.sun.com/cgi-bin/bugReport ...

Sun.cpu.endian = Little

Sun.io.Unicode.encoding = unicodelittle

User.region = CN

Sun.cpu.isalist = Pentium I486 I386HELLO, IT'S: TUE JUL 30 11:53:27 CST 2002 ====== System available locales: ========

en Englishen_US English (United States) ar Arabicar_AE Arabic (United Arab Emirates) ar_BH Arabic (Bahrain) ar_DZ Arabic (Algeria) ar_EG Arabic (Egypt) ar_IQ Arabic (Iraq) ar_JO Arabic (Jordan) ar_KW Arabic (Kuwait) ar_LB Arabic (Lebanon) ar_LY Arabic (Libya) ar_MA Arabic (Morocco) ar_OM Arabic (Oman) ar_QA Arabic (Qatar) ar_SA Arabic (Saudi Arabia) ar_SD Arabic (Sudan) ar_SY Arabic (Syria) ar_TN Arabic (Tunisia) ar_YE Arabic (Yemen) be Byelorussianbe_BY Byelorussian ( belarus) bg Bulgarianbg_BG Bulgarian (Bulgaria) ca Catalanca_ES Catalan (Spain) ca_ES_EURO Catalan (Spain, Euro) cs Czechcs_CZ Czech (Czech Republic) da Danishda_DK Danish (Denmark) de Germande_AT German (Austria) de_AT_EURO German (Austria, Euro) de_CH German ( switzerland) de_DE German (Germany) de_DE_EURO German (Germany, Euro) de_LU German (Luxembourg) de_LU_EURO German (Luxembourg, Euro) el Greekel_GR Greek (Greece) en_AU English (Australia) en_CA English (Canada) en_GB English (United Kingdom) en_IE English (Ireland) EN_IE_EURO ENGLISH (Ireland, EURO) EN_NZ English (New Zealand) en_ZA English (South Africa) es Spanishes_AR Spanish (Argentina) es_BO Spanish (Bolivia) es_CL Spanish (Chile) es_CO Spanish (Colombia) es_CR Spanish (Costa Rica) es_DO Spanish (Dominican Republic) es_EC Spanish (Ecuador) es_ES spanish (Spain) es_ES_EURO spanish (Spain, Euro) es_GT spanish (Guatemala) es_HN spanish (Honduras) es_MX spanish (Mexico) es_NI spanish (Nicaragua) es_PA spanish (Panama) es_PE spanish (Peru) es_PR spanish (Puerto Rico) es_PY spanish ( paraguay) es_SV Spanish (El Salvador) es_UY Spanish (Uruguay) es_VE Spanish (Venezuela) et Estonianet_EE Estonian (Estonia) fi Finnishfi_FI Finnish (Finland) fi_FI_EURO Finnish (Finland, Euro) fr Frenchfr_BE French (Belgium) fr_BE_EURO French (Belgium,

Euro) fr_CA French (Canada) fr_CH French (Switzerland) fr_FR French (France) fr_FR_EURO French (France, Euro) fr_LU French (Luxembourg) fr_LU_EURO French (Luxembourg, Euro) hr Croatianhr_HR Croatian (Croatia) hu Hungarianhu_HU Hungarian (Hungary) is Icelandicis_IS icelandic (Iceland) it Italianit_CH Italian (Switzerland) it_IT Italian (Italy) it_IT_EURO Italian (Italy, Euro) iw Hebrewiw_IL Hebrew (Israel) ja Japaneseja_JP Japanese (Japan) ko Koreanko_KR Korean (South Korea) lt Lithuanianlt_LT Lithuanian (Lithuania) lv Latvian ( Lettish) lv_LV Latvian (Lettish) (Latvia) mk Macedonianmk_MK Macedonian (Macedonia) nl Dutchnl_BE Dutch (Belgium) nl_BE_EURO Dutch (Belgium, Euro) nl_NL Dutch (Netherlands) nl_NL_EURO Dutch (Netherlands, Euro) no Norwegianno_NO Norwegian (Norway) no_NO_NY Norwegian ( norway, Nynorsk) pl Polishpl_PL Polish (Poland) pt Portuguesept_BR Portuguese (Brazil) pt_PT Portuguese (Portugal) pt_PT_EURO Portuguese (Portugal, Euro) ro Romanianro_RO Romanian (Romania) ru Russianru_RU Russian (Russia) sh Serbo-Croatia nsh_YU Serbo-Croatian (Yugoslavia) sk Slovaksk_SK Slovak (Slovakia) sl Sloveniansl_SI Slovenian (Slovenia) sq Albaniansq_AL Albanian (Albania) sr Serbiansr_YU Serbian (Yugoslavia) sv Swedishsv_SE Swedish (Sweden) th Thaith_TH Thai (Thailand) tr Turkishtr_TR Turkish (Turkey) uk Ukrainianuk_UA Ukrainian (Ukraine) zh Chinesezh_CN Chinese (China) zh_HK Chinese (Hong Kong) zh_TW Chinese (Taiwan) ====== System property ======== - listing properties --java.runtime.name = Java (TM) 2 Runtime Environment,

Stand ... sun.boot.library.path = c: / program files / javasoft / jre / 1.3.0_0 ... java.vm.version = 1.3.0_02java.vm.vendor = sun microsystems inc.java.vendor. URL = http: //java.sun.com/path.seParetor=; java.vm.name = java hotspot (tm) client vmfile.Encoding.pkg = sun.iojava.vm.Specification.name = Java Virtual Machine SpecificationUser. DIR = d: /java/src/char_testjava.runtime.version=1.3.0_02java.awt.graphicsenv =Sun.awt.win32graphicsenvironmentS.Arch=x86java.io.tmpdir=d: /temp/line.seParator=java.vm. Specification.vendor = sun microsystems inc.java.awt.fonts = Os.name = Windows 98java.library.path = c: / windows;.; c: / windows / system; c: / win ... java.specification. name = Java Platform API Specificationjava.class.version = 47.0os.version = 4.90user.home = C: /WINDOWSuser.timezone=Asia/Shanghaijava.awt.printerjob=sun.awt.windows.WPrinterJobfile.encoding=Cp1252java.specification. Version = 1.3

User.name = sicci

Java.class.path = d: / java / classes

Java.vm.Specification.Version = 1.0

Java.home = C: / Program files / javasoft / jre / 1.3.0_02

User.language = en

Java.specification.vendor = Sun Microsystems Inc.

AWT.TOOLKIT = sun.awt.windows.wtoolkit

Java.vm.info = Mixed Mode

Java.version = 1.3.0_02

Java.ext.dirs = C: / Program Files / JavaSoft / JRE / 1.3.0_0 ...

Sun.boot.class.path = c: / program files / javasoft / jre / 1.3.0_0 ...

Java.vendor = Sun Microsystems Inc.

File.separator = /

Java.vendor.url.bug = http://java.sun.com/cgi-bin/bugReport ...

Sun.cpu.endian = Little

Sun.io.Unicode.encoding = unicodelittle

User.Region = GB

Sun.cpu.isalist = Pentium i486 i386

Conclusion 1:

The default encoding method of JVM is determined by the "local language environment" setting of the system, and is independent of the type of the operating system. So when set to the same LOCALE, the default encoding method under Linux and Windows is no different (which can be considered to be the same as the same Western coding, only 255 Latin characters), So after the test 2 I only list the test results output to be set to en-en_, en_US, respectively. The following tests are the same as the output of the test after different areas and character sets in Windows. Test 2: The translation process of byte flow to the character stream during the input and output of Java

Through this Hellounicode.java program, demonstration description "Hello World Hello" string (16 characters) is in different default system encoding methods. After each step of encoding / decoding, the BYTE value of each character, a Short value, and the Unicode section of each character, a SHORT value are printed.

LANG = en_us lc_all = en_uslang = zh_cn lc_all = zh_cn.gbk ======== Testing1: write hello world to files ======== [Test 1-1]: with system default encoding = ISO- 8859-1STRING = Hello World World Hello Length = 20CHAR [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATINCHAR [1] = 'E' BYTE = 101 / U65 Short = 101 / U65 Basic_LATINCHAR [2 ] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_LATINCHAR [3] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_LATINCHAR [4] = 'o' Byte = 111 / U6f Short = 111 / U6f Basic_LatinChar [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATINCHAR [6] = 'W' Byte = 119 / U77 Short = 119 / U77 Basic_LATINCHAR [7] = 'o' Byte = 111 / U6f Short = 111 / u6f Basic_LATINCHAR [8] = 'r' Byte = 114 / U72 Short = 114 / u72 Basic_LATINCHAR [9] = 'L' BYTE = 108 / U6C Short = 108 / U6C Basic_LATINCHAR [10] = 'd' Byte = 100 / U64 Short = 100 / U64 Basic_LATINCHAR [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATINCHAR [12] = '? BYTE = -54 / uffffffca short = 202 / uca latin_1_supplementchar [13] ='? BYTE = -64 / ufff FFFC0 short = 192 / uC0 LATIN_1_SUPPLEMENTchar [14] = '? Byte = -67 / uFFFFFFBD short = 189 / uBD LATIN_1_SUPPLEMENTchar [15] ='? Byte = -25 / uFFFFFFE7 short = 231 / uE7 LATIN_1_SUPPLEMENTchar [16] = '? Byte = -60 / uffffffc4 short = 196 / uc4 latin_1_supplementchar [17] = '? BYTE = -29 / uffffffe3 short =

227 / ue3 latin_1_supplementchar [18] = '? BYTE = -70 / uffffffba short = 186 / uba latin_1_supplementchar [19] ='? BYTE = -61 / uffffffc3 Short = 195 / uc3 latin_1_supplement Step 1: In English Code Environment, Although the Chinese is correctly displayed in the screen, it actually prints "half" Chinese characters, writes the result to the first file hello.orig.html [test 1-2]: getBytes with platform default encoding and decoding as GB2312:

String = Hello World ???? length = 16

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_Latin

Char [2] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_Latin

Char [6] = 'w' Byte = 119 / U77 Short = 119 / U77 Basic_LATIN

Char [7] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

Char [12] = '?' Byte = 22 / u16 short = 19990 / u4e16 CJK_Unified_ideographs

Char [13] = '?' Byte = 76 / U4C Short = 30028 / U754C CJK_Unified_ideographs

Char [14] = '?' Byte = 96 / U60 Short = 20320 / U4F60 CJK_Unified_ideographs

Char [15] = '?' Byte = 125 / U7D Short = 22909 / U597D CJK_UNIFIED_IDEOGRAPHS Renovate by system default encoding into byte stream, then decoded in GB2312, although printing is a question mark (because the current English environment The system is not known for the 255 characters, so all? Display) but from the corresponding unicode mapping and short values ​​we can know that the characters are correct Chinese but the next writing 2nd file HTML .gb2312.html, no designation method (by system default ISO-8859-1 encoding mode), is therefore true of the result of the following test 2-2? '

[TEST 1-3]: Convert String to UTF8

String = Hello World = 24

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_Latin

Char [2] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_Latin

Char [6] = 'w' Byte = 119 / U77 Short = 119 / U77 Basic_LATIN

Char [7] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

CHAR [12] = '? BYTE = -28 / uffffffe4 short = 228 / ue4 latin_1_supplement

CHAR [13] = '? BYTE = -72 / uffffffb8 short = 184 / ub8 latin_1_supplement

CHAR [14] = '? BYTE = -106 / ufffffff96 short = 150 / u96 latin_1_supplement

Char [15] = '? BYTE = -25 / uffffffe7 short = 231 / ue7 latin_1_supplementchar [16] ='? BYTE = -107 / ufffffff95 short = 149 / u95 latin_1_supplement

Char [17] = '? BYTE = -116 / uffffff8c short = 140 / u8c latin_1_supplement

CHAR [18] = '? BYTE = -28 / uffffffe4 short = 228 / ue4 latin_1_supplement

CHAR [19] = '? BYTE = -67 / uffffffbd short = 189 / ubd latin_1_supplement

Char [20] = '? BYTE = -96 / uffffffa0 short = 160 / ua0 latin_1_supplement

CHAR [21] = '? BYTE = -27 / uffffffe5 short = 229 / ue5 latin_1_supplement

Char [22] = '? BYTE = -91 / uffffffa5 short = 165 / uA5 latin_1_supplement

CHAR [23] = '? BYTE = -67 / uffffffbd short = 189 / ubd latin_1_supplement

In the third test, after encoding the character stream, write the third test file hello.utf8.html, we can see that UTF8 has no effect on English, but for other words, 3 bytes encoding mode, so More than 50% of the storage than GB2312 encoding mode,

======== Testing2: reading and decoding from files ========

[TEST 2-1]: read hello.orig.html: decoding with system default encoding

String = Hello World, Hello, Length = 20

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_Latin

Char [2] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_Latin

Char [6] = 'w' Byte = 119 / U77 Short = 119 / U77 Basic_LATINCHAR [7] = 'o' byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

Char [12] = '? BYTE = -54 / ufffffffca short = 202 / uca latin_1_supplement

CHAR [13] = '? BYTE = -64 / uffffffc0 short = 192 / uc0 latin_1_supplement

CHAR [14] = '? BYTE = -67 / uffffffbd short = 189 / ubd latin_1_supplement

CHAR [15] = '? BYTE = -25 / uffffffe7 short = 231 / ue7 latin_1_supplement

Char [16] = '? BYTE = -60 / uffffffc4 short = 196 / uc4 latin_1_supplement

Char [17] = '? BYTE = -29 / uffffffe3 short = 227 / ue3 latin_1_supplement

CHAR [18] = '? BYTE = -70 / uffffffba short = 186 / uba latin_1_supplement

Char [19] = '? BYTE = -61 / uffffffc3 short = 195 / uc3 latin_1_supplement

Press the system to read the corresponding file from the middle storage hello.orig.html file, although it is read according to byte (half "word"), but due to the complete restore, the output has no error. In fact, PHP and other applications are rarely displayed. It is actually this reason. The whole process is handled by byte stream. It is very good to restore the input, but the control is also lost while doing this.

[TEST 2-2]: read hello.gb2312.html: decoding as GB2312

String = Hello World ???? length = 16

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_Latin

Char [2] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_Latin

Char [6] = 'w' Byte = 119 / U77 Short = 119 / U77 Basic_LATIN

Char [7] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

CHAR [12] = '?' Byte = 63 / u3f short = 63 / u3f Basic_LATIN

Char [13] = '?' Byte = 63 / u3f short = 63 / u3f Basic_LATIN

CHAR [14] = '?' Byte = 63 / u3f short = 63 / u3f Basic_LATIN

Char [15] = '?' BYTE = 63 / u3f short = 63 / u3f Basic_LATIN

The worst thing is to output, these '?' It is really a question mark char (63), if this is true, it is really not saved.

[TEST 2-3]: Read Hello.utf8.html: decoding as utf8

String = Hello World ???? length = 16

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_Latin

Char [2] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_Latin

Char [6] = 'w' Byte = 119 / U77 Short = 119 / U77 Basic_LATINCHAR [7] = 'o' byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

Char [12] = '?' Byte = 22 / u16 short = 19990 / u4e16 CJK_Unified_ideographs

Char [13] = '?' Byte = 76 / U4C Short = 30028 / U754C CJK_Unified_ideographs

Char [14] = '?' Byte = 96 / U60 Short = 20320 / U4F60 CJK_Unified_ideographs

Char [15] = '?' Byte = 125 / U7D Short = 22909 / U597D CJK_Unified_ideographs

Great! Characters are displayed as '?', but actually decoding is correct, it can be seen from the corresponding Unicode mapping.

======== Testing1: Write Hello World to files ======== [TEST 1-1]: with system default encoding = GBKSTRING = Hello World World Hello Length = 16CHAR [0] = ' H 'Byte = 72 / U48 Short = 72 / U48 Basic_LATINCHAR [1] =' E 'BYTE = 101 / U65 Short = 101 / U65 Basic_LATINCHAR [2] =' L 'Byte = 108 / U6C Short = 108 / U6C Basic_LATINCHAR [ 3] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_LATINCHAR [4] = 'o' Byte = 111 / U6F Short = 111 / U6F Basic_LATINCHAR [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATINCHAR [6] = 'w' Byte = 119 / u77 short = 119 / u77 Basic_LATINCHAR [7] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATINCHAR [8] = 'r' Byte = 114 / U72 Short = 114 / u72 Basic_LATINCHAR [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_LATINCHAR [10] = 'd' Byte = 100 / U64 Short = 100 / U64 Basic_LATINCHAR [11] = '' BYTE = 32 / u20 short = 32 / u20 BASIC_LATINchar [12] = 'World' byte = 22 / u16 short = 19990 / u4E16 CJK_UNIFIED_IDEOGRAPHSchar [13] = 'boundaries' byte = 76 / u4C short = 30028 / u754C CJK_UNIFIED_IDEOGRAPHSchar [ 14] = 'You' Byte = 96 / U60 Short = 20320 / U4F60 CJK_UNIFIED_IDEOGRAPHSCHAR [15] = 'Good' Byte = 125 / U7D Short = 22909 / U597D CJK_UNIFIED_IDEOGRAPHS Note: Doing the above test in the new locale requires the source program Re-compile, the earliest byte stream to the character stream begins with the Javac build source file. This test and the biggest difference between the "World Hello" in the source file is compiled. In the process, instead of byte is compiled into 8 characters (actually corresponding to 8 bytes) in the program.

[TEST 1-2]: getBytes with platform default encoding and decoding as gb2312: string = Hello World World Hello Length = 16

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_Latin

Char [2] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_Latin

Char [6] = 'w' Byte = 119 / U77 Short = 119 / U77 Basic_LATIN

Char [7] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

Char [12] = 'world' Byte = 22 / u16 short = 19990 / u4e16 cjk_unified_ideographs

Char [13] = '' Byte = 76 / U4C Short = 30028 / U754C CJK_Unified_ideographs

Char [14] = 'You' Byte = 96 / U60 Short = 20320 / U4F60 CJK_UNIFIED_IDEOGRAPHS

Char [15] = 'Good' Byte = 125 / U7D Short = 22909 / U597D CJK_UNIFIED_IDEOGRAPHS

In the Chinese environment, decoding and the default encoding above are consistent, so the output is consistent

[TEST 1-3]: Convert String to UTF8

String = Hello World = 18

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_LATINCHAR [2] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_Latin

Char [6] = 'w' Byte = 119 / U77 Short = 119 / U77 Basic_LATIN

Char [7] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

Char [12] = '' BYTE = -109 / uffffff93 short = 28051 / u6d93 cjk_unified_ideographs

Char [13] = '栫' Byte = 43 / U2B Short = 26667 / U682B CJK_UNIFIED_IDEOGRAPHS

Char [14] = '晫' Byte = 107 / U6b Short = 26219 / U666B CJK_Unified_ideographs

Char [15] = '' Byte = 99 / U63 Short = 28003 / U6D63 CJK_UNIFIED_IDEOGRAPHS

Char [16] = '犲' BYTE = -78 / uffffffb2 short = 29362 / u72b2 cjk_unified_ideographs

Char [17] = 'ソ' Byte = -67 / uffffffbd short = 12477 / u30BD Katakana

In fact, the terminal window we used to test is an application of a GBK character set. This output is actually the effect of decoding Unicode in the GBK character set.

======== Testing2: reading and decoding from files ========

[TEST 2-1]: read hello.orig.html: decoding with system default encodingstring = Hello World World Hello Length = 16

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_Latin

Char [2] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_Latin

Char [6] = 'w' Byte = 119 / U77 Short = 119 / U77 Basic_LATIN

Char [7] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

Char [12] = 'world' Byte = 22 / u16 short = 19990 / u4e16 cjk_unified_ideographs

Char [13] = '' Byte = 76 / U4C Short = 30028 / U754C CJK_Unified_ideographs

Char [14] = 'You' Byte = 96 / U60 Short = 20320 / U4F60 CJK_UNIFIED_IDEOGRAPHS

Char [15] = 'Good' Byte = 125 / U7D Short = 22909 / U597D CJK_UNIFIED_IDEOGRAPHS

[TEST 2-2]: read hello.gb2312.html: decoding as GB2312

String = Hello World, Hello, Length = 16

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_Latin

Char [2] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_Latin

Char [6] = 'w' Byte = 119 / U77 Short = 119 / U77 Basic_LATIN

Char [7] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

Char [12] = 'world' Byte = 22 / u16 short = 19990 / u4e16 cjk_unified_ideographs

Char [13] = '' Byte = 76 / U4C Short = 30028 / U754C CJK_Unified_ideographs

Char [14] = 'You' Byte = 96 / U60 Short = 20320 / U4F60 CJK_UNIFIED_IDEOGRAPHS

Char [15] = 'Good' Byte = 125 / U7D Short = 22909 / U597D CJK_UNIFIED_IDEOGRAPHS

[TEST 2-3]: Read Hello.utf8.html: decoding as utf8

String = Hello World, Hello, Length = 16

Char [0] = 'h' Byte = 72 / U48 Short = 72 / U48 Basic_LATIN

Char [1] = 'e' Byte = 101 / U65 Short = 101 / U65 Basic_Latin

Char [2] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_LATIN

Char [3] = 'l' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [4] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [5] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATINCHAR [6] = 'W' Byte = 119 / U77 Short = 119 / U77 Basic_Latin

Char [7] = 'o' Byte = 111 / u6f short = 111 / u6f Basic_LATIN

Char [8] = 'r' Byte = 114 / U72 Short = 114 / U72 Basic_LATIN

Char [9] = 'L' Byte = 108 / U6C Short = 108 / U6C Basic_Latin

Char [10] = 'd' Byte = 100 / u64 Short = 100 / U64 Basic_LATIN

Char [11] = '' BYTE = 32 / U20 Short = 32 / U20 Basic_LATIN

Char [12] = 'world' Byte = 22 / u16 short = 19990 / u4e16 cjk_unified_ideographs

Char [13] = '' Byte = 76 / U4C Short = 30028 / U754C CJK_Unified_ideographs

Char [14] = 'You' Byte = 96 / U60 Short = 20320 / U4F60 CJK_UNIFIED_IDEOGRAPHS

Char [15] = 'Good' Byte = 125 / U7D Short = 22909 / U597D CJK_UNIFIED_IDEOGRAPHS

Conclusion: If the background data uses the unicode mode store and then specifies the character set encoding, decoding method, the application can be almost impact on the environmental character set setting in the front-end application.

Some conclusions of the test 2:

All applications are processed in accordance with byte stream => character stream => byte flow mode: BYTE_STREAM == [INPUT DECODING] ==> Unicode_CHAR_STREAM == [OUTPUT Encoding] ==> Byte_Stream; in Java bytes Streaming to the character stream (or counter) is an implicit decoding process (the default is based on the system default encoding method); the earliest byte stream decoding process starts from Java code; Character in Java The storage unit is a double-byte unicode;

Trial 3: Encoding in the input and output in web applications: Java is designed for international application, servlet should set automatic switching character set configuration according to browser language: Even Java-based web applications, in servers And the client is still byte stream, such as I submitted the "World Hello" from a Chinese client's browser form: First, the browser encodes the zypsteflow in the GBK mode. CA C0 BD E7 C4 E3 BA C3, then 8 bytes are converted to:% CA% C0% BD% E7% C4% E3% BA% C3, what decoding processing should be pressed after receiving the request , What way should I encode the line-by-line? In the current Servlet specification, if not specified, the input servletRequest and the output of the input servletRequest and the output time when it is not specified (note, the encoding / decoding method here is and The language environment in the operating system environment is independent). Therefore, even if the language environment of the server operating system is Chinese, the request to be entered above is decoded into eight Unicode characters in English. When output, according to the English, it can be encoded in 8 bytes, although this is in the browser, if the setting is in Chinese. Correctly displayed, but in fact, it is actually "byte". The correct way should be based on the client browser to set servletRequest and servletResponse to enter the decoding / input code with the corresponding language encoding. HellounicoDeservlet.java is such a monitoring customer. Example of Did Browser Language Settings:

When "Accept-Language" is based on the "Accept-language" in the browser, the setup decoding method and the output character set encoding method are used to use GBK:

// auto detect broswer's languages ​​String clientLanguage = req.getHeader ( "Accept-Language"); // for Simplied Chinese if (clientLanguage.equals ( "zh-cn")) {req.setCharacterEncoding ( "GBK"); res. SetContentType ("text / html; charset = GBK");} The output is:

'Hello world' length = 4ServletRequest's Charset Encoding = GBK ServletResponse's Charset Encoding = GBK char [0] = 'World' byte = 22 / u16 short = 19990 / u4E16 CJK_UNIFIED_IDEOGRAPHSchar [1] = 'boundaries' byte = 76 / u4C short = 30028 / u754C CJK_UNIFIED_IDEOGRAPHSchar [2] = 'you' byte = 96 / u60 short = 20320 / u4F60 CJK_UNIFIED_IDEOGRAPHSchar [3] = 'good' byte = 125 / u7D short = 22909 / u597D CJK_UNIFIED_IDEOGRAPHS

A test: Take out the browser automatically detect the browser beginning, and the output result is the same as the current ISO-8859-1, "byte app", "World": "World Hello 'length = 8ServletRequest's Charset Encoding = null ServletResponse's Charset Encoding = ISO-8859-1 char [0] ='? byte = -54 / uFFFFFFCA short = 202 / uCA LATIN_1_SUPPLEMENTchar [1] = '? byte = -64 / uFFFFFFC0 short = 192 / uC0 LATIN_1_SUPPLEMENTchar [2] = '? byte = -67 / uFFFFFFBD short = 189 / uBD LATIN_1_SUPPLEMENTchar [3] ='? byte = -25 / uFFFFFFE7 short = 231 / uE7 LATIN_1_SUPPLEMENTchar [4] = '? byte = -60 / uFFFFFFC4 short = 196 / uC4 LATIN_1_SUPPLEMENTchar [5] = '? byte = -29 / uFFFFFFE3 short = 227 / uE3 LATIN_1_SUPPLEMENTchar [6] ='? byte = -70 / uFFFFFFBA short = 186 / uBA LATIN_1_SUPPLEMENTchar [7] = '? BYTE = -61 / uffffffc3 short = 195 / uc3 Latin_1_supplement Although this output result can be displayed correctly if it is set in the browser, it can be displayed correctly, but actually processed "bytes" rather than processing Chinese " Character ". ServletRequest and ServletResponse default Using ISO-8859-1 Character Set Decoding / Coding Specific Definition See:

Http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/servletRequest.html#setcharacterencoding(Java.lang.String)

http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/servletResponse.html#setContentType ()

Previously configurable a web application can be able to display Chinese on GNU / Linux encoded in GBK mode encoding and by ISO-8859-1. I have been confused for a long time. I wanted it carefully, and later I finally wanted it. In an international application: ServletRequest and servletResponse encoding / decoding mode should not be set to a fixed character set according to the server, but should be input to the client language environment. / Output coding method adaptive. A web application designed in accordance with international norms:

Try not to have Chinese as much as possible in the source code of the servlet: Since the servlet is mainly the role of the controller (C), the servlet should be turned to the corresponding display (JSP or XSLT) through the ResourceBundle mechanism. The part of the interface related to the local interface language should be completely peeled off from the servlet and the module in the background, put it in the corresponding resourcebundle file or in the XSLT file. This is entirely in the source program, and there is no need to consider the problem of the character set at all. If the servlet does needed to include Chinese, you need to set the Javac compilation option for the application server, plus the -Encoding option into the system default character set, if you write the characters written in Chinese, decode compile according to English, then follow the English mode, Although the result surface is correct, it is actually a "byte" programming. In the servlet layer, it should be designed as the Google search engine, it is designed to adaptive output according to the language environment of the client browser. In order to determine the code in the browser language servlet: public void doget (httpservletRequest Req, httpservletResponse Res) "THROWS servletexception, IOEXCEPTION {// Gets the language of the client from the HTTP request header setting String ClientLanguage = Req.getHeader (" Accept-language "); // Simplified Chinese Browser IF (ClientLanguage.Equals (" ZH-CN ")) {Req.SetCharacterencoding (" GBK "); res. setContentType (" text / html; charSet = GBK ");} // Traditional Chinese browser ELSE IF (ClientLanguage.equals (" zh-tw ")) { Req.setCharacterencoding ("BIG5"); res.setContentType ("text / html; charset = BIG5");} // Japanese browser ELSE IF (ClientLanguage.equals ("JP")) {Req.setCharacterencoding ("sjis" Res.SetContentType ("text / html; charset = sjis");} // Default to English browser else {Req.setcharacterencoding ("ISO-8859-1"); res.setContentType ("Text / HTML Charset = ISO-8859-1 ");} ... // Set the decoding method of the Request and the response encoding method, perform a subsequent operation.

//, such as turning to helloworld.gbk.jsp hellloworld.big5.jsp HelloWorld.jis.jsp etc.} and servlet defaults to set the character set to ISO-8859-1 perhaps the standard maker thinks English browser accounts for most It is often the same as the ISO-8859-1 mode.

in conclusion:

Some conclusions by the above Java test procedures:

The Java environment is based on a virtual machine application on the operating system. Therefore, if the operating system follows the international norm: the default encoding method of the JVM can be implemented by modifying the Locale settings of the operating system. For a Java application, as long as Linux's default encoding is set to GBK, its text encoding process should be consistent with the performance on the Chinese Windows platform. RedHat 6.x Using the Linux kernel is based on glibc2.1.x, Chinese Locale is not supported, so it is not possible to change the default encoding method of the JVM by changing the Locale setting, the Linux core 2.4 starts based on glibc.2.2.x, to Chinese locale Comparable support. Different JVMs have different support for character sets: such as: IBM JVM1.3.0 Start support GB18030, Sun's JVM starts supporting GB18030 correctly in the GB18030 does not necessarily represent correct display, the correct display needs to be The front-end display system (font) support However, for service applications on Linux, it is often necessary to confirm that the character correctly encoded according to the specified manner, if applying a Unicode-based encoding process, and use the UTF8 character set to concentrate Store, so that it is the most convenient output according to the client language environment;

According to the above conclusions, design a application that adapts to a multi-language environment, you can consider 2 application processing modes:

(Client Applications or Localization Applications) According to Locale, the Java application allows Java applications to switch according to the default character set settings of system locale, and decodes according to the default character set of the system, reducing the complexity of the application on coding processing. Enter byte stream ==> Press system language character set to set the word throttle decoding ==> Unicode processing ==> Press system language character set setting to encode the Unicode into byte stream ==> Output byte stream (server-side or Cross-language platform application): At the outermost end of the application: Data input output determines the user locale, the core is stored in the Unicode mode. Various regional character sets (GB2312 BIG5) can be seen as a subset of Unicode. Unicode stored data can be easily converted to any character set. Application uses UTF8 mode storage Although the storage space is added, it can also greatly simplify the complexity of the front-end application localization (I10n).简体 中文 输入体 中文 输入体 中文 简体 输体 中文 输体 中文 简 输 输 输 中 中 环境:: Decoding Determination User Language Environment: Coding / / Intermediate Processing Process: Unicode | UTF8 Code Storage

As Unicode is more and more systematic and platform support: Python Perl Glibc, etc., we should cherish it in accordance with the international standardization of Java, and cooperate with the newly developed XML specification, believe in internationalization The normative application design will show more advantages from the long run. TODO: Character set problem test in database applications: mysql oracle JDBC

Reference document: Java's international design http://java.sun.com/docs/books/tutorial/i18n/index.html

Linux International Localization and Chinese Http://www.linuxforum.net/doc/i18n-new.html

Linux programmer must read: Chinese culture and GB18030 standard http://www.ccidnet.com/tech/OS/2001/07/31/58_2811.html

Unicode FAQHTTP: //www.cl.cam.ac.uk/~mgk25/unicode.htmlhttp: //www.linuxForum.Net/books/utf-8-unicode.html (Chinese) Java programming technology Chinese characters Analysis and solve http://www-900.ibm.com/developerWorks/cn/java/java_chinese/index.shtml Chinese characters code: http://www.unihan.com.cn/cjk/ana17.htm

Different versions of JVM supported encoding mode http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.htmlhttp://java.sun.com/j2se/1.4/docs/guide/ Intl / encoding.doc.html

appendix:

A. The Unicode 2.0 Character Set

Characters Description / U0000 - / U1FFF Alphabets / U0020 - / U007F Basic Latin / U0080 - / U00FF Latin-1 Supplement / U0100 - / U017F Latin Extended-A / U0180 - / U024F Latin Extended-B / U0250 - / U02AF IPA EXTENSIONS / u02B0 - / u02FF Spacing modifier letters / u0300 - / u036F Combining diacritical marks / u0370 - / u03FF Greek / u0400 - / u04FF Cyrillic / u0530 - / u058F Armenian / u0590 - / u05FF Hebrew / u0600 - / u06FF Arabic / u0900 - / u097F devanagari / u0980 - / u09FF Bengali / u0A00 - / u0A7F Gurmukhi / u0A80 - / u0AFF Gujarati / u0B00 - / u0B7F Oriya / u0B80 - / u0BFF Tamil / u0C00 - / u0C7F Telugu / u0C80 - / u0CFF Kannada / u0D00 - / u0D7F Malayalam / u0E00 - / u0E7F Thai / u0E80 - / u0EFF Lao / u0F00 - / u0FBF Tibetan / u10A0 - / u10FF Georgian / u1100 - / u11FF Hangul Jamo / u1E00 - / u1EFF Latin extended additional / u1F00 - / u1FFF Greek extended / u2000 - / u2FFF Symbols and puncture / u2000 - / u206f general puncture / u2070 - / u209f SuperScripts and subscripts / u20a0 - / u20cf currency Symbols / U20D0 - / u20ff combining diacritical marks for symbols / u2100 - / u214F Letterlike symbols / u2150 - / u218F Number forms / u2190 - / u21FF Arrows / u2200 - / u22FF Mathematical operators / u2300 - / u23FF Miscellaneous technical / u2400 - / u243F Control pictures / u2440 - / u245F Optical character recognition / u2460 - / u24FF Enclosed alphanumerics / u2500 - / u257F Box drawing / u2580 - / u259F Block elements / u25A0 - / u25FF Geometric shapes / u2600 - / u26FF Miscellaneous symbols / u2700 - / u27BF Dingbats / u3000 - / u33FF CJK auxiliary / u3000 - / u303F CJK symbols and punctuation / u3040 - / u309F Hiragana / u30A0 - / u30FF Katakana / u3100 - / u312F Bopomofo / u3130 - / u318F Hangul compatibility Jamo / u3190 - / u319F Kanbun / u3200 - / u32FF Enclosed CJK letters AND MONTHS / U3300 - / U33FF CJK

compatibility / u4E00 - / u9FFF CJK unified ideographs: Han characters used in China, Japan, Korea, Taiwan, and Vietnam / uAC00 - / uD7A3 Hangul syllables / uD800 - / uDFFF Surrogates / uD800 - / uDB7F High surrogates / uDB80 - / uDBFF High private use surrogates / uDC00 - / uDFFF Low surrogates / uE000 - / uF8FF Private use / uF900 - / uFFFF Miscellaneous / uF900 - / uFAFF CJK compatibility ideographs / uFB00 - / uFB4F Alphabetic presentation forms / uFB50 - / uFDFF Arabic presentation forms-A / uFE20 - / uFE2F Combing half marks / uFE30 - / uFE4F CJK compatibility forms / uFE50 - / uFE6F Small form variants / uFE70 - / uFEFE Arabic presentation forms-B / uFEFF Specials / uFF00 - / uFFEF halfwidth and fullwidth forms / uFFF0 - / uFFFF Specials Original Source: http://www.chedong.com/tech/hello_unicode.html

转载请注明原文地址:https://www.9cbs.com/read-65337.html

New Post(0)