What is the Unicode in Java programming language

Mouad Oumous
3 min readJun 4, 2021

--

Unicode is Java’s native character set. Each Unicode character is a two-byte, unsigned number with a value between and 65,535. This provides enough space for characters from all the world’s alphabetic scripts and the most common characters from the ideographic scripts of Chinese and Japanese. The current version of Unicode (2.1) defines 38,887 different characters from many languages, including English, Russian, Arabic, Hebrew, Greek, Thai, Korean, and Sanskrit. The most common ideographic characters from Japanese and Chinese are also included. However, Chinese alone contains over 80,000 different ideograms, so it’s impossible to include them all in a two-byte set. A four-byte Universal Character Set (UCS) that will include the full Chinese and Japanese scripts is under development. Java does not yet support UCS.

The first 128 Unicode characters (characters through 127) are identical to the ASCII character set. 32 is the ASCII space; therefore, 32 is the Unicode space. 33 is the ASCII exclamation point, so 33 is the Unicode exclamation point, and so on. Table B.1, shows this character set. The next 128 Unicode characters (characters 128 through 255) have the same values as the equivalent characters in the Latin-1 character set defined by ISO standard 8859-1. Latin-1, a slight variation of which is used by Windows, adds the various accented characters, umlauts, cedillas, upside-down question marks, and other characters needed to write text in most Western European languages. Table B.2 shows these characters. The first 128 characters in Latin-1 are identical to the ASCII character set.

Values beyond 255 encode characters from various other character sets. Where possible, character blocks describing a particular group of characters map onto established encodings for that set of characters by simple transposition. For instance, Unicode characters 884 through 1011 encode the Greek alphabet and associated characters like the Greek question mark (;).[1] This is a direct transposition by 756 of characters 128 through 255 of the ISO 8859-7 character set, which is in turn based on the Greek national standard ELOT 928. For example, the small letter delta, , ISO 8859-7 character 228, is Unicode character 984. A small epsilon, , ISO 8859-7 character 229, is Unicode character 985. In general, the Unicode value for a Greek character equals the ISO 8859-7 value for the character plus 756. Other character sets are included in Unicode in a similar fashion whenever possible.

NextStep, BeOS, MacOS X Server, Bell Labs' Plan 9, and Windows NT 4.0 all support Unicode to some extent. Unicode support in MacOS and Windows 98 is more nascent, but it’s coming. Application software is a little slower to appear, but Microsoft Word 97 and 98, Netscape Navigator 4.0, and Internet Explorer 4.0 all support Unicode. The big hold-up on most systems is fonts and input methods. Windows NT 5.0 will include fonts covering most of the defined Unicode characters as well as input methods for most major languages

--

--