This section makes an attempt to do a general description for the possible obstacles in using Chinese on Linux; then you could find the key points out much easier as you meet with these problems. As a matter of fact, the shortcomings described here not only appear on Linux but also the other system. Even more, we can say that the whole computers environments are concerned. If this section is not suited for your tastes or you are eager to act directly, then you can jump onto the section Display and Input Chinese!
A Chinese word is composite of two bytes in computers, as we all know. The most popular encoding methods includes BIG5 codes available in the area of Taiwan and GB codes available in the mainland China. The first byte of each word is almost bigger than numeric values 128, which is what we called the non-ASCII codes.(The ASCII codes means codes smaller than 128.)
Yes! Then so what? Here are the points! Because of different kinds of reasons, in the early days, many programs didn't consider the possibility of non-ASCII codes as a part of entering data.
These kinds of programs always assume that the data prepared to manipulation are all limited in the range of ASCII codes, and the most worst is that when they meet with non-ASCII codes, an assumption of their non-existence and a truncation of the 8th bit is the most frequent method they took. This is the so called 8-bit clean problem.
Your program, for example, always take it for granted that your inputs are all the 7-bit-width ASCII codes. When you enter Chinese words, it will erase the 8th bit so that the inputs under circumstances of Chinese will become disturbed codes all the way.
Communication programs on Internet are usually could only transmit 7-bit data. A notorious substance is the earlier sendmail
program.
sendmail
could only send and receive 7-bit mails, causing that the strategies of many odd encoding methods,
Encoding which made the receivers an excessive disturbance, are recognized as sending out Chinese mails(like uuencode, base64, QP and so on).
(Frequently, I thought by myself that if the founders of emails could have put much foresight on it, then we could have little problems nowadays perhaps.)
This problem seems to be more complicated on Internet. Even you and your receivers all have the machines installed with sendmail
program which might manipulate with Chinese mails, the receiver might get disturbed mails in any way.
This is because this mail before its arrival at the target may travel over several hosts settled on Internet, if one of the hosts' sendmail
cuts the 8th bit off, then things go down.
For the programs with the architecture of client/server, the problem may be on the end of client, or on the end of server; otherwise both of them are.
Applications which are incapable of identifying the Chinese encoding are also a major problem, apart from being unable to deal with non-ASCII codes' data. That is, most programs(even if they can deal with 8-bit data accurately) all take a Chinese word as two individual bytes. This won't cause problems under some conditions, but it will show an unfortunate disaster under some spots.
The most obvious matter is that, for instance, even if you can input Chinese words properly, but when you hit the backspace key a time trying to delete a complete word, the whole word will be split into wto parts meaning that only one byte(column) can backspace on monitor and the redundant half one then become a disturbed code. More over than that, you might change new line at the second byte of a Chinese word in some text editors and then disturbed codes occurred. Besides, these text editors might think that a long Chinese sentence as a long English sentence without changing to a new line, making the picture of screen ugly and chaotic.
There are more worse matters, too ! Some Chinese words contain special codes which correspond to some particular meaning for some applications and might make these programs producing severe faults while meeting with that codes or just collapse.
Here below will try to propose some resolved methods but segmental, incomplete and also unsatisfactory. Only when all softwares can fit with Chinese, then the problems could really resolve perhaps.
However, more and more programs have noticed the significance of internationalization, for example, most hosts'
sendmail
programs now can cope with 8-bit mails exactly --- Not only transmitting Chinese mails need 8-bit, but also many multi-media mails need 8-bit.
Lots of softwares already don't need to modify at all or just open some special options for the purpose of using Chinese.
Simultaneously, there are more and more persons devoting to the birth of Chinese softwares. Let's us wait and expect for it.