Next Previous Contents

10. How does my computer store things in memory?

You probably know that everything on a computer is stored as strings of bits (binary digits; you can think of them as lots of little on-off switches). Here we'll explain how those bits are used to represent the letters and numbers that your computer is crunching.

Before we can go into this, you need to understand about the the word size of your computer. The word size is the computer's preferred size for moving units of information around; technically it's the width of your processor's registers, which are the holding areas your processor uses to do arithmetic and logical calculations. When people write about computers having bit sizes (calling them, say, ``32-bit'' or ``64-bit'') computers, this is what they mean.

Most computers (including 386, 486, Pentium and Pentium II PCs) have a word size of 32 bits. The old 286 machines had a word size of 16. Old-style mainframes often had 36-bit words. A few processors (like the Alpha from what used to be DEC and is now Compaq) have 64-bit words. The 64-bit word will become more common over the next five years; Intel is planning to replace the Pentium II with a 64-bit chip code-named `Merced', and now officially called the `Itanium'.

The computer views your memory as a sequence of words numbered from zero up to some large value dependent on your memory size. That value is limited by your word size, which is why older machines like 286s had to go through painful contortions to address large amounts of memory. I won't describe them here; they still give older programmers nightmares.

10.1 Numbers

Numbers are represented as either words or pairs of words, depending on your processor's word size. One 32-bit machine word is the most common size.

Integer arithmetic is close to but not actually mathematical base-two. The low-order bit is 1, next 2, then 4 and so forth as in pure binary. But signed numbers are represented in twos-complement notation. The highest-order bit is a sign bit which makes the quantity negative, and every negative number can be obtained from the corresponding positive value by inverting all the bits. This is why integers on a 32-bit machine have the range -2^31 + 1 to 2^31 - 1 (where ^ is the `power' operation, 2^3 = 8). That 32nd bit is being used for sign.

Some computer languages give you access to unsigned arithmetic which is straight base 2 with zero and positive numbers only.

Most processors and some languages can do in floating-point numbers (this capability is built into all recent processor chips). Floating-point numbers give you a much wider range of values than integers and let you express fractions. The ways this is done vary and are rather too complicated to discuss in detail here, but the general idea is much like so-called `scientific notation', where one might write (say) 1.234 * 10^23; the encoding of the number is split into a mantissa (1.234) and the exponent part (23) for the power-of-ten multiplier.

10.2 Characters

Characters are normally represented as strings of seven bits each in an encoding called ASCII (American Standard Code for Information Interchange). On modern machines, each of the 128 ASCII characters is the low seven bits of an 8-bit octet; octets are packed into memory words so that (for example) a six-character string only takes up two memory words. For an ASCII code chart, type `man 7 ascii' at your Unix prompt.

The preceding paragraph was misleading in two ways. The minor one is that the term `octet' is formally correct but seldom actually used; most people refer to an octet as byte and expect bytes to be eight bits long. Strictly speaking, the term `byte' is more general; there used to be, for example, 36-bit machines with 9-bit bytes (though there probably never will be again).

The major one is that not all the world uses ASCII. In fact, much of the world can't -- ASCII, while fine for American English, lacks many accented and other special characters needed by users of other languages. Even British English has trouble with the lack of a pound-currency sign.

There have been several attempts to fix this problem. All use the extra high bit that ASCII doesn't, making it the low half of a 256-character set. The most widely-used of these is the so-called `Latin-1' character set (more formally called ISO 8859-1). This is the default character set for Linux, HTML, and X. Microsoft Windows uses a mutant version of Latin-1 that adds a bunch of characters such as right and left double quotes in places proper Latin-1 leaves unassigned for historical reasons (for a scathing account of the trouble this causes, see the demoroniser page).

Latin-1 handles the major European languages, including English, French, German, Spanish, Italian, Dutch, Norwegian, Swedish, Danish. However, this isn't good enough either, and as a result there is a whole series of Latin-2 through -9 character sets to handle things like Greek, Arabic, Hebrew, and Serbo-Croatian. For details, see the ISO alphabet soup page.

The ultimate solution is a huge standard called Unicode (and its identical twin ISO/IEC 10646-1:1993). Unicode is identical to Latin-1 in its lowest 256 slots. Above these in 16-bit space it includes Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan, Japanese Kana, the complete set of modern Korean Hangul, and a unified set of Chinese/Japanese/Korean (CJK) ideographs. For details, see the Unicode Home Page


Next Previous Contents