Pointers, linked lists and hashing

Copyright Dr Alan Solomon, 1986-1995

What is it that simple programming languages like Basic, Cobol and Fortran
don't have?  What is it that makes even Turbo Pascal programs
unreadable?  And what is it that even hardened C programmers loathe
and fear?  Pointers.  The very word makes seasoned programmers quake
in their sneakers;  it is synonymous with incomprehensibility,
complexity and bugs.  But fear not;  just keep hold of my hand and
don't look down, as we explore the vertiginous heights of the world of
pointers;  I'll also be covering linked-lists and hashing.

Before I explain what a pointer is, let me explain the problem that it
solves.  I've always found that a solution makes no sense at all,
unless you know the problem that it solves.

Suppose you want to find your way around London;  suppose you are
looking for Holloway Lane.  The obvious solution is to dig up all the
roads of London, sort them into alphabetical order, then replace them
in a row running from Aardvard Avenue (which would now be near
Watford) to Zygmosis Yard (which would be moved to somewhere around
Redhill).  So if you wanted to find Holloway Lane, you'd walk South
from Aardvark Avenue, via Aaron Road, Abacus Street and so on, until
eventually you came to Holloway Lane, just after Holefood Corner.  Or
you could trek North from Zygmosis Yard, and you'd get to your
destination just after Hologram High Road.  This system of finding
things is called a sequential search, and it has an advantage and
several disadvantages.  The advantage is that it works - you're
guaranteed to find the street you're looking for.  The first
disadvantage is the very long walk that you have every time you want
to go somewhere;  up to Watford, then South.  The second disadvantage
is the major civil engineering work that is required to sort the
streets into this handy order.  The third disadvantage is apparent
when you want to add a new road called Nogrampty Street - you'll have
to dig up all the streets from Noisome Boulevard up to Zygnosis Yard,
and replace each one about a hundred yards to the North.  There must
be a better way to find your way around.

Indeed there is - it's called the A-Z street map.  Somebody had the
ingenious idea that all you need to have in alphabetical order, is the
names of the streets.  You then have, next to each name, a map
reference to where the street actually is;  in the A-Z these
references look like "76b4", meaning page 76, square b4.  OK, now you
sort the names of the streets, which is much easier, as they can be
written on bits of paper, and paper is much easier to shuffle around
than actual streets.  Even better, you put the names and map
references (pointers) in a computer, and sort them there, then you
print out the alphabetic list, with the pointers, and sell the book of
maps plus pointers for under a pound.  And notice how easily I slipped
the word pointers in?  I bet you didn't even spot it!

But reading down the list from Aardvark Avenue till I get to Holloway
Lane used to take me ages, so I phoned up A-Z, and it seems they have
solved that one too.  You open the book at M or thereabouts, and then
move a bunch of pages towards A.  If that takes you too far, you move
forward again, until pretty soon, you find the page with the Holloway
Lane pointer on.  Then you go through the same process within the
page, until you find the exact pointer you need.  This searching
method is called the Binary Chop, after the restaurant where it was
invented.

They add to the list by putting new pointers in at the end, after
Zygmosis Yard, and every now and then they re-sort the pointers, print
it out again, and sell you the upgraded version.

The next use of pointers I'd like you to meet is a kiddies game called
"Treasure Hunt".  You give the child a clue to where the first prize
is;  this clue is of course a pointer.  But when she finds the prize,
she also finds another pointer to the next prize.  And with that
prize, there's another pointer, and so on to the end;  there's no
pointer with the last prize;  that's how you know you've reached the
end.  To put it another way, the last pointer doesn't exist, or to put
it yet another way, it's the null pointer.

This is a linked list.  People pretend that linked lists are pretty
complex, but Jennifer has no problem with them, and she's only five.
The advantage of a linked list is this.  If I want to add a prize in
the middle of the list, it is easy.  Suppose my linked list is:

Apple, plus Care-Bear-pointer
Care-Bear, plus Pony-pointer
Pony, plus Smarties-pointer
Smarties, plus null-pointer.

If I want to put a Book between the Apple and the Care-Bear, I hide
the Book, put a pointer to it with the Apple, and take the pointer
that was with the Apple and put it with the Book.

My list is now:

Apple, plus Book-pointer
Book, plus Care-Bear-pointer
Care-Bear, plus Pony-pointer
Pony, plus Smarties-pointer
Smarties, plus null-pointer.

Here's the clever thing I just sneaked up on you - my linked list is
still in alphabetical order, and I haven't had to sort the pointers
again!  That's a big improvement on the A-Z method, as sorting a
million pointers can take a long time.

So, to find something in our linked list, we just traverse it like
Jennifer does, and stop when we find what we want.  But we can only go
in one direction, because that's the way the pointers point.  No
problem - all we have to do is put a second pointer with each object
that points to the previous object.  Then you can traverse it in
either direction.

But I don't much fancy traversing a million pointers every time I want
to find something.  I want a faster way to search, like I had with my
A-Z.

Consider a database of information on London roads;  each record might
hold a list of house numbers and inhabitants, where the gas main is
laid, and so on (including, of course, the A-Z map reference, which is
actually a pointer into a different file of information (the A-Z map).

One obvious way is to deal with such a large database is to have
pointers to significant records, like a list of 26 pointers, each
pointing to the start of the As, the Bs, the Cs and so on.  To find
Holloway Lane, I look at the eighth pointer, which sends me straight
to the beginning of the Hs.  Then I traverse the Hs until I find
Holloway Lane.  You can extend this idea, so that at the beginning of
the Hs, you find another list of pointers for the second letter.  Look
up your second letter in this list, and you'll get a pointer to things
that begin with HO.  Traverse the linked list, and you'll find
Holloway Lane.

This will work, and will even be fairly fast.  But it has one major
disadvantage, the same disadvantage that all linked-list systems have.
It is a bit fragile.

There is a very easy way to measure the fragility of a data retrieval
system.  If one byte is changed to a random number, how much data is
lost?  This obviously depends on which byte is changed, so we should
really talk about some average.  With a linked-list system, if any of
the pointers get corrupted, you lose an average of 50% of the data in
that list, because you can no longer traverse the list past the
corrupted pointer.  So, if we are going to use a linked-list, we
really want to divide the database up into several lists.

The system described above has another problem;  is is unbalanced.
There is only one street beginning with Zy, but there are an awful lot
that start with Hi.  This means that for common names, finding the
record will take a long time, whereas for rare names, it will be much
faster.  This is obviously the wrong way round, from the user's point
of view.

The problem of all those High Streets and of fragility is solved by a
process called hashing, which was invented in the same restaurant.
Suppose we want to set up our database of London Roads, and we know
that there are about 30,000 of them that we need to store.  Suppose
each record has about 2000 bytes (but some of them could be much
longer, as there are more houses to describe).  But we've got all the
data in a file, in some very random order, and we know we're looking
at 80 megabytes of data.  For now, I'm going to ignore the
complication that DOS can't handle such a large file, or even such a
large disk;  that can be fixed up later, as you will see.  The first
step is to think how much the database is going to grow;  in the case
of London Streets, not a lot.  So we add 30% to the data size, and buy
110Mb of disk space (perhaps a Compaq 386 with the 130Mb drive).  We
then create a file with 512-byte records (I chose 512 bytes because
that is the size that DOS will use most efficiently) which we can
number from 1 to 2,200,000 (which works out at 110 Mb).  And then we
start feeding our raw data into this file.

The first record is Mulberry Terrace, NW3.  Convert the letters of the
name to upper case, convert those letters to numbers using the usual
ASCII table, multiply the numbers all together, divide by 2,200,000
and throw away the quotient, keeping only the remainder.  This
remainder can now be used as a pointer, and we write the data into the
database, starting from the record that is pointed to.  We write it as
a linked-list of 512-byte records, even though we could just write it
as one long chunk of data - you'll see why in a minute.

You can see how we are going to retrieve our data.  You tell me the
name of the street, and I apply the same arithmetic process, and I end
up with the same pointer.  So I can go direct to the data, without any
look-up tables, any binary chops, or any traversing.  Wow!

I've simplified this a bit.  Actually, you don't just convert
characters to numbers and multiply and take the remainder.  You use a
more complicated algorithm (any algorithm used for this purpose is
called a Hashing Algorithm), but you can get a suitable algorithm out
of various reference books.  A good algorithm will scatter your data
fairly evenly throughout the file, to minimise the number of
collisions - which brings us to the second (and main) complication.


From now on, I'm going to call a collection of information about a
street a "record", and a 512-byte chunk of disk a "block".
Inevitably, as you hash your keys (the keys in this case are the
street names), you're bound to find two keys that hash to the same
block number, and when you try to write your data to that block,
you'll find that there is already something there;  this is called a
collision.  So, you write the data into the next block, and make the
first block point to it.  But perhaps that already has something in it
- never mind, use the next one along!  In fact, you create a
linked-list starting from the block that the hash points to.  So, the
rule is, to write a new record into the database, first hash the key,
and look at the block that is pointed to.  If this is empty, write the
first 512 bytes of the record, then find the next free block by
reading sequentially down the file.  When you find the next free
block, write the next 512 bytes of the record, and write a pointer to
it into the first block.  Keep on searching for free space and
building a linked-list for this record, until it is all written.  The
last block will have a null pointer, to show that it is the last one.

But if, when you've hashed the key, you find something already there,
you look in that first block for another pointer - a pointer to
another record.  If that is the null pointer, then you search
sequentially down the file until you find an empty block, and that
will be the first block of your record.  You put a pointer to it into
the first block of the first record, then write the rest of your
record into a linked-list as before.

So, let's suppose you've put all the data into the database, and you
have more pointers than a barbed-wire fence.  Let's look at what the
file looks like, as that will help to understand how it works.

Take Holloway Lane, and apply our hashing algorithm to it.  We get the
number 378,423, so we look at block number 378,423 in the file.  Look
at the first 13 characters of the record, and we find that it says
"Holloway Lane" - great, we've found our data with one disk access.
We look at the last four bytes, and they are not zero - they contain
the number 378424.  So we read block 378424, and that is the second
block of the Holloway Lane data.  Again, we look at the last four
bytes of that block, and we find the number 378426, so we skip over
block 378425 and read block 378426.  What's happening here, is that
when the data for Holloway Lane was written, 378425 was already being
used, so the linked-list skipped over it.  Now we look at the last
four bytes of 378426, and find that they are zero.  Aha, that means
that it is the last block of the Holloway Lane record, so we can stop
reading blocks.

Now consider Sycamore Road.  When we hash it, we find that we get the
number 29487, and when we retrieve block 29487, the record there isn't
for Sycamore Road!  So we look at bytes 505 to 508 of that block, and
sure enough, we find a pointer there.  What we have here, is a
linked-list of records, each of which really wanted to start at 29487.
We traverse this linked-list till we find Sycamore Road, then traverse
the Sycamore Road linked-list.

You see how it works?  If we use a good hashing algorithm, and allow
for about 30% of slack space in the file, we can usually go straight
to the first block of the record with only one disk access.  And the
rest of the record follows on from it contiguously, so reading the
rest of the record will be very fast;  no need to bash the disk read
head back and forth.  Sometimes, when we traverse the linked list of
blocks, we'll have to skip a chunk of disk, but then the data will be
contiguous again.  So retrieving the data will be fast, as it is disk
accesses that take the time in data retrieval.

Even if the data isn't in the first block we go to, the linked-list
that we have to traverse to find it will usually be quite short.

Usually, you'd use a doubly-linked-list to chain things together.  The
advantage is that you can traverse in either direction - if you want
to read the previous block, you have a pointer to it.  This also can
act as a marker;  if a block's forward and backward pointers are both
null, it obviously isn't being used for anything.  So deleting a
record is easy and fast, too.  All you have to do is change the
linked-list so that it skips that record, and change each pointer in
that record's linked-list to null.

This system isn't fragile.  If the linked-list for one record get
trashed, you've only lost that record.  And if the linked-list for one
hashing key gets corrupted, you'll only lose the few records in that
linked-list.  If an entire chunk of disk goes bad, you will only lose
the records that were on the bad patch, or that were in a linked-list
that points from that patch.

What about sorting the database?  If you want the database in
alphabetic order, you can have a couple more pointers in each record,
and the linked-list that this creates can be in whatever order you
want.  This linked-list could be in sorted by the main keys used for
hashing, or could be sorted according to some other part of the
record.  You never have to actually do a sort - you can simply make
sure that you change the pointers in the same way as we did for
"Treasure Hunt".  But it is more usual to build an index.

To build an index, lets assume that at the same time as you put
records into the database, you build a linked-list, in the order of
record insertion.  Then, if you decide to sort the database into the
order of the lengths of the streets, you do the following.

First, traverse the entire database using the linked-list mentioned
above.  This ensures that every record is visited.  As you visit the
record, you pull off the street length (the sort key) and the pointer
to the first block of the record.  So what you've pulled off is fairly
small in size.  Next, you sort this using one of several standard
sorting methods (NOT Bubble sort, which is very slow).  Then you write
out the sorted list to disk, and there's your index.  To find the
first street over 600 yards long, you read the index, do a binary chop
on it, and that gives you a pointer to the first block of the record.

You can create other indexes, too.  All you'll need to store is the
key that you're sorting on, and a pointer.

I wouldn't recommend that you rush out and write yourself a data base
management system - you can buy them off-the-shelf for peanuts.  Or you
can buy toolboxes for Turbo Pascal, Basic and C that implement the ideas
described here.  It is useful, though, to understand what is going on in
a database underneath the surface.  And it means that you can ask more
penetrating questions when you buy your database.

There are a lot more interesting subjects to cover in this field;  the
business of sorting is fascinating, for example, and very few people
actually understand what a relational database is and why it is a good
idea, and why you can only have so few files open at a time.  Searching
is also quite interesting - the binary chop isn't the only method.  Then
there is the B-tree method of organising data;  a special kind of
linked-list with special advantages for micros.

If you want to investigate this subject further, read some books.
Donald Knuth's "The Art of Computer Programming" is the definitive work,
but rather heavy going.  The Turbo Tutor is quite good in this area,
and there is an excellent PD demo program illustrating six of the
different methods of sorting.

Like all good databases, the last part of this record contains a
pointer.  In this case, it is a pointer to the next issue of this
periodical.  And just like any other linked-list, you won't know what is
in it until you read it.  See you then.