Normalization 101 or The Perils of Building a House from the Roof Down
When you build a house, you start from the foundation.
If you started with the roof, you would certainly
entertain the neighbors, but there is little likelihood
that
you would end up with a satisfactory place to live. The
foundation must be solid, of quality materials,
appropriately planned, and built with a high level of
craftsmanship. If your foundation is flawed, the house
might be OK for a few years, but eventually it would
start developing problems that could cost you lots of
money. In the worst case, a faulty foundation could
develop serious cracks that could lead to catastrophic
problems.
Databases, like houses, must also have a solid
foundation. Databases that are well thought out from
the beginning will take up less space, perform more
quickly, and be easier to modify if the need arises. In
addition, a well designed database will preserve the
integrity of the data stored within it. This month
begins a series that will help you to learn the proper
ways to build a solid foundation for your database
projects. As we move forward, we'll discuss specific
methods and details of solid construction, and the
different types of table relationships. In this issue, we'll
focus on the most common and serious problems that
should trigger alarm bells suggesting that your
database
foundation may not be as strong as it should be.
A database foundation consists of properly constructed
tables, fields and relationships - techies call the
process of optimizing the design "normalization". If you
properly normalize then you will have a solid platform on
which to build your application. If you do it
inadequately or haphazardly, it will be like a house with
a poorly poured foundation - it may work in the
beginning, but it will cause you major hassles as you
progress. It's very tempting when building a database
to slap some fields into a couple of tables and then
start blasting away with forms. You must avoid this
path at all costs. Rather, invest time up front,
to analyze your data needs and determine how to best
lay out the data.
Over the next few issues I'll give you concrete advice
on how to do this. This month, we'll use an example of
a database for a medical clinic to examine four key
symptoms -- red flags indicating that you have
problems waiting to happen
Symptom One: Tables go on forever |
 |
Do you find that you have one monster table in
your
database that has so many fields that to see them all
you have to scroll again and again? Do you find yourself
adding more and more fields to this table and ending up
with a queasy feeling that maybe this isn't the right
way to do things?
If you find that you have very few tables with loads of
fields in the tables, then you likely are not taking
advantage of the power of a database. Some people
make the mistake of developing an Access database
that resembles a spreadsheet they would build in
Excel. Databases are not Excel Spreadsheets where
you just keep on adding columns; rather they are sets
of related data. It's the existence of relationships that
begins to open the power of databases. A data item
needs to appear only once. With properly normalized
tables a single change appropriately ripples throughout
your application. To achieve this end you have to
analyze the data, determine how the various pieces of
data relate to each other (hence the term relational)
and then break them down into the appropriate related
tables.
When you are designing tables, always examine them in
the relationship window (Tools | Relationships). Of
course there is no hard and fast rule you can apply,
but
if you see that any one of your tables doesn't fit
comfortably on the screen, then you may have this
problem. We'll come back to this in future
issues and more closely examine how to best solve it;
as you may guess, normalization is the answer.
Although it can be a complex topic, we'll break it down
step by step to reach a solution.
|
Symptom Two: One Potato, Two Potato, Three Potato, Four - Repeating Fields |
 |
This one I see a lot, and it's due to a poor
understanding of how to construct the relationships.
For instance, let's say that you're building a database
to track various diagnoses (Dx) for a patient. It might
be tempting to set up a table with the fields
Diagnosis1, Diagnosis2, and Diagnosis3. Each patient
could then have 3 separate diagnoses. There are at
least two problems with this. Inevitably you'll have a
patient that needs to have a fourth diagnosis. As soon
as you get one of those, it's back to the tables to add
a field and back to your forms and reports to deal with
the new field - an expensive proposition. The second
problem comes about when its time to do analysis.
Let's say that you want to know how many patients
have a diagnosis of Cerebral Palsy. If you have one
table with three diagnoses for each patient, then you
have to write some complex SQL statements to
aggregate each field or write some code to loop
through each field in the table to look for Cerebral
Palsy. It's possible to do that, but it's neither easy nor
cheap.
The correct normalization, which builds flexibility into
your database, is shown in the picture below. The
critical table here is trelPatientDx. This table contains a
single Dx for each patient is in a separate row of the
table. I've included an additional field to set the priority
of diagnoses for a single patient called fldRank. You'll
need this field only if the order of diagnosis is important
(e.g. DSM IV diagnoses for the fellow psychologists out
there). With this construct, finding how many patients
have Cerebral Palsy is a snap. If a patient needs more
than 3 it's not a problem either, as a matter of fact
having 103 is easy as well. A patient can have as many
diagnoses as s/he requires, and there's no need to
change forms or reports if you've set them up with this
type of relationship to start with.

|
Symptom Three: The same data appears again, and again, and again and again ... (all in the same table) |
 |
Let's say you're tracking patients and their doctor with
the doctor's telephone number. You may be tempted to
have the fields Doctor and DrTelNumber in the patient
table. But if Dr. Jones is physician for 250 of the
patients in your database, she and her telephone
number would appear 250 times in the same table (a
major red flag!). Although you could do it this way,
what happens when the doctor gets a new telephone
number or the Area Code changes? You would have to
run a query that updates every record with the new
number. A better approach would be to set it up as
shown below, with a separate table to hold the
information about the doctor. The doctor is identified
in the patient table with the DrKey from tblDr. This is a
classic example of a "one to many" relationship (one
doctor has many patients). In this construct you
would simply go to the doctor table (or form) make a
single change there and it would appropriately ripple
throughout the database - a much easier and user-
friendly way to update data.

|
Symptom Four: 2 + 2 = 5 Storing data you can calculate as needed |
 |
This is a really common problem with big
consequences
so I'll give you two examples.
First Example:
Full Name stored as a field: Let's say you want to use a
person's full name in a report. You may decide to
create the fields FirstName, LastName and FullName.
The problem is that if Mary Smith gets married and
becomes Mary Jones, then two changes are necessary
to update the data rather than one; if someone forgets
to make that second change then you have a data
problem on your hands. The bottom line is that there is
no need for the FullName field. Whenever you need it,
simply calculate it using the concatenation function
(FirstName & " " & LastName). The ampersands will
concatenate the First Name and Last Name fields, the
quotes in the middle will put the space in between.
This solution preserves the integrity of the data and
saves space in your database.
Second Example: You want to set up a field to flag
the date of the next physical exam, which is always
one year from the last exam. You might be tempted to
have two fields: LastExamDate and NextExamDue. The
problem with this is two-fold:
- Anytime you are required to figure out one
date based on another date, you've created an
opportunity to make an error.
- If Last Exam Date was entered incorrectly,
someone must manually correct Last Exam Date and
recalculate Next Exam Due when the error is detected.
The biggest danger, of course, is that the user will
simply forget to recalculate the date the next exam is
due - once again you've opened the door to bad
data.
The solution to this is not quite as simple as
concatenation; you have to use the built in
function "DateAdd." Its purpose in life is exactly this
scenario. You can read up on that on your own - Press
F1 to bring up help and then do a search on DateAdd
and it will give you the particulars. Access is a
powerful program with the ability to create data based
on a calculation much like a formula in Excel. This is
one of the many examples
Conclusion and a Look Ahead
Proper analysis and creation of tables and fields and
proper establishment of relationships between tables
(normalization) is a critical exercise for all but the most
simple database applications. In this issue, we've
reviewed common symptoms of a shaky foundation.
Take extra time in the beginning to get your foundation
right and you'll save money, time and aggravation down
the road.
In the coming months we'll continue our series of how
to build a solid foundation as we talk more about
normalization and review the different kinds of table
relationships (one-to-one, one-to-many, and many-to-
many.)
|
|
Tip of the Month |
 |
To get a quick assessment of some ideas to normalize
your tables you can use the built in Table Analyzer
(From the Database Menu -- Tools | Analyze | Table.)
This feature will not automatically redesign your
database, but it will make some suggestions that you
can analyze to better approach a properly normalized
set of tables.
|
|