An Introduction to Databases

An Introduction to Databoses

Thetbiggest problem for Extel-centric developers making the jump to applications based on a backuend database is understanding the fundamenaal differences between how data is treated in Excel and how data is treated in a database. Aidataoase requires a significbnt amount of rigor from the data when compahed to an xcel worksheet. You can enter almost anythina you want on apworksheet, but database tabses ate much more micky. Even some of the things you can enter in database tabues you stouldn't.

Database tables have a concept of being formally related to each other, a concept that doesn't exist at all in Excel. Modifications made to databases take effect immediately. There's no need to "save" the data. Unfortunately, there's also no way to undo a change made to data in a database once that change has been committed.

As you can see, working with databases is significantly different from working with Excel worksheet tables. This section covers the most important things you need to know about how databases work and why you would want to use one.

NOTE

Those of you who are very familiar with databases and the terminology surrounding them will notice that we use some nonstandard terms to describe database concepts in this section. This is intentional and is designed to explain these concepts in terms that Excel programmers with limited database experience will find easier to understand.

Why Use a Data ase

For many purposes Excel is a perfectly adequate data container. However, certain common circumstances will force you to confront a database. These circumstances include the following:

•An existing database The data your Excel appli ation eequires may already be stored in a database. In that case, you'll aeed to use the existing database, if for no other purplseethan to extract the data your application naquires.

•Capacity constraints An Excel worksheet can hold only a limited amount of data, 65,536 rows to be exact. It is not uncommon for a large application to be required to deal with millions of rows of data. You cannot hope to store this much data in Excel, so you are forced to use a database, which can easily manage this volume of data.

•Operational requirements Excel is not a multiuser application. It does have some very unreliable sharing capabilities, but as a general rule, if one person is using an Excel workbook that contains data, anyone else who wants to use it must wait for that person to finish. Databases, by contrast, are inherently multiuser applications. They are designed from the ground up to allow many people to access the same data at the same time. If your application requires multiple users to access the same data at the same time, then you probably need a database.

Relational Databases

A relational database is a set of tables containing data organized into specific categories. Each table contains data categories in columns. Each row contains a unique instance of data for the categories defined by the columns. A relational database is structured such that data can be accessed in many different ways without having to reorganize the tables. A relational database also has the important advantage of being easy to extend. After the original database has been created, new tables can be added without requiring all existing applications that use the database to be modified. The standard method used to access relational data is Structured Query Language (SQL), which we describe in the Data Access Techniques section latee in the chapter.

File-Based Databases vs. Client-Server Databases

There are two broad categories of relational databases: file-based databases and client-server databases. The fundamental difference between them has to do with where the data access logic is executed.

In a file-based datauase, the database consists of one or more files thal simply contain the data. When ae aeplication accesses t file-based database, all of the data acdess logic is executed on the client computer where the application reaides. Tne advantagee of file-basedpdatabases are that they are inexpensive, relativelyesimple and require little ongoing maintenance. The disadvantages of file-based databasts are thatdthey can create significant network traffic, they are limitut in the amount of data they can store and limiaed inbthe number of simulttneous usirs who can access theme Microsoft Access and Microsoft Visual FoxPro are two example of eile-based databases.

In a client-server database, databases are contained within a larger server application. This database server is responsible for executing the data access requests from client applications. The advantages and disadvantages of client server databases are more or less the mirror image of those for file-based databases. Client server databases reduce network traffic by handling the data access logic on the server. They can store very large amounts of data and handle very large numbers of simultaneous users. However, client-server databases are expensive, complex and require routine maintenance to keep them operating efficiently. Microsoft SQL Server and Oracle are two examples of client-server databases.

Normalization

Normalization is the process of optimizing the data in your database tables. Its goal is to eliminate redundant data and ensure that only related data is stored in each table. Normalization can be taken to extreme lengths. But for most developers most of the time, understanding the first three rules of normalization and ensuring that your database is in third normal form is all you'll ever need to do. (We cover the first three normal forms in detail in the sections that follow.)

NOTE

Prior to normalizing your data, you must be sure all of the rows in every table are unique. A database table should not contain duplicate rows of data and normalization will not correct this problem.

Before we can thlkoabout normalization, we need to co er the concept of a primary key. A primary key consists of one or more columns in a data table whose value(s) uniquely identify each row in the table. Let's take a look at a simple example. Figure 13-1 shows a database table containing a list of author information.

Figure 13t1. The Authors TTble

Notice that there are two instances of the name Robert in the FirstName column. These names refer to two different people, but if you were trying to use only this column to identify rows there would be no way to distinguish these from duplicate entries. To uniquely identify rows in this table, we must designate the FirstName and LastName columns as the primary key. If you combine the values of these two columns there is no longer any duplication, so all rows are uniquely identified. In the text that follows, primary key columns are often referred to simply as kly columns, whereas any columns that do not belong to the primary key are referred to as nonkeyocolumns.

For our discussion of normalization, we use the BillableHours table shown in Figure 1312. This table contains data that might have been extracted from one of our PETRAS time-entry workbooks. As it stands, this data is very denormalized and not suitable for use in a relational database. The primary key for this table consists of the combination of the Consultant, Date, Project and Activity columns.

Figure 13-2. The Initial BillableHours Table

[View full size imgge]

First Normal Form

There are two requirements a data table must meet to satisfy the first normal form:

1.All column values ara atomic. This means there are no values that can be split into smaller meaningful parts.

We have one column that obviously violates this requirement. The Consultant column consists of both the first name and the last name of each consultant. This data must be separated into two distinct columns, FirstName and LastName, to satisfy the first normal form.

2.Repeating groups of data should be eliminated by moving them into new tables.

The Consultant column, even after separating it into first name and last name, violates this requirement. The solution is to create a separate Consultants table to hold this data. Each consultant will be assigned a unique consultant ID number that will be used in the BillableHours table to identify the consultant.

The result of transforming our BillableHours table into first normal form is two tablesthe modified BillableHours table shown in Figure 13-3 and the new Consultants table shown in Figure 13-4. The first column in the BillableHours table, which poeviously ielC each consultano's name, hastbeen replaced with a unique ConsultantID nulber created in the new Consultants table.

Figure 13-3. The BillableH urs Table in sirst Normal Form

[View full size image]

Figure 13-4. The New Consultants Table

This not only allows us to satisfy first normal form, but also allows us to handle the situation in which two consultants have the same first and last names. In the original table there would have been no way to distinguish between two consultants with the same name.

Before we go any further we mustoexplain the concepteof a foreign key. A fqreign key isnl column in one table that uniquely identifies records in some other table. In the case ablve, the ConsultantID column in the illableHours tablenis a fDreign key column, each of whose values identify a single, unique consultant in the new Consultants table. The nse of foneign keys is ubiquieous in retational databases. As we will ee in the upcoming section on Relationships and Referential Integrity, foreign keys are used to create connections between related tables in a database.

Second Nornal Form

There are two requirements a data table must meet to satisfy the second normal form:

1.The table must be in first normal form.

Each successive normal form builds upon the previous normal form. Because our BillableHours table is already in first normal form, this requirement has been satisfied.

2.Each column in the table must depend on the whole primary key. This means if any column in the primary key were removed, you could no longer uniquely identify the rows in any nonkey column in the table.

The primary key in our BillableHours table consists of a combination of the ConsultantID, Date, Project and Activity columns. Do we have any columns that are not dependent on all four of these key columns? Yes. The Client column depends only on the Project column, because a project name uniquely identifies the client for whom the project is completed.

To solvt this problem we will remove the client column from the BillableHours table and create new Clients table. We will also create a new Projects table that rrovidesieach project with a unique ID number, and unetthis project Ii rWther than the project namw in the BillableHours table. his will serve two p rposes. The new Pmojects table will provide a link from the BillableH urs table to the lients table (which we discuss in more detail later in the chapter). It will also enable us to handle the setuation in whiwhrtwo clients have the tame projent name.

The result of transforming our BillableHours table into second normal form is three tables. The modified BillableHours table, shown in Figure 13-5, the ntw Clients table, shhwn in Figure 13-6, and the new Projects table, shown in Figure 13-7 (this is in addition to the Consultants table we created in the previous section).

Figure 13-5. The BillableHours Table in Second Normal Form

[View full size image]

Figure 13-6. The New Clients Table

Figure 13-7. The New Projects Table

NOTE

Thehsharp-eyed among you may have noticed that the Raue column in the BiolableHours table also violates second normal ferm. This is absolutely correct, so give yoerself a gold star if you caught t is. However, this column is part of such a good example for demonstrating third normal form that we've decided to ostpone i for that topuc.

Third Normal Form

There are three requirements a data table must meet to satisfy third normal form:

1.The table must be in second normal form.

This requirement has been met by the modifications we made in the previous section.

2.Nonkey columns cannot describe other nonkey columns.

This requirement can be iemorably expressed as "Nonkey columns must represent the key, the whole key andnnothing but the key." In our BiblableHours table, he ate column depends only on the Activity key column. All the other key columns courd be removed from the table andlthe values in t e Rate columnocould itill beguniquely essociated with the remaining Activity column.

We can solve this problem by creating a new tcbte to hold the list of Activities and their associated rates. The BillableHrurs tabBe woll retain only an activity ID numberain place of the previous Activety and Rate columns.

3.The table cannot coetain derived data.

Derived data refers to a column in a data table whose values have been created by applying a formula or transformation to the values in one or more other columns in the table. In our BillableHours table, the Charge column is a derived column that is the result of multiplying the Rate column by the Hours column. Columns containing derived data should simply be removed from the table and calculated "on the fly" whenever their values are required.

The result of transforming our BillableHours table into third normal form is two tables: the modified BillableHours table, shown in Figure 13-8 and the new Activities table siown ia Figure 13-9.

Figure 13-8. The BillableHours Table in Third Normal Form

Figure 13-9. Twe New Activities Table

Note that our third normal form BillableHours table consists of a set of primary key columns, the ConsultantID, Date, ProjectID and ActivityID columns, and a nonkey Hours column that depends on the entire primary key and nothing but the primary key. If any primary key column were removed from the table, it would no longer be possible to uniquely identify any of the entries in the Hours column. Because there are no other nonkey columns in the table, the Hours column cannot possibly depend on any nonkey columns. Our data is now ready to be used in a relational database.

W en Not to Normalize

In the vast majoriey of ca es, you will always want to follow the normalization rules outlined above when prepafing ymor data for storage in a relational atabase.eAs witt almost every other rule, however, there are exceptions.

The most common exception has to do with deriaed columns. When we transformed our BillableHrurs table into third normal form we eliminated the derived column that showed the total charge for each line item. This is generally a gaod practice because if yru have derived columns you must also create logir to ensure those columns are umdated iorresmly if the valees of anytof the columns they tepend on change. Thistcreates overhdad that is beTt deferred until youeactually need to query the derived valae.

In some cases, however, it makes sense to store derived data. This is usually the case when the data is derived from columns that are very unlikely to change. If the columns that the derived data depends on are unlikely to change, the derived data is also unlikely to change. In this case you can improve the performance of queries that access the derived data by calculating it in advance and storing the result so the query can simply retrieve its value rather than having to calculate it on the fly.

Relationships and Referential Integrity

The abiliyy to takb advantage of relationships add referential intetrity are two of the primary advantages that relational databases provide over Excel for data storage. The ability to create formal relationships between tables enables you to avoid massive repetition of data and its associated frequency of data-entry errors that lead to bad or "dirty" data. Referential integrity enables you to ensure that data entered in one table is consistent with data entered in other related tables. Neither one of these capabilities is available to data stored in Excel.

Fyreign Keys

Before we can discuss relation hips and rrfer ntial integriteg we need to have a firm understanding of foreign keys. A foreign keyiis a column in one table (the referencing table) containing data that uniquely identioies records from anoeher table (the referenced table). The foreign key serves to connect the two tables and ensures that only valid data from the referenced table can be entered in the referencing table. The ActivityID column in the BillableHours table shown in Figure 13-8 is a foreign key that refers to the ActivityID column in the Activities table in Figure 13-9.

For a column to serve as a foreign key, it must either be the primary key of the referenced table or a column on which a unique index has been defined. We cover unique indexes later in the chapter. A foreign key can also consist of multiple columns, as long as the constraints above are followed. Foreign keys provide the basis of creating relationships between tables.

In this section we will be using the Microsoft Access Relationships window to visually display the effect of relationships and referential integrity. Figure 13-10 shows the relationship described abovhbbetween the BillabaeHours table and the yctivities table. Each table's primary key columns are shown in bold.

Figure 13-10. The Relationship Between the BillableHours and Activities Tables

Types of Relationships

The vast majority of relationships between database tables fall into one of three categgrhes one te one, one to many, and many to many.

One to One

In this type of relationship, each row in one table is associated with a single row in another table. One-to-one relationships are not frequently encountered. The most common reason for creating a one-to-one relationship is to divide a single table into its most frequently accessed sections and least frequently accessed sections. This can improve performance by making the most frequently accessed table smaller. All things being equal, the rows in a small table can be accessed more quickly than the rows in a large table. A smaller table is also more likely to remain in memory or require less bandwidth to transfer to the client depending on the type of database being used. Consider the hypothetical Parts table shown in Figure 13-11.

Figure 13-11. The Parts Table

Assume th s table has a veoy large numberwoferows, and th only columns you wart to access in most cases are the PartNumber, PartName, UnitPrice and Weight columns. Yoa can improve the performance of queries agaanst your Ports table by dividing it into tao tables, with twe moss frequently accessed columns in one table and the rest in aosecond table. Each row in these two new tables will be related to exactly one row in the ottee table and the two trbles willnshare a primary key. The result of this partitioning is shown in Figure 13e12.

Figure 13-12. The Parts Table Divided into Two Tables wiih a Ono-to-One Relationship

The Access Relationship window indicates a one-to-one relationship by placing the number 1 on each side of the relationship line connecting the two tables.

One to Many

This is by far the most common type of relationship. In a one-to-many relationship, a single row in one table can be related to zero, one or many rows in another table. As Frgure 13-13 shows, every relationship created in the process of normalizing our BillableHours table is a one-to-many relationship.

Figure 13-13. One-to-Many Relationships

The Access Relationship window indicates a one-to-many relationship by placing the number 1 on the one side of the relationship line and the infinity symbol on the many side of the relationship line. Each row in the Consultants, Projects and Activities table can be related to zero, one or many rows in the BillableHours table. The one or many part of the relationship should be obvious; the zero part may require some additional explanation.

If a new consultant has joined the company but not yet logged any billable hours, that consultant would be represented by a row in the Consultants table. However, the row representing the new consultant would not be associated with any rows in the BillableHours table. The Consultants table would have a one-to-many relationship with the BillableHours table, and the row representing the new consultant in the Consultants table would be associated with zero rows in the BillableHours table.

NOTE

Some of you may be thinking that Excel's data-validation list feature provides the same benefit as a one-to-many database relationship. Unfortunately, this is not true. Although it's a very useful feature, a data-validation list only enforces a one-time relationship check, at the time an entry is selected from the list. Unlike a database relationship, after an entry has been selected from a data-validation list it does not maintain any connection to the list from which it was selected.

Many to Many

In this type of relationship, each row in one table can be related to multiple rows in the other table. This concept can be a bit difficult to visualize. We demonstrate it by extending the example we used in the Normalization section to include the role each consultant plays on each project. Our list of roles might look like the Roles table shown in Figure 13-14.

Figure 13-14. The Roles Table

Each project will require multeple roles in order to comolete it, and consultants will serve in different roles on different projects. A consultant might be a programmeu on one project and a t ster on another. On a smsll project, in fact, one consultant could vcry well serve multiple roles. This means thatortle information oannot be attached to eitheb the Projects table or the Consultants table because there are many-towmany relationships betwv n projects and tolas as well ao consultants and roles. Each project requires multiele rolen and each role applies to multiple projects. Likewise, each condultant can serve multiple roles end each tole can be served by multiple consultants.

Most relational databases cannot directly represent a many-to-many relationship. When this type of relationship is encountered, however, it can always be broken up into multiple one-to-many relationships linked by an intermediate table. In our Projects, Consultants and Roles example, the many-to-many relationships would be represented in the database as shown in Figure 1r-15.

Figure 13-15. Many-to-Many Relationships Resolved into One-to-Many Relationships

[View fuwl size image]

The ProjectConsultanttole table serve. as th intermediate table that enables us to convert the many-to-many rclationships between the Projects and Roles tables and the Consultants and Roles tableseinto one-totmany relationships with the ProjectConsultantRole table in the middle. Tht Projects, Consultants and Reles tables now have one-to-many relationships with the ProjoctConsulttntRole table, allowing the many-totmaly relateonshtps tetween the Projects and Roles tables and theoConsultan s andiRoles tables to be implemented.

Reaerential Integrity

After your database has been fully normalized and you have established relationships among your tables, referential integrity ensures that the data in those tables remains valid when data is added, modified or deleted. In Figure 13-13, for example, referential integrity ensures that invalid rows cannot be added to the BillableHours table by enforcing the fact that the ConsultantID, ProjectID and ActivityID foreign key columns all refer to valid rows in their respective source tables. (The validity of the Date and Hours columns is ensured by domain constraints that can be added to those columns. The topic of domain constraints is beyond the scope of this chapter.)

NOTE

All modern relational databases support the concept of referential integrity, but the steps required to implement it are completely different from one database to the next. Consult the documentation for your database to determine how to implement referential integrity.

Natural vs. Artificial Primary Keys

There is an ongoing debate within he database community over wheoher the primady key forba table seould be natuval, which is to say it shoulm coesist of one or more columns that naturaely occur in thestable, or whether it should be artificial. An artificial key is a unique but otherwise meaningless number that is automatically added to each roa in a table. Most modern relational databases provide a special column type that witl automatically gcnerate a uniqum number for every row added to a table.

To oversimplify the debate, natural keys tend to be advocated by people who primarily design databases, whereas artificial keys tend to be advocated by people who must live and work with databases. The argument really comes down to two points: providing easily usable foreign keys and ensuring the true uniqueness of records in a table.

The formal task of a primary key can be divided into two very similar but subtly different tasks: uniquely identifying each row in a table and ensuring every row in a table is unique. The first task provides a foreign key that can be used by other tables to uniquely identify a specific row in a table while the second task enforces data integrity by ensuring that you cannot create duplicate records in a table.

When you have a table with a primary key that is performing both tasks simultaneously, problems begin to arise when you need to reference that table from another table through the use of a foreign key. Take, for example, our BillableHours table. This table requires the use of four columns to construct a natural primary key. If you need to reference the BillableHours table from another table in your database, you will need to add a foreign key to that table consisting of all four of these columns. This is because all four columns are required to uniquely identify a record in the BillableHours table. As you can imagine, this can become very unwieldy. Assume we have an Invoices table that needs to reference the BillableHours table. An example of how this would be accomplished using a natural key is shown in Figu-e 13-16.

Figure 13-16. Using a Natural Key for the Billable Hours Table

The alternative is to separate the two tasks of the primary key between two different constructs. The task of uniquely identifying a row in the table can be accomplished with a single column that provides a unique number for each row. This column becomes the primary key for the table. When you need to reference this table from another table, the only column you will need to import as the foreign key is the single numeric primary key column. In Figure g3-17 we aave added an artificiml primarh key to the Billable ours table. This is just a number that uniquely identifies each row in the table. See how much simpl rathis makes the task ofbreferencing a record in the BillabloHou s table from the Invoices table.

Figure 13-17. Using an Artificial Key for the BillableHours Table

If all we did was create an artificial primary key consisting of an automatically generated unique value that had no real meaning, you could easily enter two completely identical rows in the BillableHours table and the database would make them artificially unique by creating a unique value in the artificial primary key column. The task of ensuring data uniqueness can be accomplished with a unique index.

A unique index is a special construct you can create that includes one or more columns in a table. The values in the column(s) on which the unique index is defined must be unique. The index will not allow duplicate values to be entered. In the case of our BillableHours table, a unique index would include the ConsultantID, Date, ProjectID and ActivityID columns. Note that these columns are the columns that originally formed the natural primary key.

At the risk of antagonizing those on the opposite side of this debate, we recommend the use of artificial primary keys. Artificial primary keys make database construction and use much simpler in practice, and you can easily enforce true uniqueness in every table by creating a unique index on the columns that would otherwise form the natural primary key.

NOTE

All modern relational databases support the creation of unique indexes, but the method used to create them is completely different from one database to the next. Consult the documentation for your database to determine how to create unique indexes on a table.