Friday, December 7, 2007

Relational Database Management Systems

A relational database management system (RDBMS) is a software program you use to create, maintain, modify, and manipulate a relational database. Many RDBMS programs also provide the tools you need to create end-user applications that interact with the data stored in the database. Of course, the quality of an RDBMS is a direct function of the extent to which it supports the relational database model. Even among "true" RDBMSs, support for the relational database varies among vendors, and there is yet to be a full implementation of the relational model's potential. Despite this, all RDBMS programs continue to evolve and become more full-featured and powerful than ever before.

Since the early 1970s, a number of RDBMS programs have been produced by a variety of software vendors, encompassing various types of computer hardware, operating systems, and programming environments. As we continue our voyage into the beginning of the twenty-first century, it's safe to say that RDBMS programs are as ubiquitous and integrated into our daily lives as cellular phones.

In the earliest days of the relational database, RDBMSs were written for use on mainframe computers. (Didn't everything start on a mainframe?) Two RDBMS programs prevalent in the early 1970s were System R, developed by IBM at its San Jose Research Laboratory in California, and Interactive Graphics Retrieval System (INGRES), developed at the University of California at Berkeley. These two programs contributed greatly to the general appreciation of the relational model.

As the benefits of the relational database became more widely known, many companies decided to make a slow move from hierarchical and network database models to the relational database model, thus creating a need for more and better mainframe RDBMS programs. The 1980s saw the development of various commercial RDBMSs for mainframe computers, such as Oracle, developed by Oracle Corporation, and IBM's DB2.

The early to mid-1980s saw the rise of the personal computer, and with it the development of PC-based RDBMS programs. Some of the early entries in this category, such as dBase by Ashton-Tate and FoxPro from Fox Software, were nothing more than elementary file-based database-management systems. True PC-based RDBMS programs began to emerge with the introduction of R:BASE, originally developed by Microrim, and Paradox, originally developed by Ansa Software. Each of these products helped to spread the idea and potential of database management from the mainframe-dominated domain of information systems departments to the desktop of the common end user.

The need to share data became apparent as more and more users worked with databases throughout the late 1980s and early 1990s. The concept of a centrally located database that could be made available to multiple users seemed a very promising idea. This would certainly make data management and database security much easier to implement. Database vendors responded to this need by developing client/server RDBMS programs.

As Figure illustrates, the data in this type of system resides on a computer acting as a database server, and users interact with the data through applications residing on their own computer, or database client. The database developer uses the client/server RDBMS program to create and maintain the database and attendant end-user application programs. She implements data integrity and data security on the database server, giving her the ability to base a variety of user applications on the same set of data without affecting the data's integrity or security.


Client/server RDBMS programs have been widely used for quite some time to manage large volumes of shared data. Some of the more recent entries in the client/server RDBMS category are Microsoft SQL Server 2000 from Microsoft Corporation and Oracle9i Application Server, from Oracle Corporation.

The Relational Database Model

The relational database was first conceived in 1969 and has arguably become the most widely used database model in database management today. The father of the relational model, Dr. Edgar F. Codd, was an IBM research scientist in the late 1960s and was at that time looking into new ways to handle large amounts of data. His dissatisfaction with the database models and database products of the time led him to begin thinking of ways to apply the disciplines and structures of mathematics to solve the myriad of problems he had been encountering. Being a mathematician by profession, he strongly believed that he could apply specific branches of mathematics to solve problems, such as data redundancy, weak data integrity, and a database structure's overdependence on its physical implementation.

Dr. Codd formally presented his new relational model in a landmark work entitled "A Relational Model of Data for Large Shared Databanks"[1] in June of 1970. He based his new model on two branches of mathematics—set theory and first-order predicate logic. Indeed, the name of the model itself is derived from the term relation, which is part of set theory. (A widely held misconception is that the relational model derives its name from the fact that tables within a relational database can be related to one another.)

[1] Edgar F. Codd, "A Relational Model of Data for Large Shared Databanks," Communications of the ACM, June 1970, 377–87.

A relational database stores data in relations, which the user perceives as tables. Each relation is composed of tuples, or records, and attributes, or fields. (I'll use the terms tables, records, and fields throughout the remainder of the book.) The physical order of the records or fields in a table is completely immaterial, and each record in the table is identified by a field that contains a unique value. These are the two characteristics of a relational database that allow the data to exist independently of the way it is physically stored in the computer. As such, a user isn't required to know the physical location of a record in order to retrieve its data. This is unlike the hierarchical and network database models, in which knowing the layout of the structures is crucial to retrieving data.

The relational model categorizes relationships as one-to-one, one-to-many, and many-to-many. (These relationships are covered in detail in Chapter 10.) A relationship between a pair of tables is established implicitly through matching values of a shared field. In Figure for example, the CLIENTS and AGENTS tables are related via an AGENT ID field; a specific client is associated with an agent through a matching AGENT ID. Likewise, the ENTERTAINERS and ENGAGEMENTS tables are related via an ENTERTAINER ID; a record in the ENTERTAINERS table can be associated with a record in the ENGAGEMENTS through matching ENTERTAINER IDs.


As long as a user is familiar with the relationships among the tables in the database, he can access data in an almost unlimited number of ways. He can access data from tables that are directly related and from tables that are indirectly related. Consider the Agents database in Figure 1.5. Although the CLIENTS table is indirectly related to the ENGAGEMENTS table, the user can produce a list of clients and the entertainers who have performed for them. (Of course, it really depends on how the tables are actually structured, but I digress. This example serves our purpose for now.) He can do this easily because CLIENTS is directly related to ENGAGEMENTS and ENGAGEMENTS is directly related to ENTERTAINERS.

Retrieving Data

You retrieve data in a relational database by using Structured Query Language (SQL). SQL is the standard language used to create, modify, maintain, and query relational databases. Figure 1.6 shows a sample SQL query statement you can use to produce a list of all clients in the city of El Paso.

Figure 1.6 A sample SQL query statement.
SELECT ClientLastName, ClientFirstName, ClientPhoneNumber
FROM Clients
WHERE City = "El Paso"
ORDER BY ClientLastName, ClientFirstName;

The three components of a basic SQL query are the SELECT…FROM statement, the WHERE clause, and the ORDER BY clause. You use the SELECT clause to indicate the fields you want to use in the query and the FROM clause to indicate the table(s) to which the fields belong. You can filter the records the query returns by imposing criteria against one or more fields with the WHERE clause, and then sort the results in ascending or descending order with the ORDER BY clause.

Most of today's major relational database software programs incorporate various forms of SQL implementations, ranging from windows in which users can manually enter "raw" SQL statements to graphical tools that allow users to build queries using various graphic elements. For example, a user working with R:BASE Technologies's R:BASE can opt to build and execute SQL query statements directly from a command prompt, while someone using Microsoft Access may find it easier to build queries using Access's graphical query builder. Regardless of how the queries are built, the user can save them for future use.

It's not always necessary for you to know SQL in order to work with a database. If your database software provides a graphical query builder or you're using a custom-built application to work with the data in your database, you'll never need to write a single SQL statement. It's a good idea, however, for you to gain a basic understanding of SQL. It will help those of you using query-building tools to understand and troubleshoot the queries you create with these tools, and it will definitely be to your advantage should you need to work high-end database software programs, such as Oracle and Microsoft SQL Server.

Note

Although a detailed discussion of SQL is beyond the scope of this book, you should understand that SQL is a language directly related to the relational database model. If you have a desire or need to study SQL, you could start by reading my second book, SQL Queries for Mere Mortals, and then move on to any of the other SQL books that are on my recommended reading list in Appendix A.


Advantages of a Relational Database

The relational database provides a number of advantages over previous models, such as the following:

  • Built-in multilevel integrity: Data integrity is built into the model at the field level to ensure the accuracy of the data; at the table level to ensure that records are not duplicated and to detect missing primary key values; at the relationship level to ensure that the relationship between a pair of tables is valid; and at the business level to ensure that the data is accurate in terms of the business itself. (Integrity is discussed in detail as the design process unfolds.)

  • Logical and physical data independence from database applications: Neither changes a user makes to the logical design of the database, nor changes a database software vendor makes to the physical implementation of the database, will adversely affect the applications built upon it.

  • Guaranteed data consistency and accuracy: Data is consistent and accurate due to the various levels of integrity you can impose within the database. (This will become quite clear as you work through the design process.)

  • Easy data retrieval: At the user's command, data can be retrieved either from a particular table or from any number of related tables within the database. This enables a user to view information in an almost unlimited number of ways.

These and other advantages have proved beneficial to the business community and to all those who need to collect and manage data. Indeed, the relational database has become the database of choice in many circumstances.

Until recently, one perceived disadvantage of the relational database was that software programs based on it ran very slowly. This was not a fault of the relational model itself, but of the ancillary technology available at the time of the model's introduction. Processing speed, memory, and storage were simply insufficient to provide database software vendors with a platform on which to build a full implementation of the relational database, so the initial relational database software programs fell woefully short of their full potential. Since the early 1990s, however, advances in both hardware technology and software engineering have made processing speed an insignificant issue and have allowed vendors to make significant gains in their efforts to support the model more fully.

You'll learn more about the relational database model as you work through the design process presented in this book. Some of the topics you'll encounter include creating tables, establishing data integrity, working with relationships, and establishing business rules.

Early Database Models

In the days before the relational database model, two data models were commonly used to maintain and manipulate data—the hierarchical database model and the network database model.

Note

Although use of these models is rapidly waning, I've provided a brief overview of each for historical purposes. In an overall sense, I believe it is useful for you to know what preceded the relational model so that you have a basic understanding of what led to its creation and evolution.

In the following overview I briefly describe how the data in each model is structured and accessed, how the relationship between a pair of tables is represented, and one or two of the advantages or disadvantages of each model.


Some of the terms you'll encounter in this section are explained in more detail in Chapter 3, "Terminology."

The Hierarchical Database Model

Data in this type of database is structured hierarchically and is typically diagrammed as an inverted tree. A single table in the database acts as the "root" of the inverted tree and other tables act as the branches flowing from the root. Figure shows a diagram of a typical hierarchical database structure.


Agents Database

In the example shown in Figure an agent books several entertainers, and each entertainer has his own schedule. An agent also maintains a number of clients whose entertainment needs are met by the agent. A client books engagements through the agent and makes payments to the agent for his services.


A relationship in a hierarchical database is represented by the term parent/child. In this type of relationship, a parent table can be associated with one or more child tables, but a single child table can be associated with only one parent table. These tables are explicitly linked via a pointer or by the physical arrangement of the records within the tables. A user accesses data within this model by starting at the root table and working down through the tree to the target data. This access method requires the user to be very familiar with the structure of the database.

One advantage to using a hierarchical database is that a user can retrieve data very quickly because there are explicit links between the table structures. Another advantage is that referential integrity is built in and automatically enforced. This ensures that a record in a child table must be linked to an existing record in a parent table, and that a record deleted in the parent table will cause all associated records in the child table to be deleted as well.

A problem occurs in a hierarchical database when a user needs to store a record in a child table that is currently unrelated to any record in a parent table. Consider an example using the Agents database shown in Figure A user cannot enter a new entertainer in the ENTERTAINERS table until the entertainer is assigned to an agent in the AGENTS table. Recall that a record in a child table (in this case, ENTERTAINERS) must be related to a record in the parent table (AGENTS). Yet in real life, entertainers commonly sign up with the agency well before they are assigned to specific agents. This scenario is difficult to model in a hierarchical database. The rules can be bent without breaking them if a dummy agent record is inserted in the AGENTS table; however, this option is not really optimal.

This type of database cannot support complex relationships, and there is often a problem with redundant data. For example, there is a many-to-many relationship between clients and entertainers; an entertainer will perform for many clients, and a client will hire many entertainers. You can't directly model this type of relationship in a hierarchical database, so you'll have to introduce redundant data into both the SCHEDULE and ENGAGEMENTS tables.

  • The SCHEDULE table will now have client data (such as client name, address, and phone number) to show for whom and where each entertainer is performing. This particular data is redundant because it is currently stored in the CLIENTS table.

  • The ENGAGEMENTS table will now contain data on entertainers (such as entertainer name, phone number, and type of entertainer) to indicate which entertainers are performing for a given client. This data is redundant as well because it is currently stored in the ENTERTAINERS table.

The problem with this redundancy is that it opens up the possibility of allowing a user to enter a single piece of data inconsistently. This, in turn, can result in producing inaccurate information.

A user can solve this problem in a roundabout manner by creating one hierarchical database specifically for entertainers and another specifically for agents. The new Entertainers database will contain only the ENTERTAINERS table, and the revised Agents database will contain the AGENTS, CLIENTS, PAYMENTS, and ENGAGEMENTS tables. The SCHEDULE table is no longer needed in the Entertainers database because you can define a logical child relationship between the ENGAGEMENTS table in the Agents database and the ENTERTAINERS table in the Entertainers database. With this relationship in place, you can retrieve a variety of information, such as a list of booked entertainers for a given client or a performance schedule for a given entertainer. Figure shows a diagram of the new model.


As you see, a person designing a hierarchical database must be able to recognize the need to use this technique for a many-to-many relationship. Here the need is relatively obvious, but many relationships are more obscure and may not be discovered until very late in the design process or, more disturbingly, well after the database has been put into operation.

The hierarchical database lent itself well to the tape storage systems used by mainframes in the 1970s and was very popular in companies that used those systems. But, despite the fact that the hierarchical database provided fast and direct access to data and was useful in a number of circumstances, it was clear that a new database model was needed to address the growing problems of data redundancy and complex relationships among data.

The Network Database Model

The network database was, for the most part, developed as an attempt to address some of the problems of the hierarchical database. The structure of a network database is represented in terms of nodes and set structures. Figure shows a diagram of a typical network database.


Agents Database

In the example shown in Figure an agent represents a number of clients and manages a number of entertainers. Each client schedules any number of engagements and makes payments to the agent for his or her services. Each entertainer performs a number of engagements and may play a variety of musical styles.


A node represents a collection of records, and a set structure establishes and represents a relationship in a network database. It is a transparent construction that relates a pair of nodes together by using one node as an owner and the other node as a member. (This is a valuable improvement on the parent/child relationship.) A set structure supports a one-to-many relationship, which means that a record in the owner node can be related to one or more records in the member node, but a single record in the member node is related to only one record in the owner node. Additionally, a record in the member node cannot exist without being related to an existing record in the owner node. For example, a client must be assigned to an agent, but an agent with no clients can still be listed in the database. Figure shows a diagram of a basic set structure.


One or more sets (connections) can be defined between a specific pair of nodes, and a single node can also be involved in other sets with other nodes in the database. In Figure for instance, the CLIENTS node is related to the PAYMENTS node via the Make set structure. It is also related to the ENGAGEMENTS node via the Schedule set structure. Along with being related to the CLIENTS node, the ENGAGEMENTS node is related to the ENTERTAINERS node via the Perform set structure.

A user can access data within a network database by working through the appropriate set structures. Unlike the hierarchical database, where access must begin from a root table, a user can access data from within the network database, starting from any node and working backward or forward through related sets. Consider the Agents database in Figure once again. Say a user wants to find the agent who booked a specific engagement. She begins by locating the appropriate engagement record in the ENGAGEMENTS node, and then determines which client "owns" that engagement record via the Schedule set structure. Finally, she identifies the agent that "owns" the client record via the Represent set structure. The user can answer a wide variety of questions as long as she navigates properly through the appropriate set structures.

One advantage the network database provides is fast data access. It also allows users to create queries that are more complex than those they created using a hierarchical database. A network database's main disadvantage is that a user has to be very familiar with the structure of the database in order to work through the set structures. Consider the Agents database in Figure once again. It is incumbent on the user to be familiar with the appropriate set structures if she is to determine whether a particular engagement has been paid. Another disadvantage is that it is not easy to change the database structure without affecting the application programs that interact with it. Recall that a relationship is explicitly defined as a set structure in a network database. You cannot change a set structure without affecting the application programs that use this structure to navigate through the data. If you change a set structure, you must also modify all references made from within the application program to that structure.

Although the network database was clearly a step up from the hierarchical database, a few people in the database community believed that there must be a better way to manage and maintain large amounts of data. As each data model emerged, users found that they could ask more complex questions, thereby increasing the demands made upon the database. And so, we come to the relational database model.

Database Design for Mere Mortals

Types of Databases

What is a database? As you probably know, a database is an organized collection of data used for the purpose of modeling some type of organization or organizational process. It really doesn't matter whether you're using paper or a computer software program to collect and store the data. As long as you're gathering data in some organized manner for a specific purpose, you've got a database. Throughout the remainder of this discussion, we'll assume that you're using a computer software program to collect and maintain your data.

There are two types of databases found in database management, operational databases and analytical databases.

Operational databases are the backbone of many companies, organizations, and institutions throughout the world today. This type of database is primarily used in on-line transaction processing (OLTP) scenarios, that is, in situations where there is a need to collect, modify, and maintain data on a daily basis. The type of data stored in an operational database is dynamic, meaning that it changes constantly and always reflects up-to-the-minute information. Organizations, such as retail stores, manufacturing companies, hospitals and clinics, and publishing houses, use operational databases because their data is in a constant state of flux.

In contrast, analytical databases are primarily used in on-line analytical processing (OLAP) scenarios, where there is a need to store and track historical and time-dependent data. An analytical database is a valuable asset when there is a need to track trends, view statistical data over a long period of time, and make tactical or strategic business projections. This type of database stores static data, meaning that the data is never (or very rarely) modified. The information gleaned from an analytical database reflects a point-in-time snapshot of the data. Chemical labs, geological companies, and marketing-analysis firms are examples of organizations that use analytical databases.

Analytical databases often use data from operational databases as their main data source, so there can be some amount of association between them; nevertheless, operational and analytical databases fulfill very specific types of data-processing needs, and creating their structures requires radically different design methodologies. This book focuses on designing an operational database because it is still the most widely used type of database in the world today.