Jump to content

Third normal form

From Wikipedia, the free encyclopedia

Third normal form (3NF) is a level of database normalization defined by English computer scientist Edgar F. Codd. A relation (or table, in SQL) is in third normal form if it is in second normal form and also lacks non-key dependencies, meaning that no non-prime attribute is functionally dependent on (that is, contains a fact about) any other non-prime attribute. In other words, each non-prime attribute must depend solely and non-transitively on each candidate key.[1] William Kent summarised 3NF with the dictum that "a non-key field must provide a fact about the key, the whole key, and nothing but the key".[2][citation needed]

An example of a violation of 3NF would be a Patient relation with the attributes PatientID, DoctorID and DoctorName, in which DoctorName would depend first and foremost on DoctorID and only transitively on the key, PatientID (via DoctorID's dependency on PatientID). Such a design would cause a doctor's name to be redundantly duplicated across each of their patients. A database compliant with 3NF would store doctors' names in a separate Doctor relation which Patient could reference via a foreign key.

3NF was defined, along with 2NF (which forbids dependencies on proper subsets of composite keys), in Codd's paper "Further Normalization of the Data Base Relational Model" in 1971,[3] which came after 1NF's definition in "A Relational Model of Data for Large Shared Data Banks" in 1970.[citation needed] 3NF was itself followed by the definition of Boyce–Codd normal form in 1974, which seeks to prevent anomalies possible in relations with several overlapping composite keys.

Definition of third normal form

[edit]

Codd's definition states that a relation R is in 3NF if and only if it is in second normal form (2NF) and every non-prime attribute of R is non-transitively dependent on each candidate key. A non-prime attribute of R is an attribute that does not belong to any candidate key of R.[4]

Codd defines a transitive dependency of an attribute set Z on an attribute set X as a functional dependency chain XYZ that must be satisfied for some attribute set Y, where it is not the case that YX, and all three sets must be disjoint.[5]

A 3NF definition that is equivalent to Codd's, but expressed differently, was given by Carlo Zaniolo in 1982. This definition states that a table is in 3NF if and only if for each of its functional dependencies XY, at least one of the following conditions holds:[6][7][need quotation to verify]

  • X contains Y (that is, Y is a subset of X, meaning XY is a trivial functional dependency),
  • X is a superkey,
  • every element of Y \ X, the set difference between Y and X, is a prime attribute (i.e., each attribute in Y \ X is contained in some candidate key).

To rephrase Zaniolo's definition more simply, the relation is in 3NF if and only if for every non-trivial functional dependency X → Y, X is a superkey or Y \ X consists of prime attributes. Zaniolo's definition gives a clear sense of the difference between 3NF and the more stringent Boyce–Codd normal form (BCNF). BCNF simply eliminates the third alternative ("Every element of Y \ X, the set difference between Y and X, is a prime attribute.").

The definition offered by Zaniolo can be shown to be equivalent to the Codd definition in the following way: let X → A be a nontrivial functional dependency (i.e., one where X does not contain A) and let A be a non-prime attribute. Also let Y be a candidate key of R. Then Y → X. Therefore, A is not transitively dependent on Y if there is a functional dependency X → Y if and only if X is a superkey of R.

Example

[edit]

Design which violates 3NF

[edit]

The following relation, with the composite key {Name, Year}, fails to meet the requirements of 3NF. The non-prime attributes WinnerName and WinnerBirthdate are only transitively dependent on the composite key via their dependence on the non-prime attribute WinnerID. This creates redundancy and the potential for inconsistency in the case that a winner of multiple tornaments is accidentially given different dates of birth in different tuples.

Tornament
Name Year WinnerID WinnerName WinnerBirthdate
Indiana Invitational 1998 1 Al Fredrickson 1975-07-21
Cleveland Open 1999 2 Bob Albertson 1968-09-28
Des Moines Masters 1999 1 Al Fredrickson 1975-07-21
Indiana Invitational 1999 3 Chip Masterson 1977-03-14

Design which complies with 3NF

[edit]

To bring the relation into compliance with 3NF, WinnerID, WinnerName and WinnerBirthdate can be transferred to a separate table.

Tornament
Name Year WinnerID
Indiana Invitational 1998 1
Cleveland Open 1999 2
Des Moines Masters 1999 1
Indiana Invitational 1999 3
Winner
WinnerID Name Birthdate
1 Al Fredrickson 1975-07-21
2 Bob Albertson 1968-09-28
3 Chip Masterson 1977-03-14

Tornament's WinnerID attribute now acts as a foreign key referencing the primary key of Winner. Unlike before, it is not possible for a winner to be associated with multiple dates of birth.

"Nothing but the key"

[edit]

A paraphrase of Codd's definition of 3NF parodying the traditional oath to tell the truth in a court of law was given by William Kent: "a non-key field must provide a fact about the key, the whole key, and nothing but the key".[2] Requiring that non-key attributes be dependent on "the whole key" ensures compliance with 2NF, and further requiring their dependency on "nothing but the key" ensures compliance with 3NF. A common variation supplements the paraphrase with the addendum "so help me Codd".[8]

While the phrase is a useful mnemonic, the mention of only a single key makes fulfilling it necessary but not sufficient to satisfy 2NF and 3NF, both of which are concerned with all candidate keys of a relation and not just any one.[citation needed]

Christopher J. Date notes that, adapted to refer to all fields rather than just non-key fields, the summary can also encompass the slightly stronger Boyce–Codd normal form, in which prime attributes must not be functionally dependent at all.[9] Prime attributes are considered to provide a fact about the key in the sense of providing part or all of the key itself. (This rule applies only to functionally dependent attributes, as applying it to all attributes would implicitly prohibit composite keys, since each part of any such key would violate the "whole key" clause.)

Computation

[edit]

A relation can always be decomposed in third normal form, that is, the relation R is rewritten to projections R1, ..., Rn whose join is equal to the original relation. Further, this decomposition does not lose any functional dependency, in the sense that every functional dependency on R can be derived from the functional dependencies that hold on the projections R1, ..., Rn. What is more, such a decomposition can be computed in polynomial time.[10]

To decompose a relation into 3NF from 2NF, break the table into the canonical cover functional dependencies, then create a relation for every candidate key of the original relation which was not already a subset of a relation in the decomposition.[11]

Considerations for use in reporting environments

[edit]

While 3NF was ideal for machine processing, the segmented nature of the data model can be difficult to intuitively consume by a human user. Analytics via query, reporting, and dashboards were often facilitated by a different type of data model that provided pre-calculated analysis such as trend lines, period-to-date calculations (month-to-date, quarter-to-date, year-to-date), cumulative calculations, basic statistics (average, standard deviation, moving averages) and previous period comparisons (year ago, month ago, week ago) e.g. dimensional modeling and beyond dimensional modeling, flattening of stars via Hadoop and data science.[12][13] Hadley Wickham's "tidy data" framework is 3NF, with "the constraints framed in statistical language".[14]

See also

[edit]

References

[edit]
  1. ^ Codd, E. F. "Further Normalization of the Data Base Relational Model", p. 34.
  2. ^ a b Kent, William. "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM 26 (2), Feb. 1983, pp. 120–125.
  3. ^ Codd, E. F. "Further Normalization of the Data Base Relational Model". (Presented at Courant Computer Science Symposia Series 6, "Data Base Systems", New York City, May 24–25, 1971.) IBM Research Report RJ909 (August 31, 1971). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972.
  4. ^ Codd, p. 43.
  5. ^ Codd, p. 45–46.
  6. ^ Zaniolo, Carlo. "A New Normal Form for the Design of Relational Database Schemata". ACM Transactions on Database Systems 7(3), September 1982.
  7. ^ Abraham Silberschatz, Henry F. Korth, S. Sudarshan, Database System Concepts (5th edition), p. 276–277.
  8. ^ The author of a 1989 book on database management credits one of his students with coming up with the "so help me Codd" addendum. Diehr, George. Database Management (Scott, Foresman, 1989), p. 331.
  9. ^ Date, C. J. An Introduction to Database Systems (7th ed.) (Addison Wesley, 2000), p. 379.
  10. ^ Serge Abiteboul, Richard B. Hull, Victor Vianu: Foundations of Databases. Addison-Wesley, 1995. http://webdam.inria.fr/Alice/ ISBN 0201537710. Theorem 11.2.14.
  11. ^ Hammo, Bassam. "Decomposition, 3NF, BCNF" (PDF). Archived (PDF) from the original on 2023-03-15.
  12. ^ "Comparisons between Data Warehouse modelling techniques – Roelant Vos". Roelant Vos. 12 February 2013. Retrieved 5 March 2018.
  13. ^ "Hadoop Data Modeling Lessons | EMC". InFocus Blog | Dell EMC Services. 23 September 2014. Retrieved 5 March 2018.
  14. ^ Wickham, Hadley (2014-09-12). "Tidy Data". Journal of Statistical Software. 59: 1–23. doi:10.18637/jss.v059.i10. ISSN 1548-7660.

Further reading

[edit]
[edit]