NORMALIZATION

In his seminal paper on the relational data model, E.F. Codd formulated a number of design principles for a relational database (Codd, 1970). These principles were originally formalized in terms of three normal forms: first normal form, second normal form and third normal form. The process of transforming a database design through these three normal forms is known as normalization. By the mid-1970s third normal form was shown to have certain inadequacies and a stronger normal form, known as Boyce–Codd normal form (BCNF), was introduced (Codd, 1974). Subsequently Fagin introduced fourth normal form and indeed fifth normal form (Fagin, 1977; 1979).
Normalization process is the establishment of the database structure so that most of the ambiguity can be removed. The normalization of starting the light phase (1NF) to most stringent (5NF). Usually only up to the level of 3NF or BCNF because already sufficient to generate the table-a table of good quality.


1. Why Normalize?
  • Optimization table structures
  • Increase the speed
  • Eliminate income the same data
  • More efficient use of storage media
  • Reduce redundancy
  • Avoiding anomalies (insertion anomalies, deletion anomalies, update anomalies).
  • Improved data integrity

2. Stages of Normalization

Normalization is carried out in the following steps:
  • Collect the data-set – the set of data-items
  • Transform the unnormalized data-set into tables in first normal form
  • Transform first normal form tables to second normal form
  • Transform second normal form tables to third normal form

Occasionally, the data may still be subject to anomalies in third normal form.
In this case, we may have to perform further steps:
  • Transform third normal form to Boyce–Codd normal form
  • Transform third normal form to fourth normal form
  • Transform fourth normal form to fifth normal form

The process of transforming an unnormalised data-set into a fully normalised (third normal form) database is frequently referred to as a process of non-loss decomposition. This is because we continually fragment our data structure into more and more tables without losing the fundamental relationships between data-items.

2.1 Determinancy/Dependency


Normalisation is the process of identifying the logical associations between data-items and designing a database which will represent such associations but without suffering the file maintenance anomalies.
The logical associations between data-items that point the database designer in the direction of a good database design are referred to as determinant or dependent relationships. Two data-items, A and B, are said to be in a determinant or dependent relationship if certain values of data-item B always appear with certain values of data-item A.
Determinancy/dependency also implies some direction in the association. If data-item A is the determinant data-item and B the dependent data-item then the direction of the association is from A to B and not vice versa. There are two major types of determinancy or its opposite dependency: functional (single-valued) determinancy, and non-functional (multi-valued) determinancy. We introduce here the concept of functional determinancy. Non-functional determinancy is discussed below.
Data-item B is said to be functionally dependent on data-item A if for every value of A there is one, unambiguous value for B. In such a relationship data-item A is referred to as the determinant data-item, while data-item B is referred to as the dependent data-item. Functional determinancy is so-called because it is modelled on the idea of a mathematical function. A function is a directed one-to-one mapping between the elements of one set and the elements of another set.


2.2 Unnormalized Data-Set to First Normal Form

The data-set represented in tabular form in section 18.2 is said to be an unnormalized data-set. This can be seen, for instance, if we choose the data-item moduleName as the key of this data-set and underline it to indicate this. Realistically, if we remove redundant information, we should represent the information as follows:



A given cell of the table for the attributes studentNo, studentName, assGrade and assType contains multiple values. Examining the table above we see that studentNo, studentName, assGrade and assType all repeat with respect to moduleName.

A relation is in first normal form if and only if every non-key attribute is functionally dependent upon the primary key

The attributes studentNo, studentName, assGrade and assType are clearly not functionally dependent on our chosen primary key moduleName. The attributes staffNo and staffName clearly are. This means that we form two tables: one for the functionally dependent attributes, and one for the non-dependent attributes. We declare a compound of moduleName, studentNo and assType to be the primary key of this second table.



2.3 First Normal Form to Second Normal Form

To move from first normal form to second normal form we remove part-key dependencies. This involves examining those tables that have a compound key and for each non-key data-item in the table asking the question: can the data-item be uniquely identified by part of the compound key?

A relation is in second normal form if and only if it is in first normal form and every non-key attribute is fully functionally dependent on the primary key

Take, for instance, the table named Assessments. Here we have a three-part compound key moduleName, studentNo and assType. We ask the question above for each of these data-items in relation to the non-key data-items studentName and assGrade. Clearly we need all the items of the key to tell us what the assessment grade is. ModuleName however has no influence on the
studentName. StudentNo alone determines studentName. Hence, we break out the determinant and dependent data-items into their own table. This leads to a decomposition of the tables as follows:



2.4 Second Normal Form to Third Normal Form

To move from second normal form to third normal form we remove inter-data dependencies. To do this we examine every table and ask of each pair of non-key data-items: is the value of data-item A dependent on the value of data-item B, or vice versa? If the answer is yes we split off the relevant data-items into a separate table.

A relation is in third normal form if and only if it is in second normal form and every non-key attribute is non-transitively dependent on the primary key

The only place where this is relevant in our present example is in the table called Modules. Here, staffNo determines staffName. StaffName is hence transitively dependent on moduleName. StaffNo is therefore asking to be a primary key. Hence, we create a separate table to be called Lecturers with staffNo as the primary key. This is illustrated below:




2.5 BCNF

BCNF is a stronger normal form than third normal form, designed to cover anomalies that arise when there is more than one candidate key in some set of data requirements.

Example:
Suppose we have introduced a scheme of majors and minors into our degree schemes at a university. The business rules relevant to that part of this domain covering majors are listed below:
  • Each student may major in several areas
  • A student has one tutor for each area
  • Each area has several tutors but a tutor advises in only one area
  • Each tutor advises several students in an area

A diagram incorporating all these business rules is given in figure below.



On the basis of these business rules, a schema is produced in third normal form, represented in the bracketing notation below:

Majors(studentNo, area, staffNo)

A set of sample data is provided in the table below.



This schema is in third normal form because there are no partial dependencies and no inter-data dependencies. However, anomalies will still arise when we come to update this relation. For instance:
  • Suppose student 123456 changes one of his or her majors from computer science to information systems. Doing this means that we lose information about staffNo 234 tutoring on computer science. This is an update anomaly.
  • Suppose we wish to insert a new row to establish the fact that staffNo 789 tutors on computer science. We cannot do this until at least one student takes this area as his or her major. This is an insertion anomaly.
  • Suppose student 345678 withdraws from the university. In removing the relevant row we lose information about staffNo 567 being a tutor in the area of information systems. This is a deletion anomaly.
These anomalies occur because there are two overlapping candidate keys in this problem. R.F. Boyce and E.F. Codd identified this problem and proposed a solution in terms of a stronger normal form known as BCNF. A relation is in BCNF if every determinant is a candidate key. The schema above can be converted into BCNF in one of two ways. The two schemas are presented in bracketing notation below:

Schema 1:
StudentTutors(studentNo, staffNo)
TutorAreas(staffNo, area)
Schema 2:
StudentTutors(studentNo, area)
TutorAreas(staffNo, area)

2.6 Fourth and Fifth Normal Form

Relations in the fourth normal form (NF 4) if the relation in BCNF and does not contain a lot of dependence values. To remove the dependency of many values from a relation, we divide the relationship into two new relations. Each relation contains two attributes that have a lot of relationship value.
Relations in fifth normal form (5NF) deal with a property called 'join' without any loss of information (lossless join). Fifth normal form (also called the 5 NF PJNF (projection join normal form). The case is very rare and appear difficult to detect in practice.




0 comments:

Post a Comment