Normalization. Concept and necessity of using. Modification anomalies. Examples
1. Normalization. Purpose and need to use
Normalization is a process (procedure) of converting database tables to a number of normal forms (NF) in order to avoid redundancy in the database, anomalies of data insertion, editing and deletion. The tables may have inefficient or inappropriate structures that need to be normalized. Normalization involves splitting the original table (relationship) into multiple new tables (relationship).
Applying the normalization mechanism correctly to a database has the following interrelated benefits:
- the cost of storing data (data redundancy) is minimized. In other words, the size of the database is reduced;
- with an increase in the size of the database, the time spent on accessing data does not increase so much;
- there are no modification anomalies in the database. Modification anomalies include data insertion, editing, and deletion anomalies.
The normalization process involves the use of so-called normal forms. To date, the following normal forms are known (Figure 1):
- first normal form (1NF);
- second normal form (2NF);
- third normal form (3NF);
- Boyes-Codd normal form (NFBK);
- fourth normal form (4NF);
- fifth normal form (5NF).
A database is considered properly designed (optimal or close to optimal) if it meets the requirements of normal forms. It is not necessary to apply all 5 normal forms. If the number of attributes (columns) in the database is small, then the use of the first three normal forms is sufficient. The relationship of normal forms is shown in Figure 1.
Figure 1. Relationship of normal forms
2. The concept of data redundancy. Example
Data redundancy occurs when the database table is not designed properly. In this case, the table contains repeating groups of data. Such data groups arise when an attempt is made to write more than one value into one cell of a table.
Example. Let there be given a database of accounting for the educational process in a certain educational institution, which is described by a table (one of the tables) with the following structure
Figure 2. The structure of the database tables of the institution
For example, the following data is entered into the table (fragment of the table).
Figure 3. Table with filled data. Redundancy of data
In the above table, data redundancy appears in the following definitions:
- to identify the student (column Student) the data group “Book number” – “Address” is repeated. It is enough to specify only the Student field, and keep the data on the grade book number and address in another related table. Then the data of the fields “Book number” and “Address” will not be repeated once more. As a consequence, the size of the database tables will be smaller;
- it is superfluous to indicate the department for the teacher, since the teacher can be assigned to only one department. It is advisable to save the name of the department and its relationship with the teacher in another (linked) table.
3. Insertion anomaly. Example
Insertion anomaly occurs when you need to add data to a table. Here, a situation may arise when, to insert data, you need to add unnecessary (non-existent) data. In other words, it is impossible to write data about one entity to the database without specifying data about another entity. This means that an insertion anomaly is the addition of unwanted or non-existent (fictitious) information about one entity at the time of insertion of information about another entity.
Example. The following database is specified, which is based on a single table. The table defines information about teachers (Teacher, Subject, Department), students (Student, Book Number, Address), student progress (Mark).
Figure 4. Table with data on educational performance
Suppose a new teacher of mathematics (columns Teacher, Discipline), who has recently been hired, needs to be added to this database. Otherwise, in such a database view, it will be impossible to add data. This means that when adding a teacher, you need to capitalize on non-existent student assessment data. This is the insertion anomaly.
Figure 5. An example of an insertion anomaly. Adding a teacher to the database requires specifying information about the student
The same can be said about the student. If you need to add a student to the database who will be graded after some time (at the end of the semester), then you need to benefit from the grade that he will receive from a discipline that is still being studied. The teacher may already be known at this point.
4. The editing anomaly. Example
There are times when the data in a certain cell in a database table needs to be edited (corrected). This could be due to, for example, typing errors or changes in some names over time for good reasons. If the corrected data is saved in one copy, then there is no problem. If the corrected data is saved in many cells of the table, then a so-called editing anomaly occurs.
This means that the editing anomaly occurs when duplicate data exists in the database table. It is difficult to update such data when editing it, since it is necessary to make changes to all table cells in which this data appears. If, when you change the repeated data in one cell, you do not change the same data in other cells, then the computer will perceive these data as different (unlike a person).
Editing anomaly is a forced need to change (update) data in the entire table in case of their change (update) in one cell of the table in order to avoid their two-digit interpretation.
Example. Let the table of the educational record database be given. Let the physics teacher Trump D. got married and changed her last name to Smith. Now, in all cells of the column (attribute) Teacher, you need to change the teacher’s name Trump D. on Smith D. (Figure 4).
Figure 6. The editing anomaly. Editing the same data in one cell requires changing this data in other cells
5. Removal anomaly. Example
Removal anomaly occurs when you need to delete data from a table. Removal anomaly is the loss of some data in a table while deleting other data in a table.
Example. Suppose that in the table of the database, the assessment in the discipline Informatics was mistakenly entered, which was transferred to the next semesters of study. Automatically, when deleting the line with the discipline “Informatics”, the line with data about the teacher (Wonder S.), who teaches this discipline and the name of the department (Mathematical disciplines), will be lost.
Figure 7. Removal anomaly. When you delete information about the student’s assessment, information about the teacher of the department is lost