Data normalization

The normalization process aims to minimize data duplications, avoid errors during data modifications, and simplify data queries from the database. The three fundamental normalization forms are known as:

First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)

In this reading, you will learn how to apply the rules that ensure that a database meets the criteria of these three normal forms.

The following example includes fictitious data required by a Medical Group Surgery based in London to generate relevant reports. Doctors work in multiple regions and various councils in London. And once patients book an appointment, they are given a slot ID at their local surgery. There might be multiple surgeries in the same council but with different postcodes, where one or more councils belong to a particular region. For example, East or West London.

Doctor ID	Doctor name	Region	Patient ID	Patient name
D1	Karl	West London	P1
P2
P3	Rami
Kim
Nora	3	Harrow	HA9SDE	A1
A2
A3	1500 1200 1600
D1	Karl	East London	P4
P5	Kamel
Sami	4	Hackney	E1 6AW	A1
A2	2500 1000
D2	Mark	East London	P5
P6	Sami
Norma	4	Hackney	E1 6AW	A3
A4	1500 2000
D2	Mark	West London	P7
P1	Rose
Rami	5	Harrow	HA862E	A4
A5	1000
1500

The data listed in the table are in an unnormalized form. Repeating groups of data appear in many cases, for instance, doctors, regions, and council names. There are also multiple instances of data stored in the same cell, such as with the patient name and total cost columns. This makes it difficult to update and query data. Moreover, it is not easy to choose a unique key and assign it as a primary key.

This unnormalized table can be written in SQL form as follows:

CREATE TABLE Surgery 
    (DoctorID VARCHAR(10), 
    DoctorName VARCHAR(50), 
    Region VARCHAR(20), 
    PatientID VARCHAR(10), 
    PatientName VARCHAR(50), 
    SurgeryNumber INT, Council VARCHAR(20), 
    Postcode VARCHAR(10), 
    SlotID VARCHAR(5), 
    TotalCost Decimal);

First normal form

To simplify the data structure of the surgery table, let’s apply the first normal form rules to enforce the data atomicity rule and eliminate unnecessary repeating data groups. The data atomicity rule means you can only have one single instance value of the column attribute in any table cell.

The atomicity problem only exists in the columns of data related to the patients. Therefore, it is important to create a new table for patient data to fix this. In other words, you can organize all data related to the patient entity in one separate table, where each column cell contains only one single instance of data, as depicted in the following example:

Patient ID	Patient name	Slot ID	Total Cost
P1	Rami	A1	1500
P2	Kim	A2	1200
P3	Nora	A3	1600
P4	Kamel	A1	2500
P5	Sami	A2	1000
P6	Norma	A5	2000
P7	Rose	A6	1000

This table includes one single instance of data in each cell, which makes it much simpler to read and understand. However, the patient table requires two columns, the patient ID and the Slot ID, to identify each record uniquely. This means that you need a composite primary key in this table. To create this table in SQL you can write the following code:

CREATE TABLE Patient
 (PatientID VARCHAR(10) NOT NULL, 
  PatientName VARCHAR(50), 
  SlotID VARCHAR(10) NOT NULL, 
  TotalCost Decimal,  
  CONSTRAINT PK_Patient 
  PRIMARY KEY (PatientID, SlotID));

Once you have removed the patient attributes from the main table, you just have the doctor ID, name, region, surgery number, council and postcode columns left in the table.

Doctor ID	Doctor name	Region	Surgery Number	Surgery council	Postcode
D1	Karl	West London	3	Harrow	HA9SDE
D1	Karl	East London	4	Hackney	E1 6AW
D2	Mark	West London	4	Hackney	E1 6AW
D2	Mark	East London	5	Harrow	HA862E

You may have noticed that the table also contains repeating groups of data in each column. You can fix this by separating the table into two tables of data: the doctor table and the surgery table, where each table deals with one specific entity.

First normal form

Doctor table