Thursday 18 October 2012

SQL Introduction, Summary and Normalization

SQL Introduction

SQL is a standard computer language for accessing and manipulating databases. SQL language mainly consists of 3 types of paradigms.

·         DDL (Data Definition Language)
·         DML (Data Manipulation Language)
·         DCL (Data Control Language)

SQL is the American National Standards Institute (ANSI) language for talking to Databases (though there are many additions to ANSI SQL). Some people pronounce it ess-que-ell, but I prefer to say it like sequel, since I can say it faster.

Simply put SQL is a way of asking questions, or giving orders, to a database. A database is just a collection of information. I'm a fan of examples, so here's a (not-so) hypothetical situation:

You keep a pad of paper next to the telephone for writing down people who call, and new phone numbers. You want to know how many times the school has called you this month. So you flip through the stack of paper and pull out all the pages with the school and this month on it. Your count them.
Your stack of paper is a database. You are searching and counting is a query. Your database works for you. But if you suddenly hire a call center, and need to find out how many times your office called a prospect, you'll probably want to move to a SQL database.

The biggest advantage of SQL is that you don't have to go through the stack of papers to find information you just need to know how to ask the question. And the question is always asked the same way:

select what you want to know
from what stack of paper
where something is on the page

Not to complicated. The above example would read like this
select count(when)
from phonecalls
where caller like '%school%' and date > now() - interval 1 months

What is SQL?

·         SQL stands for Structured Query Language
·         SQL allows you to access a database
·         SQL is ANSI standard computer language
  • SQL can execute queries against a database
  • SQL can retrieve data from a database L can insert new records in a database
  • SQL can delete records from a database
  • SQL can update records in a database
  • SQL is easy to learn
SQL DDL
The Data Definition Language (DDL) part of SQL permits database tables to be created or deleted. We can also define indexes (keys), specify links between tables, and impose constraints between database tables.
The most important DDL statements in SQL are:
  • CREATE TABLE - creates a new database table
  • ALTER TABLE - alters (changes) a database table
  • DROP TABLE - deletes a database table
  • CREATE INDEX - creates an index (search key)
  • DROP INDEX - deletes an index
SQL DML
SQL (Structured Query Language) is syntax for executing queries. But the SQL language also includes syntax to update, insert, and delete records.
These query and update commands together form the Data Manipulation Language (DML) part of SQL:
  1. SELECT - extracts data from a database table
  2. UPDATE - updates data in a database table
  3. DELETE - deletes data from a database table
  4. INSERT INTO - inserts new data into a database table
Common terms used in SQL commands
·         Primary Key (PK)     - uniquely identifies each tuple in the relation with no redundant attributes
·         Foreign Key (FK)                 - an attribute that is a PK in another relation
·         Composite Key                   - a PK consisting of multiple attributes
·         Candidate Key (CK)             - a set of attributes that can be used as a PK
·         Alternate Key (AK)               - a candidate key not designed as a PK
·         Super Key                           - a set of attributes that includes a CK and may include redundant key attributes
·         Prime Attribute                    - any attribute that belongs to a CK
·         Secondary Key                   - an attribute that is used as an index
·         Search Keys                       - An attribute used in a query
SQL reference
create table syntax
create table table_name (column_name1 data_type1, column_name2 data_type2, …, [
CONSTRAINT constraint_name CONSTRAINT_TYPE constraint_condition, …]);
update statement syntax
update table_expression_clause set_clause where_clause
select statement syntax
SELECT * | column_names_list FROM table_name WHERE_clause group_by_clause | order_by-clause;
delete statement syntax
delete FROM table_expression_clause WHERE_clause;
insert statement syntax
insert into table_expression_clause (column1, column2, …) values_clause

Normalization
Normalization is the process of non-loss decomposition of the database into tables so as to minimize the data redundancy and inconsistencies in the database.

We have to consider various constraints on the database so as to convert it into standard form viz. Integrity constraints and dependencies. Normalization is achieved through the series of steps. These steps are called normal forms.

There are various normal forms like 1NF, 2NF and 3NF. Generally database normalized at 3NF serves the purpose, but for complex database representations, more advance level of normalizations, viz. 4NF, BCNF, DKNF etc is done.

The database in one normal form is by default in previous normal forms. E.g. if it is in 3NF, it must be in 1NF and 2NF.



 




First Normal Form: A relation is in first normal form if it does not contain any repeating values or groups.
  • Eliminate duplicative columns from the same table.
  • Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).
Second Normal Form: A relation is in second normal form if it is in 1NF and all non-key attributes are fully functional dependent on the primary key.
  • Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
  • Create relationships between these new tables and their predecessors through the use of foreign keys.
Third Normal Form: A relation is in third normal form if it is in 2NF and every attribute is independent of all other non-key attributes.
  • Remove columns that are not dependent upon the primary key.
BCNF (Boyce-Codd Normal Form): A relation is in BCNF is if and only if every determinant is a candidate key. (A candidate is any attribute on which some other attribute is (fully) functionally dependent. 
Fourth normal form (4NF):
  • A relation is in 4NF if it has no multi-valued dependencies.
Finally we have Fifth Normal Form (5 NF): In some cases there may be no losses join decomposition into two relation schemas but there may be a losses join decomposition into more than two relation schemas. These cases are handled by the join dependency and fifth normal form, and it’s important to note that these cases occur very rarely and are difficult to detect in practice.
We can also always define stricter forms that take into account additional types of dependencies and constraints. The idea behind domain-key normal form is to specify, (theoretically, at least) the "ultimate normal form" that takes into account all possible dependencies and constraints. A relation is said to be in DKNF if all constraints and dependencies that should hold on the relation can be enforced simply by enforcing the domain constraints and the key constraints specified on the relation.
For a relation in DKNF, it becomes very straightforward to enforce the constraints by simply checking that each attribute value in a tuple is of the appropriate domain and that every key constraint on the relation is enforced. However, it seems unlikely that complex constraints can be included in a DKNF relation; hence, its practical utility is limited. 
Consider a following table structure:
Stud_no
Stud_name
Stud_addr
teacher_code
Sub_code
Sub_name
Sub_teacher


In the above table definition, the attribute groups Sub_code, Sub_name represents repeatable group as one student can take many subjects, so that information is repeated for the same student.

We can convert above schema into 1NF by separating repeating columns (multi-valued attributes) into separate subject table.

Student                         Student_subject
Stud_no                        Stud_no (CPK)
Stud_name                    Sub_code (CPK)
Stud_addr                     Sub_name
                        teacher_code
                                    Sub_teacher

In above schema the attribute ‘Sub_name’ is not fully functionally dependant on composite key (Stud_no+Sub_code) of Student_subject table.

So we can achieve 2NF by separating the attributes not fully functionally dependant on primary key into separate tables e.g.

Student                         Student_subject                         Subject

Stud_no (PK)                 Stud_no (FK)                             Sub_code (PK)
Stud_name                    Sub_code (FK)                          Sub_name
Stud_addr                     teacher_code
                                    Sub_teacher

In the above schema, the attribute Sub_teacher, which represents teacher for that subject, can be derived from that table itself looking at the two columns Sub_code and teacher_code.

So we can achieve 3NF by removing such attribute, which is independent of non-key attributes. For ex:

Student                         Student_subject                         Subject

Stud_no                        Stud_no                                    Sub_code
Stud_name                    Sub_code                                  Sub_name
Stud_addr                     teacher_code

In the above schema, the attribute teacher_code represents the teacher for particular subject. This attribute is not the candidate key for the primary key of the table. We need to have one separate table called sub_teacher.

The above normalization is called BCNF (Boyce/Codd Normal form) e.g.

Student                         Student_subject             Subject             Sub_teacher

Stud_no                        Stud_no                        Sub_code          Sub_code         
Stud_name                    Sub_code                      Sub_name         teacher_code    
Stud_addr        

BCNF is improved form of 3NF.

Consider the schema where in students understanding in each subject is mentioned as one of the best, average, poor. The table looks somewhat like this:

Student_subject

Stud_no
Sub_code
Sub_interest

In addition to above normalization levels, we also have other levels viz. 4NF, 5NF etc. which are complex are not applied in normal situations.



DKNF (Domain Key Normal Form)


Exercise:
Following is the un-normalized database; normalize it up to BCNF through 1NF, 2NF, 3NF.

Following is structure of order table:

Order

Order_No.
Order_date
Customer_no
Item_no
Item_name
Qty_ordered
Rate_per_unit
Item_value

(Item_value can be calculated using  Qty_ordered * Rate_per_unit)

12 CODD Rules

Rule 0: For any system to be called as RDBMS, it must be able to manage database entirely through its relational capabilities.

Rule 1: All information in a RDBMS is represented explicitly in exactly one way, by value in the table.
According to this rule anything that does not exist in a table in the database does not exist at all.

Rule 2: Each and every datum is logically accessible through a combination of table name, column name, and primary key value.

Rule 3: Inapplicable or missing information can be represented through null values.

Rule 4: The description of the database is held in the same way as ordinary data that is in tables and columns and is accessible to authorized users.

Rule 5: There must be at least one language which is comprehensive in supporting data definition, view definition, data manipulation, integrity constraints, authorization, and transaction control.

Rule 6: All views that are theoretically updatable are updatable by the system.

Rule 7: The capability of handling a base or a derived table as a single operand applies not only to the retrieval of data but also to the insertion, update and deletion of data.

Rule 8: Application programs and terminal activity remain logically unimpaired whenever any changes are made in the storage representation or access method.

Rule 9: Application programs and terminal activity remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base tables.

Rule 10: All integrity constraints must be definable in the data sub language and storable in the catalogue, not in the application program.

Rule 11: The system must have a data sub-language which can support distributed databases without impairing application programs or terminal activities.

Rule 12: If the system has a low-level language this language cannot be used to bypass the integrity rules and constraints expressed in the higher level relational language.

1 comment: