Tutorials

Data Joins Unions

Views: 62037
Rating: 5/5
Votes: 11

Introduction

In this tutorial we look at using joins and unions in our database queries. Before we get to those, however, we'll first put our data into a format where we can take full advantage of the power of SQL joins. This process of removing redundant duplication and repetition of data is called "normalization". I am not covering it in any detail here but I will give you a rough-and-ready method of handling data relationships. (Google "data normalization" so you know exactly what you are doing. In practice you'll use first, second and third normal forms. Fourth and beyond are reserved for theoretical classroom exercises.)

This tutorial is primarily written for MySQL. Some adjustment to the queries may be required for other dialects of SQL.

The Scenario

Throughout this tutorial we will use a fictitious school database. The school's headmaster, Mr Beetsem-Daley, has commissioned us to build an intranet for pupil administration. He requires us to record which pupils are in each house and the subjects chosen for study by each pupil. His data is currently held in spreadsheet and looks like this:

+------------------+------------+------------------+----------------------------------------+
| pupil_name       | house_name | house_master     | chosen subjects                        |
+------------------+------------+------------------+----------------------------------------+
| Adam Simms       | Laker      | Frank Morrisson  | Economics, English, German             |
| Allan Blair      | Grace      | Charles Hadleigh | Economics, Geography, German, Physics  |
| Anna Hamilton    | Grace      | Charles Hadleigh | English, History                       |
| Anne Bailey      | Jardine    | Pauline Fforbes  | English, German, History, Physics      |
| Anthony Bell     | Cowdrey    | Robert Bingley   | Economics, Geography, History, Physics |
| Caroline Freeman | Grace      | Charles Hadleigh | Biology, Chemistry, Geography, History |
| David Powell     | Laker      | Frank Morrisson  | Economics, Maths, Physics              |
| Emma Watson      | Cowdrey    | Robert Bingley   | English, Geography                     |
| Gearge Wilson    | Laker      | Frank Morrisson  | Chemistry, Economics, English, History |
| Henry Irving     | Cowdrey    | Robert Bingley   | Biology, Chemistry, Maths              |
| Jane Morrison    | Laker      | Frank Morrisson  | Economics, English, Physics            |
| John Patterson   | Jardine    | Pauline Fforbes  | Geography, German                      |
| John Tully       | Jardine    | Pauline Fforbes  | Biology, English, History,Maths        |
| John Watson      | Grace      | Charles Hadleigh | Chemistry, Economics, English, German  |
| John Williams    | Grace      | Charles Hadleigh | English, German, History               |
| Margaret Norton  | Cowdrey    | Robert Bingley   | German, Physics                        |
| Mary Blake       | Cowdrey    | Robert Bingley   | German, History, Physics               |
| Mary Sheldon     | Jardine    | Pauline Fforbes  | Chemistry, Economics, English, History |
| Mary Whitehouse  | Grace      | Charles Hadleigh | English, Geography, History, Maths     |
| Michael Grove    | Jardine    | Pauline Fforbes  | Economics, English, Physics            |
| Peter Adamson    | Laker      | Frank Morrisson  | Chemistry, Economics, German, Physics  |
| Peter Appleby    | Jardine    | Pauline Fforbes  | Biology, Economics, German, Maths      |
| Wayne Jones      | Laker      | Frank Morrisson  | Biology, Economics, German, Maths      |
| William Smith    | Cowdrey    | Robert Bingley   | Biology, Economics, Maths, Physics     |
+------------------+------------+------------------+----------------------------------------+

Instantly we notice the repetition of the house and house-master names and the totally unsuitable comma-separated lists.

The Data

After weeks of intensive data analysis and data modelling we present our redesigned database to the headmaster. (OK, it took a few seconds but if you were doing this on a corporate scale it is going to take that timescale, identifying all the processes and data elements and how they relate to one another. Get this bit right and life is a lot easier down the line.)

We have immediately identified three main items of data (entities), namely pupil, house and subject and these will be our main tables. We give each entity a unique identifier (ID). From the spreadsheet you might have counted four entities but a pupil's house-master depends on the house to which the pupil belongs, and is therefore an item in the house table.

house               pupil            subject
============        ===========      ==========
houseID             pupilID          subjectID
house_name          pupil_name       subject
house_master

The Data Relationships

We look at each pair of entities and consider how they are related.

House - subject

We will assume there is no relationship between house and subject. Our examination of the processes showed no requirement. If Beetsem-Daley had wanted, say, a league table of house results for each subject then there would be. (As a professional, it's my duty to make him aware of the potential though. No harm in generating extra work).

House - pupil

A house has many pupils. A pupil belongs to one house. Here we have a simple one-to-many relationship between the two. We handle this by adding the houseID to the pupil record so we know to which house each pupil belongs. (Computers handle numbers much more efficiently than text so we will will use these numeric ids to link our data.)

Pupil - subject

A pupil can choose many subjects. A subject can be chosen by many pupils. Here we have a many-to-many relationship. We handle these by creating another entity (in this case we'll call it choice) to link pupil and subject. Each choice will contain the pupilID and the subjectID of a chosen subject. Actually, a subject can be chosen by none, one or many pupils. This doesn't affect our table design but it will affect how we query the data (more of which later).

The Database Schema

Our final design, showing the entity relationships, looks like this

house               pupil                 choice                 subject
============        ===========           ==========             ==========
houseID      --+    pupilID     ---+      id             +----   subjectID
house_name     |    pupil_name     +---<  pupilID        |       subject
house_master   +--< houseID               subjectID   >--+

Notice that our many-to-many relationship between pupil-subject has now become two simple one-to-many relationships between pupil-choice and subject-choice. In our original spreadsheet we could see which subjects were chosen by a pupil. Now, by looking at the same relationship in the opposite direction, we can easily find which pupils chose a particular subject.

Had we been required to store who teaches each subject then there would also be a teacher entity so that the teacher's name (and other teacher details) would be stored in only one place. In this case the house-master item would have been a teacherID linked to the teacher table and each subject would have a "taught_by" item, also linked to the teacher table.

Now we can start querying our database using joins.

Joins

We are going to look at the three types of join that you will need to use at some time or another

  • INNER JOIN
  • LEFT JOIN
  • cartesian join

You can just use "JOIN" instead of "INNER JOIN" as, by default, the INNER is implicit. Similarly, "LEFT JOIN" is actually "LEFT OUTER JOIN", but again, the "OUTER" is implicit.

At the end of this tutorial we will look at an exercise involving all three.

INNER JOIN

This is the most commonly used of the three and is use to get data from matching rows in different tables.
Let's look at an example from our database.

Problem: List pupils in Jardine house showing house name and pupil name.

Input

Our table data. Each pupil has their houseID as part of the data and each house has a houseID. As indicated in our DB schema link, this will be the basis of our "join condition". We want rows from the pupil table where its houseID matches the id from the house table.

house                                                    pupil (extracted sample)            
+---------+------------+                                 +---------+-----------------+
| houseID | house_name |                                 | houseID | pupil_name      |
+---------+------------+                                 +---------+-----------------+
|       1 | Grace      |                                 |    ...  |  ...            |
|       2 | Laker      |                                 |       4 | Mary Sheldon    |
|       3 | Cowdrey    |                                 |       3 | Mary Blake      |
|       4 | Jardine    |                                 |       4 | Michael Grove   |
+---------+------------+                                 |       1 | Mary Whitehouse |
                                                         |       3 | Margaret Norton |
                                                         |    ...  |  ...            |
                                                         +---------+-----------------+

Query

SELECT h.house_name, p.pupil_name                       /* define the table.columns you want to retrieve */
FROM house h                                            /* define the tables to be joined and on which   */
     INNER JOIN pupil p ON h.houseID = p.houseID        /*    columns they should be matched             */
WHERE h.house_name = 'Jardine'                          /* define the search criteria                    */
ORDER BY p.pupil_name                                   /* define the order of the sorted results        */

A note on style and syntax. I could have wrtten this query as

select house.house_name, pupil.pupil_name from house, pupil where house.house_name = 'jardine' and house.houseid = pupil.houseid order by pupil.pupil_name
  1. A common convention is upper case keywords and lower case table and column names
  2. Split the query over several lines. Not only is it more readable but when mysql_error() reports an error it tells you on which line.
  3. Don't use "... FROM A,B WHERE ...". Keep the join condition separate from the selection criteria with "...INNER JOIN ... ON"
  4. Use aliases for tables. Again for legibility but also the table name only occurs once. If you mis-spell or want to change it, it's easier than if it is repeated with every column. (Especially if your tablename is something like "residual_hypoglycaemic_oxidization_factors")

Results

There are six pupils with houseID=4 so six rows are returned, each with a pupil name and the matching house name

+------------+----------------+
| house_name | pupil_name     |
+------------+----------------+
| Jardine    | Anne Bailey    |
| Jardine    | John Patterson |
| Jardine    | John Tully     |
| Jardine    | Mary Sheldon   |
| Jardine    | Michael Grove  |
| Jardine    | Peter Appleby  |
+------------+----------------+

Problem : Which pupils chose Maths?

From our schema, to link from the subject table to the pupil table we need to link via the choice table. So we join from pupil to choice using the pupilID column and from subject to choice using the subjectID column. This time we'll use a slightly different syntax. Because the column names we are joining on are the same in both tables, we can can use the USING keyword

SELECT p.pupil_name
FROM pupil p
  INNER JOIN choice c USING (pupilID)
  INNER JOIN subject s USING (subjectID)
WHERE s.subject = 'Maths'

Results:

+-----------------+
| pupil_name      |
+-----------------+
| William Smith   |
| Henry Irving    |
| David Powell    |
| Wayne Jones     |
| John Tully      |
| Mary Whitehouse |
| Peter Appleby   |
+-----------------+

Problem : How many pupils chose each subject?

Our subject table contains the following

+-----------+-----------+
| subjectID | subject   |
+-----------+-----------+
|         1 | English   |
|         2 | German    |
|         3 | Geography |
|         4 | Maths     |
|         5 | Economics |
|         6 | Biology   |
|         7 | Chemistry |
|         8 | Physics   |
|         9 | History   |
|        10 | Computing |
+-----------+-----------+

We need to count the number of records in choice that match each subject

Query:

SELECT s.subjectID, s.subject, COUNT(c.pupilID) as total
FROM subject s 
INNER JOIN choice c USING (subjectID)
GROUP BY s.subjectID                    /* GROUP BY subjectID gives us a total for each subject */

Results:

+-----------+-----------+-------+
| subjectID | subject   | total |
+-----------+-----------+-------+
|         1 | English   |    12 |
|         2 | German    |    11 |
|         3 | Geography |     6 |
|         4 | Maths     |     7 |
|         5 | Economics |    13 |
|         6 | Biology   |     6 |
|         7 | Chemistry |     6 |
|         8 | Physics   |    10 |
|         9 | History   |    10 |
+-----------+-----------+-------+

There are ten subjects but this query gives only nine totals. Earlier I said subjects could be chosen by none, one or many pupils and the way we query the table would be affected. The INNER JOIN only totals matching records, so what if there is no match? How do we see that "computing" was chosen by no pupils?

Enter the LEFT JOIN.

LEFT JOIN

We use a left join when we want to return all selected rows from the left table regardless of whether there is a matching value in the right table. (LEFT merely denotes its postion in the query statement. If you have "A LEFT JOIN B" then "A", to the left of the "JOIN", is the left table)

Using the same query as the previous problem, except with a LEFT JOIN instead of an INNER JOIN

SELECT s.subjectID, s.subject, COUNT(c.pupilID) as total
FROM subject s 
LEFT JOIN choice c USING (subjectID)
GROUP BY s.subjectID

Results

+-----------+-----------+-------+
| subjectID | subject   | total |
+-----------+-----------+-------+
|         1 | English   |    12 |
|         2 | German    |    11 |
|         3 | Geography |     6 |
|         4 | Maths     |     7 |
|         5 | Economics |    13 |
|         6 | Biology   |     6 |
|         7 | Chemistry |     6 |
|         8 | Physics   |    10 |
|         9 | History   |    10 |
|        10 | Computing |     0 |    ------- now we get all subjects
+-----------+-----------+-------+

Where there is no matching row in the right table, a NULL value is returned in any columns selected from that table

SELECT s.subject, p.pupil_name
FROM subject s
	LEFT JOIN choice c USING (subjectID)
	LEFT JOIN pupil p USING (pupilID)
WHERE s.subject IN ('Maths','Computing')


Results:

+-----------+-----------------+
| subject   | pupil_name      |
+-----------+-----------------+
| Maths     | William Smith   |
| Maths     | Henry Irving    |
| Maths     | David Powell    |
| Maths     | Wayne Jones     |
| Maths     | John Tully      |
| Maths     | Mary Whitehouse |
| Maths     | Peter Appleby   |
| Computing | NULL            |
+-----------+-----------------+

We can use the NULL value when we need to know if there are records in one table with NO matching record in another table. So to find subjects not chosen by any pupil

SELECT s.subject
FROM subject s
LEFT JOIN choice c USING (subjectID)
WHERE c.subjectID IS NULL

Results:

+-----------+
| subject   |
+-----------+
| Computing |
+-----------+

WHERE and LEFT JOINS

One last word on LEFT JOINs, and something I still forget myself sometimes.

Suppose we want a list of all pupils and we want to indicate those taking geography (subject id 3). The obvious query is

SELECT p.pupil_name, c.subjectID
FROM pupil p
  LEFT JOIN choice c ON p.pupilID = c.pupilID
WHERE c.subjectID = 3

Unfortunately, this gives just a list of those taking geography and not all pupils as expected.

+------------------+-----------+
| pupil_name       | subjectID |
+------------------+-----------+
| John Patterson   |         3 |
| Allan Blair      |         3 |
| Caroline Freeman |         3 |
| Emma Watson      |         3 |
| Anthony Bell     |         3 |
| Mary Whitehouse  |         3 |
+------------------+-----------+

Since not all rows in the right table of a LEFT JOIN may not be present then it doesn't work if you have a WHERE condition on non-existent records. You need to make right-table conditions part of the join condition.

SELECT p.pupil_name, c.subjectID
FROM pupil p
  LEFT JOIN choice c ON p.pupilID = c.pupilID AND c.subjectID = 3

Now we get the expected results

+------------------+-----------+
| pupil_name       | subjectID |
+------------------+-----------+
| John Watson      |      NULL |
| Peter Adamson    |      NULL |
| William Smith    |      NULL |
| Mary Sheldon     |      NULL |
| Anna Hamilton    |      NULL |
| Gearge Wilson    |      NULL |
| Henry Irving     |      NULL |
| John Patterson   |         3 |
| Allan Blair      |         3 |
| Adam Simms       |      NULL |
| Mary Blake       |      NULL |
| Anne Bailey      |      NULL |
| Caroline Freeman |         3 |
| David Powell     |      NULL |
| Emma Watson      |         3 |
| Michael Grove    |      NULL |
| John Williams    |      NULL |
| Wayne Jones      |      NULL |
| Anthony Bell     |         3 |
| John Tully       |      NULL |
| Mary Whitehouse  |         3 |
| Jane Morrison    |      NULL |
| Margaret Norton  |      NULL |
| Peter Appleby    |      NULL |
+------------------+-----------+

Which just leaves the cartesian join

Cartesian join

This is not a join you will you use very often (although you may use one accidentally, especially when using the FROM A,B syntax for your inner joins and forgetting to put the join condition in the WHERE clause) but it does have its practical uses, as we shall see in the sample application at the end of this tutorial. (Another practical use is generating data)

A cartesian join is produced by omitting the join condition a join. In this case, every row in the first table is joined with every row in the second. For example, we'll just count the number of rows returned rather than list them (for obvious reasons)

SELECT COUNT(*) as total FROM pupil, choice, subject

Result

+-------+
| total |
+-------+
| 19440 |     24 pupils * 81 choices * 10 subjects
+-------+

Now we will look at UNIONS

UNION

We saw that joins had the effect of appending columns, selected from the tables, to each row, thus expanding the query horizontally. With unions the columns stay the same and the query results are expanded vertically by adding more rows.

To create a union we take a SELECT query, add "UNION" then add another SELECT query. E.g.

SELECT a, b FROM table1
UNION
SELECT x, y FROM table2

As stated earlier, the columns stay the same so we cannot select a, b, c in one part and then select only x, y in another. However, we can do this to maintain the number of columns.

SELECT a, b, c FROM table1
UNION
SELECT x, y, '' FROM table2

UNION ALL

By default, a union returns DISTINCT rows. This means that in our first example, if a, b contains "John", "Doe" and x, y also contains "John", "Doe" then it would appear only once in the results. If we want it to appear twice we use UNION ALL.

ORDER BY

The results will be sorted when all the results are in. The ORDER BY clause goes at the end of the whole query and applies to the whole result set.

Back to our school

"House" turns out to be a rather grand name for the ex-army barracks that contain the pupils' dormitories and the house-master's sleeping quarters. From a Health and Safety perspective, these dilapidated wooden structures are an accident waiting to happen. Good thing none of the pupils smoke cigarettes in their dorms. They have "No Smoking" signs to prevent that.

Anyway, I tactfully suggested that, as we had the data available, we could produce a register of each house as check lists to make sure all inhabitants were safely evacuated in the event of fire. These lists would need to include house-master and pupils for each house.

SELECT h.house_name, 'Housemaster' as status, h.house_master as name 
FROM house h
UNION ALL
SELECT h.house_name, 'Pupil' as status, p.pupil_name as name
FROM pupil p
  INNER JOIN house h USING (houseID)
ORDER BY house_name, status

Results

+------------+-------------+------------------+
| house_name | status      | name             |
+------------+-------------+------------------+
| Cowdrey    | Housemaster | Robert Bingley   |
| Cowdrey    | Pupil       | Henry Irving     |
| Cowdrey    | Pupil       | Mary Blake       |
| Cowdrey    | Pupil       | Emma Watson      |
| Cowdrey    | Pupil       | Anthony Bell     |
| Cowdrey    | Pupil       | William Smith    |
| Cowdrey    | Pupil       | Margaret Norton  |
| Grace      | Housemaster | Charles Hadleigh |
| Grace      | Pupil       | Mary Whitehouse  |
| Grace      | Pupil       | Anna Hamilton    |
| Grace      | Pupil       | Allan Blair      |
| Grace      | Pupil       | Caroline Freeman |
| Grace      | Pupil       | John Williams    |
| Grace      | Pupil       | John Watson      |
| Jardine    | Housemaster | Pauline Fforbes  |
| Jardine    | Pupil       | John Tully       |
| Jardine    | Pupil       | Mary Sheldon     |
| Jardine    | Pupil       | Peter Appleby    |
| Jardine    | Pupil       | John Patterson   |
| Jardine    | Pupil       | Anne Bailey      |
| Jardine    | Pupil       | Michael Grove    |
| Laker      | Housemaster | Frank Morrisson  |
| Laker      | Pupil       | David Powell     |
| Laker      | Pupil       | Wayne Jones      |
| Laker      | Pupil       | Peter Adamson    |
| Laker      | Pupil       | Jane Morrison    |
| Laker      | Pupil       | Gearge Wilson    |
| Laker      | Pupil       | Adam Simms       |
+------------+-------------+------------------+

Sample Application

Beetsem-Daley wants an input screen, similar to his original spreadsheet, where he can view and edit the subjects chosen by each pupil. Pupils will be listed in alphabetical order within their houses. Checkboxes will be used to indicate pupils' choices.

+------------+----------------+--------------------------------------------------------------+
|   House    |      Pupil     |       Biol Chem Comp Econ Engl Geog Germ Hist Math Phys      |
|  Jardine   |                |                                                              |
|            | Anne Bailey    |       [X]  [ ]  [ ]  [ ]  [ ]  [ ]  [X]  [X]  [ ]  [X]       |
|            | John Patterson |       [ ]  [ ]  [ ]  [ ]  [ ]  [X]  [X]  [ ]  [ ]  [ ]       |
|            | John Tully     |       [X]  [ ]  [ ]  [ ]  [X]  [ ]  [ ]  [X]  [X]  [ ]       |
|            | Mary Sheldon   |       [ ]  [X]  [ ]  [X]  [X]  [ ]  [ ]  [X]  [ ]  [ ]       |
|            | Michael Grove  |       [ ]  [ ]  [ ]  [X]  [X]  [ ]  [ ]  [ ]  [ ]  [X]       |
|            | Peter Appleby  |       [X]  [ ]  [ ]  [X]  [ ]  [ ]  [X]  [ ]  [X]  [ ]       |
+------------+----------------+--------------------------------------------------------------+

The sample code

<?php
$cnx = mysql_connect('localhost');     //  substute
mysql_select_db('jointute', $cnx);           //  your connection code

/**
* get the subjects and store in array for headings
*/
$subjects = array();
$res = mysql_query ("SELECT SUBSTRING(subject,1,4) FROM subject ORDER BY subject");
$subjectCount = mysql_num_rows($res);
while ($row = mysql_fetch_row($res)) $subjects[] = $row[0];
mysql_free_result($res);

/**
* get the pupil subject choices
* NOTE the cartesian join pupil-subject to get a row for each subject for each pupil
*   then a LEFT JOIN to choice to see if each pupil/subject combination was chosen
*/

$sql = "SELECT h.house_name, p.pupil_name, p.pupilID, s.subjectID, c.subjectID as chosen
		FROM pupil p
			INNER JOIN subject s
			INNER JOIN house h USING (houseID)
			LEFT JOIN choice c USING (pupilID, subjectID)
		ORDER BY h.house_name, p.pupil_name, s.subject";
$choice_res = mysql_query($sql);
?>
<html>
<head>
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="cache-control" content="no-cache">
<meta name="generator" content="PhpED Version 4.5 (Build 4513)">
<title>Sample application</title>
<meta name="author" content="Barry Andrew">
<meta name="creation-date" content="06/29/2008">
<style type='text/css'>
       body, p, td {
       		font-family: verdana, sans-serif;
       		font-size : 10pt;
       		padding: 5px;
       }
       td {
       	    background-color: #FFF;
       }
       td.house {
       		background-color: #999;
       		color: white;
       		font-size : 12pt;
       		font-weight : 600;
       		border: 1pt solid gray;
       }
       td.cb {
       		text-align: center;
       }
       th {
       		background-color: #EEE;
       		color: black;
       		font-size : 10pt;
       		font-weight : 600;
       		padding: 5px;       
       }
       table {
       		background-color: #666;
       }
</style>
</head>
<body>
<form method='POST' action='updateChoices.php'>
<!--  updateChoices.php is outside the scope of the tutorial and is left as an exercise for the reader -->

<table width='90%' cellspacing='1'>

<?php 
	$prevHouse = '';
	$prevPupil = '';
	while (list($house, $pupil, $pid, $sid, $chosen) = mysql_fetch_row($choice_res))
	{
	    if ($prevHouse != $house)
	    {
	    	$span = $subjectCount+1;
	    	if ($prevPupil != '')
	    	{
	    		echo '</tr>'."\n";           // end previous row
	    	}
	    	echo "<tr><td class='house' colspan='$span'>House : $house</td></tr>\n";
			echo '<tr><th rowspan="2">Pupil</th><th colspan="'.$subjectCount.'">Subject</th></tr>'."\n";
			echo '<tr><th width="6%">' . join ('</th><th width="6%">', $subjects ) . '</th></tr>'."\n";
	    	$prevHouse = $house;
	    	$prevPupil = '';
	    }
	    if ($prevPupil != $pupil)
	    {
	    	if ($prevPupil != '')
	    	{
	    		echo '</tr>'."\n";           // end previous row
	    	}
	    	echo "<tr><td>$pupil</td>\n";
	    	$prevPupil = $pupil;
	    }
	    $checked = $chosen ? 'checked=checked' : '';
	    echo "<td class='cb'><input type='checkbox' name='choice[$pid][$sid]' value='1' $checked></td>\n";
	    
	}
	echo '</tr>'."\n";
	echo "<tr><th  colspan='$span'><input type='submit' name='btnSubmit' value='Save'></th></tr>\n";
?>
</table>
</form>
</body>
</html>

Hopefully I'll be back with a look at some more advanced queries