The SQL Subquery
Now that you've learned about SQL Joins, aggregate functions and the group by keyword, it's time we moved on to our final topic in our SQL tutorial series. Today you'll be learning all about SQL Subqueries, how to use them and when you should use a SQL subquery.
So, what is a subquery?
First and foremost, let's get the jargon out of the way. A subquery can also be referred to as a nested query. It's just like having a nested if
statement in your Java code. Essentially what you're doing with a subquery is you are constructing a regular old query (select
statement) which could be run all by itself if you wanted to, but instead of running it all by itself, you're jamming it into another query (select
statement) to give you more specific (filtered) results.
What's very important to note here is that the SQL subquery can almost always be re-written as a join
with a where
clause attached to it.
The advantage to using a subquery is that they tend to be easier to write and thus easier to understand if you're coming in as a new coder looking at an existing system. Of course, as is always universally true, when you have a piece of code that's easier to write and understand, it means that it's likely also less efficient. Unfortunately that statement holds true here as well. You can easily get carried away with SQL subqueries and start to write very inefficient scripts.
The rule I always live by is to write it so it works first, then refactor it to run faster and be more efficient. This is the test driven development mantra :)
What does a SQL Subquery look like?
I love to teach by example, so let's take a look at one.
Let's say we're working with a database table that deals with people and addresses. A Person
has a name
and an Address
has all the pertinent fields associated with it like so:
drop table if exists address; drop table if exists person; create table person ( person_id int(11) auto_increment primary key, first_name varchar(20), last_name varchar(20) ); create table address ( address_id int(11) auto_increment primary key, person_id int(11), address_line_1 varchar(100), address_line_2 varchar(50), city varchar(50), region varchar(50), zipcode varchar(10), country varchar(50), foreign key (person_id) references person (person_id) ); insert into person (first_name, last_name) values ('Trevor', 'Page'); insert into person (first_name, last_name) values ('Jane', 'Doe'); insert into person (first_name, last_name) values ('Jack', 'Johnson'); insert into person (first_name, last_name) values ('Jack', 'Black'); insert into person (first_name, last_name) values ('Dave', 'Matthews'); insert into person (first_name, last_name) values ('John', 'Doe'); insert into address (person_id, address_line_1, city, zipcode, region, country) values (1, '16 Dunwatson Dr', 'Toronto', 'Ontario', 'M1C 3L8' ,'Canada'); insert into address (person_id, address_line_1, city, zipcode, region, country) values (2, '1129 Bannock St', 'Denver', 'CO', '80204' ,'USA'); insert into address (person_id, address_line_1, city, region, country) values (3, 'Serna 307', 'Santa Ana', 'SON' ,'Mexico'); insert into address (person_id, address_line_1, city, zipcode, region, country) values (4, '462 Duke St', 'Glasgow City', 'G31 1QN', 'Glasgow','UK'); insert into address (person_id, address_line_1, address_line_2, city, zipcode, region, country) values (5, '1584 Fischer Hallman St', 'Kitchener', 'Unit 301', 'N2R 1C1', 'Ontario' ,'Canada'); insert into address (person_id, address_line_1, city, zipcode, region, country) values (6, '155 Northampton St', 'Rochester', 'New York', '14606' ,'USA');
Okay, so that script will create our database tables and populate it with some random information… Now onto the fun stuff, the subqueries!
From the information stored in our database, let's say we want to isolate the people that live in certain countries. Let's say that we only want to see people who live in Canada, what would that query look like?
Here's the big reveal!
select * From person where person_id in (select person_id from address where country = 'Canada');
Fairly unexciting stuff. What you should note here is the use of the in
keyword.
The in
keyword is one of the main keywords that we can use to create our subqueries.
How does the SQL Subquery work?
Let's breakdown the query above: we are searching for any Person
who lives in Canada. This is accomplished by writing two individual select
statements:
select * from person
select person_id from address where country = 'Canada'
We then “join” these two queries together using the in
keyword (thus utilizing the subquery structure).
Again, we want to know WHO lives in WHICH country. So that's why were have one query looking at the Person
table and one query looking at the Address
table. Make sense?
The next important thing to note about this subquery is the specific use of the person_id
key. You'll notice that we said where person_id in (select person_id from ...)
, when you're constructing subqueries and using the in
keyword, you need to make sure that you're matching IDs together (just like if you were doing a join statement).
Actually, speaking of joins, how about I show you what the equivalent query would look like as a join instead of a subquery?
select person.* from person join address on address.person_id = person.person_id where address.country = 'Canada';
As I mentioned at the beginning of this post, you're usually able to write out a typical subquery as a join too, but sometimes the ‘join' query can get a bit confusing. If you were to run the script directly above this paragraph, you'll see that the results are the same as the subquery script we were looking at earlier.
You should see this as a good thing, because if you can understand the ‘join' script above, then you should understand how the subquery works. The ‘join' statement invokes a join based on the person_id
from both the Person
and the Address
tables… which is exactly what our subquery does using the in
keyword… ie. select * from person where person_id in (select person_id from address)
Comparing SQL Subqueries to Joins
For the sake of clarity, how about I write the same query in both formats (as a join and as a subquery) and highlight which parts are functionally equivalent?
Sounds like a great idea, here you go:
where person_id in
(select person_id from address
where country = ‘Canada');
select person.* from person
join address on address.person_id = person.person_id
where address.country = ‘Canada';
So as you can see above, the black parts are where we choose what we actually want to see once all the filtering is done. The blue parts are used to join the two tables together, and the red parts are what we use to filter our result set.
Multiple Subqueries
Alright, so hopefully you get the basic idea of how subqueries work… now how to we use multiple subqueries in the same query?
Well, let's say we want to know all the people who live in North America (i.e. USA and Canada combined). How would we go about doing this? We could just use a subquery inside of a subquery like so:
-- people who live in North America select * From person where person_id in (select person_id from address where country in ('Canada', 'USA'));
Note the use of two different in
keywords. This means that we have two subqueries running in one script, they are just nested within each other. You can nest your subqueries many times over (I think up to 16 or something ridiculous like that)… but with great power comes great responsibility. The deeper you go with nested subqueries, the more strain you could be putting on the database system. Of course, this kind of strain is all relative, because if you have a very small dataset, then you can likely get away with inefficient scripts and not have to pay the price. Typically what happens is that you do your best to write great code/scripts, and as your system grows, the inefficient pieces will make themselves known and you'll fix them as they come up… not the greatest approach, but that's just how the real world works.
Other Types of Subqueries
So we've talked a lot about using the in
keyword, but there are others that can be used… for example, the not in
keyword can be helpful.
What if you want to find all the people who live outside of North America?
-- people who live outside of North America select * From person where person_id in (select person_id from address where country not in ('Canada', 'USA'));
Not the use of the not in
keyword in this query… this will find all address rows that are not associated with Canada or USA, then it will join with the Person
table via the person_id
and return their names.
Easy peasy!
There are a few other examples of subqueries, but I don't tend to use them very often (if at all), but if you're interested in learning more, there's a great article on subqueries here for your reading pleasure. This article focuses on MySQL (just like mine does) and will give you a brief overview on all the different ways to write subqueries in MySQL.