SQL How To ...
Select the first record in a group
(When a group by or select distinct isn't enough)
The Challenge
Lets say we have a log table containing a record of page views for an intranet:
DateStamp | IpAddress | Page |
---|---|---|
20/06/2011 | 192.168.0.10 | home |
21/06/2011 | 192.168.0.11 | about |
21/06/2011 | 192.168.0.10 | home |
22/06/2011 | 192.168.0.12 | home |
22/06/2011 | 192.168.0.10 | home |
23/06/2011 | 192.168.0.11 | about |
To query this table to show the earliest date when each page was viewed requires just a simple group-by query:
SELECT Page, min(DateStamp) as FirstViewed
FROM
LogTable
GROUP BY Page
which will give:
Page | FirstViewed |
---|---|
about | 21/06/2011 |
home | 20/06.2011 |
But if we wanted to include the IP address associated with each of those first page views, then a simple join will no longer suffice and our query becomes significantly more complex.
SQL to return the first record in a group
There are at least two different approaches which can be taken to return the first record in a group - in our working example the date when each page was first viewed and the IP address associated with that event.
The first approach is to use an embedded select. The second approach is to use an exclusion join.
First approach: An embedded select (part 1)
The embedded select approach first uses an embedded select to get the value of interest (in our case the first view date for each page) with an outer query to pull back the full details (where we also want the corresponding IP address). Thus:
SELECT detail.Page, detail.DateStamp as FirstViewed,
detail.IpAddress
FROM LogTable detail
INNER JOIN (
SELECT
Page, min(DateStamp)
as FirstViewed
FROM LogTable
GROUP
BY Page) inside
ON inside.Page =
detail.Page
AND inside.FirstViewed = detail.DateStamp
This returns:
Page | FirstViewed | IpAddress |
---|---|---|
home | 20/06/2011 | 192.168.0.10 |
about | 21/06/2011 | 192.168.0.11 |
There are some real-world considerations and these are covered below under the heading "Real World Considerations".
Second approach: An exclusion join (part 1)
The idea of an exclusion join is to do a query where we join the table with itself, joining on the field we wish to "group" on, adding a comparison and then only pulling back those records where we know we have the top value.
Thus:
SELECT A.Page, A.DateStamp as FirstViewed,
A.IpAddress
FROM LogTable A
LEFT JOIN LogTable B
ON A.Page
= B.Page
AND A.DateStamp > B.DateStamp
WHERE B.DateStamp is null
which returns:
Page | FirstViewed | IpAddress |
---|---|---|
home | 20/06/2011 | 192.168.0.10 |
about | 21/06/2011 | 192.168.0.11 |
So, how does this exclusion join work? An exclusion join can be considered as having three parts: (1.) a left join, (2.) a join condition and (3.) a where clause.
(1.) The left join:
LEFT JOIN LogTable B
as this is a left join where there are no records which satisfy the join condition the fields B.* will be null.
(2.) The join condition:
ON A.Page
= B.Page
AND A.DateStamp > B.DateStamp
Since we are interested in the first view date for each page the A.Page = B.Page will come as no surprise. Since we are after the first record the A.DateStamp > B.DateStamp may seem unintuitive, after all we are after the first DateStamp not the last DateStamp. However, given that this is using a left join where A.DateStamp is at its smallest there will be no record for B where A.DateStamp > B.DateStamp. So in the returned results B.DateStamp will be null when A.DateStamp is at its smallest.
(3.) The where clause:
WHERE B.DateStamp is null
This is filtering out all but those combinations where B.DateStamp is null, which from our join condition we know is when A.DateStamp is at its smallest.
Needless to say, for performance it is worth checking that you have an appropriate index on your table(s). In this example an index on "Page, DateStamp" would seem ideal.
Real world considerations
The above examples both work fine when there is a single matching record - returning the first view date of each page together with the related IP address. However this only returns a single result per page per date because these records are unique. Consider what happens (again using our example data) where we are after the last record in the group - in this case the last view date for each page together with the IP address:
First approach: An embedded select (part 2)
SELECT detail.Page, detail.DateStamp as LastViewed,
detail.IpAddress
FROM LogTable detail
INNER JOIN (
SELECT
Page, max(DateStamp)
as LastViewed
FROM LogTable
GROUP
BY Page) inside
ON inside.Page =
detail.Page
AND inside.LastViewed = detail.DateStamp
This is the same as before but with "max(DateStamp)
" instead of
"min(DateStamp)
" in the inner join and the field alias
changed from "FirstViewed" to "LastViewed". However, rather than the two
results we might expect we now get three:
Page | LastViewed | IpAddress |
---|---|---|
home | 22/06/2011 | 192.168.0.12 |
home | 22/06/2011 | 192.168.0.10 |
about | 23/06/2011 | 192.168.0.11 |
This returns three results because two different IP addresses share the same date stamp for the last view of the "home" page. If we were to add a time component to our date-stamp field then it would become less frequent that we would have multiple records, but it is likely to occur.
The above is correct - it is returning the right data - but what if only a single record were required? say it is for a report where only a single record is wanted and it is acceptable to record only the first instance? If we were to qualify our original requirement to add that in the event of a clash we only want to see the lowest IpAddress, how might we do this?
Continuing with the join this adds considerable complexity. It is left to the reader to consider how best to approach this.
Second approach: An exclusion join (part 2)
Taking the initial SQL for returning the first viewed information and modifying it to return instead the last viewed we get:
SELECT A.Page, A.DateStamp as LastViewed,
A.IpAddress
FROM LogTable A
LEFT JOIN LogTable B
ON A.Page
= B.Page
AND A.DateStamp < B.DateStamp
WHERE B.DateStamp is null
and like the group-by approach this returns:
Page | LastViewed | IpAddress |
---|---|---|
home | 22/06/2011 | 192.168.0.12 |
home | 22/06/2011 | 192.168.0.10 |
about | 23/06/2011 | 192.168.0.11 |
To extend the query to meet our additional requirement of returning only the first IP address in the event of a clash is a simple case of extending the "on" clause of the join:
SELECT A.Page, A.DateStamp as LastViewed, A.IpAddress
FROM
LogTable A
LEFT JOIN LogTable B
ON A.Page = B.Page
AND (A.DateStamp
< B.DateStamp
OR (A.DateStamp = B.DateStamp and A.IpAddress >
B.IpAddress))
WHERE B.DateStamp is null
this now returns:
Page | LastViewed | IpAddress |
---|---|---|
home | 22/06/2011 | 192.168.0.10 |
about | 23/06/2011 | 192.168.0.11 |
Performance wise, what would be the best index to use in this instance? Probably a single index created on: Page, DateStamp, IpAddress.
About the author: Brian Cryer is a dedicated software developer and webmaster. For his day job he develops websites and desktop applications as well as providing IT services. He moonlights as a technical author and consultant.