SQL - Part 3 - Analytical Functions (Ranking, Rows, Over, ...)

Analytical Functions

Analytical functions are brought into existence to solve the problems that we have with Aggregate functions. The biggest disadvantage of these Aggregate functions is that they always reduce the number of results, but what if we want to be able to perform functions on all our results based on the results? (think at ranking, row_number, top N problems, …)

Here I will explain some of the Analytical function tricks that we can use to solve common problems.

Some of the database engine specific analytical functions:

  • rank() : Returns a sequential number for the result set, this differs from row_number because this function gives duplicates the same number.
  • dense_rank() : Does the same as rank(), only will it put gaps between the next result.
  • row_number() : Returns a sequential number for the result set.
  • ntile(n) : Distributes the result into groups, this returns the number to which the group belongs.

The over() function

One of the things I have to explain first before we are able to show special queries is the usage of the over() function. The over() function will execute a reporting function over the complete table without the need of a GROUP BY statement. We can also specify a PARTITION BY to this over function, which is going to allow us to perform a GROUP BY on the OVER function.

More details to this can be found here: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions001.htm

MIN MAX problems (also called minimax)

Let's say that we have a table that holds the values of the language usages of a country. What if we want to get the maximum spoken languages from those countries, and then get the language that is spoken the least in that country?

This is a great problem to use a over function in. You just select the maximum usages, and then you can add for example a column that will put an X for the language that is spoken the least. To do this we use a case structure that checks if the max usage of a language is equal to the min usage over the whole result set of the max usages.

Result:

SELECT iso, max(usage),
  CASE
    WHEN MAX(usage) = MIN(MAX(usage)) OVER() THEN 'x' ELSE ' '
  END
FROM languageusage
WHERE hasc IN('BE', 'DE')
GROUP BY iso

Our result after running the query:

iso | usage | min(max(usages)) :-: | ----: | :-------------- kur| 0.0049998134397970222 | ned| 0.60320897726154155 | tur| 0.026367175354153453 | fra| 0.33231617285165865 | ita| 0.02480559110148637 | sla| 0.0148750668507394 | ara| 0.015848016537060736| dui| 0.93813663669266079 | poo| 0.0034948944691118493| X spa| 0.0049217442661679302| grn| 0.0045023195651903531|

Creating order based on criteria

We want to be able to assign numbers based on some criteria.

Example: Select races, and assign a number based on:

  • rownumber for each resort, sorted by racedate within resort
  • rownumber for resort and racedate, sorted by resort, racedate (for all)
  • rownumber for racedate, sorted by racedate (for all)
SELECT resort, racedate
, rank() over(partition by resort order by racedate) "rownr / resort"
, rank() over(order by resort, racedate) "rownr by resort and racedate"
, rank() over(order by racedate) "rownr by racedate"
FROM races
WHERE extract(year from racedate) = 2008
GROUP BY resort, racedate
ORDER BY resort, racedate

Clipping

A rarely known technique for databases is called Clipping, clipping allows us to remove an extreme band of results from a result set. (For example, when we measured different points and we want to remove the extreme points from this resultset, then we can use clipping).

This is where ntile(n) comes in handy, we use it to divide our result set in groups, and then we select the groups that lie within boundaries.

Example: Let us select measure points of longitude and latitude, and we will remove the most extreme points.

Then the query becomes:

SELECT lev2, latitude, longitude,
  CASE 
    WHEN ntile(10) OVER (PARTITION BY lev2 ORDER BY latitude) BETWEEN 2 AND 9 
    THEN latitude 
  END x,
  CASE 
    WHEN ntile(10) OVER (PARTITION BY lev2 ORDER BY longitude) BETWEEN 2 AND 9 
    THEN longitude 
  END y
FROM cities
WHERE iso = 'BE'
AND lev2 IS NOT NULL
AND latitude IS NOT NULL
AND longitude IS NOT NULL;

Simulating WIDTH_BUCKET

WIDTHBUCKET is an oracle specific command that lets you construct equiwidth histograms, in which the histogram range is divided into intervals that have identical size. (More here: https://docs.oracle.com/cd/B1930601/server.102/b14200/functions214.htm)

Since this is only in oracle we want to simulate this so we can use it in other database engines. To simulate this we can use the function CEIL together with the OVER functions.

So then this becomes:

SELECT ceil(row_number() OVER(ORDER BY ... ) / 5.0)

Cumulative Total and Gliding Totals

If we want to be able to select the cumulative total of a resultset, then we can use this query:

SUM(points) OVER(ORDER BY points, name) AS cumul

However if we want to calculate the gliding total, then we need to use this:

SUM(points) OVER(ORDER BY date RANGE BETWEEN 90 PRECEDING AND CURRENT ROW) AS mov_average

Combined Select Queries

With combined select queries we mean the joining of 2 different select queries with the use of special operators such as union, union all, intersect, except, minus, …

>One thing to note is that even though this is explained in this article, it is not recommended to be used. Combined Select Queries are a feature of the database engines but are mostly slower then CTE's (Common Table Expressions) that is why I recommend using those rather then the combined select queries. This is also the reason why this portion of the article is relatively slow compared to the rest.

Union and Union ALL

The union operator will merge tables row by row but it will remove the duplicate rows. When we use union all then we will do the same as union, but this time we will include the duplicate rows.

Example:

SELECT hasc2, length
FROM boundaries
WHERE hasc1 = 'DE'
UNION ALL
SELECT hasc1, length
FROM boundaries
WHERE hasc2 = 'DE'
ORDER BY length

> Special Note: If you are using a database engine that does not support the rollup and cube functions, then you can simulate those using the union operator.

Intersect

The intersect operator will only keep the rows that the different queries have in common.

Example:

SELECT name,
  CASE
    WHEN MAX(SUM(points)) OVER(PARTITION BY gender, season) = SUM(points) THEN 'x'
    ELSE ' '
  END "result"
FROM Ranking
WHERE discipline IS NOT NULL
GROUP BY gender, season, name
  INTERSECT
SELECT name, 'X'
FROM competitors
WHERE nation = 'SWE'

Except of minus (difference)

The except of minus operators will take the first query and they will remove every row that is the result from the other queries.

Example:

SELECT hasc, population, area
FROM regios
WHERE SUBSTR(hasc, 1, 2) = 'BE' AND level = 3
  EXCEPT
SELECT parent, SUM(population), SUM(area)
FROM regios
WHERE SUBSTR(hasc, 1, 2) = 'BE' AND level = 4
GROUP BY parent