Relational Division in SQL The Easy Way

I recently studied SQL as part of an introductory course on databases. SQL itself is not particularly difficult to grasp, yet compared to relational algebra, the division operation is much more complex. In relational algebra, there is a division operator, which has no direct equivalent in SQL. This means that you’ll have to find a workaround. There are a number of ways to express division in SQL, and with the exception of one, they are all quite complex.

Division identifies attribute values from a relation that are paired with all of the values from another relation. Popular textbook examples are the identification of suppliers who deliver all parts of a particular color. An intuitive solution would be to count the number of distinct red parts, and then look at every distributor to find out which of those deliver all those parts. To express this in SQL, you have to use the set theoretic operators “having” and “group by”, and then you simply count the tuples meeting certain criteria.

Let’s say you have table T1 in front of you and want to find out which A’s have both b2 and b3. You can assume that b2 and b3 are the red parts. If you’ve only been exposed to standard textbook treatments of division in SQL, you may be surprised that the problem can be solved as simply as this:

SELECT A
FROM T1
WHERE B IN ( 
             SELECT B
             FROM T2 
           )
GROUP BY A
HAVING COUNT(*) = ( 
                    SELECT COUNT (*)
                    FROM T2
                  );

Of course you can add a Where clause to the last expression.

The previous example is quite easy to grasp. The same can’t be said about how SQL division is commonly taught. Standard database theory textbooks expose you to a statement that is doubly nested and peppered with two negations. No matter how smart you are, it takes longer to parse than the previous example. I think you can safely call the following a monstrosity:

SELECT DISTINCT x.A
FROM T1 AS x
WHERE NOT EXISTS (
                  SELECT *
                  FROM	T2 AS y
                  WHERE NOT EXISTS (
                                     SELECT *
                                     FROM T1 AS z
                                     WHERE (z.A=x.A) AND (z.B=y.B)
                                   )
                 );

Which one would you prefer?

The examples above are taken from the paper “A Simpler (and Better) SQL Approach to Relational Division” by Matos and Grasser, published in Journal of Information Systems Education, Vol. 13(2). I was quite happy to have come across that paper. Unfortunately, theirs is not a very well-known approach to SQL division. This is unfortunate, since it’s not only easier to grasp, but, as Matos and Grasser write, it also exhibits better computational performance.

15 thoughts on “Relational Division in SQL The Easy Way

  1. Nik

    No doubt. Your article really begs the question – do textbook authors teach the double negation way because they believe it superior or because they themselves don’t know any better.

    -Cheers

    Reply
    1. Gregor Ulm Post author

      It would be easy to say that they just don’t care. However, the reality in the textbook industry is that the person whose name is on the cover of the textbook is not necessarily the person who wrote it. In that case, the real authors arguably just didn’t know better. Further, at university there is the tendency to obfuscate the material to make it more difficult to grasp. This is particularly true if the material itself would not be overly challenging. The more trivial the field, the more complex it will be presented. Just open a textbook on sociology or human resources, if you need an example. Database theory has a reputation of being one of the easier subjects in the CS curriculum, so it could well be that teaching SQL division with double nesting and double negation is an attempt to make the subject appear to be more ‘esteemed’.

      Reply
    1. Gregor Ulm Post author

      I’m not sure what you want to express, and it would arguably helped if you cared to elaborate. Both queries result in the same output. In other words, they are equal. To be more precise: both queries exhibit the property of being extensionally equal. Since you seem to say that the second, convoluted, query performs relational division, while the first one does not, we therefore arrive at a contradiction, because, as a matter of fact, the first, much simpler expression, also performs relational division. Allow me to say that it would be incredibly foolish to prefer a more complex method (and slower one!) over a simpler one. Then again, foolishness has been in fashion in professional software development for decades, so you can enjoy the warm and fuzzy feeling of belonging to the majority.

      Reply
      1. Anonymous

        He might have been expecting exact division (aka without remainder), which your query doesn’t fulfill – nor does the textbook script, leaving him a fool nonetheless.
        Still, that’s no excuse to dish out longwinded passive-aggressive paragraphs from your high horse, you could have pointed out the kind of division you were aiming for instead.

        Reply
        1. Gregor Ulm Post author

          It seems you are projecting your own passive-aggressive behavior onto me. I can’t read minds, and neither can you, so you arguably shouldn’t assume what that person really wanted to express. Anyway, the article is quite specific about the operation it describes. I’m not sure how a remainder comes into play with SQL division or what this is even supposed to mean, considering that SQL division is about determining the set of entities A that interacts some set of entities B in a particular way. Feel free to elaborate.

          Reply
  2. JLG

    CREATE TABLE T1 (A integer, B integer);
    CREATE TABLE T2 (B integer);
    INSERT INTO T1 VALUES (1,1),(1,1);
    INSERT INTO T2 VALUES (1),(2);
    Simple division -> 1
    Real division -> nothing

    The solution you propose makes the (very strong) assumption that the couple (A,B) is unique in T1 (probably a PK). Otherwise you will count duplicates… In the paper that you cite (section 2.1), ” T1 represents a list of customers and the options they bought for their new cars”. Given their database design, a customer cannot buy two cars with the same option!

    Reply
  3. daniel

    The problem I have with the first querty is that it fails if u have “repeated elements”:

    ej:
    T1 T2
    A B C B
    1 1 1 1
    1 1 2 2
    1 2 3
    2 1 4

    I’ll get no results with the first one, because the first count(*) will return 3
    but A=1 will pass in the second one.

    Reply
  4. anonymous

    I’d be surprised if the first “simple” query is doing relational division. For A(a,b)/B(b), if there is an ‘a’ having multiple identical ‘b’s in the table A, this query will simply fail because the having count(*) will be larger than count in T2. I’ve verified this on a dataset. I think a more proper description of SQLRA division is here: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server/

    Reply
  5. Teves

    The problem with the first statement is that it doesn’t consider the duplicates so you will need some pre-processing like Relational Algebra does. The performance issue in the second statements is due to a double nested-query. It could possibly be solved by using NOT EXISTS (… EXCEPT …) instead.
    For example, Table T1: [A, milk, 9/2],
    [A, eggs, 9/3],
    [A, milk, 9/3],
    [B, eggs, 9/2],
    [B, eggs, 9/3]
    Talbe T2: [milk],

    Reply
    1. Teves

      Talbe T2: [milk],
      [eggs]

      If you write the first statement without pre-processing then it will not be correct,
      As the count(*) in group A is 3, and the count(*) in group B is 2.
      However, you can solve this by: (Assuming the attributes in T1 are name, items, date)
      SELECT name
      FROM T1 as x
      WHERE NOT EXISTS (
      SELECT *
      FROM T2
      EXCEPT
      SELECT items
      FROM T1
      WHERE name=x.name)

      Reply
    2. Teves

      Forget to say, I am not doing T1 divided by T2 but actually finding who bought all items in T2, which means T1(name, items) divided by T2.

      Reply
  6. Adam

    Let’s take it easy on the guy,
    he’s completely right. All that is necessary is that is to preface the query with a select distinct.
    could look something like this (for the simple case of one item):

    with t1p as (select distinct * from t1),
    total_items as (select count(*) from (select distinct * from t2) a)
    select A
    from t1p
    where B in (select B from t2)
    group by A
    having count(*)=(select * from total_items);

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Spammer prevention; the answer is an integer: * Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.