Medical Nerds Blog Logo
medicalnerds.com

technology, stats and IT for medics

 

Finding rows in one table not present in another table with PostgreSQL

March 6th, 2010 by Mark · 2 Comments

In implementing a comprehensive SNOMED CT browser as part of a research clinical information system, I cache the results of several processor and disk-intensive queries to speed up lookup and SNOMED CT navigation. I was finding some odd results and found that my caching was missing for several items. I needed a way of comparing the contents of one table with another in PostgreSQL and finding out which rows were not present.

To do this, PostgreSQL supports set operations. For instance, I have two tables:


rsdb=# \d t_concept
                                      Table "public.t_concept"
        Column        |          Type          |                      Modifiers
----------------------+------------------------+-----------------------------------------------------
 concept_id           | bigint                 | not null default nextval('t_concept_seq'::regclass)
 concept_status_code  | integer                | not null
 ctv_id               | character varying(10)  | not null
 fully_specified_name | character varying(255) | not null
 is_primitive         | integer                | not null
 snomed_id            | character varying(10)  | not null


rsdb=# \d t_cached_parent_concepts
Table "public.t_cached_parent_concepts"
      Column       |  Type  | Modifiers
-------------------+--------+-----------
 child_concept_id  | bigint | not null
 parent_concept_id | bigint | not null

I want to identify rows in t_concept without a corresponding row in t_cached_parent_concepts using SQL:

Try this!


select concept_id from t_concept except select child_concept_id from t_cached_parent_concepts;

Now this doesn’t explain why my sophisticated caching system is not so sophisticated…..


rsdb=# select count(*) from t_concept;
 count
--------
 388289
(1 row)

rsdb=# select count(*) from (select concept_id from t_concept except select child_concept_id from t_cached_parent_concepts) as tmp;
 count
--------
 142960
(1 row)

Hmm.. that’s a lot of concepts without a cache….

 

Update 8th March 2010


rsdb=# select count(*) from t_concept;
 count
--------
 2
(1 row)

Now that’s better. But why isn’t there only a single root node with no parent concepts? Further investigation identifies a UK-only SNOMED term that hasn’t been classified correctly. A bug report will be filed with Connecting for Health!

Tags: Databases

2 responses so far ↓

  • 1 Gavin // Mar 7, 2010 at 11:36 am

    Personally I would have done this with a left join (although I use mysql for largely historical reasons). Is there a big performance advantage to this.

    A (mysql) left join would go something like
    select concept_id, parent_concept_id from t_concept left join t_cached_parent_concept on child_concept_id=concept_id where parent_concept_id=null

  • 2 Mark // Mar 7, 2010 at 2:19 pm

    Hi Gavin,

    For readers who don’t understand that, you’re referring to an anti-join SQL pattern in which you try to keep rows that do not match.

    What I’d use in this instance is:

    
    select count(concept_id) from t_concept left outer join t_cached_parent_concepts on (concept_id = child_concept_id) where child_concept_id is null;
    
    

    Indeed, looking at how PostgreSQL plans these queries….

    
    rsdb=# explain analyze select count(*) from (select concept_id from t_concept except select child_concept_id from t_cached_parent_concepts) as tmp;   
                                                                                    QUERY PLAN                                                                                
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=1210387.72..1210387.73 rows=1 width=0) (actual time=38151.555..38151.555 rows=1 loops=1)
       ->  Subquery Scan tmp  (cost=1170701.92..1208885.34 rows=600954 width=0) (actual time=28311.878..38135.527 rows=104395 loops=1)
             ->  SetOp Except  (cost=1170701.92..1202875.80 rows=600954 width=8) (actual time=28311.877..38108.796 rows=104395 loops=1)
                   ->  Sort  (cost=1170701.92..1186788.86 rows=6434776 width=8) (actual time=28311.869..36857.682 rows=6732362 loops=1)
                         Sort Key: "*SELECT* 1".concept_id
                         Sort Method:  external merge  Disk: 144808kB
                         ->  Append  (cost=0.00..179082.52 rows=6434776 width=8) (actual time=0.011..5881.833 rows=6732362 loops=1)
                               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..20505.08 rows=600954 width=8) (actual time=0.011..363.759 rows=601118 loops=1)
                                     ->  Seq Scan on t_concept  (cost=0.00..14495.54 rows=600954 width=8) (actual time=0.010..218.017 rows=601118 loops=1)
                               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..158577.44 rows=5833822 width=8) (actual time=0.026..4278.903 rows=6131244 loops=1)
                                     ->  Seq Scan on t_cached_parent_concepts  (cost=0.00..100239.22 rows=5833822 width=8) (actual time=0.025..2779.234 rows=6131244 loops=1)
     Total runtime: 38163.173 ms
    (12 rows)
    
    rsdb=# explain analyze select count(concept_id) from t_concept left outer join t_cached_parent_concepts on (concept_id = child_concept_id) where child_concept_id is null;
                                                                           QUERY PLAN                                                                       
    --------------------------------------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=256432.52..256432.53 rows=1 width=8) (actual time=9205.009..9205.009 rows=1 loops=1)
       ->  Hash Anti Join  (cost=195950.99..255400.75 rows=412705 width=8) (actual time=4724.080..9186.629 rows=104395 loops=1)
             Hash Cond: (t_concept.concept_id = t_cached_parent_concepts.child_concept_id)
             ->  Seq Scan on t_concept  (cost=0.00..14495.54 rows=600954 width=8) (actual time=0.011..217.181 rows=601118 loops=1)
             ->  Hash  (cost=100239.22..100239.22 rows=5833822 width=8) (actual time=4706.365..4706.365 rows=6131244 loops=1)
                   ->  Seq Scan on t_cached_parent_concepts  (cost=0.00..100239.22 rows=5833822 width=8) (actual time=0.036..2117.103 rows=6131244 loops=1)
     Total runtime: 9205.170 ms
    (7 rows)
    
    
    

    The anti-join is quicker by a considerable margin – a great performance advantage – thank you!

    
    rsdb=# select count(*) from (select concept_id from t_concept except select child_concept_id from t_cached_parent_concepts) as tmp;
     count  
    --------
     104395
    (1 row)
    
    rsdb=# select count(concept_id) from t_concept left outer join t_cached_parent_concepts on (concept_id = child_concept_id) where child_concept_id is null;
     count  
    --------
     104395
    (1 row)
    
    
    

Leave a Comment

(Don't forget to fill in the Captcha)