In implementing a comprehensive SNOMED CT browser as part of a research clinical information system, I cache the results of several processor and disk-intensive queries to speed up lookup and SNOMED CT navigation. I was finding some odd results and found that my caching was missing for several items. I needed a way of comparing the contents of one table with another in PostgreSQL and finding out which rows were not present.
To do this, PostgreSQL supports set operations. For instance, I have two tables:
rsdb=# \d t_concept
Table "public.t_concept"
Column | Type | Modifiers
----------------------+------------------------+-----------------------------------------------------
concept_id | bigint | not null default nextval('t_concept_seq'::regclass)
concept_status_code | integer | not null
ctv_id | character varying(10) | not null
fully_specified_name | character varying(255) | not null
is_primitive | integer | not null
snomed_id | character varying(10) | not null
rsdb=# \d t_cached_parent_concepts
Table "public.t_cached_parent_concepts"
Column | Type | Modifiers
-------------------+--------+-----------
child_concept_id | bigint | not null
parent_concept_id | bigint | not null
I want to identify rows in t_concept
without a corresponding row in t_cached_parent_concepts
using SQL:
Try this!
select concept_id from t_concept except select child_concept_id from t_cached_parent_concepts;
Now this doesn’t explain why my sophisticated caching system is not so sophisticated…..
rsdb=# select count(*) from t_concept;
count
--------
388289
(1 row)
rsdb=# select count(*) from (select concept_id from t_concept except select child_concept_id from t_cached_parent_concepts) as tmp;
count
--------
142960
(1 row)
Hmm.. that’s a lot of concepts without a cache….
Update 8th March 2010
rsdb=# select count(*) from t_concept;
count
--------
2
(1 row)
Now that’s better. But why isn’t there only a single root node with no parent concepts? Further investigation identifies a UK-only SNOMED term that hasn’t been classified correctly. A bug report will be filed with Connecting for Health!
2 responses so far ↓
1 Gavin // Mar 7, 2010 at 11:36 am
Personally I would have done this with a left join (although I use mysql for largely historical reasons). Is there a big performance advantage to this.
A (mysql) left join would go something like
select concept_id, parent_concept_id from t_concept left join t_cached_parent_concept on child_concept_id=concept_id where parent_concept_id=null
2 Mark // Mar 7, 2010 at 2:19 pm
Hi Gavin,
For readers who don’t understand that, you’re referring to an anti-join SQL pattern in which you try to keep rows that do not match.
What I’d use in this instance is:
Indeed, looking at how PostgreSQL plans these queries….
The anti-join is quicker by a considerable margin – a great performance advantage – thank you!
Leave a Comment
(Don't forget to fill in the Captcha)