Medical Nerds Blog Logo

technology, stats and IT for medics


Recursive SQL with PostgreSQL 8.4

March 4th, 2010 by Mark · 1 Comment

Representing hierarchical data in a relational database is easy. For instance, modelling the contents of a filesystem with directories and subdirectories is straightforward using self-joins on a parent key. The root node(s) are represented as those without a parent. Such a model is useful for modelling other types of data – in particular, hierarchies in which nodes can have multiple parents – for example, SNOMED CT is a hierarchical clinical terminology – the “Systematised Nomenclature of Medicine Clinical Terms”. The Connecting for Health website suggests this terminology is in widespread use by “all computers” in the NHS (website accessed 4th March 2010) but as those of us who work in the NHS will testify, this is not the case.

Those who work in the NHS will be more familiar with Read codes. These are primarily used in primary care information systems. The origins of terminology systems worldwide is a subject for another article, but Read codes are limited to a fixed hierarchy. Such a hierarchy results in duplicate terms as semantically some terms may be logically related to different terms in the hierarchy.

The structure of SNOMED CT is complex and again, beyond the scope of this article. Those who download the distribution will find a large number of explanatory technical and implementation guidelines. There are essentially three tables representing a Concept, a Description and a Relationship. There is a one to many relationship between Concept and Description – as such, one concept can have multiple descriptions. There is also a more complex relationship between Concept and Relationship with a relationship defined as a join table between related concepts. Each relationship links a “source concept” to a “target concept” and as such has an implicit direction – from parent to child. A relationship can have different types – themselves represented as a concept – the most commonly used concept reflects a “IS A” type of relationship. Thus, each Relationship has three Concept foreign keys – “source Concept”, “target Concept” and “type Concept”.

The greatest difficulty in implementing SNOMED CT in an application is its sheer size. I debated using only a small (manually managed) subset but instead decided to include all SNOMED CT terms available – including those in designated UK-specific subpacks such as that provided to document pharmaceutical products available in the UK. As such, careful optimisation of full-text searches and other queries is essential, with applications decided which subsets to use.

My web-based application supports automatic updating from multiple SNOMED CT distributions and as such grows very large. Even with only the core SNOMED CT distribution there are 388,289 concepts, 1,149,406 descriptions and 1,387,930 relationships.

How does this relate to the use of recursive SQL in PostgreSQL? My naive approach to implementing SNOMED CT was to iterate manually through an object graph of SNOMED CT as provided by a sophisticated object-relational mapping (ORM) – in this case the Enterprise Object Framework (EOF) from WebObjects. This works well for many use cases, but iterating recursively to find all child or parent concepts uses memory and processor time. There are not many use cases when you need to do this – certainly fetching all child concepts of a high-level SNOMED CT concept even as raw SQL is inappropriate. However, there is a need to find all parent concepts of a particular concept to understand its position in a semantic sense. For instance, understanding algorithmically than “Friedreich’s ataxia” is a “Degenerative disease”.

It certainly is possible to generate a cache of all parent concepts and keep this up to date which can sometimes be required if you clearly demonstrate that this is the rate-limiting step using profiling. However, I try to avoid early optimisation and as such, use recursive SQL to rapidly identify all parent concepts.

Recursive queries appeared in PostgreSQL only with the release of version 8.4. The documentation is here.

Here is my table structure:

rsdb=# \d t_concept
Table "public.t_concept"
Column        |          Type          |                      Modifiers
concept_id           | bigint                 | not null default nextval('t_concept_seq'::regclass)
concept_status_code  | integer                | not null
ctv_id               | character varying(10)  | not null
fully_specified_name | character varying(255) | not null
is_primitive         | integer                | not null
snomed_id            | character varying(10)  | not null

rsdb=# \d t_description
Table "public.t_description"
Column          |          Type          |                        Modifiers
concept_id              | bigint                 | not null
description_id          | bigint                 | not null default nextval('t_description_seq'::regclass)
description_status_code | integer                | not null
description_type_code   | integer                | not null
initial_capital_status  | character varying(10)  | not null
language_code           | character varying(10)  | not null
term                    | character varying(255) | not null

rsdb=# \d t_relationship
Table "public.t_relationship"
Column            |         Type          |                        Modifiers
characteristic_type          | integer               | not null
refinability                 | integer               | not null
relationship_group           | character varying(10) | not null
relationship_id              | bigint                | not null default nextval('t_relationship_seq'::regclass)
relationship_type_concept_id | bigint                | not null
source_concept_id            | bigint                | not null
target_concept_id            | bigint                | not null

And this is the SQL to fetch the concept_id of all parent concepts of the concept Friedreich’s ataxia (concept ID 10394003).

with recursive parent_concepts(concept_id) as (
select t0.concept_id from t_concept t0 inner join t_relationship t1 on t0.concept_id = t1.target_concept_id
where (t1.relationship_type_concept_id = 116680003 and t1.source_concept_id = 10394003)
union all
select t0.concept_id from t_concept t0,t_relationship t1, parent_concepts pc where t0.concept_id = t1.target_concept_id and
t1.source_concept_id = pc.concept_id and t1.relationship_type_concept_id = 116680003
select distinct(concept_id) from parent_concepts;

You might ask, why am I limiting the parent concepts to those linked by relationships with the concept type “116680003”? This is because that reflects the IS A relationship type concept.

A quick hack to get the fully specified names of the concepts…

rsdb=# select fully_specified_name from t_concept, (with recursive parent_concepts(concept_id) as (
rsdb(# select t0.concept_id from t_concept t0 inner join t_relationship t1 on t0.concept_id = t1.target_concept_id
rsdb(# where (t1.relationship_type_concept_id = 116680003 and t1.source_concept_id = 10394003)
rsdb(# union all
rsdb(# select t0.concept_id from t_concept t0,t_relationship t1, parent_concepts pc where t0.concept_id = t1.target_concept_id and
rsdb(# t1.source_concept_id = pc.concept_id and t1.relationship_type_concept_id = 116680003
rsdb(# )
rsdb(# select distinct(concept_id) from parent_concepts) as temp where t_concept.concept_id = temp.concept_id;

gives these results:

 Disorder of head (disorder)
 Clinical finding (finding)
 Cerebellar disorder (disorder)
 Disorder of body system (disorder)
 Spinal cord disorder (disorder)
 Disorder of nervous system (disorder)
 Disorder of back (disorder)
 Encephalomyelopathy (disorder)
 Head finding (finding)
 Degenerative disorder (disorder)
 Disorder of body cavity (disorder)
 Disease (disorder)
 Disorder of spine (disorder)
 Finding of back (finding)
 Finding of head and neck region (finding)
 Finding by site (finding)
 Disorder of spinal region (disorder)
 Disorder of the central nervous system (disorder)
 Degenerative disease of the central nervous system (disorder)
 Spinocerebellar disease (disorder)
 Finding of body region (finding)
 Degenerative brain disorder (disorder)
 Disorder by body site (disorder)
 Finding of spinal region (finding)
 Disorder of brain (disorder)

Note: I am not limiting concepts by whether they are “active” or not, as this is something I switch on and off at the application level.

Tags: Databases · Medical

1 response so far ↓

  • 1 Junai // Jan 19, 2011 at 12:55 pm

    Hi Can you please help me write a query which can use recursive SQL to generate output like this.
    Concept A has a Parent Concept B.
    table structure is
    concept name
    parent id

    id is used as foreign key on parentid in the same table

Leave a Comment

(Don't forget to fill in the Captcha)