Posts Tagged ‘SQL’

Inner/Outer Queries and Aliasing

Posted: March 18, 2012 in SQL
Tags:

A simple technique that I see few developers use is the inner outer query to present a final solution to a user.  In normal programming you don’t care about extraneous columns as they won’t be presented to the user however when generating reports from a database using raw SQL it’s a timesaver to not have to go back and delete columns.

Here’s an example:

select alertstr from (
   select rejdate || ' : ' ||
   to_char(round((sum(case when rejects is null then 0 else rejects end)/sum(totals)),2),990.99) ||
   '% rejected' as alertstr, round(sum(rejects)/sum(totals),2) as rejmodel from (
       select to_char(dtent,'MM/dd') as rejdate, count(1) as totals, null as rejects
       from activity
       where dtent between to_date('12/01/2007','MM/dd/yyyy') and
       to_date('12/10/2007','MM/dd/yyyy') and ttype in ('E','F')
       and customer = 1234 group by dtent
       union all
       select to_char(dtent,'MM/dd'), null, count(1)
       from activity
       where dtent between to_date('12/01/2007','MM/dd/yyyy') and
       to_date('12/10/2007','MM/dd/yyyy') and ttype in ('E','F')
       and customer = 1234 and ptype='E' group by dtent
   ) group by rejdate
) where rejmodel >= 10 and rejmodel < 20 

* Note – This is an actual production query so I replaced the table names, id’s and columns with fake data.

You’re probably looking at this query and saying to yourself what a mess.  But let’s look at the requirements.

  1. The query must only return one column and one row only.  The row results will be directly display to the user without any programmatic interfernece.
  2. The query must show all rejections on an arbitrary data span.
  3. The query will potentially be parsing on database tables that contain upwards of 50-100 million rows of data.

This is a unique situation in that you are displaying query results to the user without any further business logic but it happens occasionally.  In hindsight given enough time and concrete requirements I would have redesigned this feature set as the logic is far too encapsulated.

Examining the query the sequence of events occurs as:

  1. Pull data from the activity table and union this data on a second set from the same table.  Notice the columns, the first query is pulling totals and the second query in the union is pulling rejects.  When performing a union there is no reason to alias the columns after the first query.
  2. Since we’ve aliased the columns in the inner query from step 1, the outer query call pull this directly.  Select rejects pulls the third column from the inner query.  Aliasing is vital when working with inner/outer queries and is really the only way when aggregating data on the fly.  The second to last outer query concatenates data such as:
     select rejdate || ' : ' || to_char(round((sum(case when rejects is null then 0 else rejects end)/sum(totals)),2),990.99) || '% rejected' as alertstr
    
  3. In the final outer query instead of saying :
     select rejdate || ' : ' || to_char(round((sum(case when rejects is null then 0 else rejects end)/sum(totals)),2),990.99) || '% rejected'
    

    we just have to say

    select alertstr

    So now all this aliasing makes sense now.  This is also useful since in the where clause we are going to constrain based on the second query which is performing a round function.

If you break down what is occurring in this query you can see the power and flexibility that inner/outer aliasing provides.  It’s important to be able to visualize your data sets in your mind and you’ll be all set.

I ran into an interesting situation today helping a colleague with her homework.  This person does not have a background in programming or databases and is taking a series of courses related to HIT (health information technology).  In this course she was presented with an ERD (entity relationship diagram) which she had to modify based on the requirements of a given CCD (continuity of care document).  This is the diagram this person was given.

ERD Example

We have to remember this diagram was given to someone with almost no relational database experience.  I can clearly see a few issues where I would have done things differently.  I would assert that the technology one is using, and perhaps more importantly, the technology one has had experience with alters your view.  What do I mean by this?  A developer with 10 or more years has probably spent a lot of time with raw SQL before moving on to working with entities where a developer with only a year or two of experience may have never even worked with raw SQL.  A developer in a small firm with little QA that has had to run analysis queries at 3am versus a 9-5 developer at a large corporation with CMMI level 5 documentation and more than adequate QA will also change your view. This also can be determined by the technology.  I had a very experienced developer of 20+ years who had spent almost all of his time working on the Microsoft stack.  He only knew the very rudimentary basics of SQL as he always had a DBA to assist him with performance and optimization of queries (in fact I believe he just gave the DBA the basic of what was needed and depended on him).  This developer was able to easily pick up working with straight SQL and eventually created some advanced queries that were ingenious.

Getting back on topic, this diagram got me to thinking about the drawbacks of not using a sequence number on the row data and instead using a composite primary key.  In one of my past life’s we had a sequence number (acting as a persistent object identifier) on almost every table in the database.  I saw a discussion recently which was quite involved where the DBA’s where eschewing the concept of an identity column.  What are the advantages and disadvantages of using a sequence number on a table? From what I can tell:

Advantages:

  • Normalization
    • Join Dependency – DBA’s will often argue that a sequence number is superfluous and not needed.  A developer will often complain that it convolutes the join they need to code.  It really gets down to the details of the table.  In the ERD above to perform a join on the encounter to patient we have to reference two columns to obtain a truly unique id.  If the sequence number is a primary key we would be able to simplify this greatly.  In a typical entity this is a non-issue.
    • Consistency
      • Consistency across tables isn’t a big issue when using entities, you code once, and load the entity in the persistence manager and you’re done.  However, at 3am when a developer needs to perform an analysis quickly, half asleep it becomes an issue.  Say that you have to find a count of encounter per encounter date for patients with the last name starting with ‘A’.
 select count(e.encounter_id), to_char(e.encounter_date, ‘MM/dd/yyyy’) from person p
left outer join patient pa on pa.person_oid=p.person_oid and pa.person_id=p.person_id
left outer join encounter e on e.patient_oid=pa.patient_oid and e.patient_id=pa.patient_id
where upper(p.last_name) like ‘A%’ group by to_char(e.encounter_date, ‘MM/dd/yyyy’) 

Ugh, that’s gnarly. Why the separate table for a patient object? Remember I didn’t make these. Reformatted for simplicity we have.

Our SQL query now appears as:

select count(e.seqno), to_char(e.encounter_date, ‘MM/dd/yyyy’) from patient pa
left outer encounter e on e.patient=pa.seqno
where upper(pa.last_name) like ‘A%’ group by to_char(e.encounter_date, ‘MM/dd/yyyy’)

The query is simplified and is clearer to the developer at 3am.  Having consistent columns for uniqueness is advantageous.  The developer of course still has to recognize what index’s exist on the table but knowing the seqno is always an indexed column that can be directly referenced is advantageous.  In this case the LAST_NAME column would also most likely be indexed on the PATIENT table.

  • Abstraction from business modeling
    • Using a sequence number as a persistent object identifier abstracts your identity for the row from the business model logic.  Changing the business model logic will not affect the key and no changes will be necessary to other tables.  In the initial example the OID and ID combination’s may change.  A year later a new object model is implemented for the system, you could possibly have the same patient twice.  Once for the previous OID and another for the new OID.