Coding and More - keeping track of annoying things I run into while coding: postgres

Showing posts with label postgres. Show all posts

Friday, September 20, 2013

postgres killing queries

once in a while our database server get's terrible overloaded, cause people are running thousands of long running queries against it.

1. How to find out who is generating queries:

select datname, client_addr from pg_stat_activity;

2. Killing queries from a specific IP:

select pg_terminate_backend(procpid) from pg_stat_activity where client_addr='IP';

This can be done as postgres user.

Monday, March 11, 2013

Lift - getting access to a database

Since I recently started to play more and more with lift. I discovered that it's not always as easy as you think it is and once you find a solution it was easy indeed.

For example I want to connect to my database server, but have no desire to actually use any form of OR mapping.

So after googleing a bit and woundering how todo this. It's all rather simple.

In your Boot.scala file

def boot {
    // where to search snippet
    LiftRules.addToPackages("code")

    // Build SiteMap
    LiftRules.setSiteMap(sitemap)

    //rest based stuff

    /**
     * connection url for lift
     */
    val vendor = new StandardDBVendor("org.postgresql.Driver","jdbc:postgresql://****/****",Full("username"),Full("password")) with ProtoDBVendor{
      override def maxPoolSize = 100
      override def doNotExpandBeyond = 200
    }
    
    DB.defineConnectionManager(DefaultConnectionIdentifier, vendor)

  }

This basically defines what to connect to and a connection pool with a max of 100 connections.

To actually access this connection and todo something with it, well this is remarkable easy.

          DB.performQuery("select * from data")

Tuesday, April 3, 2012

calculating mass spec similarity direct in postgres using stored procedures

since I'm currently debugging a rather annoying bug and for this reason need to be able to execute queries like:











CREATE OR REPLACE FUNCTION binbase.calculatesimilarity (in unknown text, in library text) RETURNS float8 AS

$BODY$




DECLARE

    result float8;

    sameIons int[];

    sameSpectraRelativeValuesunknown float8[];

    sameSpectraAbsoluteValuesunknown float8[];




    sameSpectraRelativeValueslibrary float8[];

    sameSpectraAbsoluteValueslibrary float8[];




    unknownSpectra float8[];

    unknownSpectraRel float8[];




    librarySpectra float8[];

    librarySpectraRel float8[];




    unknownSpectraLength int :=0;




    f1 float8 := 0;

    f2 float8 := 0;




    lib float8 := 0;

    unk float8 := 0;




    sqrt1 float8 := 0;

    summ float8 := 0;

    summ4 float8 := 0;

    summ2 float8 := 0;

    summ3 float8 := 0;




    array_len int;

    sameIon int;




BEGIN




    unknownSpectra = convertSpectra(unknown);

    unknownSpectraRel = createRelavieSpectra(unknownSpectra);




    librarySpectra = convertSpectra(library);

    librarySpectraRel = createRelavieSpectra(librarySpectra);




    array_len = 1000;




    sameIon = 0;




    for i in 1 .. array_len

    LOOP

        -- this will contain all the identical ions --

        IF unknownSpectra[i] is not null and librarySpectra[i] is not null

        then

            sameIons[sameIon] = i;

            sameSpectraRelativeValuesunknown[sameIon] = unknownSpectraRel[i];

            sameSpectraAbsoluteValuesunknown[sameIon] = unknownSpectra[i];

            sameSpectraRelativeValueslibrary[sameIon] = librarySpectraRel[i];

            sameSpectraAbsoluteValueslibrary[sameIon] = librarySpectra[i];

            sameIon = sameIon + 1;

        END IF;

       

    END LOOP;







    -- calculate f1 --

    for i in 1 .. sameIon

    LOOP

        -- this will contain all the identical ions --

        IF sameIons[i] is not null 

        then

            sqrt1 = sqrt(sameSpectraRelativeValueslibrary[i] * sameSpectraRelativeValuesunknown[i]);

            summ4 = summ4 + (sqrt1 * sameIons[i]);




            IF i > 0 

            THEN

                unk = sameSpectraRelativeValuesunknown[i]/sameSpectraRelativeValuesunknown[i-1];

                lib = sameSpectraRelativeValueslibrary[i]/sameSpectraRelativeValueslibrary[i-1];




                if unk <= lib

                then

                    summ = summ + (unk/lib);

                else

                    summ = summ + (lib/unk);

                end if;

            END IF;

        END IF;

    END LOOP;




    unknownSpectraLength = 0;




    for i in 1 .. array_len

    LOOP

        IF librarySpectra[i] is not null and librarySpectra[i] > 0

        then

            summ2 = summ2 + (librarySpectraRel[i] * i);

        END IF;

       

        IF unknownSpectra[i] is not null and unknownSpectra[i] > 0

        then

            unknownSpectraLength = unknownSpectraLength + 1;

            summ3 = summ3 + (unknownSpectraRel[i] * i);

        END IF;

    END LOOP;




    f1 = summ4 / sqrt(summ2 * summ3);

    f2 = 1.0/sameIon * summ;




    result = (1000.0/(unknownSpectraLength + sameIon))*((unknownSpectraLength * f1) + (sameIon * f2));




    RETURN result;




EXCEPTION

    WHEN division_by_zero THEN

        RAISE NOTICE 'caught division_by_zero';

        RETURN 0;

END;

$BODY$

LANGUAGE 'plpgsql'

It's definitely not the fastest or most beautiful implementation I could think off. But it's quick enough for now.

It definitely makes the debugging of my algorithm easier and allows me to test new algorithms quicker.

converting an absolute massspec to a relative massspec using postgres stored functions

the next step to a working similarity calculation is to convert our absolute massspec to a relative spectra. Which is rather simple using stored procedures





-- calculates a relative spectra

create or replace function createRelavieSpectra(spectra float8[]) returns float8[] AS $$



DECLARE

    result float8[1000];

    array_len int;

    maxValue float8 := 0;



BEGIN



    array_len = array_upper(spectra,1);

    

    for i in 1 .. array_len

    LOOP

            IF spectra[i] is not null and spectra[i] > maxValue

            THEN

               maxValue := spectra[i];

            END IF;

    END LOOP;

    

    for i in 1 .. array_len

    LOOP

        if spectra[i] is not null 

        then

            result[i] := spectra[i]/maxValue * 100;

        end if;

    END LOOP;



    RETURN result;



END;

$$ LANGUAGE plpgsql;

converting massspecs to an array using postgres stored functions

currently I'm in the progress of writing a couple of stored procedures to query our BinBase database directly by spectral similarity and so the first step was to transform a standard leco spectra string to a standard double array, which represents a spectra.





-- converts a spectra string to a float array for further use --

create or replace function convertSpectra(spectra text) returns float8[] AS $$



DECLARE

    result float8[1000];

    spectraSplit text[];

    ionSplit text[];

    array_len int;



    ion int;

    value float8;



BEGIN



    spectraSplit = regexp_split_to_array(spectra,' ');

    array_len = array_upper(spectraSplit,1);

    

    for i in 1 .. array_len

    LOOP

            ionSplit = regexp_split_to_array(spectraSplit[i],':');



           ion := ionSplit[1];

           value := ionSplit[2];





            result[ion] = value;



    END LOOP;



    RETURN result;



END;

$$ LANGUAGE plpgsql;

Friday, October 21, 2011

grouping dates in postgres using date_trunc

well recently I started to write some statistics functions for my miniX application and so I was woundering, how can you easily execute queries like:

give me the count of object for the last 24h grouped by hour

which should be rather simple, if you know about the functions 'date_trunc' in postgres.

So 5 minutes later some sql appeared...

       

"select max(object_count) as object,date_trunc('hour',created_at) as date from bbstatistic_snapshot where object_name = ? and created_at between ? and ? group by date_trunc('hour',created_at) order by date_trunc('hour',created_at)"

which works rather well and get's the job done, but the speed is a touch slow.

Results should look like:

object | date
--------+---------------------
1742 | 2011-10-13 15:00:00
1742 | 2011-10-13 16:00:00
1742 | 2011-10-13 17:00:00
1742 | 2011-10-13 18:00:00
2102 | 2011-10-13 19:00:00
2102 | 2011-10-13 20:00:00
2057 | 2011-10-13 21:00:00
1899 | 2011-10-13 22:00:00
1803 | 2011-10-13 23:00:00
1742 | 2011-10-14 00:00:00
1742 | 2011-10-14 01:00:00
1742 | 2011-10-14 02:00:00
1742 | 2011-10-14 03:00:00
1742 | 2011-10-14 04:00:00
1742 | 2011-10-14 05:00:00
1742 | 2011-10-14 06:00:00

Friday, May 7, 2010

moving indexes in postgres to a new tablespace

Recently I upgraded my server and added a SSD raid system of 3 128GB disks and for obvious reasons want to have my indexes on it. So the first step was to create a new tablespace



create tablespace 'name' owner 'owner' location 'directory'

and than creating a quick script to move all the indexes



select 'ALTER INDEX '||indexname||' SET TABLESPACE compoundindex ;' from pg_catalog.pg_indexes where schemaname = 'public' order by tablename;

which than just needs to be executed

Friday, March 5, 2010

Postgres - partial indexes and sort indexes

this morning it was a sad moment. Normally people ask me all the time about SQL optimization or SQL tuning or when to create an index and so on.

This morning i had another DOOOH moment.

Basically while trying to optimize BinBase even further I noticed that some queries take an absurd amount of time to finish. Like 500 seconds to query the bin table.

So I started to investigate and discovered that some of my queries are using seq scans over huge tables. For no apparent reason.

It turned out that the sort column was not indexed...



CREATE INDEX bin_retention_index ON bin (retention_index ASC NULLS LAST);

which was greatly improved after executing this statement.

Now there was still a huge wait time for a certain query



explain select * from BIN where bin_id not in ( SELECT bin_id from SPECTRA where sample_id = 1630733 and bin_id is not null )ORDER BY retention_index ASC

and it turned out that indexes were never used for the 'is not null' part. After some research and a lot of head scratching it turned out that postgres supports partial indexes for exactly this case.



create index spectra_bin_id_partial_is_not_null on spectra (bin_id) where bin_id is not null

Now afterward we some improvement but the actually slow data access is caused by something else. We always need the complete table - a few compounds and this operation takes some time.

Time to optimize the internal database lookup...

Saturday, February 20, 2010

postgres and insert performance with grails

recently I spend a lot of time tuning a postgres database and an algorithm (based on grails) to try to insert 80M chemical compounds into a database with reference checks and assurance for there uniqueness.

The main issue is that the inserts get slower and slower over time and end up taking to long. To improve the performance we started with the following approach:

more memory
database indexes (obvious)
flush the hibernate session all 100 inserts
flush the grails cache all 100 inserts
tune several postgres parameters
re index the database all 10k inserts
enable vacuum on a 10k row basis
enable analyze on a 10k row basis
set auto vacuum to check all 10 minutes

But after all this work we are still stuck with a problem, the inserts get gradually slower over time, which can be related to the index building time.

Graph over 1M inserts: TODO, it's still in the calculation...

Tuesday, February 9, 2010

Postgres and utf8

Recently I discovered that my postgres datbase is not running with UTF-8 support which causes all kinds of headdache,

so what do you have todo to change this?

backup your database
set Lang to: LANG = en_US.UTF-8
execute initdb
recreate the database and specify the encoding: createdb --encoding=unicode
import your data

you should now have a working utf-8 database

Sunday, January 24, 2010

how to handle 400 Billion rows in postgress

well currently I try to optimize an denoising and replacement algorithm. This includes calculating ion traces over thousands files which have 20k peaks or 1 Million ions each.
To make this more challenging we don't do this once, we do it during every BinBase export. Since this is the base algorithm for the Zero replacement.

Short we do the same calculation over and over again, like to calculate the minimum intensity for 100 files between 15 and 20 seconds for the ion 87.

Short this shouts use a SQL database.

Now how much data do we have?

40k files
each file has around 20k spectra
each spectra has up to 500 ion

which translates into

400 000 000 000 rows

Performance Issues and how to make Postgres handle this amount of data

The first attempt was to just store all these files into a single table using hibernate. Which caused an OutOfMemory exception, surprise surprise.

So the next attempt was to use a stateless session which worked quite well and up to 1 Billion rows we had a query speed of under 20ms. Quite nice for no tuning of the database. But once we hit a bit over 1 Billions rows the query speed got worse and worse. Shor this was not an acceptable solution.

So we tried to use partions with postgres and define one table for each ion, inheriting from a master table.



Table "public.iontrace"

Column   |       Type       | Modifiers 

------------+------------------+-----------

id         | bigint           | not null

intensity  | double precision | 

ion        | integer          | 

spectra_id | bigint           | 

Indexes:

"iontrace_pkey" PRIMARY KEY, btree (id)

Foreign-key constraints:

"fk4f93923dc3b62c9a" FOREIGN KEY (spectra_id) REFERENCES spectra(id)

Rules:

rule_index_iontrace_100 AS

ON INSERT TO iontrace

WHERE new.ion = 100 DO INSTEAD  INSERT INTO iontrace_100 (id, intensity, ion, spectra_id) 

VALUES (new.id, new.intensity, new.ion, new.spectra_id)

rule_index_iontrace_101 AS

ON INSERT TO iontrace

WHERE new.ion = 101 DO INSTEAD  INSERT INTO iontrace_101 (id, intensity, ion, spectra_id) 

VALUES (new.id, new.intensity, new.ion, new.spectra_id)

rule_index_iontrace_102 AS

ON INSERT TO iontrace

WHERE new.ion = 102 DO INSTEAD  INSERT INTO iontrace_102 (id, intensity, ion, spectra_id)

which didn't work with hibernate. It kept complaining about



org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1

so we had to write our own batcher as was described here

a bit later we rewrote the example to actually work, since the field batch size is set to private.



/**

* hibernate partion batcher

*/

public class HibernatePartionBatcher extends BatchingBatcher {



public HibernatePartionBatcher(org.hibernate.jdbc.ConnectionManager connectionManager, org.hibernate.Interceptor interceptor) {

super(connectionManager, interceptor)

}





@Override

protected void doExecuteBatch(PreparedStatement ps) throws SQLException, HibernateException {



//use reflections to access the private field of the super class ugly but neccessaery

Field field = this.getClass().getSuperclass().getDeclaredField("batchSize")

field.setAccessible(true)



int value = field.getInt(this)



if (value != 0) {

try {

ps.executeBatch()

}

catch (RuntimeException re) {

throw re;

}

finally {

field.setInt(this, 0)

}

}

}

}



/**

* hibernate factory to intitialize the factory

*/

public class HibernatePartitionBatcherFactory extends BatchingBatcherFactory {



public HibernatePartitionBatcherFactory() {

}



public org.hibernate.jdbc.Batcher createBatcher(org.hibernate.jdbc.ConnectionManager connectionManager, org.hibernate.Interceptor interceptor) {

return new HibernatePartionBatcher(connectionManager, interceptor);

}

}

and to register this in our groovy script



def hibProps = [

"hibernate.jdbc.factory_class": HibernatePartitionBatcherFactory.class.getName()

]

A first test with just a single sample reveals that we are executing a sequential scan over all partions.



explain SELECT * from iontrace where ion = 105;

QUERY PLAN                                                  

-------------------------------------------------------------------------------------------------------------

Result  (cost=0.00..1163.25 rows=3374 width=28)

->  Append  (cost=0.00..1163.25 rows=3374 width=28)

->  Seq Scan on iontrace  (cost=0.00..10.46 rows=7 width=28)

Filter: (ion = 105)

->  Index Scan using index_iontrace_20 on iontrace_20 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_21 on iontrace_21 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_22 on iontrace_22 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_23 on iontrace_23 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_24 on iontrace_24 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_25 on iontrace_25 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

to avoid this you have to change a parameter in postgresql.



SET constraint_exclusion = on;

and the new result shows that we only work on the correct partition now



explain SELECT * from iontrace where ion = 105;

QUERY PLAN                                                  

-------------------------------------------------------------------------------------------------------------

Result  (cost=0.00..13.03 rows=14 width=28)

->  Append  (cost=0.00..13.03 rows=14 width=28)

->  Seq Scan on iontrace  (cost=0.00..10.46 rows=7 width=28)

Filter: (ion = 105)

->  Index Scan using index_iontrace_105 on iontrace_105 iontrace  (cost=0.00..2.57 rows=7 width=28)

Index Cond: (ion = 105)

it still executes a sequential scan, but now it only does it on the correct partion with the stored data.

Now to reduce the likelihood for an index scan over the database we will define indexes on all the table (the script at the end of this post does this for you)

the new query plan is



explain SELECT * from iontrace where ion = 105;

QUERY PLAN                                                  

-------------------------------------------------------------------------------------------------------------

Result  (cost=0.00..4.15 rows=14 width=28)

->  Append  (cost=0.00..4.15 rows=14 width=28)

->  Index Scan using index_iontrace_ion on iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_105 on iontrace_105 iontrace  (cost=0.00..2.57 rows=7 width=28)

Index Cond: (ion = 105)

The result is slightly different since we generate by now more data in the database. Sorry for that. But you can see that the query plan is much more efficient with the index.

Important is that you need an index for ion on every partition and on the main 'iontrace' table, if you don't have an index on the main table you will keep getting sequential scans.

The next snatch we encountered was the speed of the actual data insert. Since the check constraints seem to be rather slow and expensive.

First this is a hibernate insert on the main datatable:



LOG:  duration: 8.961 ms  statement: EXECUTE   [PREPARE:  insert into IonTrace (intensity, ion, spectra_id, id) values ($1, $2, $3, $4)]

LOG:  duration: 8.951 ms  statement: EXECUTE   [PREPARE:  insert into IonTrace (intensity, ion, spectra_id, id) values ($1, $2, $3, $4)]

LOG:  duration: 8.966 ms  statement: EXECUTE   [PREPARE:  insert into IonTrace (intensity, ion, spectra_id, id) values ($1, $2, $3, $4)]

Second this is plain old sql inserted into the main iontrace table:



LOG:  duration: 6.608 ms  statement: EXECUTE   [PREPARE:  insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),1418.0,237,700013)]

LOG:  duration: 6.610 ms  statement: EXECUTE   [PREPARE:  insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),1373.0,238,700013)]

LOG:  duration: 6.605 ms  statement: EXECUTE   [PREPARE:  insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),1383.0,239,700013)]

Third this is plain old sql inserted in each partition:



LOG:  duration: 0.055 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_475(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),27.0,475,680203)]

LOG:  duration: 0.058 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_489(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),13.0,489,680203)]

LOG:  duration: 0.051 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_496(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),13.0,496,680203)]

LOG:  duration: 0.138 ms  statement: EXECUTE   [PREPARE:  insert into Spectra (file_id, retentionTime, id) values ($1, $2, $3)]

From this we can see that it makes the most sense to use plain sql to insert the data into the related partitions and to avoid the checks during the insert phase.

For this we used the following approach



//build the ion trace

Statement statement = session.connection().createStatement()



for (int mass = 0; mass < mz.length; mass++) {           if (mass >= beginMass && mass <= endMass) {             if (intensity[mass] > 0) {

IonTrace trace = new IonTrace(ion: mz[mass], intensity: intensity[mass])

trace.spectra = spectra





statement.addBatch("insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),${intensity[mass]},${(int)(mz[mass])},${spectra.id})")

}

}

}



statement.executeBatch()

which worked quite well. As a conclusion I removed the complete hibernate layer from this application, since we lost most of it's benefits by now.

To further optimize the insert speed we replace the Statements with PreparedStatements, which improves the speed of the insert by close to 50%.



LOG:  duration: 0.035 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_127(id, intensity, ion, spectra_id) values (nextval('seq_trace'),$1,$2,$3)]

LOG:  duration: 0.035 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_127(id, intensity, ion, spectra_id) values (nextval('seq_trace'),$1,$2,$3)]

LOG:  duration: 0.035 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_127(id, intensity, ion, spectra_id) values (nextval('seq_trace'),$1,$2,$3)]

Storage

as you can imagine this amount of rows can take up quite some space. An initial estimate say it's roughly 20TB with the current table schema.

Which is 20 times more than the files occupy on the harddrive.

So what can you do to keep the storage space down

use the smallest possible data types possible
only index what is necessary to be index
remove columns which are not necessary, like we don't need a primary key on the ion trace table
estimate your possible amount of data stored in the database, like we won't store billions of spectra, so we can use integer instead of bigint
store intensities as integer and not as double.

To estimate the required storage we used the following function, which roughly gives us storage capacity for 4 years of netcdf files.



netcdf-repository=# SELECT pg_size_pretty(pg_database_size('netcdf-repository')/(select count(*) from netcdf)* 40000) as "estimated database size";

 estimated database size 

-------------------------

 16 TB

(1 row)

And a server for this would currently cost around 7000$

-END-

Wednesday, November 18, 2009

SQL - small mistakes, big impact

the last couple of days I was fighting with a database which became slower and slower and slower and now needed close to 24h to execute a rather simple query...

After working for hours on different parameters and tuning the database, including setting up a spare server to see if a dump and restore fixes the issues I got the time down to 5 minutes.

Sounds good?

well not really since I needed an execution in the range of a couple of seconds. So I actually looked at the sql statement which caused issues and found a gigantic error in it...

the statement:



select * from BIN where bin_id not in ( SELECT bin_id from SPECTRA where sample_id = 424105 AND SPECTRA.bin_id = BIN.bin_id)

and the correct statement would be:



select * from BIN a where a.bin_id not in ( SELECT bin_id from SPECTRA where sample_id = 424105 and bin_id is not null) ORDER BY retention_index ASC

the problem?



SPECTRA.bin_id = BIN.bin_id

this little part is totally wrong in the whole statement and causes a seq scan over a table with 70 Million entries against a table with 6000 entries. It does nothing for the queries and I guess it was left over from a copy/paste accident 6 years ago when I wrote this script originally.

let's compare some numbers:

explain before:

Seq Scan on bin (cost=0.00..459513.42 rows=3044 width=1290)
SubPlan
Filter: (NOT (subplan))
-> Bitmap Heap Scan on spectra (cost=75.20..75.42 rows=2 width=8)
Recheck Cond: ((sample_id = 424105::double precision) AND (bin_id = ($0)::double precision))
-> BitmapAnd (cost=75.20..75.20 rows=2 width=0)
-> Bitmap Index Scan on spectra_sample_id (cost=0.00..18.99 rows=5696 width=0)
-> Bitmap Index Scan on spectra_binid (cost=0.00..56.18 rows=17658 width=0)
Index Cond: (sample_id = 424105::double precision)
Index Cond: (bin_id = ($0)::double precision)

explain after:

Sort (cost=1765.47..1766.23 rows=3044 width=1290)
Sort Key: retention_index
-> Seq Scan on bin a (cost=647.21..1747.86 rows=3044 width=1290)
Filter: (NOT (hashed subplan))
SubPlan
-> Index Scan using spectra_sample_id on spectra (cost=0.00..646.42 rows=3138 width=8)
Index Cond: (sample_id = 424105::double precision)
Filter: (bin_id IS NOT NULL)

in real world number this translates to 24 hours before statement compared to 4 seconds after statement...
...

Monday, July 13, 2009

What's more expensive IN or BETWEEN statement

This morning I spend a couple of hours tuning and testing different SQL statements to find the fastest way to fetch some data for a real time application.

While doing this I discovered that the IN query cost slightly more than the BETWEEN query.
It's not a lot, but still something.

IN query:


explain select sample_name,class,date_of_import,date from samples where visible = 'TRUE' and sample_name like '%q%' and date in( '5/26/2009 12:00:00 AM', '6/11/2009 12:00:00 AM','5/26/2009 12:00:00 AM','2009-06-01 12:00:00 AM') order by 1 desc

result:


 QUERY PLAN                                                                                                                                                                                                                                
 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Sort  (cost=8126.07..8126.42 rows=139 width=37)                                                                                                                                                                                           
   Sort Key: sample_name                                                                                                                                                                                                                   
   ->  Seq Scan on samples  (cost=0.00..8121.12 rows=139 width=37)                                                                                                                                                                         
         Filter: (((sample_name)::text ~~ '%q%'::text) AND ((visible)::text = 'TRUE'::text) AND (date = ANY ('{"2009-05-26 00:00:00","2009-06-11 00:00:00","2009-05-26 00:00:00","2009-06-01 00:00:00"}'::timestamp without time zone[]))) 

 4 record(s) selected [Fetch MetaData: 0/ms] [Fetch Data: 0/ms] 

 [Executed: 7/13/09 1:11:46 PM PDT ] [Execution: 2/ms]

BETWEEN query:


explain select sample_name,class,date_of_import,date from samples where visible = 'TRUE' and sample_name like '%q%' and date between '5/26/2009 12:00:00 AM' and '6/11/2009 12:00:00 AM' order by 1 desc

Result:


 QUERY PLAN                                                                                                                                                                                                                    
 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Sort  (cost=8123.15..8123.32 rows=67 width=37)                                                                                                                                                                                
   Sort Key: sample_name                                                                                                                                                                                                       
   ->  Seq Scan on samples  (cost=0.00..8121.12 rows=67 width=37)                                                                                                                                                              
         Filter: (((sample_name)::text ~~ '%q%'::text) AND (date >= '2009-05-26 00:00:00'::timestamp without time zone) AND (date <= '2009-06-11 00:00:00'::timestamp without time zone) AND ((visible)::text = 'TRUE'::text)) 

 4 record(s) selected [Fetch MetaData: 0/ms] [Fetch Data: 0/ms] 

 [Executed: 7/13/09 12:53:32 PM PDT ] [Execution: 2/ms]

Coding and More - keeping track of annoying things I run into while coding