Coding and More - keeping track of annoying things I run into while coding: 2010

Tuesday, December 21, 2010

registering jboss to start by default and comforms to chkconfig

currently I had yet another binbase install and it turned out that my standard jboss script is not chkconfig compatible.

So it was time todo a rewrite



#! /bin/bash

#

# network Bring up/down jboss

#

# chkconfig: 345 20 80

# description: Starts and stops the jboss server

#

# /etc/rc.d/init.d/jboss

# See how we were called.



#define where jboss is - this is the directory containing directories log, bin, conf etc

JBOSS_HOME=/share/apps/jboss



#define the user under which jboss will run, or use 'RUNASIS' to run as the current user

JBOSS_USER=${JBOSS_USER:-"root"}



#make sure java is in your path

JAVAPTH=/usr/java/latest/bin/java



#configuration to use, usually one of 'minimal', 'default', 'all'

JBOSS_CONF=${JBOSS_CONF:-"all"}

JBOSS_HOST=0.0.0.0



JBOSS_CONSOLE=$JBOSS_HOME/console.log

#if JBOSS_HOST specified, use -b to bind jboss services to that address

JBOSS_BIND_ADDR=${JBOSS_HOST:+"-b $JBOSS_HOST"}



#define the classpath for the shutdown class

JBOSSCP=${JBOSSCP:-"$JBOSS_HOME/bin/shutdown.jar:$JBOSS_HOME/client/jnet.jar"}



#define the script to use to start jboss

JBOSSSH=${JBOSSSH:-"$JBOSS_HOME/bin/run.sh -c $JBOSS_CONF $JBOSS_BIND_ADDR"}



if [ "$JBOSS_USER" = "RUNASIS" ]; then

  SUBIT=""

else

  SUBIT="su - $JBOSS_USER -c "

fi



if [ -n "$JBOSS_CONSOLE" -a ! -d "$JBOSS_CONSOLE" ]; then

  # ensure the file exists

  touch $JBOSS_CONSOLE

  if [ ! -z "$SUBIT" ]; then

    chown $JBOSS_USER $JBOSS_CONSOLE

  fi

fi

Saturday, December 18, 2010

grails - providing your own plugins configuration

sometimes you want to share some services in a plugin, which needs to be configured at runtime.

The easiest way is to provide a simple file in the grails conf diretory of your plugin and just read it and access it at runtime.

How to?

our little helper:



package binbase.core



import grails.util.GrailsUtil

import org.codehaus.groovy.grails.commons.GrailsApplication



/**

 * Created by IntelliJ IDEA.

 * User: wohlgemuth

 * Date: 12/18/10

 * Time: 1:00 AM

 * To change this template use File | Settings | File Templates.

 */

class BinBaseConfigReader {



  private static ConfigObject config = initialize()



  /**

   * initializes the object

   * @return

   */

  static ConfigObject initialize() {

    return new ConfigSlurper().parse(new GroovyClassLoader(BinBaseConfigReader.class.getClassLoader()).loadClass('BinBaseConfig'))

  }



  /**

   * returns the server

   * @return

   */

  static String getServer() {

    if (GrailsUtil.getEnvironment().equals(GrailsApplication.ENV_TEST)) {

      return config.binbase.test.server.toString()

    }

    else if (GrailsUtil.getEnvironment().equals(GrailsApplication.ENV_DEVELOPMENT)) {

      return config.binbase.development.server.toString()

    }

    else if (GrailsUtil.getEnvironment().equals(GrailsApplication.ENV_PRODUCTION)) {

      return config.binbase.production.server.toString()



    }

    else {

      throw new RuntimeException("unexspected enviroment found")

    }

  }

}

our configuration



/**

 * this file contains the binbase configuration for this plugin.

 *

 */

binbase {

  production {

    key = "dasdasd"

    server = "10.1.1.1"

  }

  test {

    key = "dasdasd"

    server = "10.2.2.2"

  }

  development {

    key = "dasdasd"

    server = "127.0.0.1"

  }

}

and to access it in your, say spring configuration?



  jndiBinBaseTemplate(org.springframework.jndi.JndiTemplate) {



    environment = [

            "java.naming.factory.initial": "org.jnp.interfaces.NamingContextFactory",

            "java.naming.factory.url.pkgs": "org.jboss.naming:org.jnp.interfaces",

            "java.naming.provider.url": "${BinBaseConfigReader.getServer()}:1099".toString()

    ]

  }

easy, isn't it?

grails - testing service, which accesses ejb's

this night I just wanted to write a quick grails plugin to access my BinBase database system.

Since I'm a smart developer I start with a simple spring dsl file and write a test.





// Place your Spring DSL code here

beans = {



  jndiBinBaseTemplate(org.springframework.jndi.JndiTemplate) {



    environment = [

            "java.naming.factory.initial": "org.jnp.interfaces.NamingContextFactory",

            "java.naming.factory.url.pkgs": "org.jboss.naming:org.jnp.interfaces",

            "java.naming.provider.url": "localhost:1099"

    ]

  }



  /**

   * connection to the cluster configuration

   */

  clusterConfigService(SimpleRemoteStatelessSessionProxyFactoryBean) { bean ->

    businessInterface = "edu.ucdavis.genomics.metabolomics.binbase.cluster.ejb.delegate.ClusterConfigService"

    jndiName = "clusterservice/ClusterConfigServiceBean/remote"

    jndiTemplate = ref(jndiBinBaseTemplate)

  }



}

and the test code is simple enough





class BinBaseSchedulingServiceTests extends GroovyTestCase {





  ClusterConfigService clusterConfigService



  protected void setUp() {

    super.setUp()

  }



  protected void tearDown() {

    super.tearDown()

  }



  public void testTest(){

    assertTrue(clusterConfigService.username != null)

  }

}

the variable 'ClusterConfigService' is an stateless session bean in a jboss container and I know it works...

except grails is complaining:





org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'clusterConfigService': Invocation of init method failed; nested exception is javax.naming.NameNotFoundException: Name [clusterservice/ClusterConfigServiceBean/remote] not bound; 0 bindings: []

 at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1412)

 at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:519)

 at org.codehaus.groovy.grails.commons.spring.ReloadAwareAutowireCapableBeanFactory.doCreateBean(ReloadAwareAutowireCapableBeanFactory.java:135)

 at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:456)

 at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:291)

 at 

....lots and lots of more...

Caused by: javax.naming.NameNotFoundException: Name [clusterservice/ClusterConfigServiceBean/remote] not bound; 0 bindings: []

 at org.springframework.mock.jndi.SimpleNamingContext.lookup(SimpleNamingContext.java:132)

 at javax.naming.InitialContext.lookup(InitialContext.java:392)

 at org.springframework.jndi.JndiTemplate$1.doInContext(JndiTemplate.java:154)

 at org.springframework.jndi.JndiTemplate.execute(JndiTemplate.java:87)

 at org.springframework.jndi.JndiTemplate.lookup(JndiTemplate.java:152)

 at org.springframework.jndi.JndiTemplate.lookup(JndiTemplate.java:178)

 at org.springframework.jndi.JndiLocatorSupport.lookup(JndiLocatorSupport.java:95)

 at org.springframework.jndi.JndiObjectLocator.lookup(JndiObjectLocator.java:105)

 at org.springframework.ejb.access.AbstractRemoteSlsbInvokerInterceptor.lookup(AbstractRemoteSlsbInvokerInterceptor.java:100)

 at org.springframework.ejb.access.AbstractSlsbInvokerInterceptor.refreshHome(AbstractSlsbInvokerInterceptor.java:122)

 at 

...lots and lots of more...

[23:59:35,938] [ERROR] [main] [StackTrace] [Sanitizing stacktrace:]

javax.naming.NameNotFoundException: Name [clusterservice/ClusterConfigServiceBean/remote] not bound; 0 bindings: []

 at org.springframework.mock.jndi.SimpleNamingContext.lookup(SimpleNamingContext.java:132)

 at javax.naming.InitialContext.lookup(InitialContext.java:392)

 at org.springframework.jndi.JndiTemplate$1.doInContext(JndiTemplate.java:154)

 at org.springframework.jndi.JndiTemplate.execute(JndiTemplate.java:87)

 at org.springframework.jndi.JndiTemplate.lookup(JndiTemplate.java:152)

mhm doesn't make much sense. Reason for the failure is that grails decides to mock the access, which is normally ok. But in our case really really not desired.

After all we want to access the application server context!

But it keeps jumping to the 'org.springframework.mock.jndi.SimpleNamingContext' mocking context.

So the only solution is this slightly dirty hack...



/***

 * we do not need mocking in this plugin...

 */

if (org.springframework.mock.jndi.SimpleNamingContextBuilder.currentContextBuilder) {

  org.springframework.mock.jndi.SimpleNamingContextBuilder.currentContextBuilder.deactivate()

}

and promptly it passes.

Since we only want to change this for test cases, we add yet another conditon and our resources.groovy file ends up looking like



import org.springframework.ejb.access.SimpleRemoteStatelessSessionProxyFactoryBean

import grails.util.GrailsUtil

import org.codehaus.groovy.grails.commons.GrailsApplication



/** *

 * we do not need mocking in this plugin...

 */

if (GrailsUtil.getEnvironment().equals(GrailsApplication.ENV_TEST)) {

  println "warning disabling mocking context!!! to make remote ejb's work"

  if (org.springframework.mock.jndi.SimpleNamingContextBuilder.currentContextBuilder) {

    org.springframework.mock.jndi.SimpleNamingContextBuilder.currentContextBuilder.deactivate()

  }

}



// Place your Spring DSL code here

beans = {



  jndiBinBaseTemplate(org.springframework.jndi.JndiTemplate) {



    environment = [

            "java.naming.factory.initial": "org.jnp.interfaces.NamingContextFactory",

            "java.naming.factory.url.pkgs": "org.jboss.naming:org.jnp.interfaces",

            "java.naming.provider.url": "eros.fiehnlab.ucdavis.edu:1099"

    ]

  }



  /**

   * connection to the cluster configuration

   */

  clusterConfigService(SimpleRemoteStatelessSessionProxyFactoryBean) { bean ->

    businessInterface = "edu.ucdavis.genomics.metabolomics.binbase.cluster.ejb.delegate.ClusterConfigService"

    jndiName = "clusterservice/ClusterConfigServiceBean/remote"

    jndiTemplate = ref(jndiBinBaseTemplate)

  }



}

Thursday, December 16, 2010

Grails 1.3.5 webtests - custom steps to reset the database

once in a while I have the trouble that I want to reset the complete database during webtests or to make a snapshot of the current database while running the test.

So we use the great dbunit plugin to dump the database in one step:



import javax.sql.DataSource

import grails.plugin.remotecontrol.RemoteControl

import org.dbunit.database.DatabaseConnection

import org.dbunit.dataset.IDataSet

import org.dbunit.database.IDatabaseConnection

import org.dbunit.dataset.xml.FlatXmlDataSet

/**

* Created by IntelliJ IDEA.

* User: wohlgemuth

* Date: Nov 24, 2010

* Time: 3:46:47 PM

* To change this template use File | Settings | File Templates.

*/

class DumpDatabaseStep extends com.canoo.webtest.steps.Step {



String fileName = "result.xml"

/**

* resets the database

*/

void doExecute() {



RemoteControl remote = new RemoteControl()

try {



def outputFile = fileName

remote {

File file = new File("target/test-reports/data")

file.mkdirs()



DataSource dataSource = ctx.dataSource



IDatabaseConnection connection = new DatabaseConnection(dataSource.connection)



IDataSet fullDataSet = connection.createDataSet()



File out = new File(file, outputFile)



int counter = 1;



boolean run = out.exists()

while (run) {

out = new File(file, "${counter}-${outputFile}")



run = out.exists()

if (counter == 100) {

println "killed after 100 runs.."

run = false

}

counter++

}



FlatXmlDataSet.write(fullDataSet, new FileOutputStream(out))



null

}

}

catch (Exception e) {

e.printStackTrace()



throw e

}



}

}

and also to reset the complete database in another step



import grails.plugin.remotecontrol.RemoteControl

import javax.sql.DataSource

import groovy.sql.Sql



/**

* simple step which resets the complete database

*/

class ResetDatabaseStep extends com.canoo.webtest.steps.Step {



/**

* resets the database

*/

void doExecute() {



try {

DbUnitOperator.create()





RemoteControl remote = new RemoteControl()



remote.exec {

try {

DataSource dataSource = ctx.dataSource

Sql sql = Sql.newInstance(dataSource)



sql.execute("DROP SEQUENCE MINIX_ID")

sql.execute("CREATE SEQUENCE MINIX_ID START WITH  1000 INCREMENT BY 1")

}

catch (Exception e) {

e.printStackTrace()

throw e

}

}





}

catch (Exception e) {

println "error: ${e.getMessage()}"

e.printStackTrace()

throw new RuntimeException(e)

}



}

}

Grails 1.3.5 webtests and JQuery trouble

recently I migrated my prototype code over to JQuery and promptly all my webtest's failed with a lot of error messages and exceptions.

Like:



"TypeError: Cannot find function createComment in  

object [object]. (FunctionalTestCase.groovy#372)"

So during some googeling I found these reports:

old htmlunit version

replacing version

Which led to me reporting:

my approach

and the final solution was to put this into your 'BuildConfig.groovy'



plugin("webtest") {

test('net.sourceforge.htmlunit:htmlunit:2.8-SNAPSHOT') {

excludes 'xalan'

excludes 'xml-apis'

}

test('com.canoo.webtest:webtest:3.1-SNAPSHOT') {

excludes 'xalan'

excludes 'xml-apis'

}

test('xalan:xalan:2.7.0') {

excludes 'xml-apis'

}

}

Tuesday, November 23, 2010

grails many to many

currently grails many to many relation ships are giving me a lot of grief with exceptions like





org.hibernate.StaleObjectStateException:

Row was updated or deleted by another transaction (or unsaved-value mapping was 

incorrect): [ClassName#id]

or constraint excpetions.

Which all turned out to be related that the optimistic locking fails from time to time and needs to be disabled.
With version = false in the domain mapping.

References:

disable optimistic locking
delete many to many relations

Thursday, September 30, 2010

BinBase - generate histograms

sql statements to generate histograms, to learn more about your data

similarity



select count("range"),"range" from 

(

    select

        case 

            when match > 500 and match <= 550 then '501 - 550' 
            when match > 550 and match <= 600 then '550 - 600' 
            when match > 600 and match <= 650 then '600 - 650' 
            when match > 650 and match <= 700 then '650 - 700' 
            when match > 700 and match <= 750 then '700 - 750'
            when match > 750 and match <= 800 then '750 - 800'
            when match > 800 and match <= 850 then '800 - 850'
            when match > 850 and match <= 900 then '850 - 900'
            when match > 900 and match <= 950 then '900 - 950'
            when match > 950 and match <= 1000 then '950 - 1000'
         end 
            AS "range" from spectra a, bin b where a.bin_id = b.bin_id and b.export = 'FALSE'
)
a group by "range"

purity



select count("range"),"range" from 

(

    select

        case 

            when a.purity > 0 and a.purity <= 0.1 then '0 - 0.1' 
            when a.purity > 0.1 and a.purity <= 0.5 then '0.1 - 0.5' 
            when a.purity > 0.5 and a.purity <= 0.9 then '0.5 - 0.9' 
            when a.purity > 0.9 and a.purity <= 1.3 then '0.9 - 1.3' 
            when a.purity > 1.3 and a.purity <= 1.7 then '1.3 - 1.7'
            when a.purity > 1.7 and a.purity <= 2.1 then '1.7 - 2.1'
            when a.purity > 2.1 and a.purity <= 2.5 then '2.1 - 2.5'
            when a.purity > 2.5  then '> 2.5'

         end 

            AS "range" from spectra a, bin b where a.bin_id = b.bin_id and b.export = 'TRUE'

)

a group by "range"

signal noise

Tuesday, September 28, 2010

BinBase - reset database

simple sql script to reset the binbase database



delete from bin where bin_id not in (select bin_id from standard)

GO

delete from spectra where bin_id not in (select bin_id from standard)

GO

delete from correction_data

GO

delete from samples where sample_id not in (select sample_id from bin)

GO

delete from rawdata

GO

delete from result_link

GO

delete from result

GO

delete from runtime

GO

delete from quantification

GO

Monday, September 27, 2010

BinBase - simulate the sifter algorithm directly on the database

currently I'm generating statistics for the BinBase system and so needed to simulate the annotation algorithm in sql.

It was rather simple....





-- 0.05 settings

select 

count(*) as "count", 

lower('0 < 0.05') as purity,
    lower('> 500') as signalNoise,

lower('> 800') as similarity



from spectra where purity > 0 and purity < 0.05 and signal_noise > 500 and match > 800



UNION



select 

count(*) as "count", 

lower('0 < 0.05') as purity,
    lower('> 50 && < 500') as signalNoise,
    lower('> 700') as similarity



from spectra where purity > 0 and purity < 0.05 and signal_noise > 50 and signal_noise <= 500 and match > 700





UNION



select 

count(*) as "count", 

lower('0 < 0.05') as purity,
    lower('> 5 && < 50') as signalNoise,
    lower('> 600') as similarity



from spectra where purity > 0 and purity < 0.05 and signal_noise > 5 and signal_noise <= 50 and match > 600



-- 0.3 settings



UNION



select 

count(*) as "count", 

lower('0 < 0.3') as purity,
    lower('> 500') as signalNoise,

lower('> 700') as similarity



from spectra where purity > 0 and purity < 0.3 and signal_noise > 500 and match > 700



UNION



select 

count(*) as "count", 

lower('0 < 0.3') as purity,
    lower('> 50 && < 500') as signalNoise,
    lower('> 600') as similarity



from spectra where purity > 0 and purity < 0.3 and signal_noise > 50 and signal_noise <= 500 and match > 600



UNION



select 

count(*) as "count", 

lower('0 < 0.3') as purity,
    lower('> 5 && < 50') as signalNoise,
    lower('> 500') as similarity



from spectra where purity > 0 and purity < 0.03 and signal_noise > 5 and signal_noise <= 50 and match > 500



-- all ither settings



UNION



select 

count(*) as "count", 

lower('> 0.3') as purity,

lower('> 500') as signalNoise,

lower('> 600') as similarity



from spectra where purity > 0.3 and signal_noise > 500 and match > 600



UNION



select 

count(*) as "count", 

lower('> 0.3') as purity,

lower('> 50 && < 500') as signalNoise,
    lower('> 500') as similarity



from spectra where purity > 0.3 and signal_noise > 50 and signal_noise <= 500 and match > 500



UNION



select 

count(*) as "count", 

lower('> 0.3') as purity,

lower('> 5 && < 50') as signalNoise,
    lower('> 450') as similarity



from spectra where purity > 0.03 and signal_noise > 5 and signal_noise <= 50 and match > 450



-- total --



union



select 

count(*) as "count", 

lower('no filter') as purity,

lower('no filter') as signalNoise,

lower('no filter') as similarity



from spectra

leco 2.22 settings



-- 0.1 settings

select 

count(*) as "count", 

lower('0 < 0.1') as purity,
    lower('> 250') as signalNoise,

lower('> 800') as similarity



from spectra where purity > 0 and purity < 0.1 and signal_noise > 250 and match > 800



UNION



select 

count(*) as "count", 

lower('0 < 0.1') as purity,
    lower('> 25 && < 250') as signalNoise,
    lower('> 700') as similarity



from spectra where purity > 0 and purity < 0.1 and signal_noise > 25 and signal_noise <= 250 and match > 700





UNION



select 

count(*) as "count", 

lower('0 < 0.1') as purity,
    lower('>35 && < 25') as signalNoise,
    lower('> 600') as similarity



from spectra where purity > 0 and purity < 0.1 and signal_noise > 3 and signal_noise <= 25 and match > 600



-- 1.5 settings



UNION



select 

count(*) as "count", 

lower('0 < 1.5') as purity,
    lower('> 250') as signalNoise,

lower('> 700') as similarity



from spectra where purity > 0 and purity < 1.5 and signal_noise > 250 and match > 700



UNION



select 

count(*) as "count", 

lower('0 < 1.5') as purity,
    lower('> 25 && < 250') as signalNoise,
    lower('> 600') as similarity



from spectra where purity > 0 and purity < 1.5 and signal_noise > 25 and signal_noise <= 250 and match > 600



UNION



select 

count(*) as "count", 

lower('0 < 1.5') as purity,
    lower('> 3 && < 25') as signalNoise,
    lower('> 500') as similarity



from spectra where purity > 0 and purity < 1.5 and signal_noise > 3 and signal_noise <= 25 and match > 500



-- all other settings



UNION



select 

count(*) as "count", 

lower('> 1.5') as purity,

lower('> 250') as signalNoise,

lower('> 600') as similarity



from spectra where purity > 1.5 and signal_noise > 250 and match > 600



UNION



select 

count(*) as "count", 

lower('> 1.5') as purity,

lower('> 3 && < 250') as signalNoise,
    lower('> 500') as similarity



from spectra where purity > 1.5 and signal_noise > 3 and signal_noise <= 250 and match > 500



UNION



-- total --



select 

count(*) as "count", 

lower('no filter') as purity,

lower('no filter') as signalNoise,

lower('no filter') as similarity



from spectra

and as query to filter polysiloxanes out of the result for leco 2.22







-- 0.1 settings

select 

count(*) as "count", 

lower('0 < 0.1') as purity, lower('> 250') as signalNoise,

lower('> 800') as similarity



from spectra a, bin b  where a.purity > 0 and a.purity < 0.1 and a.signal_noise > 250 and match > 800 and a.bin_id = b.bin_id and b.export = 'TRUE'



UNION



select 

count(*) as "count", 

lower('0 < 0.1') as purity, lower('> 25 && < 250') as signalNoise, lower('> 700') as similarity



from spectra a, bin b  where a.purity > 0 and a.purity < 0.1 and a.signal_noise > 25 and a.signal_noise <= 250 and match > 700 and a.bin_id = b.bin_id and b.export = 'TRUE'





UNION



select 

count(*) as "count", 

lower('0 < 0.1') as purity, lower('>35 && < 25') as signalNoise, lower('> 600') as similarity



from spectra a, bin b  where a.purity > 0 and a.purity < 0.1 and a.signal_noise > 3 and a.signal_noise <= 25 and match > 600 and a.bin_id = b.bin_id and b.export = 'TRUE'



-- 1.5 settings



UNION



select 

count(*) as "count", 

lower('0 < 1.5') as purity, lower('> 250') as signalNoise,

lower('> 700') as similarity



from spectra a, bin b  where a.purity > 0 and a.purity < 1.5 and a.signal_noise > 250 and match > 700 and a.bin_id = b.bin_id and b.export = 'TRUE'



UNION



select 

count(*) as "count", 

lower('0 < 1.5') as purity, lower('> 25 && < 250') as signalNoise, lower('> 600') as similarity



from spectra a, bin b  where a.purity > 0 and a.purity < 1.5 and a.signal_noise > 25 and a.signal_noise <= 250 and match > 600 and a.bin_id = b.bin_id and b.export = 'TRUE'



UNION



select 

count(*) as "count", 

lower('0 < 1.5') as purity, lower('> 3 && < 25') as signalNoise, lower('> 500') as similarity



from spectra a, bin b  where a.purity > 0 and a.purity < 1.5 and a.signal_noise > 3 and a.signal_noise <= 25 and match > 500 and a.bin_id = b.bin_id and b.export = 'TRUE'



-- all other settings



UNION



select 

count(*) as "count", 

lower('> 1.5') as purity,

lower('> 250') as signalNoise,

lower('> 600') as similarity



from spectra a, bin b  where a.purity > 1.5 and a.signal_noise > 250 and match > 600 and a.bin_id = b.bin_id and b.export = 'TRUE'



UNION



select 

count(*) as "count", 

lower('> 1.5') as purity,

lower('> 3 && < 250') as signalNoise, lower('> 500') as similarity



from spectra a, bin b  where a.purity > 1.5 and a.signal_noise > 3 and a.signal_noise <= 250 and match > 500 and a.bin_id = b.bin_id and b.export = 'TRUE'



UNION



-- total --



select 

count(*) as "count", 

lower('no filter') as purity,

lower('no filter') as signalNoise,

lower('no filter') as similarity



from spectra a, bin b where a.bin_id = b.bin_id and b.export = 'TRUE'

Wednesday, September 22, 2010

BinBase queries

short overview of binbase queries to generate statistics





count of spectra



select count(*) from spectra a, samples b where  a.sample_id = b.sample_id and visible = 'TRUE'



count of spectra with no annotations



select count(*) from spectra a, samples b where  a.sample_id = b.sample_id and visible = 'TRUE' and bin_id is null



count of samples



select count(*) from samples where visibile = 'TRUE'



known species



select distinct value from sample_info a, meta_key b where a.key_id = b.key_id and b.key = 'Species'



count of species



select count(distinct value) from sample_info a, meta_key b where a.key_id = b.key_id and b.key = 'Species'



sample count for species



select count(a.sample_id),a.value from sample_info a, meta_key b,samples c where a.key_id = b.key_id and b.key = 'Species' and a.sample_id = c.sample_id and c.visible = 'TRUE' group by a.value



bin generation by species



select a.value,count(d.spectra_id) from sample_info a, meta_key b,samples c,bin d where a.key_id = b.key_id and b.key = 'Species' and a.sample_id = c.sample_id and d.sample_id = c.sample_id group by a.value



ri correction successful



select count(*) from samples where visible = 'TRUE' and correction_failed = 'FALSE'



ri correction successful



select count(*) from samples where visible = 'TRUE' and correction_failed = 'FALSE'



named bins



select count(bin_id) from bin where name = TO_CHAR(bin_id,'FM9999999999')



distribution of similarities





select count("range"),"range" from (

select

    case

        when match >= 0 and match <= 100 then '0 - 100' 
        when match > 100 and match <= 200 then '101 - 200'
        when match > 200 and match <= 300 then '201 - 300'
        when match > 300 and match <= 400 then '301 - 400'
        when match > 400 and match <= 500 then '401 - 500'
        when match > 500 and match <= 600 then '501 - 600'
        when match > 600 and match <= 700 then '601 - 700'
        when match > 700 and match <= 800 then '701 - 800'
        when match > 800 and match <= 900 then '801 - 900'
        when match > 900 and match <= 1000 then '901 - 1000'
    end AS "range"
 from spectra 
) a
group by "range"

Wednesday, September 8, 2010

grails - provide a dynamically generated file as download

today I had the rather simple request

generate a file on the fly and provide it as download.

Well the easiest solution was to just create a closure in a controller and set the content type in there.





def download = {



File file = new File(params.file)

response.setHeader "Content-disposition", "attachment; filename=${file.name}.txt"

response.contentType = 'text-plain'

response.outputStream << file.text     response.outputStream.flush()    }

Wednesday, August 25, 2010

dynamic fields with groovy

one of the features I love is to just define new fields in objects whenever I want.

like:





def test = ["a","b","c,"]



Object a = new Object()



test.each{



a.metaClass."${it}" = it

}

generates te following fields in the object a

a.a = "a"
a.b = "b"
a.c = "c"

works incredible well with the grails export plugin :)

Friday, July 23, 2010

Bash - One line based For Loop

today I needed to start 40 jobs in a row and the easiest way I could think about was to use a one line bash script todo this for me



for i in `seq 40`; do ./node.sh; done

tada...

Tuesday, July 13, 2010

finding all external reference names in pubchem sdf files

currently I'm working on a simple way to find all external reference names in the pubchem sdf files. Which is rather trivial, but time consuming:





cat *.sdf | grep PUBCHEM_EXT_DATASOURCE_NAME -A 1 | grep -v PUBCHEM_EXT_DATASOURCE_NAME | grep -v '\-\-' | sort | uniq

once this finished it should give us a list of all possible data sources from pubchem as a unique list of names.



...

Ambinter

Burnham Center for Chemical Genomics

Calbiochem

CC_PMLSC

ChEBI

ChemSpider

DiscoveryGate

Emory University Molecular Libraries Screening Center

InFarmatik

KUMGM

LipidMAPS

MICAD

MLSMR

MMDB

MTDP

Nature Chemical Biology

NCGC

NIAID

NMMLSC

NMRShiftDB

ORST SMALL MOLECULE SCREENING CENTER

PCMD

ProbeDB

Prous Science Drugs of the Future

R&D Chemicals

Sigma-Aldrich

Specs

SRMLSC

Structural Genomics Consortium

The Scripps Research Institute Molecular Screening Center

Thomson Pharma

UM-BBD

UPCMLD

...

Monday, July 5, 2010

ubuntu - stucked in graphical mode without an editor

well for some reason I thought it was smart to install on a freshly installed ubuntu linux server the gdm package in a VMWare console.

20 Seconds later it turned out that I had

- no terminal application
- could not switch to a virtual terminal, vmware kept ignoring the commands
- had ssh disabled

short I was stuck. Till I remembered to write a simple script to just install xterm and execute it. But wait I need to enter parameters and passwords after the prompt.

Quick forward

#! /bin/bash
echo 'password' | sudo -S apt-get --yes install xterm > log.txt

as script in a simple text file, which was set to be executable gave me all I need to install my xterm and be happy again.

Tuesday, June 22, 2010

making a jmx depend on an ear

this morning I discovered that my JBoss application server sometimes deploys my jmx beans before a depending ear. The fix is like always easy just define in your jboss-service.xml





<server>

        <mbean

                code="edu.ucdavis.genomics.metabolomics.binbase.bci.server.jmx.LifeTimeEvictionPolicyConfigJMX"

                name="binbase.cache:service=LifeTimeEvictionPolicy">

 <depends>jboss.j2ee:ear=clusterservice.ear</depends>



</mbean>

Thursday, June 17, 2010

Getting the hostname for the local machine

there are always the simple things which I'm getting stuck with, sometimes simple things like just getting the hostname for a local ip address in java.

solutions:





InetAddress addr = InetAddress.getLocalHost();

String hostName = addr.getCanonicalHostName();

Wednesday, June 16, 2010

sure shortly after the iPad with the unlimited dataplan got released, they change the plan to

2GB A month

what a joke...

Data per month Price per month
250MB/month $14.99
2GB/month $25.00

now I'm glad that I cancelled my pre order for one. What do 2GB give me these times? Nothing...

Thursday, May 13, 2010

renaming all files in a directory

some of the simple things in live





for x in *.php5; do n=${x/.php5/.php}; mv $x $n; done

taken from here

http://www.debianadmin.com/rename-multiple-files-to-another-extension-in-linux.html

Friday, May 7, 2010

moving indexes in postgres to a new tablespace

Recently I upgraded my server and added a SSD raid system of 3 128GB disks and for obvious reasons want to have my indexes on it. So the first step was to create a new tablespace



create tablespace 'name' owner 'owner' location 'directory'

and than creating a quick script to move all the indexes



select 'ALTER INDEX '||indexname||' SET TABLESPACE compoundindex ;' from pg_catalog.pg_indexes where schemaname = 'public' order by tablename;

which than just needs to be executed

Tuesday, May 4, 2010

using perl to replace all occurences of a word with another in large files

rather simple, we want to replace all occurences of the word 'volatile' with 'volatile2' in a large text file and so just execute



perl -i -p -e 's{volatile}{volatile2}g' /mnt/storage/data/export

I can't believe I'm using perl again...

flushing BinBase

once in a while you think you want to reprocess all data in a binbase database and for this reason you want to flush all the content except the retention index standards. This should never be required, but if you decided to regenerate all the bins with different settings, well you gotta do what you gotta do.

here is a quick sql script todo this for you



delete from comments



go



delete from sample_info



go



delete from meta_key



go



delete from metainformation



go 



delete from runtime



go



delete from bin_compare



go



delete from virtual_bin



go



delete from bin a where a.bin_id not in (select bin_id from standard)



go



delete from samples a where a.sample_id not in (select sample_id from bin)



go



delete from reference_class



go



delete from reference



go



delete from result_link



go



delete from bin_ratio



go



delete from bin_references



go



delete from classification



go



delete from  configuration where configuration_id not in (select configuration_id from samples)



go



delete from spectra where sample_id not in (select sample_id from samples)



go



delete from qualitycontrol



go



delete from result



go



delete from standard_hist



go



delete from structure



go



delete from substance_classes



go



delete from synonyme



go



delete from user_validation



go



delete from substance_classes



go

Wednesday, March 24, 2010

griffon

The last couple of days I spend with developing a quick application in griffon and have to say this is a rather impressive framework.. Specially since it's a young framework with a version of 0.3

what I like?

using the grails approach
very simple to use
saves a lot of time in the initial setup of the project
clear separation into controller/model/views

what I'm missing?

better support in IDE's
no SWT support
no gorm (it will come sooner or later)
documentation, with the current documentation i can't use it for production code

Thursday, March 11, 2010

i just love groovy for tasks like this...

sometimes you have these annoying little tasks on your hands and I always forget the bash syntax and it’s been a while since I seriously played with python.

So I thought, hey let’s try groovysh.

task?

read wordlist
convert to lowercase
remove parts of words

solution

new File("en_US.dic").text.split("\n").each{ new File("result.txt").append "${it.split("/")[0].toLowerCase()}\n" }

ok after a bit more coding I actually wrote a groovy script

read file
read file with words we do not want in word list
convert all values to lowercase
remove duplicated values
save to output file

really nothing fancy, but get’s the job done and took like 3 minutes to write and won’t be used again.

Set<String> cache = new HashSet<String>()
Set<String> result = new HashSet<String>()

new File(“whitelist.txt”).text.split(“\n”).each{ it.split("\t").eachWithIndex{

        String s, int index ->

                if(index > 0){
                        cache.add(s.toLowerCase())
                }
        }
}

new File(“blacklist.txt”).text.split(“\n”).each{
        if(cache.contains(it.toLowerCase()) == false){
                result.add(it.toLowerCase())
        }
}

File out = new File("wordlist.txt")

result.each{
        out.append it
        out.append "\n"
}

Wednesday, March 10, 2010

calculating exact molare masses using the cdk 1.3.1

I needed a simple way to calculate the exact mass for a couple of millions compounds and so decided to give it another try with the CDK.

After googleing a bit I found something what put me in the right direction and worked with the current version of the cdk, so it was rather simple
 
class ConvertInchi {

/**
* converts an inchi code to a molare mass
* @param inchi
* @return
*/
public static double convertInchiToMolareMass(String inchi) {

IMolecularFormula moleculeFormula = MolecularFormulaManipulator.getMolecularFormula(convertInchiToMolecularFormula(inchi),DefaultChemObjectBuilder.getInstance())

return MolecularFormulaManipulator.getTotalExactMass(moleculeFormula)

}

/**
*
* converts an inchi to a molecular formula
* @param
inchi
* @return
*/
public static String convertInchiToMolecularFormula(String inchi) {
return inchi.split("/")[1]

}
}

Friday, March 5, 2010

Postgres - partial indexes and sort indexes

this morning it was a sad moment. Normally people ask me all the time about SQL optimization or SQL tuning or when to create an index and so on.

This morning i had another DOOOH moment.

Basically while trying to optimize BinBase even further I noticed that some queries take an absurd amount of time to finish. Like 500 seconds to query the bin table.

So I started to investigate and discovered that some of my queries are using seq scans over huge tables. For no apparent reason.

It turned out that the sort column was not indexed...



CREATE INDEX bin_retention_index ON bin (retention_index ASC NULLS LAST);

which was greatly improved after executing this statement.

Now there was still a huge wait time for a certain query



explain select * from BIN where bin_id not in ( SELECT bin_id from SPECTRA where sample_id = 1630733 and bin_id is not null )ORDER BY retention_index ASC

and it turned out that indexes were never used for the 'is not null' part. After some research and a lot of head scratching it turned out that postgres supports partial indexes for exactly this case.



create index spectra_bin_id_partial_is_not_null on spectra (bin_id) where bin_id is not null

Now afterward we some improvement but the actually slow data access is caused by something else. We always need the complete table - a few compounds and this operation takes some time.

Time to optimize the internal database lookup...

Thursday, March 4, 2010

rocks linux cluster - adding a new parallel environment

by default rocks ships with a couple of environments, which execute stuff on different nodes. But sometimes you just want to have a node all to your self and take over all it's slots.

Todo this you can just create a new environment and which gives you a defined number of cpus for a specified job.

create a file which describes the paralell environment like this




pe_name            threaded


slots              999


user_lists         NONE


xuser_lists        NONE


start_proc_args    /bin/true


stop_proc_args     /bin/true


allocation_rule    $pe_slots


control_slaves     FALSE


job_is_first_task  TRUE


urgency_slots      min


accounting_summary FALSE




qconf -Ap file.txt

add it to the list of available envionments




qconf -mq all.q


pe_list               make mpich mpi orte threaded

test it with qlogin




qlogin -pe threaded 4

Wednesday, March 3, 2010

SetupX - tuning and the missing indexes

after analyzing a couple of hundred sql statements in setupx I noticed that there is no real use of indexes for some reason. Why there are no indexes escapes my mind, but since we prefer higher query speed I suggest the creation of the following indexes (which are far from perfect)

create index pubdata_relatedExperimentID_index on pubdata(relatedExperimentID)
create index NCBIClassifierID_ncbiID_index on NCBIClassifierID(ncbiID)
create index formobject_question_index on formobject(question)
create index formobject_discriminator_index on formobject(discriminator)
create index formobject_value_index on formobject(value(100))
create index query_querytype_index on query(querytype)
create index query_userid_index on query(userID)
create index datafile_source_index on datafile (source)
create index cache_experiment_id_index on cache (experimentID)
create index user_password_index on user (passwd)
create index user_username_index on user (username)
create index BlockedIP_blockedIp_index on BlockedIP(blockedIP)

this should improve the performance nicely and is currently being applied to our production system.

SetupX - rawdata access

ever had the desire to access the rawdata in SetupX, because the gui doesn't give you easy or access at all to it?

First you need to understand what's happening in the background. The used approache for setupx looks like a mapping from a tree to a table structure. Which is a valid approach and keeps it very flexible in therory, but is not a real pratical solution. Specially since it's a real pain to write queries for.

Get all species:

select distinct value from formobject where question = 'species' and value != ''

Get all organs:

select distinct lower(value) from formobject where question in ( 'organ','organ name','Organs','Organ specification' ) and value != ''

we need to use lower, cause people have a 'strange' way of spelling things

work in progress!

Tuesday, March 2, 2010

jboss 4.2.1GA and Java6 and jboss based webservice

thanks to this report I was able to fix this issue. Well it's not a fix, it's a workaround...

Ariela Hui - 13/Aug/08 04:23 PM
I was able to solve this problem. This is what environment I have:
winXP
JDK 1.6.0
JBoss 4.2.1

In the [jboss_home]/lib/endorsed add:
jaxb-api.jar
jboss-jaxrpc.jar
jboss-jaxws.jar
jboss-saaj.jar

copy them from [jboss_home]/server/default/lib

I've also added to endored jboss-logging-spi.jar
This I copied from jboss-5.0.0 client folder.

Monday, March 1, 2010

scala/groovy on rocks linux

well since there is not scala/groovy roll for rocks we need to install it the traditional way.

go into the directory /shared/apps on the frontend
if apps doesn't exist create it
copy your scala/groovy tgz there
gunzip and untar it
edit your extend-compute.xml as shown here
add a new file modification section like this





<file name="/etc/profile" mode="append">



GROOVY_HOME=/share/apps/groovy

SCALA_HOME=/share/apps/scala



export GROOVY_HOME

export SCALA_HOME



PATH=$GROOVY_HOME/bin:$PATH

PATH=$SCALA_HOME/bin:$PATH



export PATH



</file>

rebuild your dist as shown here
reinstall you nodes as shown here

Friday, February 26, 2010

backup! we got REG Expressions!

recently I got an interesting task handed

please extract all words out of a text...

ok simple enough, we know regular expression and we know word boundaries.

So we just do



\b\w+\b

and this simple expression applied to the following sentence

Glucose (Glc), a monosaccharide (or simple sugar) also...

gives us the following list

Glucose
Glc
a
monosaccharide
or simple
sugar
also

Sadly the sentence didn't stop at this and continued to include the following tricky statement...

...including as much as possible chemical names

Ok time to read on the reg expressions in groovy and java
Now let's discover a regular expression which helps us with this.

Ok let's try this again with a different sentence

...as glucose, only one of which (D-glucose) is biologically...

our first expression would miss

D-glucose

and return for this

D
glucose

so we need to modify it a bit to include the first seperation. So it becomes

\b(\w\-)*\w+\b

and the day is saved till we try a new sentence and try to discover compounds like

1,3-Diaminopropane
N-(3S-hydroxydecanoyl)-L-serine
3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane
2-(allyloxy)-1,3,5-trimethylbenzene
3-hydroxy-2-butanone
3,3'-Oxybis(1-propene)
1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane
2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine
1,6,9,13-tetraoxadispiro[4.2.4.2]tetradecane
3-(N, N-Diethylamino)-1-propylamine
D-Glucose
(R)-3-Hydroxybutyric acid

Or to make it more realistic, find all the words in this completely pointless and scientific wrong text

bunch of rumble to find 1,3-Diaminopropane in D-Glucose and 1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane.
It's also nice to have 3,3'-Oxybis(1-propene) or (R)-3-Hydroxybutyric acid. Last bot not least I'm a huge fan or 3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane.
Also it's a great feeling if we can find (glucose) in brakets without finding statement like (help i'm surrounded by brackets).

So do you see a pattern here?

everything in () or [] or {} can be part of a chemical so we use ((\[.*\])|($.*$)) for this part
everything separated by a ',' and followed by another character ending with a dash can be a chemical, so we use (\w+(,\w+)*\-) for this part
it ends all with a word \w+ or a ) (masked as \) )

so this expression would work for all these



 (\([\w\+]+(,\w+)*\)-)?\b[(\w+(,\w+[\'])*\-)*((\[.*\])|(\(.*\))|(\{.*\}))*\w+]+(\b)( (acid)|(anhydride)|(\sbenzoate)|(\sketone)|(\sether)|(\sester)|(\scyanide))?

except

3-(N, N-Diethylamino)-1-propylamine
glucose instead we get 'glucose)'

no solution for the 1 or 2 yet. Still trying to figure it out.

now the nicest thing is the groovy closure + match example to get all the content in a text.



def match = (text =~ pattern)

congrats now all you words are in the match variable! Text is just a string containing our text.

Wednesday, February 24, 2010

rocks linux cluster - mounting an nfs share on all nodes

after setting up the latest cluster I tried to provide to all nodes a couple of nfs shares, since user demanded this.

Well in rocks linux it's rather simple, once you understand the concept behind.

So a step to step tutorial.

go to the profile directory

cd /export/rocks/install/site-profiles/5.3/nodes/

make a copy of the skeleton file

cp skeleton.xml extend-compute.xml

edit file to tell it that we need to create a directory and add a line to the fstab. The right place for this is in the post section

mkdir -p /mnt/share <file name="/etc/fstab" mode="append"> server:/mount /mnt/share nfs defaults 0 0 </file>
change back to the main install dir

cd /export/rocks/install

rebuild rocks distibution

rocks create distro

rebuild nodes

ssh compute-0-0 '/boot/kickstart/cluster-kickstart'

congratulations if you did everything right your node should now boot up and have a directory mounted.

playing around with threads

currently I got back to my hobby and play a bit with multithreading to tune an algorithm.

So the first step was to write a smpl class to test if the threading api works on my system and what is better than to calculate primes?



import java.util.concurrent.ExecutorService;

import java.util.concurrent.Executors;

import java.util.concurrent.TimeUnit;



public class Test {



 public static void main(String args[]) throws InterruptedException {



  ExecutorService service = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());

  for (int i = 0; i < 500000000; i++) {
   final int postion = i;

   service.execute(new Runnable() {

    @Override
    public void run() {
     try {
      int i = postion;

      int i1 = (int) Math.ceil(Math.sqrt(i));

      boolean isPrimeNumber = false;

      while (i1 > 1) {



       if ((i != i1) && (i % i1 == 0)) {

        isPrimeNumber = false;

        break;

       }

       else if (!isPrimeNumber) {

        isPrimeNumber = true;

       }



       --i1;

      }



      if (isPrimeNumber) {

       System.out.println(Thread.currentThread().getName() + " - prime " + i);

      }

      

     }

     catch (Exception e) {

      e.printStackTrace();

     }

    }

   });

  }

  service.shutdown();

  service.awaitTermination(4000, TimeUnit.DAYS);

 }

}

and the output is nicely:

pool-1-thread-2 - prime 469237
pool-1-thread-2 - prime 469241
pool-1-thread-2 - prime 469253
pool-1-thread-4 - prime 466553
pool-1-thread-4 - prime 469267
pool-1-thread-4 - prime 469279
pool-1-thread-2 - prime 469283
pool-1-thread-3 - prime 467869
pool-1-thread-3 - prime 469303
pool-1-thread-2 - prime 469321

while all 4 CPU's are at 100% use.

translation, java executor api seems to work quite well.

Now time to tune the binbase algorithm...

Saturday, February 20, 2010

postgres and insert performance with grails

recently I spend a lot of time tuning a postgres database and an algorithm (based on grails) to try to insert 80M chemical compounds into a database with reference checks and assurance for there uniqueness.

The main issue is that the inserts get slower and slower over time and end up taking to long. To improve the performance we started with the following approach:

more memory
database indexes (obvious)
flush the hibernate session all 100 inserts
flush the grails cache all 100 inserts
tune several postgres parameters
re index the database all 10k inserts
enable vacuum on a 10k row basis
enable analyze on a 10k row basis
set auto vacuum to check all 10 minutes

But after all this work we are still stuck with a problem, the inserts get gradually slower over time, which can be related to the index building time.

Graph over 1M inserts: TODO, it's still in the calculation...

Tuesday, February 9, 2010

Postgres and utf8

Recently I discovered that my postgres datbase is not running with UTF-8 support which causes all kinds of headdache,

so what do you have todo to change this?

backup your database
set Lang to: LANG = en_US.UTF-8
execute initdb
recreate the database and specify the encoding: createdb --encoding=unicode
import your data

you should now have a working utf-8 database

Wednesday, February 3, 2010

muss das sein?

[INFO] [compiler:compile]
[INFO] Compiling 399 source files to /Users/wohlgemuth/Documents/workspace-private/oscar3-chem/branches/cdk-1.3.1/target/classes
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Compilation failure
Failure executing javac, but could not parse the error:
An exception has occurred in the compiler (1.5.0_20). Please file a bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport) after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report. Thank you.
com.sun.tools.javac.code.Symbol$CompletionFailure: file org/openscience/cdk/annotations/TestClass.class not found

Failure executing javac, but could not parse the error:
An exception has occurred in the compiler (1.5.0_20). Please file a bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport) after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report. Thank you.
com.sun.tools.javac.code.Symbol$CompletionFailure: file org/openscience/cdk/annotations/TestClass.class not found

I mean common, do I really have to deal with compiler bugs now...

correction

nvm forgot the cdk annotations library....

cdk-maven-mojos

recently I'm doing an awful lot with the CDK library and since I always 'like' to work with maven I thought it's time to write a couple of mojos to help me with the CDK work.

The first one of the list is a mojo which deploys the cdk library to my local maven repository and can be found under google code.

There are most likely more mojo's to come as I work more and more with the cdk.

Tuesday, February 2, 2010

regular expression for common chemical identifiers

This is basically a small collection for regular expressions which I use from time to time to distinguish chemical identifiers. Please feel free to add more to the list to make it grow and more complete.

The first line is the name, the second is the valid groovy/java version. All are validate with at least thousand examples

std inchi

InChI=1S/([^/]+)(?:/[^/]+)*\\S

std inchiKey

[A-Z]{14}-[A-Z]{10}-[A-Z,0-9]

CAS

\\d{1,7}-\\d\\d-\\d

KEGG

C\d{5}

LipidMaps

LMFA[0-9]{8}

HMDB

HMDB[0-9]*

Sunday, January 24, 2010

how to handle 400 Billion rows in postgress

well currently I try to optimize an denoising and replacement algorithm. This includes calculating ion traces over thousands files which have 20k peaks or 1 Million ions each.
To make this more challenging we don't do this once, we do it during every BinBase export. Since this is the base algorithm for the Zero replacement.

Short we do the same calculation over and over again, like to calculate the minimum intensity for 100 files between 15 and 20 seconds for the ion 87.

Short this shouts use a SQL database.

Now how much data do we have?

40k files
each file has around 20k spectra
each spectra has up to 500 ion

which translates into

400 000 000 000 rows

Performance Issues and how to make Postgres handle this amount of data

The first attempt was to just store all these files into a single table using hibernate. Which caused an OutOfMemory exception, surprise surprise.

So the next attempt was to use a stateless session which worked quite well and up to 1 Billion rows we had a query speed of under 20ms. Quite nice for no tuning of the database. But once we hit a bit over 1 Billions rows the query speed got worse and worse. Shor this was not an acceptable solution.

So we tried to use partions with postgres and define one table for each ion, inheriting from a master table.



Table "public.iontrace"

Column   |       Type       | Modifiers 

------------+------------------+-----------

id         | bigint           | not null

intensity  | double precision | 

ion        | integer          | 

spectra_id | bigint           | 

Indexes:

"iontrace_pkey" PRIMARY KEY, btree (id)

Foreign-key constraints:

"fk4f93923dc3b62c9a" FOREIGN KEY (spectra_id) REFERENCES spectra(id)

Rules:

rule_index_iontrace_100 AS

ON INSERT TO iontrace

WHERE new.ion = 100 DO INSTEAD  INSERT INTO iontrace_100 (id, intensity, ion, spectra_id) 

VALUES (new.id, new.intensity, new.ion, new.spectra_id)

rule_index_iontrace_101 AS

ON INSERT TO iontrace

WHERE new.ion = 101 DO INSTEAD  INSERT INTO iontrace_101 (id, intensity, ion, spectra_id) 

VALUES (new.id, new.intensity, new.ion, new.spectra_id)

rule_index_iontrace_102 AS

ON INSERT TO iontrace

WHERE new.ion = 102 DO INSTEAD  INSERT INTO iontrace_102 (id, intensity, ion, spectra_id)

which didn't work with hibernate. It kept complaining about



org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1

so we had to write our own batcher as was described here

a bit later we rewrote the example to actually work, since the field batch size is set to private.



/**

* hibernate partion batcher

*/

public class HibernatePartionBatcher extends BatchingBatcher {



public HibernatePartionBatcher(org.hibernate.jdbc.ConnectionManager connectionManager, org.hibernate.Interceptor interceptor) {

super(connectionManager, interceptor)

}





@Override

protected void doExecuteBatch(PreparedStatement ps) throws SQLException, HibernateException {



//use reflections to access the private field of the super class ugly but neccessaery

Field field = this.getClass().getSuperclass().getDeclaredField("batchSize")

field.setAccessible(true)



int value = field.getInt(this)



if (value != 0) {

try {

ps.executeBatch()

}

catch (RuntimeException re) {

throw re;

}

finally {

field.setInt(this, 0)

}

}

}

}



/**

* hibernate factory to intitialize the factory

*/

public class HibernatePartitionBatcherFactory extends BatchingBatcherFactory {



public HibernatePartitionBatcherFactory() {

}



public org.hibernate.jdbc.Batcher createBatcher(org.hibernate.jdbc.ConnectionManager connectionManager, org.hibernate.Interceptor interceptor) {

return new HibernatePartionBatcher(connectionManager, interceptor);

}

}

and to register this in our groovy script



def hibProps = [

"hibernate.jdbc.factory_class": HibernatePartitionBatcherFactory.class.getName()

]

A first test with just a single sample reveals that we are executing a sequential scan over all partions.



explain SELECT * from iontrace where ion = 105;

QUERY PLAN                                                  

-------------------------------------------------------------------------------------------------------------

Result  (cost=0.00..1163.25 rows=3374 width=28)

->  Append  (cost=0.00..1163.25 rows=3374 width=28)

->  Seq Scan on iontrace  (cost=0.00..10.46 rows=7 width=28)

Filter: (ion = 105)

->  Index Scan using index_iontrace_20 on iontrace_20 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_21 on iontrace_21 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_22 on iontrace_22 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_23 on iontrace_23 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_24 on iontrace_24 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_25 on iontrace_25 iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

to avoid this you have to change a parameter in postgresql.



SET constraint_exclusion = on;

and the new result shows that we only work on the correct partition now



explain SELECT * from iontrace where ion = 105;

QUERY PLAN                                                  

-------------------------------------------------------------------------------------------------------------

Result  (cost=0.00..13.03 rows=14 width=28)

->  Append  (cost=0.00..13.03 rows=14 width=28)

->  Seq Scan on iontrace  (cost=0.00..10.46 rows=7 width=28)

Filter: (ion = 105)

->  Index Scan using index_iontrace_105 on iontrace_105 iontrace  (cost=0.00..2.57 rows=7 width=28)

Index Cond: (ion = 105)

it still executes a sequential scan, but now it only does it on the correct partion with the stored data.

Now to reduce the likelihood for an index scan over the database we will define indexes on all the table (the script at the end of this post does this for you)

the new query plan is



explain SELECT * from iontrace where ion = 105;

QUERY PLAN                                                  

-------------------------------------------------------------------------------------------------------------

Result  (cost=0.00..4.15 rows=14 width=28)

->  Append  (cost=0.00..4.15 rows=14 width=28)

->  Index Scan using index_iontrace_ion on iontrace  (cost=0.00..1.57 rows=7 width=28)

Index Cond: (ion = 105)

->  Index Scan using index_iontrace_105 on iontrace_105 iontrace  (cost=0.00..2.57 rows=7 width=28)

Index Cond: (ion = 105)

The result is slightly different since we generate by now more data in the database. Sorry for that. But you can see that the query plan is much more efficient with the index.

Important is that you need an index for ion on every partition and on the main 'iontrace' table, if you don't have an index on the main table you will keep getting sequential scans.

The next snatch we encountered was the speed of the actual data insert. Since the check constraints seem to be rather slow and expensive.

First this is a hibernate insert on the main datatable:



LOG:  duration: 8.961 ms  statement: EXECUTE   [PREPARE:  insert into IonTrace (intensity, ion, spectra_id, id) values ($1, $2, $3, $4)]

LOG:  duration: 8.951 ms  statement: EXECUTE   [PREPARE:  insert into IonTrace (intensity, ion, spectra_id, id) values ($1, $2, $3, $4)]

LOG:  duration: 8.966 ms  statement: EXECUTE   [PREPARE:  insert into IonTrace (intensity, ion, spectra_id, id) values ($1, $2, $3, $4)]

Second this is plain old sql inserted into the main iontrace table:



LOG:  duration: 6.608 ms  statement: EXECUTE   [PREPARE:  insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),1418.0,237,700013)]

LOG:  duration: 6.610 ms  statement: EXECUTE   [PREPARE:  insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),1373.0,238,700013)]

LOG:  duration: 6.605 ms  statement: EXECUTE   [PREPARE:  insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),1383.0,239,700013)]

Third this is plain old sql inserted in each partition:



LOG:  duration: 0.055 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_475(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),27.0,475,680203)]

LOG:  duration: 0.058 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_489(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),13.0,489,680203)]

LOG:  duration: 0.051 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_496(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),13.0,496,680203)]

LOG:  duration: 0.138 ms  statement: EXECUTE   [PREPARE:  insert into Spectra (file_id, retentionTime, id) values ($1, $2, $3)]

From this we can see that it makes the most sense to use plain sql to insert the data into the related partitions and to avoid the checks during the insert phase.

For this we used the following approach



//build the ion trace

Statement statement = session.connection().createStatement()



for (int mass = 0; mass < mz.length; mass++) {           if (mass >= beginMass && mass <= endMass) {             if (intensity[mass] > 0) {

IonTrace trace = new IonTrace(ion: mz[mass], intensity: intensity[mass])

trace.spectra = spectra





statement.addBatch("insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),${intensity[mass]},${(int)(mz[mass])},${spectra.id})")

}

}

}



statement.executeBatch()

which worked quite well. As a conclusion I removed the complete hibernate layer from this application, since we lost most of it's benefits by now.

To further optimize the insert speed we replace the Statements with PreparedStatements, which improves the speed of the insert by close to 50%.



LOG:  duration: 0.035 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_127(id, intensity, ion, spectra_id) values (nextval('seq_trace'),$1,$2,$3)]

LOG:  duration: 0.035 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_127(id, intensity, ion, spectra_id) values (nextval('seq_trace'),$1,$2,$3)]

LOG:  duration: 0.035 ms  statement: EXECUTE   [PREPARE:  insert into iontrace_127(id, intensity, ion, spectra_id) values (nextval('seq_trace'),$1,$2,$3)]

Storage

as you can imagine this amount of rows can take up quite some space. An initial estimate say it's roughly 20TB with the current table schema.

Which is 20 times more than the files occupy on the harddrive.

So what can you do to keep the storage space down

use the smallest possible data types possible
only index what is necessary to be index
remove columns which are not necessary, like we don't need a primary key on the ion trace table
estimate your possible amount of data stored in the database, like we won't store billions of spectra, so we can use integer instead of bigint
store intensities as integer and not as double.

To estimate the required storage we used the following function, which roughly gives us storage capacity for 4 years of netcdf files.



netcdf-repository=# SELECT pg_size_pretty(pg_database_size('netcdf-repository')/(select count(*) from netcdf)* 40000) as "estimated database size";

 estimated database size 

-------------------------

 16 TB

(1 row)

And a server for this would currently cost around 7000$

-END-

Friday, January 22, 2010

protecting jboss

Protecting JBoss

in the last couple of weeks we released more and more of our BinBase Tools to make it able to actually work with the database. Which means we have to protect our data better.

Since I had no time at work for this and JBoss doesn't provide a convenient way, well I decided to write a little tool which does this for me.

So I created yet another google code project, called 'jboss-ip-filter', which basically does nothing else than providing an interceptor, which intercepts all method calls and check's if the ip is in a list of registered IP Address.

Features

protect ejb3.x services
protect ejb2.x services
ip can be defined as regular expression to support subnets

Configuration/Installation

First you need to download the latest release and copy it into the jboss library directory of your choosen configuration.

Afterwards you need to register the interceptor in the jboss configuration.

Example

vim /usr/local/jboss/server/all/conf/standardjboss.xml

Go to the part about the container configurations and register the interceptor in the first position for every ejb configuration you want to protect.

The name of the class is: 'com.blogspot.codingandmore.jboss.filter.SessionInterceptor'



<container-configuration>

<container-name>Standard CMP 2.x EntityBean</container-name>

<call-logging>false</call-logging>

<invoker-proxy-binding-name>entity-unified-invoker</invoker-proxy-binding-name>

<sync-on-commit-only>false</sync-on-commit-only>

<insert-after-ejb-post-create>false</insert-after-ejb-post-create>

<call-ejb-store-on-clean>true</call-ejb-store-on-clean>

<container-interceptors>

<interceptor>com.blogspot.codingandmore.jboss.filter.SessionInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.SecurityInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>

<interceptor>org.jboss.ejb.plugins.CallValidationInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.EntityLockInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.EntityReentranceInterceptor</interceptor>

<interceptor>org.jboss.resource.connectionmanager.CachedConnectionInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.EntitySynchronizationInterceptor</interceptor>

<interceptor>org.jboss.ejb.plugins.cmp.jdbc.JDBCRelationInterceptor</interceptor>

</container-interceptors>

<instance-pool>org.jboss.ejb.plugins.EntityInstancePool</instance-pool>

<instance-cache>org.jboss.ejb.plugins.InvalidableEntityInstanceCache</instance-cache>

<persistence-manager>org.jboss.ejb.plugins.cmp.jdbc.JDBCStoreManager</persistence-manager>

<locking-policy>org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock</locking-policy>

<container-cache-conf>

<cache-policy>org.jboss.ejb.plugins.LRUEnterpriseContextCachePolicy</cache-policy>

<cache-policy-conf>

<min-capacity>50</min-capacity>

<max-capacity>1000000</max-capacity>

<overager-period>300</overager-period>

<max-bean-age>600</max-bean-age>

<resizer-period>400</resizer-period>

<max-cache-miss-period>60</max-cache-miss-period>

<min-cache-miss-period>1</min-cache-miss-period>

<cache-load-factor>0.75</cache-load-factor>

</cache-policy-conf>

</container-cache-conf>

<container-pool-conf>

<MaximumSize>100</MaximumSize>

</container-pool-conf>

<commit-option>B</commit-option>

</container-configuration>

After this is done you need to restart your server and it should generate a property in the start directory after the next reboot. In this directory you configure your ip address. To be allowed.

For example if you started the server in the bin directory, the file will be found there



vim /usr/local/jboss/bin/ip-filter-config.properties

The ip address of the local host is always registered.

These following two lines allow it the host '128.120.136.154' to connect but refuses connections from any other hosts to the ejb's.



128.120.136.154 = true

\b(?:\d{1,3}\.){3}\d{1,3}\b = false

If you encounter any problems, please don't hesitate to contact me and I try to help with the encountered problems.