Tuesday, December 21, 2010

registering jboss to start by default and comforms to chkconfig

currently I had yet another binbase install and it turned out that my standard jboss script is not chkconfig compatible.

So it was time todo a rewrite


#! /bin/bash
#
# network Bring up/down jboss
#
# chkconfig: 345 20 80
# description: Starts and stops the jboss server
#
# /etc/rc.d/init.d/jboss
# See how we were called.

#define where jboss is - this is the directory containing directories log, bin, conf etc
JBOSS_HOME=/share/apps/jboss

#define the user under which jboss will run, or use 'RUNASIS' to run as the current user
JBOSS_USER=${JBOSS_USER:-"root"}

#make sure java is in your path
JAVAPTH=/usr/java/latest/bin/java

#configuration to use, usually one of 'minimal', 'default', 'all'
JBOSS_CONF=${JBOSS_CONF:-"all"}
JBOSS_HOST=0.0.0.0

JBOSS_CONSOLE=$JBOSS_HOME/console.log
#if JBOSS_HOST specified, use -b to bind jboss services to that address
JBOSS_BIND_ADDR=${JBOSS_HOST:+"-b $JBOSS_HOST"}

#define the classpath for the shutdown class
JBOSSCP=${JBOSSCP:-"$JBOSS_HOME/bin/shutdown.jar:$JBOSS_HOME/client/jnet.jar"}

#define the script to use to start jboss
JBOSSSH=${JBOSSSH:-"$JBOSS_HOME/bin/run.sh -c $JBOSS_CONF $JBOSS_BIND_ADDR"}

if [ "$JBOSS_USER" = "RUNASIS" ]; then
SUBIT=""
else
SUBIT="su - $JBOSS_USER -c "
fi

if [ -n "$JBOSS_CONSOLE" -a ! -d "$JBOSS_CONSOLE" ]; then
# ensure the file exists
touch $JBOSS_CONSOLE
if [ ! -z "$SUBIT" ]; then
chown $JBOSS_USER $JBOSS_CONSOLE
fi
fi

Saturday, December 18, 2010

grails - providing your own plugins configuration

sometimes you want to share some services in a plugin, which needs to be configured at runtime.

The easiest way is to provide a simple file in the grails conf diretory of your plugin and just read it and access it at runtime.

How to?

our little helper:


package binbase.core

import grails.util.GrailsUtil
import org.codehaus.groovy.grails.commons.GrailsApplication

/**
* Created by IntelliJ IDEA.
* User: wohlgemuth
* Date: 12/18/10
* Time: 1:00 AM
* To change this template use File | Settings | File Templates.
*/
class BinBaseConfigReader {

private static ConfigObject config = initialize()

/**
* initializes the object
* @return
*/
static ConfigObject initialize() {
return new ConfigSlurper().parse(new GroovyClassLoader(BinBaseConfigReader.class.getClassLoader()).loadClass('BinBaseConfig'))
}

/**
* returns the server
* @return
*/
static String getServer() {
if (GrailsUtil.getEnvironment().equals(GrailsApplication.ENV_TEST)) {
return config.binbase.test.server.toString()
}
else if (GrailsUtil.getEnvironment().equals(GrailsApplication.ENV_DEVELOPMENT)) {
return config.binbase.development.server.toString()
}
else if (GrailsUtil.getEnvironment().equals(GrailsApplication.ENV_PRODUCTION)) {
return config.binbase.production.server.toString()

}
else {
throw new RuntimeException("unexspected enviroment found")
}
}
}



our configuration


/**
* this file contains the binbase configuration for this plugin.
*
*/
binbase {
production {
key = "dasdasd"
server = "10.1.1.1"
}
test {
key = "dasdasd"
server = "10.2.2.2"
}
development {
key = "dasdasd"
server = "127.0.0.1"
}
}


and to access it in your, say spring configuration?


jndiBinBaseTemplate(org.springframework.jndi.JndiTemplate) {

environment = [
"java.naming.factory.initial": "org.jnp.interfaces.NamingContextFactory",
"java.naming.factory.url.pkgs": "org.jboss.naming:org.jnp.interfaces",
"java.naming.provider.url": "${BinBaseConfigReader.getServer()}:1099".toString()
]
}


easy, isn't it?

grails - testing service, which accesses ejb's

this night I just wanted to write a quick grails plugin to access my BinBase database system.

Since I'm a smart developer I start with a simple spring dsl file and write a test.



// Place your Spring DSL code here
beans = {

jndiBinBaseTemplate(org.springframework.jndi.JndiTemplate) {

environment = [
"java.naming.factory.initial": "org.jnp.interfaces.NamingContextFactory",
"java.naming.factory.url.pkgs": "org.jboss.naming:org.jnp.interfaces",
"java.naming.provider.url": "localhost:1099"
]
}

/**
* connection to the cluster configuration
*/
clusterConfigService(SimpleRemoteStatelessSessionProxyFactoryBean) { bean ->
businessInterface = "edu.ucdavis.genomics.metabolomics.binbase.cluster.ejb.delegate.ClusterConfigService"
jndiName = "clusterservice/ClusterConfigServiceBean/remote"
jndiTemplate = ref(jndiBinBaseTemplate)
}

}


and the test code is simple enough



class BinBaseSchedulingServiceTests extends GroovyTestCase {


ClusterConfigService clusterConfigService

protected void setUp() {
super.setUp()
}

protected void tearDown() {
super.tearDown()
}

public void testTest(){
assertTrue(clusterConfigService.username != null)
}
}


the variable 'ClusterConfigService' is an stateless session bean in a jboss container and I know it works...

except grails is complaining:



org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'clusterConfigService': Invocation of init method failed; nested exception is javax.naming.NameNotFoundException: Name [clusterservice/ClusterConfigServiceBean/remote] not bound; 0 bindings: []
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1412)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:519)
at org.codehaus.groovy.grails.commons.spring.ReloadAwareAutowireCapableBeanFactory.doCreateBean(ReloadAwareAutowireCapableBeanFactory.java:135)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:456)
at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:291)
at
....lots and lots of more...
Caused by: javax.naming.NameNotFoundException: Name [clusterservice/ClusterConfigServiceBean/remote] not bound; 0 bindings: []
at org.springframework.mock.jndi.SimpleNamingContext.lookup(SimpleNamingContext.java:132)
at javax.naming.InitialContext.lookup(InitialContext.java:392)
at org.springframework.jndi.JndiTemplate$1.doInContext(JndiTemplate.java:154)
at org.springframework.jndi.JndiTemplate.execute(JndiTemplate.java:87)
at org.springframework.jndi.JndiTemplate.lookup(JndiTemplate.java:152)
at org.springframework.jndi.JndiTemplate.lookup(JndiTemplate.java:178)
at org.springframework.jndi.JndiLocatorSupport.lookup(JndiLocatorSupport.java:95)
at org.springframework.jndi.JndiObjectLocator.lookup(JndiObjectLocator.java:105)
at org.springframework.ejb.access.AbstractRemoteSlsbInvokerInterceptor.lookup(AbstractRemoteSlsbInvokerInterceptor.java:100)
at org.springframework.ejb.access.AbstractSlsbInvokerInterceptor.refreshHome(AbstractSlsbInvokerInterceptor.java:122)
at
...lots and lots of more...
[23:59:35,938] [ERROR] [main] [StackTrace] [Sanitizing stacktrace:]
javax.naming.NameNotFoundException: Name [clusterservice/ClusterConfigServiceBean/remote] not bound; 0 bindings: []
at org.springframework.mock.jndi.SimpleNamingContext.lookup(SimpleNamingContext.java:132)
at javax.naming.InitialContext.lookup(InitialContext.java:392)
at org.springframework.jndi.JndiTemplate$1.doInContext(JndiTemplate.java:154)
at org.springframework.jndi.JndiTemplate.execute(JndiTemplate.java:87)
at org.springframework.jndi.JndiTemplate.lookup(JndiTemplate.java:152)




mhm doesn't make much sense. Reason for the failure is that grails decides to mock the access, which is normally ok. But in our case really really not desired.

After all we want to access the application server context!

But it keeps jumping to the 'org.springframework.mock.jndi.SimpleNamingContext' mocking context.

So the only solution is this slightly dirty hack...


/***
* we do not need mocking in this plugin...
*/
if (org.springframework.mock.jndi.SimpleNamingContextBuilder.currentContextBuilder) {
org.springframework.mock.jndi.SimpleNamingContextBuilder.currentContextBuilder.deactivate()
}



and promptly it passes.

Since we only want to change this for test cases, we add yet another conditon and our resources.groovy file ends up looking like


import org.springframework.ejb.access.SimpleRemoteStatelessSessionProxyFactoryBean
import grails.util.GrailsUtil
import org.codehaus.groovy.grails.commons.GrailsApplication

/** *
* we do not need mocking in this plugin...
*/
if (GrailsUtil.getEnvironment().equals(GrailsApplication.ENV_TEST)) {
println "warning disabling mocking context!!! to make remote ejb's work"
if (org.springframework.mock.jndi.SimpleNamingContextBuilder.currentContextBuilder) {
org.springframework.mock.jndi.SimpleNamingContextBuilder.currentContextBuilder.deactivate()
}
}

// Place your Spring DSL code here
beans = {

jndiBinBaseTemplate(org.springframework.jndi.JndiTemplate) {

environment = [
"java.naming.factory.initial": "org.jnp.interfaces.NamingContextFactory",
"java.naming.factory.url.pkgs": "org.jboss.naming:org.jnp.interfaces",
"java.naming.provider.url": "eros.fiehnlab.ucdavis.edu:1099"
]
}

/**
* connection to the cluster configuration
*/
clusterConfigService(SimpleRemoteStatelessSessionProxyFactoryBean) { bean ->
businessInterface = "edu.ucdavis.genomics.metabolomics.binbase.cluster.ejb.delegate.ClusterConfigService"
jndiName = "clusterservice/ClusterConfigServiceBean/remote"
jndiTemplate = ref(jndiBinBaseTemplate)
}

}

Thursday, December 16, 2010

Grails 1.3.5 webtests - custom steps to reset the database

once in a while I have the trouble that I want to reset the complete database during webtests or to make a snapshot of the current database while running the test.

So we use the great dbunit plugin to dump the database in one step:


import javax.sql.DataSource
import grails.plugin.remotecontrol.RemoteControl
import org.dbunit.database.DatabaseConnection
import org.dbunit.dataset.IDataSet
import org.dbunit.database.IDatabaseConnection
import org.dbunit.dataset.xml.FlatXmlDataSet
/**
* Created by IntelliJ IDEA.
* User: wohlgemuth
* Date: Nov 24, 2010
* Time: 3:46:47 PM
* To change this template use File | Settings | File Templates.
*/
class DumpDatabaseStep extends com.canoo.webtest.steps.Step {

String fileName = "result.xml"
/**
* resets the database
*/
void doExecute() {

RemoteControl remote = new RemoteControl()
try {

def outputFile = fileName
remote {
File file = new File("target/test-reports/data")
file.mkdirs()

DataSource dataSource = ctx.dataSource

IDatabaseConnection connection = new DatabaseConnection(dataSource.connection)

IDataSet fullDataSet = connection.createDataSet()

File out = new File(file, outputFile)

int counter = 1;

boolean run = out.exists()
while (run) {
out = new File(file, "${counter}-${outputFile}")

run = out.exists()
if (counter == 100) {
println "killed after 100 runs.."
run = false
}
counter++
}

FlatXmlDataSet.write(fullDataSet, new FileOutputStream(out))

null
}
}
catch (Exception e) {
e.printStackTrace()

throw e
}

}
}


and also to reset the complete database in another step


import grails.plugin.remotecontrol.RemoteControl
import javax.sql.DataSource
import groovy.sql.Sql

/**
* simple step which resets the complete database
*/
class ResetDatabaseStep extends com.canoo.webtest.steps.Step {

/**
* resets the database
*/
void doExecute() {

try {
DbUnitOperator.create()


RemoteControl remote = new RemoteControl()

remote.exec {
try {
DataSource dataSource = ctx.dataSource
Sql sql = Sql.newInstance(dataSource)

sql.execute("DROP SEQUENCE MINIX_ID")
sql.execute("CREATE SEQUENCE MINIX_ID START WITH 1000 INCREMENT BY 1")
}
catch (Exception e) {
e.printStackTrace()
throw e
}
}


}
catch (Exception e) {
println "error: ${e.getMessage()}"
e.printStackTrace()
throw new RuntimeException(e)
}

}
}

Grails 1.3.5 webtests and JQuery trouble

recently I migrated my prototype code over to JQuery and promptly all my webtest's failed with a lot of error messages and exceptions.

Like:


"TypeError: Cannot find function createComment in
object [object]. (FunctionalTestCase.groovy#372)"


So during some googeling I found these reports:

old htmlunit version

replacing version

Which led to me reporting:

my approach

and the final solution was to put this into your 'BuildConfig.groovy'


plugin("webtest") {
test('net.sourceforge.htmlunit:htmlunit:2.8-SNAPSHOT') {
excludes 'xalan'
excludes 'xml-apis'
}
test('com.canoo.webtest:webtest:3.1-SNAPSHOT') {
excludes 'xalan'
excludes 'xml-apis'
}
test('xalan:xalan:2.7.0') {
excludes 'xml-apis'
}
}

Tuesday, November 23, 2010

grails many to many

currently grails many to many relation ships are giving me a lot of grief with exceptions like



org.hibernate.StaleObjectStateException:
Row was updated or deleted by another transaction (or unsaved-value mapping was
incorrect): [ClassName#id]



or constraint excpetions.

Which all turned out to be related that the optimistic locking fails from time to time and needs to be disabled.
With version = false in the domain mapping.

References:

disable optimistic locking
delete many to many relations

Thursday, September 30, 2010

BinBase - generate histograms

sql statements to generate histograms, to learn more about your data

similarity


select count("range"),"range" from
(
select
case
when match > 500 and match <= 550 then '501 - 550' when match > 550 and match <= 600 then '550 - 600' when match > 600 and match <= 650 then '600 - 650' when match > 650 and match <= 700 then '650 - 700' when match > 700 and match <= 750 then '700 - 750' when match > 750 and match <= 800 then '750 - 800' when match > 800 and match <= 850 then '800 - 850' when match > 850 and match <= 900 then '850 - 900' when match > 900 and match <= 950 then '900 - 950' when match > 950 and match <= 1000 then '950 - 1000' end AS "range" from spectra a, bin b where a.bin_id = b.bin_id and b.export = 'FALSE' ) a group by "range"


purity


select count("range"),"range" from
(
select
case
when a.purity > 0 and a.purity <= 0.1 then '0 - 0.1' when a.purity > 0.1 and a.purity <= 0.5 then '0.1 - 0.5' when a.purity > 0.5 and a.purity <= 0.9 then '0.5 - 0.9' when a.purity > 0.9 and a.purity <= 1.3 then '0.9 - 1.3' when a.purity > 1.3 and a.purity <= 1.7 then '1.3 - 1.7' when a.purity > 1.7 and a.purity <= 2.1 then '1.7 - 2.1' when a.purity > 2.1 and a.purity <= 2.5 then '2.1 - 2.5' when a.purity > 2.5 then '> 2.5'
end
AS "range" from spectra a, bin b where a.bin_id = b.bin_id and b.export = 'TRUE'
)
a group by "range"


signal noise


Tuesday, September 28, 2010

BinBase - reset database

simple sql script to reset the binbase database


delete from bin where bin_id not in (select bin_id from standard)
GO
delete from spectra where bin_id not in (select bin_id from standard)
GO
delete from correction_data
GO
delete from samples where sample_id not in (select sample_id from bin)
GO
delete from rawdata
GO
delete from result_link
GO
delete from result
GO
delete from runtime
GO
delete from quantification
GO

Monday, September 27, 2010

BinBase - simulate the sifter algorithm directly on the database

currently I'm generating statistics for the BinBase system and so needed to simulate the annotation algorithm in sql.

It was rather simple....



-- 0.05 settings
select
count(*) as "count",
lower('0 < 0.05') as purity, lower('> 500') as signalNoise,
lower('> 800') as similarity

from spectra where purity > 0 and purity < 0.05 and signal_noise > 500 and match > 800

UNION

select
count(*) as "count",
lower('0 < 0.05') as purity, lower('> 50 && < 500') as signalNoise, lower('> 700') as similarity

from spectra where purity > 0 and purity < 0.05 and signal_noise > 50 and signal_noise <= 500 and match > 700


UNION

select
count(*) as "count",
lower('0 < 0.05') as purity, lower('> 5 && < 50') as signalNoise, lower('> 600') as similarity

from spectra where purity > 0 and purity < 0.05 and signal_noise > 5 and signal_noise <= 50 and match > 600

-- 0.3 settings

UNION

select
count(*) as "count",
lower('0 < 0.3') as purity, lower('> 500') as signalNoise,
lower('> 700') as similarity

from spectra where purity > 0 and purity < 0.3 and signal_noise > 500 and match > 700

UNION

select
count(*) as "count",
lower('0 < 0.3') as purity, lower('> 50 && < 500') as signalNoise, lower('> 600') as similarity

from spectra where purity > 0 and purity < 0.3 and signal_noise > 50 and signal_noise <= 500 and match > 600

UNION

select
count(*) as "count",
lower('0 < 0.3') as purity, lower('> 5 && < 50') as signalNoise, lower('> 500') as similarity

from spectra where purity > 0 and purity < 0.03 and signal_noise > 5 and signal_noise <= 50 and match > 500

-- all ither settings

UNION

select
count(*) as "count",
lower('> 0.3') as purity,
lower('> 500') as signalNoise,
lower('> 600') as similarity

from spectra where purity > 0.3 and signal_noise > 500 and match > 600

UNION

select
count(*) as "count",
lower('> 0.3') as purity,
lower('> 50 && < 500') as signalNoise, lower('> 500') as similarity

from spectra where purity > 0.3 and signal_noise > 50 and signal_noise <= 500 and match > 500

UNION

select
count(*) as "count",
lower('> 0.3') as purity,
lower('> 5 && < 50') as signalNoise, lower('> 450') as similarity

from spectra where purity > 0.03 and signal_noise > 5 and signal_noise <= 50 and match > 450

-- total --

union

select
count(*) as "count",
lower('no filter') as purity,
lower('no filter') as signalNoise,
lower('no filter') as similarity

from spectra



leco 2.22 settings


-- 0.1 settings
select
count(*) as "count",
lower('0 < 0.1') as purity, lower('> 250') as signalNoise,
lower('> 800') as similarity

from spectra where purity > 0 and purity < 0.1 and signal_noise > 250 and match > 800

UNION

select
count(*) as "count",
lower('0 < 0.1') as purity, lower('> 25 && < 250') as signalNoise, lower('> 700') as similarity

from spectra where purity > 0 and purity < 0.1 and signal_noise > 25 and signal_noise <= 250 and match > 700


UNION

select
count(*) as "count",
lower('0 < 0.1') as purity, lower('>35 && < 25') as signalNoise, lower('> 600') as similarity

from spectra where purity > 0 and purity < 0.1 and signal_noise > 3 and signal_noise <= 25 and match > 600

-- 1.5 settings

UNION

select
count(*) as "count",
lower('0 < 1.5') as purity, lower('> 250') as signalNoise,
lower('> 700') as similarity

from spectra where purity > 0 and purity < 1.5 and signal_noise > 250 and match > 700

UNION

select
count(*) as "count",
lower('0 < 1.5') as purity, lower('> 25 && < 250') as signalNoise, lower('> 600') as similarity

from spectra where purity > 0 and purity < 1.5 and signal_noise > 25 and signal_noise <= 250 and match > 600

UNION

select
count(*) as "count",
lower('0 < 1.5') as purity, lower('> 3 && < 25') as signalNoise, lower('> 500') as similarity

from spectra where purity > 0 and purity < 1.5 and signal_noise > 3 and signal_noise <= 25 and match > 500

-- all other settings

UNION

select
count(*) as "count",
lower('> 1.5') as purity,
lower('> 250') as signalNoise,
lower('> 600') as similarity

from spectra where purity > 1.5 and signal_noise > 250 and match > 600

UNION

select
count(*) as "count",
lower('> 1.5') as purity,
lower('> 3 && < 250') as signalNoise, lower('> 500') as similarity

from spectra where purity > 1.5 and signal_noise > 3 and signal_noise <= 250 and match > 500

UNION

-- total --

select
count(*) as "count",
lower('no filter') as purity,
lower('no filter') as signalNoise,
lower('no filter') as similarity

from spectra


and as query to filter polysiloxanes out of the result for leco 2.22




-- 0.1 settings
select
count(*) as "count",
lower('0 < 0.1') as purity, lower('> 250') as signalNoise,
lower('> 800') as similarity

from spectra a, bin b where a.purity > 0 and a.purity < 0.1 and a.signal_noise > 250 and match > 800 and a.bin_id = b.bin_id and b.export = 'TRUE'

UNION

select
count(*) as "count",
lower('0 < 0.1') as purity, lower('> 25 && < 250') as signalNoise, lower('> 700') as similarity

from spectra a, bin b where a.purity > 0 and a.purity < 0.1 and a.signal_noise > 25 and a.signal_noise <= 250 and match > 700 and a.bin_id = b.bin_id and b.export = 'TRUE'


UNION

select
count(*) as "count",
lower('0 < 0.1') as purity, lower('>35 && < 25') as signalNoise, lower('> 600') as similarity

from spectra a, bin b where a.purity > 0 and a.purity < 0.1 and a.signal_noise > 3 and a.signal_noise <= 25 and match > 600 and a.bin_id = b.bin_id and b.export = 'TRUE'

-- 1.5 settings

UNION

select
count(*) as "count",
lower('0 < 1.5') as purity, lower('> 250') as signalNoise,
lower('> 700') as similarity

from spectra a, bin b where a.purity > 0 and a.purity < 1.5 and a.signal_noise > 250 and match > 700 and a.bin_id = b.bin_id and b.export = 'TRUE'

UNION

select
count(*) as "count",
lower('0 < 1.5') as purity, lower('> 25 && < 250') as signalNoise, lower('> 600') as similarity

from spectra a, bin b where a.purity > 0 and a.purity < 1.5 and a.signal_noise > 25 and a.signal_noise <= 250 and match > 600 and a.bin_id = b.bin_id and b.export = 'TRUE'

UNION

select
count(*) as "count",
lower('0 < 1.5') as purity, lower('> 3 && < 25') as signalNoise, lower('> 500') as similarity

from spectra a, bin b where a.purity > 0 and a.purity < 1.5 and a.signal_noise > 3 and a.signal_noise <= 25 and match > 500 and a.bin_id = b.bin_id and b.export = 'TRUE'

-- all other settings

UNION

select
count(*) as "count",
lower('> 1.5') as purity,
lower('> 250') as signalNoise,
lower('> 600') as similarity

from spectra a, bin b where a.purity > 1.5 and a.signal_noise > 250 and match > 600 and a.bin_id = b.bin_id and b.export = 'TRUE'

UNION

select
count(*) as "count",
lower('> 1.5') as purity,
lower('> 3 && < 250') as signalNoise, lower('> 500') as similarity

from spectra a, bin b where a.purity > 1.5 and a.signal_noise > 3 and a.signal_noise <= 250 and match > 500 and a.bin_id = b.bin_id and b.export = 'TRUE'

UNION

-- total --

select
count(*) as "count",
lower('no filter') as purity,
lower('no filter') as signalNoise,
lower('no filter') as similarity

from spectra a, bin b where a.bin_id = b.bin_id and b.export = 'TRUE'

Wednesday, September 22, 2010

BinBase queries

short overview of binbase queries to generate statistics



count of spectra

select count(*) from spectra a, samples b where a.sample_id = b.sample_id and visible = 'TRUE'

count of spectra with no annotations

select count(*) from spectra a, samples b where a.sample_id = b.sample_id and visible = 'TRUE' and bin_id is null

count of samples

select count(*) from samples where visibile = 'TRUE'

known species

select distinct value from sample_info a, meta_key b where a.key_id = b.key_id and b.key = 'Species'

count of species

select count(distinct value) from sample_info a, meta_key b where a.key_id = b.key_id and b.key = 'Species'

sample count for species

select count(a.sample_id),a.value from sample_info a, meta_key b,samples c where a.key_id = b.key_id and b.key = 'Species' and a.sample_id = c.sample_id and c.visible = 'TRUE' group by a.value

bin generation by species

select a.value,count(d.spectra_id) from sample_info a, meta_key b,samples c,bin d where a.key_id = b.key_id and b.key = 'Species' and a.sample_id = c.sample_id and d.sample_id = c.sample_id group by a.value

ri correction successful

select count(*) from samples where visible = 'TRUE' and correction_failed = 'FALSE'

ri correction successful

select count(*) from samples where visible = 'TRUE' and correction_failed = 'FALSE'

named bins

select count(bin_id) from bin where name = TO_CHAR(bin_id,'FM9999999999')

distribution of similarities


select count("range"),"range" from (
select
case
when match >= 0 and match <= 100 then '0 - 100' when match > 100 and match <= 200 then '101 - 200' when match > 200 and match <= 300 then '201 - 300' when match > 300 and match <= 400 then '301 - 400' when match > 400 and match <= 500 then '401 - 500' when match > 500 and match <= 600 then '501 - 600' when match > 600 and match <= 700 then '601 - 700' when match > 700 and match <= 800 then '701 - 800' when match > 800 and match <= 900 then '801 - 900' when match > 900 and match <= 1000 then '901 - 1000' end AS "range" from spectra ) a group by "range"

Wednesday, September 8, 2010

grails - provide a dynamically generated file as download

today I had the rather simple request

generate a file on the fly and provide it as download.

Well the easiest solution was to just create a closure in a controller and set the content type in there.



def download = {

File file = new File(params.file)
response.setHeader "Content-disposition", "attachment; filename=${file.name}.txt"
response.contentType = 'text-plain'
response.outputStream << file.text response.outputStream.flush() }

Wednesday, August 25, 2010

dynamic fields with groovy

one of the features I love is to just define new fields in objects whenever I want.

like:



def test = ["a","b","c,"]

Object a = new Object()

test.each{

a.metaClass."${it}" = it
}



generates te following fields in the object a

a.a = "a"
a.b = "b"
a.c = "c"

works incredible well with the grails export plugin :)

Friday, July 23, 2010

Bash - One line based For Loop

today I needed to start 40 jobs in a row and the easiest way I could think about was to use a one line bash script todo this for me


for i in `seq 40`; do ./node.sh; done


tada...

Tuesday, July 13, 2010

finding all external reference names in pubchem sdf files

currently I'm working on a simple way to find all external reference names in the pubchem sdf files. Which is rather trivial, but time consuming:



cat *.sdf | grep PUBCHEM_EXT_DATASOURCE_NAME -A 1 | grep -v PUBCHEM_EXT_DATASOURCE_NAME | grep -v '\-\-' | sort | uniq



once this finished it should give us a list of all possible data sources from pubchem as a unique list of names.


...
Ambinter
Burnham Center for Chemical Genomics
Calbiochem
CC_PMLSC
ChEBI
ChemSpider
DiscoveryGate
Emory University Molecular Libraries Screening Center
InFarmatik
KUMGM
LipidMAPS
MICAD
MLSMR
MMDB
MTDP
Nature Chemical Biology
NCGC
NIAID
NMMLSC
NMRShiftDB
ORST SMALL MOLECULE SCREENING CENTER
PCMD
ProbeDB
Prous Science Drugs of the Future
R&D Chemicals
Sigma-Aldrich
Specs
SRMLSC
Structural Genomics Consortium
The Scripps Research Institute Molecular Screening Center
Thomson Pharma
UM-BBD
UPCMLD
...

Monday, July 5, 2010

ubuntu - stucked in graphical mode without an editor

well for some reason I thought it was smart to install on a freshly installed ubuntu linux server the gdm package in a VMWare console.

20 Seconds later it turned out that I had

- no terminal application
- could not switch to a virtual terminal, vmware kept ignoring the commands
- had ssh disabled

short I was stuck. Till I remembered to write a simple script to just install xterm and execute it. But wait I need to enter parameters and passwords after the prompt.

Quick forward

#! /bin/bash
echo 'password' | sudo -S apt-get --yes install xterm > log.txt

as script in a simple text file, which was set to be executable gave me all I need to install my xterm and be happy again.

Tuesday, June 22, 2010

making a jmx depend on an ear

this morning I discovered that my JBoss application server sometimes deploys my jmx beans before a depending ear. The fix is like always easy just define in your jboss-service.xml




<server>
<mbean
code="edu.ucdavis.genomics.metabolomics.binbase.bci.server.jmx.LifeTimeEvictionPolicyConfigJMX"
name="binbase.cache:service=LifeTimeEvictionPolicy">
<depends>jboss.j2ee:ear=clusterservice.ear</depends>

</mbean>


Thursday, June 17, 2010

Getting the hostname for the local machine

there are always the simple things which I'm getting stuck with, sometimes simple things like just getting the hostname for a local ip address in java.

solutions:



InetAddress addr = InetAddress.getLocalHost();
String hostName = addr.getCanonicalHostName();

Wednesday, June 16, 2010

sure shortly after the iPad with the unlimited dataplan got released, they change the plan to

2GB A month

what a joke...

Data per month Price per month
250MB/month $14.99
2GB/month $25.00

now I'm glad that I cancelled my pre order for one. What do 2GB give me these times? Nothing...

Thursday, May 13, 2010

renaming all files in a directory

some of the simple things in live




for x in *.php5; do n=${x/.php5/.php}; mv $x $n; done



taken from here

http://www.debianadmin.com/rename-multiple-files-to-another-extension-in-linux.html

Friday, May 7, 2010

moving indexes in postgres to a new tablespace

Recently I upgraded my server and added a SSD raid system of 3 128GB disks and for obvious reasons want to have my indexes on it. So the first step was to create a new tablespace


create tablespace 'name' owner 'owner' location 'directory'


and than creating a quick script to move all the indexes


select 'ALTER INDEX '||indexname||' SET TABLESPACE compoundindex ;' from pg_catalog.pg_indexes where schemaname = 'public' order by tablename;


which than just needs to be executed

Tuesday, May 4, 2010

using perl to replace all occurences of a word with another in large files

rather simple, we want to replace all occurences of the word 'volatile' with 'volatile2' in a large text file and so just execute


perl -i -p -e 's{volatile}{volatile2}g' /mnt/storage/data/export


I can't believe I'm using perl again...

flushing BinBase

once in a while you think you want to reprocess all data in a binbase database and for this reason you want to flush all the content except the retention index standards. This should never be required, but if you decided to regenerate all the bins with different settings, well you gotta do what you gotta do.

here is a quick sql script todo this for you



delete from comments

go

delete from sample_info

go

delete from meta_key

go

delete from metainformation

go

delete from runtime

go

delete from bin_compare

go

delete from virtual_bin

go

delete from bin a where a.bin_id not in (select bin_id from standard)

go

delete from samples a where a.sample_id not in (select sample_id from bin)

go

delete from reference_class

go

delete from reference

go

delete from result_link

go

delete from bin_ratio

go

delete from bin_references

go

delete from classification

go

delete from configuration where configuration_id not in (select configuration_id from samples)

go

delete from spectra where sample_id not in (select sample_id from samples)

go

delete from qualitycontrol

go

delete from result

go

delete from standard_hist

go

delete from structure

go

delete from substance_classes

go

delete from synonyme

go

delete from user_validation

go

delete from substance_classes

go

Wednesday, March 24, 2010

griffon

The last couple of days I spend with developing a quick application in griffon and have to say this is a rather impressive framework.. Specially since it's a young framework with a version of 0.3

what I like?

  • using the grails approach
  • very simple to use
  • saves a lot of time in the initial setup of the project
  • clear separation into controller/model/views
what I'm missing?
  • better support in IDE's
  • no SWT support
  • no gorm (it will come sooner or later)
  • documentation, with the current documentation i can't use it for production code

Thursday, March 11, 2010

i just love groovy for tasks like this...

sometimes you have these annoying little tasks on your hands and I always forget the bash syntax and it’s been a while since I seriously played with python.

So I thought, hey let’s try groovysh.

task?

  • read wordlist
  • convert to lowercase
  • remove parts of words
solution

new File("en_US.dic").text.split("\n").each{ new File("result.txt").append "${it.split("/")[0].toLowerCase()}\n" }

ok after a bit more coding I actually wrote a groovy script

  • read file
  • read file with words we do not want in word list
  • convert all values to lowercase
  • remove duplicated values
  • save to output file
really nothing fancy, but get’s the job done and took like 3 minutes to write and won’t be used again.

Set<String> cache = new HashSet<String>()
Set<String> result = new HashSet<String>()

new File(“whitelist.txt”).text.split(“\n”).each{ it.split("\t").eachWithIndex{

        String s, int index ->

                if(index > 0){
                        cache.add(s.toLowerCase())
                }
        }
}

new File(“blacklist.txt”).text.split(“\n”).each{
        if(cache.contains(it.toLowerCase()) == false){
                result.add(it.toLowerCase())
        }
}

File out = new File("wordlist.txt")

result.each{
        out.append it
        out.append "\n"
}


Wednesday, March 10, 2010

calculating exact molare masses using the cdk 1.3.1

I needed a simple way to calculate the exact mass for a couple of millions compounds and so decided to give it another try with the CDK.

After googleing a bit I found something what put me in the right direction and worked with the current version of the cdk, so it was rather simple

class ConvertInchi {

/**
* converts an inchi code to a molare mass
* @param inchi
* @return
*/
public static double convertInchiToMolareMass(String inchi) {

IMolecularFormula moleculeFormula = MolecularFormulaManipulator.getMolecularFormula(convertInchiToMolecularFormula(inchi),DefaultChemObjectBuilder.getInstance())

return MolecularFormulaManipulator.getTotalExactMass(moleculeFormula)

}

/**
*
* converts an inchi to a molecular formula
* @param
inchi
* @return
*/
public static String convertInchiToMolecularFormula(String inchi) {
return inchi.split("/")[1]

}
}

Friday, March 5, 2010

Postgres - partial indexes and sort indexes

this morning it was a sad moment. Normally people ask me all the time about SQL optimization or SQL tuning or when to create an index and so on.

This morning i had another DOOOH moment.

Basically while trying to optimize BinBase even further I noticed that some queries take an absurd amount of time to finish. Like 500 seconds to query the bin table.

So I started to investigate and discovered that some of my queries are using seq scans over huge tables. For no apparent reason.

It turned out that the sort column was not indexed...


CREATE INDEX bin_retention_index ON bin (retention_index ASC NULLS LAST);


which was greatly improved after executing this statement.

Now there was still a huge wait time for a certain query


explain select * from BIN where bin_id not in ( SELECT bin_id from SPECTRA where sample_id = 1630733 and bin_id is not null )ORDER BY retention_index ASC


and it turned out that indexes were never used for the 'is not null' part. After some research and a lot of head scratching it turned out that postgres supports partial indexes for exactly this case.


create index spectra_bin_id_partial_is_not_null on spectra (bin_id) where bin_id is not null


Now afterward we some improvement but the actually slow data access is caused by something else. We always need the complete table - a few compounds and this operation takes some time.

Time to optimize the internal database lookup...

Thursday, March 4, 2010

rocks linux cluster - adding a new parallel environment

by default rocks ships with a couple of environments, which execute stuff on different nodes. But sometimes you just want to have a node all to your self and take over all it's slots.

Todo this you can just create a new environment and which gives you a defined number of cpus for a specified job.


  1. create a file which describes the paralell environment like this

  2. pe_name threaded
    slots 999
    user_lists NONE
    xuser_lists NONE
    start_proc_args /bin/true
    stop_proc_args /bin/true
    allocation_rule $pe_slots
    control_slaves FALSE
    job_is_first_task TRUE
    urgency_slots min
    accounting_summary FALSE

  3. register this on the head node

  4. qconf -Ap file.txt

  5. add it to the list of available envionments

  6. qconf -mq all.q
    pe_list make mpich mpi orte threaded

  7. test it with qlogin

  8. qlogin -pe threaded 4

Wednesday, March 3, 2010

SetupX - tuning and the missing indexes

after analyzing a couple of hundred sql statements in setupx I noticed that there is no real use of indexes for some reason. Why there are no indexes escapes my mind, but since we prefer higher query speed I suggest the creation of the following indexes (which are far from perfect)

create index pubdata_relatedExperimentID_index on pubdata(relatedExperimentID)
create index NCBIClassifierID_ncbiID_index on NCBIClassifierID(ncbiID)
create index formobject_question_index on formobject(question)
create index formobject_discriminator_index on formobject(discriminator)
create index formobject_value_index on formobject(value(100))
create index query_querytype_index on query(querytype)
create index query_userid_index on query(userID)
create index datafile_source_index on datafile (source)
create index cache_experiment_id_index on cache (experimentID)
create index user_password_index on user (passwd)
create index user_username_index on user (username)
create index BlockedIP_blockedIp_index on BlockedIP(blockedIP)

this should improve the performance nicely and is currently being applied to our production system.

SetupX - rawdata access

ever had the desire to access the rawdata in SetupX, because the gui doesn't give you easy or access at all to it?

First you need to understand what's happening in the background. The used approache for setupx looks like a mapping from a tree to a table structure. Which is a valid approach and keeps it very flexible in therory, but is not a real pratical solution. Specially since it's a real pain to write queries for.



Get all species:

select distinct value from formobject where question = 'species' and value != ''

Get all organs:

select distinct lower(value) from formobject where question in ( 'organ','organ name','Organs','Organ specification' ) and value != ''

we need to use lower, cause people have a 'strange' way of spelling things


work in progress!

Tuesday, March 2, 2010

jboss 4.2.1GA and Java6 and jboss based webservice

thanks to this report I was able to fix this issue. Well it's not a fix, it's a workaround...

Ariela Hui - 13/Aug/08 04:23 PM
I was able to solve this problem. This is what environment I have:
winXP
JDK 1.6.0
JBoss 4.2.1

In the [jboss_home]/lib/endorsed add:
jaxb-api.jar
jboss-jaxrpc.jar
jboss-jaxws.jar
jboss-saaj.jar

copy them from [jboss_home]/server/default/lib

I've also added to endored jboss-logging-spi.jar
This I copied from jboss-5.0.0 client folder. 

Monday, March 1, 2010

scala/groovy on rocks linux

well since there is not scala/groovy roll for rocks we need to install it the traditional way.

  • go into the directory /shared/apps on the frontend
  • if apps doesn't exist create it
  • copy your scala/groovy tgz there
  • gunzip and untar it
  • edit your extend-compute.xml as shown here
  • add a new file modification section like this


<file name="/etc/profile" mode="append">

GROOVY_HOME=/share/apps/groovy
SCALA_HOME=/share/apps/scala

export GROOVY_HOME
export SCALA_HOME

PATH=$GROOVY_HOME/bin:$PATH
PATH=$SCALA_HOME/bin:$PATH

export PATH

</file>


  • rebuild your dist as shown here
  • reinstall you nodes as shown here

Friday, February 26, 2010

backup! we got REG Expressions!

recently I got an interesting task handed
please extract all words out of a text...
ok simple enough, we know regular expression and we know word boundaries.

So we just do

\b\w+\b


and this simple expression applied to the following sentence
Glucose (Glc), a monosaccharide (or simple sugar) also...
gives us the following list
  • Glucose
  • Glc
  • a
  • monosaccharide
  • or simple
  • sugar
  • also
Sadly the sentence didn't stop at this and continued to include the following tricky statement...
...including as much as possible chemical names
Ok time to read on the reg expressions in groovy and java
Now let's discover a regular expression which helps us with this.

Ok let's try this again with a different sentence
...as glucose, only one of which (D-glucose) is biologically...
our first expression would miss
  • D-glucose
and return for this
  • D
  • glucose
so we need to modify it a bit to include the first seperation. So it becomes

\b(\w\-)*\w+\b

and the day is saved till we try a new sentence and try to discover compounds like

  1. 1,3-Diaminopropane
  2. N-(3S-hydroxydecanoyl)-L-serine
  3. 3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane
  4. 2-(allyloxy)-1,3,5-trimethylbenzene
  5. 3-hydroxy-2-butanone
  6. 3,3'-Oxybis(1-propene)
  7. 1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane
  8. 2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine
  9. 1,6,9,13-tetraoxadispiro[4.2.4.2]tetradecane
  10. 3-(N, N-Diethylamino)-1-propylamine
  11. D-Glucose
  12. (R)-3-Hydroxybutyric acid
Or to make it more realistic, find all the words in this completely pointless and scientific wrong text
 bunch of rumble to find 1,3-Diaminopropane in D-Glucose and 1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane.
 It's also nice to have 3,3'-Oxybis(1-propene) or (R)-3-Hydroxybutyric acid. Last bot not least I'm a huge fan or 3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane.
 Also it's a great feeling if we can find (glucose) in brakets without finding statement like (help i'm surrounded by brackets).

So do you see a pattern here?
  • everything in () or [] or {} can be part of a chemical so we use ((\[.*\])|(\(.*\))) for this part 
  • everything separated by a ',' and followed by another character ending with a dash can be a chemical, so we use (\w+(,\w+)*\-) for this part
  • it ends all with a word \w+ or a ) (masked as \) )
so this expression would work for all these

(\([\w\+]+(,\w+)*\)-)?\b[(\w+(,\w+[\'])*\-)*((\[.*\])|(\(.*\))|(\{.*\}))*\w+]+(\b)( (acid)|(anhydride)|(\sbenzoate)|(\sketone)|(\sether)|(\sester)|(\scyanide))?


except
  1. 3-(N, N-Diethylamino)-1-propylamine
  2. glucose instead we get 'glucose)'
no solution for the 1 or 2 yet. Still trying to figure it out.

now the nicest thing is the groovy closure + match example to get all the content in a text.


def match = (text =~ pattern)


congrats now all you words are in the match variable! Text is just a string containing our text.

Wednesday, February 24, 2010

rocks linux cluster - mounting an nfs share on all nodes

after setting up the latest cluster I tried to provide to all nodes a couple of nfs shares, since user demanded this.

Well in rocks linux it's rather simple, once you understand the concept behind.

So a step to step tutorial.

  • go to the profile directory
  • cd /export/rocks/install/site-profiles/5.3/nodes/
  • make a copy of the skeleton file
  • cp skeleton.xml extend-compute.xml
  • edit file to tell it that we need to create a directory and add a line to the fstab. The right place for this is in the post section


    mkdir -p /mnt/share

    <file name="/etc/fstab" mode="append">
    server:/mount /mnt/share nfs defaults 0 0
    </file>

  • change back to the main install dir
  • cd /export/rocks/install
  • rebuild rocks distibution
  • rocks create distro
  • rebuild nodes
  • ssh compute-0-0 '/boot/kickstart/cluster-kickstart'

congratulations if you did everything right your node should now boot up and have a directory mounted.

playing around with threads

currently I got back to my hobby and play a bit with multithreading to tune an algorithm.

So the first step was to write a smpl class to test if the threading api works on my system and what is better than to calculate primes?



import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Test {

public static void main(String args[]) throws InterruptedException {

ExecutorService service = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
for (int i = 0; i < 500000000; i++) { final int postion = i; service.execute(new Runnable() { @Override public void run() { try { int i = postion; int i1 = (int) Math.ceil(Math.sqrt(i)); boolean isPrimeNumber = false; while (i1 > 1) {

if ((i != i1) && (i % i1 == 0)) {
isPrimeNumber = false;
break;
}
else if (!isPrimeNumber) {
isPrimeNumber = true;
}

--i1;
}

if (isPrimeNumber) {
System.out.println(Thread.currentThread().getName() + " - prime " + i);
}

}
catch (Exception e) {
e.printStackTrace();
}
}
});
}
service.shutdown();
service.awaitTermination(4000, TimeUnit.DAYS);
}
}




and the output is nicely:

pool-1-thread-2 - prime 469237
pool-1-thread-2 - prime 469241
pool-1-thread-2 - prime 469253
pool-1-thread-4 - prime 466553
pool-1-thread-4 - prime 469267
pool-1-thread-4 - prime 469279
pool-1-thread-2 - prime 469283
pool-1-thread-3 - prime 467869
pool-1-thread-3 - prime 469303
pool-1-thread-2 - prime 469321

while all 4 CPU's are at 100% use.

translation, java executor api seems to work quite well.

Now time to tune the binbase algorithm...

Saturday, February 20, 2010

postgres and insert performance with grails

recently I spend a lot of time tuning a postgres database and an algorithm (based on grails) to try to insert 80M chemical compounds into a database with reference checks and assurance for there uniqueness.


The main issue is that the inserts get slower and slower over time and end up taking to long. To improve the performance we started with the following approach:
  • more memory
  • database indexes (obvious)
  • flush the hibernate session all 100 inserts
  • flush the grails cache all 100 inserts
  • tune several postgres parameters
  • re index the database all 10k inserts
  • enable vacuum on a 10k row basis
  • enable analyze on a 10k row basis
  • set auto vacuum to check all 10 minutes
But after all this work we are still stuck with a problem, the inserts get gradually slower over time, which can be related to the index building time.

Graph over 1M inserts: TODO, it's still in the calculation...

Tuesday, February 9, 2010

Postgres and utf8

Recently I discovered that my postgres datbase is not running with UTF-8 support which causes all kinds of headdache,

so what do you have todo to change this?

  • backup your database
  • set Lang to: LANG = en_US.UTF-8
  • execute initdb
  • recreate the database and specify the encoding: createdb --encoding=unicode
  • import your data

you should now have a working utf-8 database

Wednesday, February 3, 2010

muss das sein?

[INFO] [compiler:compile]
[INFO] Compiling 399 source files to /Users/wohlgemuth/Documents/workspace-private/oscar3-chem/branches/cdk-1.3.1/target/classes
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Compilation failure
Failure executing javac, but could not parse the error:
An exception has occurred in the compiler (1.5.0_20). Please file a bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport)  after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report.  Thank you.
com.sun.tools.javac.code.Symbol$CompletionFailure: file org/openscience/cdk/annotations/TestClass.class not found



Failure executing javac, but could not parse the error:
An exception has occurred in the compiler (1.5.0_20). Please file a bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport)  after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report.  Thank you.
com.sun.tools.javac.code.Symbol$CompletionFailure: file org/openscience/cdk/annotations/TestClass.class not found

I mean common, do I really have to deal with compiler bugs now...

correction


nvm forgot the cdk annotations library....



cdk-maven-mojos

recently I'm doing an awful lot with the CDK library and since I always 'like' to work with maven I thought it's time to write a couple of mojos to help me with the CDK work.

The first one of the list is a  mojo which deploys the cdk library to my local maven repository and can be found under google code.

There are most likely more mojo's to come as I work more and more with the cdk.

Tuesday, February 2, 2010

regular expression for common chemical identifiers

This is basically a small collection for regular expressions which I use from time to time to distinguish chemical identifiers. Please feel free to add more to the list to make it grow and more complete.


The first line is the name, the second is the valid groovy/java version. All are validate with at least thousand examples

std inchi 

InChI=1S/([^/]+)(?:/[^/]+)*\\S

std inchiKey 

[A-Z]{14}-[A-Z]{10}-[A-Z,0-9]


CAS

\\d{1,7}-\\d\\d-\\d

KEGG

C\d{5}

LipidMaps

LMFA[0-9]{8}

HMDB

HMDB[0-9]*

Sunday, January 24, 2010

how to handle 400 Billion rows in postgress

well currently I try to optimize an denoising and replacement algorithm. This includes calculating ion traces over thousands files which have 20k peaks or 1 Million ions each.
To make this more challenging we don't do this once, we do it during every BinBase export. Since this is the base algorithm for the Zero replacement.

Short we do the same calculation over and over again, like to calculate the minimum intensity for 100 files between 15 and 20 seconds for the ion 87.

Short this shouts use a SQL database.

Now how much data do we have?

  • 40k files
  • each file has around 20k spectra
  • each spectra has up to 500 ion

which translates into

400 000 000 000 rows

Performance Issues and how to make Postgres handle this amount of data

The first attempt was to just store all these files into a single table using hibernate. Which caused an OutOfMemory exception, surprise surprise.

So the next attempt was to use a stateless session which worked quite well and up to 1 Billion rows we had a query speed of under 20ms. Quite nice for no tuning of the database. But once we hit a bit over 1 Billions rows the query speed got worse and worse. Shor this was not an acceptable solution.

So we tried to use partions with postgres and define one table for each ion, inheriting from a master table.


Table "public.iontrace"
Column | Type | Modifiers
------------+------------------+-----------
id | bigint | not null
intensity | double precision |
ion | integer |
spectra_id | bigint |
Indexes:
"iontrace_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"fk4f93923dc3b62c9a" FOREIGN KEY (spectra_id) REFERENCES spectra(id)
Rules:
rule_index_iontrace_100 AS
ON INSERT TO iontrace
WHERE new.ion = 100 DO INSTEAD INSERT INTO iontrace_100 (id, intensity, ion, spectra_id)
VALUES (new.id, new.intensity, new.ion, new.spectra_id)
rule_index_iontrace_101 AS
ON INSERT TO iontrace
WHERE new.ion = 101 DO INSTEAD INSERT INTO iontrace_101 (id, intensity, ion, spectra_id)
VALUES (new.id, new.intensity, new.ion, new.spectra_id)
rule_index_iontrace_102 AS
ON INSERT TO iontrace
WHERE new.ion = 102 DO INSTEAD INSERT INTO iontrace_102 (id, intensity, ion, spectra_id)



which didn't work with hibernate. It kept complaining about


org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1


so we had to write our own batcher as was described here

a bit later we rewrote the example to actually work, since the field batch size is set to private.


/**
* hibernate partion batcher
*/
public class HibernatePartionBatcher extends BatchingBatcher {

public HibernatePartionBatcher(org.hibernate.jdbc.ConnectionManager connectionManager, org.hibernate.Interceptor interceptor) {
super(connectionManager, interceptor)
}


@Override
protected void doExecuteBatch(PreparedStatement ps) throws SQLException, HibernateException {

//use reflections to access the private field of the super class ugly but neccessaery
Field field = this.getClass().getSuperclass().getDeclaredField("batchSize")
field.setAccessible(true)

int value = field.getInt(this)

if (value != 0) {
try {
ps.executeBatch()
}
catch (RuntimeException re) {
throw re;
}
finally {
field.setInt(this, 0)
}
}
}
}

/**
* hibernate factory to intitialize the factory
*/
public class HibernatePartitionBatcherFactory extends BatchingBatcherFactory {

public HibernatePartitionBatcherFactory() {
}

public org.hibernate.jdbc.Batcher createBatcher(org.hibernate.jdbc.ConnectionManager connectionManager, org.hibernate.Interceptor interceptor) {
return new HibernatePartionBatcher(connectionManager, interceptor);
}
}



and to register this in our groovy script


def hibProps = [
"hibernate.jdbc.factory_class": HibernatePartitionBatcherFactory.class.getName()
]



A first test with just a single sample reveals that we are executing a sequential scan over all partions.


explain SELECT * from iontrace where ion = 105;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Result (cost=0.00..1163.25 rows=3374 width=28)
-> Append (cost=0.00..1163.25 rows=3374 width=28)
-> Seq Scan on iontrace (cost=0.00..10.46 rows=7 width=28)
Filter: (ion = 105)
-> Index Scan using index_iontrace_20 on iontrace_20 iontrace (cost=0.00..1.57 rows=7 width=28)
Index Cond: (ion = 105)
-> Index Scan using index_iontrace_21 on iontrace_21 iontrace (cost=0.00..1.57 rows=7 width=28)
Index Cond: (ion = 105)
-> Index Scan using index_iontrace_22 on iontrace_22 iontrace (cost=0.00..1.57 rows=7 width=28)
Index Cond: (ion = 105)
-> Index Scan using index_iontrace_23 on iontrace_23 iontrace (cost=0.00..1.57 rows=7 width=28)
Index Cond: (ion = 105)
-> Index Scan using index_iontrace_24 on iontrace_24 iontrace (cost=0.00..1.57 rows=7 width=28)
Index Cond: (ion = 105)
-> Index Scan using index_iontrace_25 on iontrace_25 iontrace (cost=0.00..1.57 rows=7 width=28)
Index Cond: (ion = 105)



to avoid this you have to change a parameter in postgresql.


SET constraint_exclusion = on;


and the new result shows that we only work on the correct partition now


explain SELECT * from iontrace where ion = 105;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Result (cost=0.00..13.03 rows=14 width=28)
-> Append (cost=0.00..13.03 rows=14 width=28)
-> Seq Scan on iontrace (cost=0.00..10.46 rows=7 width=28)
Filter: (ion = 105)
-> Index Scan using index_iontrace_105 on iontrace_105 iontrace (cost=0.00..2.57 rows=7 width=28)
Index Cond: (ion = 105)



it still executes a sequential scan, but now it only does it on the correct partion with the stored data.

Now to reduce the likelihood for an index scan over the database we will define indexes on all the table (the script at the end of this post does this for you)

the new query plan is


explain SELECT * from iontrace where ion = 105;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Result (cost=0.00..4.15 rows=14 width=28)
-> Append (cost=0.00..4.15 rows=14 width=28)
-> Index Scan using index_iontrace_ion on iontrace (cost=0.00..1.57 rows=7 width=28)
Index Cond: (ion = 105)
-> Index Scan using index_iontrace_105 on iontrace_105 iontrace (cost=0.00..2.57 rows=7 width=28)
Index Cond: (ion = 105)


The result is slightly different since we generate by now more data in the database. Sorry for that. But you can see that the query plan is much more efficient with the index.

Important is that you need an index for ion on every partition and on the main 'iontrace' table, if you don't have an index on the main table you will keep getting sequential scans.

The next snatch we encountered was the speed of the actual data insert. Since the check constraints seem to be rather slow and expensive.

First this is a hibernate insert on the main datatable:


LOG: duration: 8.961 ms statement: EXECUTE [PREPARE: insert into IonTrace (intensity, ion, spectra_id, id) values ($1, $2, $3, $4)]
LOG: duration: 8.951 ms statement: EXECUTE [PREPARE: insert into IonTrace (intensity, ion, spectra_id, id) values ($1, $2, $3, $4)]
LOG: duration: 8.966 ms statement: EXECUTE [PREPARE: insert into IonTrace (intensity, ion, spectra_id, id) values ($1, $2, $3, $4)]


Second this is plain old sql inserted into the main iontrace table:


LOG: duration: 6.608 ms statement: EXECUTE [PREPARE: insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),1418.0,237,700013)]
LOG: duration: 6.610 ms statement: EXECUTE [PREPARE: insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),1373.0,238,700013)]
LOG: duration: 6.605 ms statement: EXECUTE [PREPARE: insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),1383.0,239,700013)]


Third this is plain old sql inserted in each partition:


LOG: duration: 0.055 ms statement: EXECUTE [PREPARE: insert into iontrace_475(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),27.0,475,680203)]
LOG: duration: 0.058 ms statement: EXECUTE [PREPARE: insert into iontrace_489(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),13.0,489,680203)]
LOG: duration: 0.051 ms statement: EXECUTE [PREPARE: insert into iontrace_496(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),13.0,496,680203)]
LOG: duration: 0.138 ms statement: EXECUTE [PREPARE: insert into Spectra (file_id, retentionTime, id) values ($1, $2, $3)]


From this we can see that it makes the most sense to use plain sql to insert the data into the related partitions and to avoid the checks during the insert phase.

For this we used the following approach


//build the ion trace
Statement statement = session.connection().createStatement()

for (int mass = 0; mass < mz.length; mass++) { if (mass >= beginMass && mass <= endMass) { if (intensity[mass] > 0) {
IonTrace trace = new IonTrace(ion: mz[mass], intensity: intensity[mass])
trace.spectra = spectra


statement.addBatch("insert into iontrace(id, intensity, ion, spectra_id) values (nextval('SEQ_TRACE'),${intensity[mass]},${(int)(mz[mass])},${spectra.id})")
}
}
}

statement.executeBatch()


which worked quite well. As a conclusion I removed the complete hibernate layer from this application, since we lost most of it's benefits by now.

To further optimize the insert speed we replace the Statements with PreparedStatements, which improves the speed of the insert by close to 50%.


LOG: duration: 0.035 ms statement: EXECUTE [PREPARE: insert into iontrace_127(id, intensity, ion, spectra_id) values (nextval('seq_trace'),$1,$2,$3)]
LOG: duration: 0.035 ms statement: EXECUTE [PREPARE: insert into iontrace_127(id, intensity, ion, spectra_id) values (nextval('seq_trace'),$1,$2,$3)]
LOG: duration: 0.035 ms statement: EXECUTE [PREPARE: insert into iontrace_127(id, intensity, ion, spectra_id) values (nextval('seq_trace'),$1,$2,$3)]



Storage

as you can imagine this amount of rows can take up quite some space. An initial estimate say it's roughly 20TB with the current table schema.

Which is 20 times more than the files occupy on the harddrive.

So what can you do to keep the storage space down
  • use the smallest possible data types possible
  • only index what is necessary to be index
  • remove columns which are not necessary, like we don't need a primary key on the ion trace table
  • estimate your possible amount of data stored in the database, like we won't store billions of spectra, so we can use integer instead of bigint
  • store intensities as integer and not as double. 

To estimate the required storage we used the following function, which roughly gives us storage capacity for 4 years of netcdf files.


netcdf-repository=# SELECT pg_size_pretty(pg_database_size('netcdf-repository')/(select count(*) from netcdf)* 40000) as "estimated database size";
estimated database size
-------------------------
16 TB
(1 row)


And a server for this would currently cost around 7000$


-END-

Friday, January 22, 2010

protecting jboss

Protecting JBoss

in the last couple of weeks we released more and more of our BinBase Tools to make it able to actually work with the database. Which means we have to protect our data better.

Since I had no time at work for this and JBoss doesn't provide a convenient way, well I decided to write a little tool which does this for me.

So I created yet another google code project, called 'jboss-ip-filter', which basically does nothing else than providing an interceptor, which intercepts all method calls and check's if the ip is in a list of registered IP Address.

Features
  • protect ejb3.x services
  • protect ejb2.x services
  • ip can be defined as regular expression to support subnets
Configuration/Installation

First you need to download the latest release and copy it into the jboss library directory of your choosen configuration.

Afterwards you need to register the interceptor in the jboss configuration.

Example

vim /usr/local/jboss/server/all/conf/standardjboss.xml

Go to the part about the container configurations and register the interceptor in the first position for every ejb configuration you want to protect.

The name of the class is: 'com.blogspot.codingandmore.jboss.filter.SessionInterceptor'


<container-configuration>
<container-name>Standard CMP 2.x EntityBean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>entity-unified-invoker</invoker-proxy-binding-name>
<sync-on-commit-only>false</sync-on-commit-only>
<insert-after-ejb-post-create>false</insert-after-ejb-post-create>
<call-ejb-store-on-clean>true</call-ejb-store-on-clean>
<container-interceptors>
<interceptor>com.blogspot.codingandmore.jboss.filter.SessionInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.SecurityInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>
<interceptor>org.jboss.ejb.plugins.CallValidationInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityLockInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityReentranceInterceptor</interceptor>
<interceptor>org.jboss.resource.connectionmanager.CachedConnectionInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.EntitySynchronizationInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.cmp.jdbc.JDBCRelationInterceptor</interceptor>
</container-interceptors>
<instance-pool>org.jboss.ejb.plugins.EntityInstancePool</instance-pool>
<instance-cache>org.jboss.ejb.plugins.InvalidableEntityInstanceCache</instance-cache>
<persistence-manager>org.jboss.ejb.plugins.cmp.jdbc.JDBCStoreManager</persistence-manager>
<locking-policy>org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock</locking-policy>
<container-cache-conf>
<cache-policy>org.jboss.ejb.plugins.LRUEnterpriseContextCachePolicy</cache-policy>
<cache-policy-conf>
<min-capacity>50</min-capacity>
<max-capacity>1000000</max-capacity>
<overager-period>300</overager-period>
<max-bean-age>600</max-bean-age>
<resizer-period>400</resizer-period>
<max-cache-miss-period>60</max-cache-miss-period>
<min-cache-miss-period>1</min-cache-miss-period>
<cache-load-factor>0.75</cache-load-factor>
</cache-policy-conf>
</container-cache-conf>
<container-pool-conf>
<MaximumSize>100</MaximumSize>
</container-pool-conf>
<commit-option>B</commit-option>
</container-configuration>


After this is done you need to restart your server and it should generate a property in the start directory after the next reboot. In this directory you configure your ip address. To be allowed.

For example if you started the server in the bin directory, the file will be found there


vim /usr/local/jboss/bin/ip-filter-config.properties


The ip address of the local host is always registered.

These following two lines allow it the host '128.120.136.154' to connect but refuses connections from any other hosts to the ejb's.


128.120.136.154 = true
\b(?:\d{1,3}\.){3}\d{1,3}\b = false


If you encounter any problems, please don't hesitate to contact me and I try to help with the encountered problems.