Friday, February 26, 2010

backup! we got REG Expressions!

recently I got an interesting task handed
please extract all words out of a text...
ok simple enough, we know regular expression and we know word boundaries.

So we just do

\b\w+\b


and this simple expression applied to the following sentence
Glucose (Glc), a monosaccharide (or simple sugar) also...
gives us the following list
  • Glucose
  • Glc
  • a
  • monosaccharide
  • or simple
  • sugar
  • also
Sadly the sentence didn't stop at this and continued to include the following tricky statement...
...including as much as possible chemical names
Ok time to read on the reg expressions in groovy and java
Now let's discover a regular expression which helps us with this.

Ok let's try this again with a different sentence
...as glucose, only one of which (D-glucose) is biologically...
our first expression would miss
  • D-glucose
and return for this
  • D
  • glucose
so we need to modify it a bit to include the first seperation. So it becomes

\b(\w\-)*\w+\b

and the day is saved till we try a new sentence and try to discover compounds like

  1. 1,3-Diaminopropane
  2. N-(3S-hydroxydecanoyl)-L-serine
  3. 3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane
  4. 2-(allyloxy)-1,3,5-trimethylbenzene
  5. 3-hydroxy-2-butanone
  6. 3,3'-Oxybis(1-propene)
  7. 1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane
  8. 2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine
  9. 1,6,9,13-tetraoxadispiro[4.2.4.2]tetradecane
  10. 3-(N, N-Diethylamino)-1-propylamine
  11. D-Glucose
  12. (R)-3-Hydroxybutyric acid
Or to make it more realistic, find all the words in this completely pointless and scientific wrong text
 bunch of rumble to find 1,3-Diaminopropane in D-Glucose and 1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane.
 It's also nice to have 3,3'-Oxybis(1-propene) or (R)-3-Hydroxybutyric acid. Last bot not least I'm a huge fan or 3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane.
 Also it's a great feeling if we can find (glucose) in brakets without finding statement like (help i'm surrounded by brackets).

So do you see a pattern here?
  • everything in () or [] or {} can be part of a chemical so we use ((\[.*\])|(\(.*\))) for this part 
  • everything separated by a ',' and followed by another character ending with a dash can be a chemical, so we use (\w+(,\w+)*\-) for this part
  • it ends all with a word \w+ or a ) (masked as \) )
so this expression would work for all these

(\([\w\+]+(,\w+)*\)-)?\b[(\w+(,\w+[\'])*\-)*((\[.*\])|(\(.*\))|(\{.*\}))*\w+]+(\b)( (acid)|(anhydride)|(\sbenzoate)|(\sketone)|(\sether)|(\sester)|(\scyanide))?


except
  1. 3-(N, N-Diethylamino)-1-propylamine
  2. glucose instead we get 'glucose)'
no solution for the 1 or 2 yet. Still trying to figure it out.

now the nicest thing is the groovy closure + match example to get all the content in a text.


def match = (text =~ pattern)


congrats now all you words are in the match variable! Text is just a string containing our text.

Wednesday, February 24, 2010

rocks linux cluster - mounting an nfs share on all nodes

after setting up the latest cluster I tried to provide to all nodes a couple of nfs shares, since user demanded this.

Well in rocks linux it's rather simple, once you understand the concept behind.

So a step to step tutorial.

  • go to the profile directory
  • cd /export/rocks/install/site-profiles/5.3/nodes/
  • make a copy of the skeleton file
  • cp skeleton.xml extend-compute.xml
  • edit file to tell it that we need to create a directory and add a line to the fstab. The right place for this is in the post section


    mkdir -p /mnt/share

    <file name="/etc/fstab" mode="append">
    server:/mount /mnt/share nfs defaults 0 0
    </file>

  • change back to the main install dir
  • cd /export/rocks/install
  • rebuild rocks distibution
  • rocks create distro
  • rebuild nodes
  • ssh compute-0-0 '/boot/kickstart/cluster-kickstart'

congratulations if you did everything right your node should now boot up and have a directory mounted.

playing around with threads

currently I got back to my hobby and play a bit with multithreading to tune an algorithm.

So the first step was to write a smpl class to test if the threading api works on my system and what is better than to calculate primes?



import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Test {

public static void main(String args[]) throws InterruptedException {

ExecutorService service = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
for (int i = 0; i < 500000000; i++) { final int postion = i; service.execute(new Runnable() { @Override public void run() { try { int i = postion; int i1 = (int) Math.ceil(Math.sqrt(i)); boolean isPrimeNumber = false; while (i1 > 1) {

if ((i != i1) && (i % i1 == 0)) {
isPrimeNumber = false;
break;
}
else if (!isPrimeNumber) {
isPrimeNumber = true;
}

--i1;
}

if (isPrimeNumber) {
System.out.println(Thread.currentThread().getName() + " - prime " + i);
}

}
catch (Exception e) {
e.printStackTrace();
}
}
});
}
service.shutdown();
service.awaitTermination(4000, TimeUnit.DAYS);
}
}




and the output is nicely:

pool-1-thread-2 - prime 469237
pool-1-thread-2 - prime 469241
pool-1-thread-2 - prime 469253
pool-1-thread-4 - prime 466553
pool-1-thread-4 - prime 469267
pool-1-thread-4 - prime 469279
pool-1-thread-2 - prime 469283
pool-1-thread-3 - prime 467869
pool-1-thread-3 - prime 469303
pool-1-thread-2 - prime 469321

while all 4 CPU's are at 100% use.

translation, java executor api seems to work quite well.

Now time to tune the binbase algorithm...

Saturday, February 20, 2010

postgres and insert performance with grails

recently I spend a lot of time tuning a postgres database and an algorithm (based on grails) to try to insert 80M chemical compounds into a database with reference checks and assurance for there uniqueness.


The main issue is that the inserts get slower and slower over time and end up taking to long. To improve the performance we started with the following approach:
  • more memory
  • database indexes (obvious)
  • flush the hibernate session all 100 inserts
  • flush the grails cache all 100 inserts
  • tune several postgres parameters
  • re index the database all 10k inserts
  • enable vacuum on a 10k row basis
  • enable analyze on a 10k row basis
  • set auto vacuum to check all 10 minutes
But after all this work we are still stuck with a problem, the inserts get gradually slower over time, which can be related to the index building time.

Graph over 1M inserts: TODO, it's still in the calculation...

Tuesday, February 9, 2010

Postgres and utf8

Recently I discovered that my postgres datbase is not running with UTF-8 support which causes all kinds of headdache,

so what do you have todo to change this?

  • backup your database
  • set Lang to: LANG = en_US.UTF-8
  • execute initdb
  • recreate the database and specify the encoding: createdb --encoding=unicode
  • import your data

you should now have a working utf-8 database

Wednesday, February 3, 2010

muss das sein?

[INFO] [compiler:compile]
[INFO] Compiling 399 source files to /Users/wohlgemuth/Documents/workspace-private/oscar3-chem/branches/cdk-1.3.1/target/classes
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Compilation failure
Failure executing javac, but could not parse the error:
An exception has occurred in the compiler (1.5.0_20). Please file a bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport)  after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report.  Thank you.
com.sun.tools.javac.code.Symbol$CompletionFailure: file org/openscience/cdk/annotations/TestClass.class not found



Failure executing javac, but could not parse the error:
An exception has occurred in the compiler (1.5.0_20). Please file a bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport)  after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report.  Thank you.
com.sun.tools.javac.code.Symbol$CompletionFailure: file org/openscience/cdk/annotations/TestClass.class not found

I mean common, do I really have to deal with compiler bugs now...

correction


nvm forgot the cdk annotations library....



cdk-maven-mojos

recently I'm doing an awful lot with the CDK library and since I always 'like' to work with maven I thought it's time to write a couple of mojos to help me with the CDK work.

The first one of the list is a  mojo which deploys the cdk library to my local maven repository and can be found under google code.

There are most likely more mojo's to come as I work more and more with the cdk.

Tuesday, February 2, 2010

regular expression for common chemical identifiers

This is basically a small collection for regular expressions which I use from time to time to distinguish chemical identifiers. Please feel free to add more to the list to make it grow and more complete.


The first line is the name, the second is the valid groovy/java version. All are validate with at least thousand examples

std inchi 

InChI=1S/([^/]+)(?:/[^/]+)*\\S

std inchiKey 

[A-Z]{14}-[A-Z]{10}-[A-Z,0-9]


CAS

\\d{1,7}-\\d\\d-\\d

KEGG

C\d{5}

LipidMaps

LMFA[0-9]{8}

HMDB

HMDB[0-9]*