Friday, February 26, 2010

backup! we got REG Expressions!

recently I got an interesting task handed
please extract all words out of a text...
ok simple enough, we know regular expression and we know word boundaries.

So we just do

\b\w+\b


and this simple expression applied to the following sentence
Glucose (Glc), a monosaccharide (or simple sugar) also...
gives us the following list
  • Glucose
  • Glc
  • a
  • monosaccharide
  • or simple
  • sugar
  • also
Sadly the sentence didn't stop at this and continued to include the following tricky statement...
...including as much as possible chemical names
Ok time to read on the reg expressions in groovy and java
Now let's discover a regular expression which helps us with this.

Ok let's try this again with a different sentence
...as glucose, only one of which (D-glucose) is biologically...
our first expression would miss
  • D-glucose
and return for this
  • D
  • glucose
so we need to modify it a bit to include the first seperation. So it becomes

\b(\w\-)*\w+\b

and the day is saved till we try a new sentence and try to discover compounds like

  1. 1,3-Diaminopropane
  2. N-(3S-hydroxydecanoyl)-L-serine
  3. 3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane
  4. 2-(allyloxy)-1,3,5-trimethylbenzene
  5. 3-hydroxy-2-butanone
  6. 3,3'-Oxybis(1-propene)
  7. 1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane
  8. 2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine
  9. 1,6,9,13-tetraoxadispiro[4.2.4.2]tetradecane
  10. 3-(N, N-Diethylamino)-1-propylamine
  11. D-Glucose
  12. (R)-3-Hydroxybutyric acid
Or to make it more realistic, find all the words in this completely pointless and scientific wrong text
 bunch of rumble to find 1,3-Diaminopropane in D-Glucose and 1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane.
 It's also nice to have 3,3'-Oxybis(1-propene) or (R)-3-Hydroxybutyric acid. Last bot not least I'm a huge fan or 3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane.
 Also it's a great feeling if we can find (glucose) in brakets without finding statement like (help i'm surrounded by brackets).

So do you see a pattern here?
  • everything in () or [] or {} can be part of a chemical so we use ((\[.*\])|(\(.*\))) for this part 
  • everything separated by a ',' and followed by another character ending with a dash can be a chemical, so we use (\w+(,\w+)*\-) for this part
  • it ends all with a word \w+ or a ) (masked as \) )
so this expression would work for all these

(\([\w\+]+(,\w+)*\)-)?\b[(\w+(,\w+[\'])*\-)*((\[.*\])|(\(.*\))|(\{.*\}))*\w+]+(\b)( (acid)|(anhydride)|(\sbenzoate)|(\sketone)|(\sether)|(\sester)|(\scyanide))?


except
  1. 3-(N, N-Diethylamino)-1-propylamine
  2. glucose instead we get 'glucose)'
no solution for the 1 or 2 yet. Still trying to figure it out.

now the nicest thing is the groovy closure + match example to get all the content in a text.


def match = (text =~ pattern)


congrats now all you words are in the match variable! Text is just a string containing our text.

No comments:

Post a Comment