please extract all words out of a text...ok simple enough, we know regular expression and we know word boundaries.
So we just do
\b\w+\b
and this simple expression applied to the following sentence
Glucose (Glc), a monosaccharide (or simple sugar) also...gives us the following list
- Glucose
- Glc
- a
- monosaccharide
- or simple
- sugar
- also
...including as much as possible chemical namesOk time to read on the reg expressions in groovy and java
Now let's discover a regular expression which helps us with this.
Ok let's try this again with a different sentence
...as glucose, only one of which (D-glucose) is biologically...our first expression would miss
- D-glucose
and return for this
- D
- glucose
so we need to modify it a bit to include the first seperation. So it becomes
\b(\w\-)*\w+\b
and the day is saved till we try a new sentence and try to discover compounds like
- 1,3-Diaminopropane
- N-(3S-hydroxydecanoyl)-L-serine
- 3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane
- 2-(allyloxy)-1,3,5-trimethylbenzene
- 3-hydroxy-2-butanone
- 3,3'-Oxybis(1-propene)
- 1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane
- 2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine
- 1,6,9,13-tetraoxadispiro[4.2.4.2]tetradecane
- 3-(N, N-Diethylamino)-1-propylamine
- D-Glucose
- (R)-3-Hydroxybutyric acid
Or to make it more realistic, find all the words in this completely pointless and scientific wrong text
bunch of rumble to find 1,3-Diaminopropane in D-Glucose and 1,1,1,2,2,3,3,4,4-nonafluoro-4-(1,1,2,2,3,3,4,4,4-nonafluorobutoxy)butane.It's also nice to have 3,3'-Oxybis(1-propene) or (R)-3-Hydroxybutyric acid. Last bot not least I'm a huge fan or 3,9-divinyl-2,4,8,10-tetraoxaspiro[5.5]undecane.Also it's a great feeling if we can find (glucose) in brakets without finding statement like (help i'm surrounded by brackets).
So do you see a pattern here?
- everything in () or [] or {} can be part of a chemical so we use ((\[.*\])|(\(.*\))) for this part
- everything separated by a ',' and followed by another character ending with a dash can be a chemical, so we use (\w+(,\w+)*\-) for this part
- it ends all with a word \w+ or a ) (masked as \) )
so this expression would work for all these
(\([\w\+]+(,\w+)*\)-)?\b[(\w+(,\w+[\'])*\-)*((\[.*\])|(\(.*\))|(\{.*\}))*\w+]+(\b)( (acid)|(anhydride)|(\sbenzoate)|(\sketone)|(\sether)|(\sester)|(\scyanide))?
except
- 3-(N, N-Diethylamino)-1-propylamine
- glucose instead we get 'glucose)'
no solution for the 1 or 2 yet. Still trying to figure it out.
now the nicest thing is the groovy closure + match example to get all the content in a text.
def match = (text =~ pattern)
congrats now all you words are in the match variable! Text is just a string containing our text.