Using Regex fragmenter to extract paragraphs

Mark Ferguson Fri, 12 Dec 2008 13:38:18 -0800

Hello,

I am trying to use the regex fragmenter and am having a hard time getting
the results I want. I am trying to get fragments that start on a word
character and end on punctuation, but for some reason the fragments being
returned to me seem to be very inflexible, despite that I've provided a
large slop. Here are the relevant parameters I'm using, maybe someone can
help point out where I've gone wrong:


<str name="hl.fragsize">500</str>
<str name="hl.fragmenter">regex</str>
<str name="hl.regex.slop">0.8</str>
<str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
<str name="hl">true</str>
<str name="q">chinese</str>

This should be matching between 400-600 characters, beginning with a word
character and ending with one of .!?. Here is an example of a typical
result:

. Check these pictures out. Nine panda cubs on display for the first time
Thursday in southwest China. They're less than a year old. They just
recently stopped nursing. There are only 1,600 of these guys left in the
mountain forests of central China, another 120 in <span
class='hl'>Chinese</span> breeding facilities and zoos. And they're about 20
that live outside China in zoos. They exist almost entirely on bamboo. They
can live to be 30 years old. And these little guys will eventually get much
bigger. They'll grow

As you can see, it is starting with a period and ending on a word character!
It's almost as if the fragments are just coming out as they will and the
regex isn't doing anything at all, but the results are different when I use
the gap fragmenter. In the above result I don't see any reason why it
shouldn't have stripped out the preceding period and the last two words,
there is plenty of room in the slop and in the regex pattern. Please help me
figure out what I'm doing wrong...

Thanks a lot,

Mark Ferguson

Using Regex fragmenter to extract paragraphs

Reply via email to