i have a program written in Icon that does basic sentence splitting. with about 5
heuristics and one small lookup table, i can get well over 90% accuracy doing sentence
boundary detection on email. for well edited English text, like newswires, i can
manage closer to 99%. this is all that is
i am stuck with company policy with respect to open source project participation. this
is why i am dropping some fairly detailed hints of what has to be done instead of
doing it myself. this policy may change in the next year, but by then, i will have to
be working with a solution and not just
i am not implying rejection of a match across sentence boundaries, i am saying that it
receives a lower score than a match within a sentence boundary.
Herb
-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 8:15 PM
To: Lucene Users List
you cannot layer sentence boundary detection on top of Lucene and post process the hit
list without effectively building a completely new search engine index. if i am going
to go to this trouble, there is no point to using Lucene at all.
Herb
-Original Message-
From: Tatu Saloranta
respecting sentence boundaries and using them to affect a document's score in the
ranking algorithm requires linguistic knowledge, not NLP knowledge. think about it.
Herb
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 9:13 PM
To:
now you're talking. this is one way of doing it. you need to work out a heuristic to
increment the counter enough that a misrecognized long sentence won't trigger this.
however, one can argue that a sentence that contains 1000 words can't possibly be
about one topic.
Herb
-Original
There is already an implementation in the Java API for sentence boundary
detection. The BreakIterator in the java.text package has this to say about
sentence splitting:
Sentence boundary analysis allows selection with correct interpretation of
periods within numbers and abbreviations, and
i don't know what the Java implementation is like but the C++ one is very fast.
Herb
-Original Message-
From: Philippe Laflamme [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:39 AM
To: Lucene Users List
Subject: RE: inter-term correlation [was Re: Vector Space Model in
Hi,
it actually is quite nice and it can be used in production for such things as have
been discussed
lately in this group.
If you want to play it safe: The iterator breaks at dots after numbers (e.g. 15.
March), the precision
of the algorithm can be increased if you never break after a
My only concern with this being integrated into lucene is that it be done in
a way that doesn't make its use mandatory. Lucene is powerful enough that
it can be used for a lot of cases where NLP doesn't make any sense. For
example, I think that sentence boundaries would severely screw up the
show an example document.
Herb
-Original Message-
From: Dan Quaroni [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:48 AM
To: 'Lucene Users List'
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])
My only
Hope this isn't out of context - but Dan makes a very valid point.
Besides the potential performance slowdown if NLP was always applied to
a users query - there are times that an exact term match is desired
without the query expansion that an NLP process normally requires.
Joe
- Original
there is nothing i said about NLP. in fact my specific statements exclude NLP. the
processing i am describing covers a linguistic observation and a constraint. a
sequence of terms in the query receive a higher score when it occurs inside a single
sentence than when it crosses a sentence
I'm not sure I can share a sample, but the specific situation I'm thinking
of is when you have data that doesn't exist within a sentence, for example
the name, address, etc of a company. Some foreign companies have funky
punctuation within their names and addresses.
I'd have to see the results
Joe Paulsen wrote:
Hope this isn't out of context - but Dan makes a very valid point.
Besides the potential performance slowdown if NLP was always applied to
a users query - there are times that an exact term match is desired
without the query expansion that an NLP process normally requires.
then sentence detection at indexing time shouldn't see them as sentences. no sentence
detection is run on the query terms.
Herb...
-Original Message-
From: Dan Quaroni [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:27 AM
To: 'Lucene Users List'
Subject: RE: Contributing to
the core of the search engine has to have certain capabilities, however, because they
are next to impossible to add as a layer on top with any efficiency. detecting
sentence boundaries outside the core search engine is really hard to do without
building another search engine index. if i have to
Dmitry once contributed a nice beefy patch that added Term Vector
support to Lucene. While we never integrated the changes (for no good
reason), I do recall that the patch was nice and elegant, because it
allowed one to turn Term Vector support on/off at indexing time.
If turned on, Lucene would
Hello,
I am considering using the document id in order to implement a fast 'join' during
relational search.
My first question is: should I steer clear of this all together? And why? If not, I
need to know which Lucene operations can cause document ids to change.
I am assuming that the
Tate Avery wrote:
My first question is: should I steer clear of this all together?
No, I think this is appropriate.
If not, I need to know which Lucene operations can cause document ids to change.
I am assuming that the following can cause potential changes:
1) Add document
2)
Karsten Konrad wrote:
I was wondering whether we could, while indexing, make a use of this by
increasing the position counter by a large number, let's say 1000,
whenever we encounter a sentence separator (Note, this is not trivial;
not every '.' ends a sentence etc. etc. etc.). Thus, searching
any arbitrary number you pick will be broken by some document someone puts into the
system.
Herb
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector
you could use the negative of the actual value.
Herb
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
This is exactly the
Ah, I see. You have an absoulte interpretation. I am more relative. I
think we're talking about a heuristic, not a law.
Matches within a sentence are scored higher than those that are not.
And the closer matching the terms are, whether within the same sentence
or not, the greater the score.
Dear Brain:
From the list I found that you have the javacc2.0. Would you please send
the package to me. I could not find it any where else.
Thanks
Jianshuo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands,
If you create two parallel indices (to use different parsing methods for
instance), and always add and delete documents in parallel, will the
document ID's always correspond in both indices? And could optimization
destroy any such invariance?
-Original Message-
From: Doug Cutting
If they're optimized at different times then the document ids could get
out of sync, as the optimized version will have deleted documents
removed, while the un-optimized one won't.
Also, for add/delete to keep document ids in sync you need to also be
sure to use the same mergeFactor.
Doug
Hi,
So we've implemented both suggestions and it made a big difference.
You can see a Beta sample at
http://www.fastbuzz.com/search/index.jsp
We have around 7,000,000 items in the index.
What we did:
1. Instead of using msec granularity, we're using hour granularity for
date searches. This
On Monday 17 November 2003 07:40, Chong, Herb wrote:
i don't know what the Java implementation is like but the C++ one is very
fast.
...
I personally do not have any experience with the BreakIterator in Java. Has
anyone used it in any production environment? I'd be very interested to
learn
On Monday 17 November 2003 08:39, Chong, Herb wrote:
the core of the search engine has to have certain capabilities, however,
because they are next to impossible to add as a layer on top with any
efficiency. detecting sentence boundaries outside the core search engine is
really hard to do
30 matches
Mail list logo