RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
i have a program written in Icon that does basic sentence splitting. with about 5 heuristics and one small lookup table, i can get well over 90% accuracy doing sentence boundary detection on email. for well edited English text, like newswires, i can manage closer to 99%. this is all that is

RE: Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
i am stuck with company policy with respect to open source project participation. this is why i am dropping some fairly detailed hints of what has to be done instead of doing it myself. this policy may change in the next year, but by then, i will have to be working with a solution and not just

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
i am not implying rejection of a match across sentence boundaries, i am saying that it receives a lower score than a match within a sentence boundary. Herb -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 8:15 PM To: Lucene Users List

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
you cannot layer sentence boundary detection on top of Lucene and post process the hit list without effectively building a completely new search engine index. if i am going to go to this trouble, there is no point to using Lucene at all. Herb -Original Message- From: Tatu Saloranta

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
respecting sentence boundaries and using them to affect a document's score in the ranking algorithm requires linguistic knowledge, not NLP knowledge. think about it. Herb -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 9:13 PM To:

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
now you're talking. this is one way of doing it. you need to work out a heuristic to increment the counter enough that a misrecognized long sentence won't trigger this. however, one can argue that a sentence that contains 1000 words can't possibly be about one topic. Herb -Original

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Philippe Laflamme
There is already an implementation in the Java API for sentence boundary detection. The BreakIterator in the java.text package has this to say about sentence splitting: Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
i don't know what the Java implementation is like but the C++ one is very fast. Herb -Original Message- From: Philippe Laflamme [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 9:39 AM To: Lucene Users List Subject: RE: inter-term correlation [was Re: Vector Space Model in

AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Karsten Konrad
Hi, it actually is quite nice and it can be used in production for such things as have been discussed lately in this group. If you want to play it safe: The iterator breaks at dots after numbers (e.g. 15. March), the precision of the algorithm can be increased if you never break after a

RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Dan Quaroni
My only concern with this being integrated into lucene is that it be done in a way that doesn't make its use mandatory. Lucene is powerful enough that it can be used for a lot of cases where NLP doesn't make any sense. For example, I think that sentence boundaries would severely screw up the

RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
show an example document. Herb -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 9:48 AM To: 'Lucene Users List' Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) My only

Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Joe Paulsen
Hope this isn't out of context - but Dan makes a very valid point. Besides the potential performance slowdown if NLP was always applied to a users query - there are times that an exact term match is desired without the query expansion that an NLP process normally requires. Joe - Original

RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
there is nothing i said about NLP. in fact my specific statements exclude NLP. the processing i am describing covers a linguistic observation and a constraint. a sequence of terms in the query receive a higher score when it occurs inside a single sentence than when it crosses a sentence

RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Dan Quaroni
I'm not sure I can share a sample, but the specific situation I'm thinking of is when you have data that doesn't exist within a sentence, for example the name, address, etc of a company. Some foreign companies have funky punctuation within their names and addresses. I'd have to see the results

Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Andrzej Bialecki
Joe Paulsen wrote: Hope this isn't out of context - but Dan makes a very valid point. Besides the potential performance slowdown if NLP was always applied to a users query - there are times that an exact term match is desired without the query expansion that an NLP process normally requires.

RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
then sentence detection at indexing time shouldn't see them as sentences. no sentence detection is run on the query terms. Herb... -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 10:27 AM To: 'Lucene Users List' Subject: RE: Contributing to

RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
the core of the search engine has to have certain capabilities, however, because they are next to impossible to add as a layer on top with any efficiency. detecting sentence boundaries outside the core search engine is really hard to do without building another search engine index. if i have to

RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Otis Gospodnetic
Dmitry once contributed a nice beefy patch that added Term Vector support to Lucene. While we never integrated the changes (for no good reason), I do recall that the patch was nice and elegant, because it allowed one to turn Term Vector support on/off at indexing time. If turned on, Lucene would

Which operations change document ids?

2003-11-17 Thread Tate Avery
Hello, I am considering using the document id in order to implement a fast 'join' during relational search. My first question is: should I steer clear of this all together? And why? If not, I need to know which Lucene operations can cause document ids to change. I am assuming that the

Re: Which operations change document ids?

2003-11-17 Thread Doug Cutting
Tate Avery wrote: My first question is: should I steer clear of this all together? No, I think this is appropriate. If not, I need to know which Lucene operations can cause document ids to change. I am assuming that the following can cause potential changes: 1) Add document 2)

Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Doug Cutting
Karsten Konrad wrote: I was wondering whether we could, while indexing, make a use of this by increasing the position counter by a large number, let's say 1000, whenever we encounter a sentence separator (Note, this is not trivial; not every '.' ends a sentence etc. etc. etc.). Thus, searching

RE: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
any arbitrary number you pick will be broken by some document someone puts into the system. Herb -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 2:56 PM To: Lucene Users List Subject: Re: AW: inter-term correlation [was Re: Vector

RE: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
you could use the negative of the actual value. Herb -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 2:56 PM To: Lucene Users List Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?] This is exactly the

Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Doug Cutting
Ah, I see. You have an absoulte interpretation. I am more relative. I think we're talking about a heuristic, not a law. Matches within a sentence are scored higher than those that are not. And the closer matching the terms are, whether within the same sentence or not, the greater the score.

javacc2.0

2003-11-17 Thread Jianshuo Niu
Dear Brain: From the list I found that you have the javacc2.0. Would you please send the package to me. I could not find it any where else. Thanks Jianshuo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,

RE: Which operations change document ids?

2003-11-17 Thread Jamie Stallwood
If you create two parallel indices (to use different parsing methods for instance), and always add and delete documents in parallel, will the document ID's always correspond in both indices? And could optimization destroy any such invariance? -Original Message- From: Doug Cutting

Re: Which operations change document ids?

2003-11-17 Thread Doug Cutting
If they're optimized at different times then the document ids could get out of sync, as the optimized version will have deleted documents removed, while the un-optimized one won't. Also, for add/delete to keep document ids in sync you need to also be sure to use the same mergeFactor. Doug

Re: AW: Slow response time with datefilter

2003-11-17 Thread Dror Matalon
Hi, So we've implemented both suggestions and it made a big difference. You can see a Beta sample at http://www.fastbuzz.com/search/index.jsp We have around 7,000,000 items in the index. What we did: 1. Instead of using msec granularity, we're using hour granularity for date searches. This

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Tatu Saloranta
On Monday 17 November 2003 07:40, Chong, Herb wrote: i don't know what the Java implementation is like but the C++ one is very fast. ... I personally do not have any experience with the BreakIterator in Java. Has anyone used it in any production environment? I'd be very interested to learn

Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Tatu Saloranta
On Monday 17 November 2003 08:39, Chong, Herb wrote: the core of the search engine has to have certain capabilities, however, because they are next to impossible to add as a layer on top with any efficiency. detecting sentence boundaries outside the core search engine is really hard to do