Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Tatu Saloranta
On Monday 17 November 2003 08:39, Chong, Herb wrote:
> the core of the search engine has to have certain capabilities, however,
> because they are next to impossible to add as a layer on top with any
> efficiency. detecting sentence boundaries outside the core search engine is
> really hard to do without building another search engine index. if i have
> to do that, there is no point in using Lucene.

It's also good to know what exactly constitutes core; I would assume that 
analyzer implementations are not part per se, as long as core knows how
to use analyzers. But as long as index structure has some way to store 
information needed (perhaps by using existing property of distances between 
tokens, which allows both overlapping tokens and gaps, like someone 
suggested?), core need not know specifics of how analyzers determine 
structural (sentence etc) boundaries.

To me this seems like one of many issues where it's possible to retain 
distinction between Lucene kernel (lean mean core) and more specialized 
functionality; highlighting was another one.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Tatu Saloranta
On Monday 17 November 2003 07:40, Chong, Herb wrote:
> i don't know what the Java implementation is like but the C++ one is very
> fast.
...
>> I personally do not have any experience with the BreakIterator in Java. Has
>> anyone used it in any production environment? I'd be very interested to
>> learn more about it's efficiency.

Even if that implementation wasn't fast (which it should be), it should be 
fairly easy to implement it to be pretty much as efficient as any of basic 
tokenizers; ie. not much slower than full scanning speed over text data and 
token creation overhead.

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: Slow response time with datefilter

2003-11-17 Thread Dror Matalon
Hi,

So we've implemented both suggestions and it made a big difference.
You can see a Beta sample at

http://www.fastbuzz.com/search/index.jsp

We have around 7,000,000 items in the index.

What we did:
1. Instead of using msec granularity, we're using hour granularity for
date searches. This reduced search times from tens of seconds to 2-5
seconds. No ideal, but ...
2. We cache the results. So if you're looking for items in the last 15
days and then do a "next" it'll save the filter using
CachingWrapperFilter and reuse it, resulting in much faster times the
second time.
This reduces the times from the above 2-5 secons to 0.2 - 0.8 msecs.

One of the challenges though is that since the index is updated in real
time, we can't cache for very long. We'll probably have to set up a
mechanism to "seed" the cache before the "new" index becomes available.

Regards,

Dror



On Sat, Nov 15, 2003 at 11:03:13AM -0800, Dror Matalon wrote:
> After posting the original email, I started wondering if that's the
> issue, the fact that we store timestamp up to the millisecond rather
> than a more reasonable granularity. Dates are too high a granularity for
> us, but minutes, and possibly hours should work.
> 
> I'll report once we've tested some more.
> 
> Regards,
> 
> Dror
> 
> On Sat, Nov 15, 2003 at 12:25:47PM -0500, Erik Hatcher wrote:
> > On Saturday, November 15, 2003, at 11:38  AM, Karsten Konrad wrote:
> > >If the number of different date terms causes this effect, why not 
> > >"round"
> > >the date to the nearest or next midnight while indexing. Thus, 
> > >filtering
> > >for the last  15 days would require walking over 15-17 different date 
> > >terms.
> > >If you don't do this, the number of different terms will be the same as
> > >the number of documents you indexed, explaining the slowing down when 
> > >you
> > >have more results.
> > 
> > I wholeheartedly concur.  And in fact I don't use the Keyword(String, 
> > Date) thing at all if I just need to represent a date.  I use MMDD 
> > as a String instead.  It's just too fiddly to deal with dates using the 
> > built-in handling of it.
> > 
> > Erik
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> -- 
> Dror Matalon
> Zapatec Inc 
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Which operations change document ids?

2003-11-17 Thread Doug Cutting
If they're optimized at different times then the document ids could get 
out of sync, as the optimized version will have deleted documents 
removed, while the un-optimized one won't.

Also, for add/delete to keep document ids in sync you need to also be 
sure to use the same mergeFactor.

Doug

Jamie Stallwood wrote:
If you create two parallel indices (to use different parsing methods for
instance), and always add and delete documents in parallel, will the
document ID's always correspond in both indices? And could optimization
destroy any such invariance?


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: 17 November 2003 19:51
To: Lucene Users List
Subject: Re: Which operations change document ids?
Tate Avery wrote:

My first question is:  should I steer clear of this all together?


No, I think this is appropriate.


If not, I need to know which Lucene operations can cause document ids to
change.

I am assuming that the following can cause potential changes:
 1) Add document
 2) Optimize index
What else could cause a document id to change?


Nothing.  And even these can only cause an id to change if there have
been deletions.

Could delete provoke a doc id change?


Not when you perform the delete.  Later, when you add to or optimize the
index, the ids for deleted documents are reclaimed.

And, I am assuming that the following DO NOT change the document id:

 1) Query the index


That is correct.

Document ids never change with an instance of IndexReader.  When you
open a new index reader you should usually assume that ids have changed.
Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Which operations change document ids?

2003-11-17 Thread Jamie Stallwood
If you create two parallel indices (to use different parsing methods for
instance), and always add and delete documents in parallel, will the
document ID's always correspond in both indices? And could optimization
destroy any such invariance?



-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: 17 November 2003 19:51
To: Lucene Users List
Subject: Re: Which operations change document ids?

Tate Avery wrote:
> My first question is:  should I steer clear of this all together?

No, I think this is appropriate.

> If not, I need to know which Lucene operations can cause document ids to
change.
>
> I am assuming that the following can cause potential changes:
>   1) Add document
>   2) Optimize index
>
> What else could cause a document id to change?

Nothing.  And even these can only cause an id to change if there have
been deletions.

> Could delete provoke a doc id change?

Not when you perform the delete.  Later, when you add to or optimize the
index, the ids for deleted documents are reclaimed.

> And, I am assuming that the following DO NOT change the document id:
>
>   1) Query the index

That is correct.

Document ids never change with an instance of IndexReader.  When you
open a new index reader you should usually assume that ids have changed.

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



javacc2.0

2003-11-17 Thread Jianshuo Niu
Dear Brain:

>From the list I found that you have the javacc2.0. Would you please send
the package to me. I could not find it any where else.

Thanks

Jianshuo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Doug Cutting
Ah, I see.  You have an absoulte interpretation.  I am more relative.  I 
think we're talking about a heuristic, not a law.

Matches within a sentence are scored higher than those that are not. 
And the closer matching the terms are, whether within the same sentence 
or not, the greater the score.  Given these two principals, at some 
point, as sentences get longer, a close match across sentence boundaries 
should probably score substantially higher than a very distant match 
within a sentence.  Thus missing some distant yet still within-sentence 
matches in very long sentences probably won't substantially alter the 
ranking.  Is 100 long enough?  Perhaps not.  But 1000 is certainly 
plenty long.

Doug

Chong, Herb wrote:
any arbitrary number you pick will be broken by some document someone puts into the system.

Herb

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
This is exactly the sort of approach I was advocating in earlier 
messages.  (Although I think you'd only need to increase the position 
counter by 101 for the first word in each sentence.)  Herb Chong didn't 
seem to think this was appropriate, but I never understood why.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
you could use the negative of the actual value.

Herb

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]


This is exactly the sort of approach I was advocating in earlier 
messages.  (Although I think you'd only need to increase the position 
counter by 101 for the first word in each sentence.)  Herb Chong didn't 
seem to think this was appropriate, but I never understood why.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
any arbitrary number you pick will be broken by some document someone puts into the 
system.

Herb

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

This is exactly the sort of approach I was advocating in earlier 
messages.  (Although I think you'd only need to increase the position 
counter by 101 for the first word in each sentence.)  Herb Chong didn't 
seem to think this was appropriate, but I never understood why.

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Doug Cutting
Karsten Konrad wrote:
I was wondering whether we could, while indexing, make a use of this by 
increasing the position counter by a large number, let's say 1000, 
whenever we encounter a sentence separator (Note, this is not trivial; 
not every '.' ends a  sentence etc. etc. etc.). Thus, searching for

"income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain

would find "income tax gain" as usual, but would boost all texts
where the phrases involved appear within sentence boundaries
This is exactly the sort of approach I was advocating in earlier 
messages.  (Although I think you'd only need to increase the position 
counter by 101 for the first word in each sentence.)  Herb Chong didn't 
seem to think this was appropriate, but I never understood why.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Which operations change document ids?

2003-11-17 Thread Doug Cutting
Tate Avery wrote:
My first question is:  should I steer clear of this all together?
No, I think this is appropriate.

If not, I need to know which Lucene operations can cause document ids to change.

I am assuming that the following can cause potential changes:
1) Add document
2) Optimize index
What else could cause a document id to change?
Nothing.  And even these can only cause an id to change if there have 
been deletions.

Could delete provoke a doc id change?
Not when you perform the delete.  Later, when you add to or optimize the 
index, the ids for deleted documents are reclaimed.

And, I am assuming that the following DO NOT change the document id:

	1) Query the index
That is correct.

Document ids never change with an instance of IndexReader.  When you 
open a new index reader you should usually assume that ids have changed.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Which operations change document ids?

2003-11-17 Thread Tate Avery
Hello,

I am considering using the document id in order to implement a fast 'join' during 
relational search.

My first question is:  should I steer clear of this all together?  And why?  If not, I 
need to know which Lucene operations can cause document ids to change.

I am assuming that the following can cause potential changes:

1) Add document
- since it might trigger a merge

2) Optimize index
- since it does trigger a merge

3) Update document
- since it is a delete + add

What else could cause a document id to change?  Could delete provoke a doc id change?

And, I am assuming that the following DO NOT change the document id:

1) Query the index


Also, am I missing any others that will or will not cause a document id to change?  

Thank you,

Tate


P.S. It appears (to me) that the SearchBean (in lucene sandbox) sorting makes use of 
the Hits.id(int _n) method.  How does it cope, if at all, with changes to the 
underlying document ids?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Otis Gospodnetic
Dmitry once contributed a nice beefy patch that added Term Vector
support to Lucene.  While we never integrated the changes (for no good
reason), I do recall that the patch was nice and elegant, because it
allowed one to turn Term Vector support on/off at indexing time.

If turned on, Lucene would collect information about terms and document
that allows building of term vectors.  If turned off, Lucene created
only its normal index files.

If you can provide something like that, I bet a lot of people would be
interested.  I have a feeling this won't get done unless you do it,
though.

Otis


--- "Chong, Herb" <[EMAIL PROTECTED]> wrote:
> the core of the search engine has to have certain capabilities,
> however, because they are next to impossible to add as a layer on top
> with any efficiency. detecting sentence boundaries outside the core
> search engine is really hard to do without building another search
> engine index. if i have to do that, there is no point in using
> Lucene.
> 
> Herb...
> 
> -Original Message-
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 17, 2003 10:26 AM
> To: Lucene Users List
> Subject: Re: Contributing to Lucene (was RE: inter-term correlation
> [was R e: Vector Space Model in Lucene?])
> 
> 
> Query expansion can (and I believe should) be done efficiently
> outside 
> the core of search engine. After all, it's a process of changing the 
> query according to some expansion/rewriting algorithms, but it is
> still 
> the unchanged search engine that in the end has to answer the new
> query...
> 
> -- 
> Best regards,
> Andrzej Bialecki
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
the core of the search engine has to have certain capabilities, however, because they 
are next to impossible to add as a layer on top with any efficiency. detecting 
sentence boundaries outside the core search engine is really hard to do without 
building another search engine index. if i have to do that, there is no point in using 
Lucene.

Herb...

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:26 AM
To: Lucene Users List
Subject: Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector 
Space Model in Lucene?])


Query expansion can (and I believe should) be done efficiently outside 
the core of search engine. After all, it's a process of changing the 
query according to some expansion/rewriting algorithms, but it is still 
the unchanged search engine that in the end has to answer the new query...

-- 
Best regards,
Andrzej Bialecki

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
then sentence detection at indexing time shouldn't see them as sentences. no sentence 
detection is run on the query terms.

Herb...

-Original Message-
From: Dan Quaroni [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:27 AM
To: 'Lucene Users List'
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


I'm not sure I can share a sample, but the specific situation I'm thinking
of is when you have data that doesn't exist within a sentence, for example
the name, address, etc of a company.  Some foreign companies have funky
punctuation within their names and addresses.  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Andrzej Bialecki
Joe Paulsen wrote:

Hope this isn't out of context - but Dan makes a very valid point.
Besides the potential performance slowdown if NLP was always applied to
a users query - there are times that an exact term match is desired
without the "query expansion" that an NLP process normally requires.
Query expansion can (and I believe should) be done efficiently outside 
the core of search engine. After all, it's a process of changing the 
query according to some expansion/rewriting algorithms, but it is still 
the unchanged search engine that in the end has to answer the new query...

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Dan Quaroni
I'm not sure I can share a sample, but the specific situation I'm thinking
of is when you have data that doesn't exist within a sentence, for example
the name, address, etc of a company.  Some foreign companies have funky
punctuation within their names and addresses.  

I'd have to see the results to know if the NLP would mess anything up.  If
all it did was weight the results, then perhaps it wouldn't, but it's also
possible that it would.  Basically my concern is that it would mess up the
use of lucene for non-sentence based applications that might contain
punctuation.

On the whole I think adding the NLP to lucene is a good idea because the
vast majority of the applications of lucene would benefit from it.  Making
it optional could be a good way to maintain the current power of lucene and
perhaps also retain the speed depending on the performance of the NLP
functionality and the needs of the user.


-Original Message-
From: Chong, Herb [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:01 AM
To: Lucene Users List
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


show an example document.

Herb

-Original Message-
From: Dan Quaroni [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:48 AM
To: 'Lucene Users List'
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


My only concern with this being integrated into lucene is that it be done in
a way that doesn't make its use mandatory.  Lucene is powerful enough that
it can be used for a lot of cases where NLP doesn't make any sense.  For
example, I think that sentence boundaries would severely screw up the
project I recently did using lucene because there are no sentences, but
there is punctuation.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
there is nothing i said about NLP. in fact my specific statements exclude NLP. the 
processing i am describing covers a linguistic observation and a constraint. a 
sequence of terms in the query receive a higher score when it occurs inside a single 
sentence than when it crosses a sentence boundary. also, there are many situations 
where NLP processing doesn't do any query expansion and reduces the number of possible 
documents that a query can match, thereby speeding up search. query expansion is only 
one way to use NLP, and i am not even interested in NLP changes to Lucene.

Herb

-Original Message-
From: Joe Paulsen [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:12 AM
To: Lucene Users List
Subject: Re: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


Hope this isn't out of context - but Dan makes a very valid point.
Besides the potential performance slowdown if NLP was always applied to
a users query - there are times that an exact term match is desired
without the "query expansion" that an NLP process normally requires.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
we're talking 2-3% if this is done right.

Herb...

-Original Message-
From: Joe Paulsen [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:12 AM
To: Lucene Users List
Subject: Re: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


Hope this isn't out of context - but Dan makes a very valid point.
Besides the potential performance slowdown if NLP was always applied to
a users query - there are times that an exact term match is desired
without the "query expansion" that an NLP process normally requires.

Joe

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Joe Paulsen
Hope this isn't out of context - but Dan makes a very valid point.
Besides the potential performance slowdown if NLP was always applied to
a users query - there are times that an exact term match is desired
without the "query expansion" that an NLP process normally requires.

Joe

- Original Message - 
From: "Chong, Herb" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, November 17, 2003 10:00 AM
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


show an example document.

Herb

-Original Message-
From: Dan Quaroni [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:48 AM
To: 'Lucene Users List'
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


My only concern with this being integrated into lucene is that it be
done in
a way that doesn't make its use mandatory.  Lucene is powerful enough
that
it can be used for a lot of cases where NLP doesn't make any sense.  For
example, I think that sentence boundaries would severely screw up the
project I recently did using lucene because there are no sentences, but
there is punctuation.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
show an example document.

Herb

-Original Message-
From: Dan Quaroni [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:48 AM
To: 'Lucene Users List'
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


My only concern with this being integrated into lucene is that it be done in
a way that doesn't make its use mandatory.  Lucene is powerful enough that
it can be used for a lot of cases where NLP doesn't make any sense.  For
example, I think that sentence boundaries would severely screw up the
project I recently did using lucene because there are no sentences, but
there is punctuation.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Dan Quaroni
My only concern with this being integrated into lucene is that it be done in
a way that doesn't make its use mandatory.  Lucene is powerful enough that
it can be used for a lot of cases where NLP doesn't make any sense.  For
example, I think that sentence boundaries would severely screw up the
project I recently did using lucene because there are no sentences, but
there is punctuation.

--- "Chong, Herb" <[EMAIL PROTECTED]> wrote:
> that concept is that multiword queries
> are mostly multiword terms and they can't cross sentence boundaries
> according to the rules of English.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Karsten Konrad

Hi,

it actually is quite nice and it can be used in production for such things as have 
been discussed
lately in this group. 

If you want to play it safe: The iterator breaks at dots after numbers (e.g. 15. 
March), the precision
of the algorithm can be increased if you never break after a number.

The implementation is fast.

Regards,

Karsten

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com


-Ursprüngliche Nachricht-
Von: Philippe Laflamme [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 17. November 2003 15:39
An: Lucene Users List
Betreff: RE: inter-term correlation [was Re: Vector Space Model in Lucene?]


There is already an implementation in the Java API for sentence boundary detection. 
The BreakIterator in the java.text package has this to say about sentence splitting:

"Sentence boundary analysis allows selection with correct interpretation of periods 
within numbers and abbreviations, and trailing punctuation marks such as quotation 
marks and parentheses." 
http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html

The whole i18n Java API is based on the ICU framework from IBM: 
http://oss.software.ibm.com/icu/index.html
It supports many languages.

I personally do not have any experience with the BreakIterator in Java. Has anyone 
used it in any production environment? I'd be very interested to learn more about it's 
efficiency.

Regards,
Phil

> -Original Message-
> From: Chong, Herb [mailto:[EMAIL PROTECTED]
> Sent: November 17, 2003 08:53
> To: Lucene Users List
> Subject: RE: inter-term correlation [was Re: Vector Space Model in 
> Lucene?]
>
>
> i have a program written in Icon that does basic sentence splitting. 
> with about 5 heuristics and one small lookup table, i can get well 
> over 90% accuracy doing sentence boundary detection on email. for well 
> edited English text, like newswires, i can manage closer to 99%. this 
> is all that is needed for significantly improving a search engine's 
> performance when the query engine respects sentence boundaries. 
> incidentally, the GATE Information Extraction framework cites some 
> references that indicate that for named entity feature extraction, 
> their system can exceed the ability of trained humans to detect and 
> classify named entities if only one person does the detection.
> collaborating humans are still better, but no-one has the time in
> practical applications.
>
> you probably know, since you know about Markov chains, that within 
> sentence term correlation, and hence the language model, is different 
> than across sentences. linguists have known this for a very long time. 
> it isn't hard to put this capability into a search engine, but it 
> absolutely breaks down unless there is sentence boundary information 
> stored for use at query time.
>
> Herb
>
> -Original Message-
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Friday, November 14, 2003 5:54 PM
> To: Lucene Users List
> Subject: Re: inter-term correlation [was Re: Vector Space Model in 
> Lucene?]
>
>
> Well ... Sure, nothing can replace a human mind. But believe it or 
> not, there are studies which show that even human experts can 
> significantly differ in their opinions on what are key-phrases for a 
> given text. So, the results are never clear cut with humans either...
>
> So, in this sense a heuristic tool for sentence splitting and 
> key-phrase detection can go long ways. For example, the application I 
> mentioned, uses quite a few heuristic rules (+ Markov chains as a 
> heavier ammunition :-), and it comes up with the following phrases for 
> your email discussion (the text quoted below):
>
> (lang=EN): NLP, trainable rule-based tagging, natural language 
> processing, apache, NLP expert
>
> Now, this set of key-phrases does reflect the main noun-phrases in the 
> text... which means I have a practical and tangible benefit from NLP. 
> QED ;-)
>
> Best regards,
> Andrzej
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
i don't know what the Java implementation is like but the C++ one is very fast.

Herb

-Original Message-
From: Philippe Laflamme [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:39 AM
To: Lucene Users List
Subject: RE: inter-term correlation [was Re: Vector Space Model in Lucene?]


I personally do not have any experience with the BreakIterator in Java. Has
anyone used it in any production environment? I'd be very interested to
learn more about it's efficiency.

Regards,
Phil

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Philippe Laflamme
There is already an implementation in the Java API for sentence boundary
detection. The BreakIterator in the java.text package has this to say about
sentence splitting:

"Sentence boundary analysis allows selection with correct interpretation of
periods within numbers and abbreviations, and trailing punctuation marks
such as quotation marks and parentheses."
http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html

The whole i18n Java API is based on the ICU framework from IBM:
http://oss.software.ibm.com/icu/index.html
It supports many languages.

I personally do not have any experience with the BreakIterator in Java. Has
anyone used it in any production environment? I'd be very interested to
learn more about it's efficiency.

Regards,
Phil

> -Original Message-
> From: Chong, Herb [mailto:[EMAIL PROTECTED]
> Sent: November 17, 2003 08:53
> To: Lucene Users List
> Subject: RE: inter-term correlation [was Re: Vector Space Model
> in Lucene?]
>
>
> i have a program written in Icon that does basic sentence
> splitting. with about 5 heuristics and one small lookup table, i
> can get well over 90% accuracy doing sentence boundary detection
> on email. for well edited English text, like newswires, i can
> manage closer to 99%. this is all that is needed for
> significantly improving a search engine's performance when the
> query engine respects sentence boundaries. incidentally, the GATE
> Information Extraction framework cites some references that
> indicate that for named entity feature extraction, their system
> can exceed the ability of trained humans to detect and classify
> named entities if only one person does the detection.
> collaborating humans are still better, but no-one has the time in
> practical applications.
>
> you probably know, since you know about Markov chains, that
> within sentence term correlation, and hence the language model,
> is different than across sentences. linguists have known this for
> a very long time. it isn't hard to put this capability into a
> search engine, but it absolutely breaks down unless there is
> sentence boundary information stored for use at query time.
>
> Herb
>
> -Original Message-
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Friday, November 14, 2003 5:54 PM
> To: Lucene Users List
> Subject: Re: inter-term correlation [was Re: Vector Space Model
> in Lucene?]
>
>
> Well ... Sure, nothing can replace a human mind. But believe it or not,
> there are studies which show that even human experts can significantly
> differ in their opinions on what are key-phrases for a given text. So,
> the results are never clear cut with humans either...
>
> So, in this sense a heuristic tool for sentence splitting and key-phrase
> detection can go long ways. For example, the application I mentioned,
> uses quite a few heuristic rules (+ Markov chains as a heavier
> ammunition :-), and it comes up with the following phrases for your
> email discussion (the text quoted below):
>
> (lang=EN): NLP, trainable rule-based tagging, natural language
> processing, apache, NLP expert
>
> Now, this set of key-phrases does reflect the main noun-phrases in the
> text... which means I have a practical and tangible benefit from NLP.
> QED ;-)
>
> Best regards,
> Andrzej
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
now you're talking. this is one way of doing it. you need to work out a heuristic to 
increment the counter enough that a misrecognized long sentence won't trigger this. 
however, one can argue that a sentence that contains 1000 words can't possibly be 
about one topic.

Herb

-Original Message-
From: Karsten Konrad [mailto:[EMAIL PROTECTED]
Sent: Saturday, November 15, 2003 7:16 AM
To: Lucene Users List
Subject: AW: inter-term correlation [was Re: Vector Space Model in
Lucene?]

Anyway, Herb is right, sentence boundaries do carry a meaning and the 
linguistic rule could be phrased as: "Constituents (Concepts) mentioned 
in one sentence together have a closer relation than those that are not."

I was wondering whether we could, while indexing, make a use of this by 
increasing the position counter by a large number, let's say 1000, 
whenever we encounter a sentence separator (Note, this is not trivial; 
not every '.' ends a  sentence etc. etc. etc.). Thus, searching for

"income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain

would find "income tax gain" as usual, but would boost all texts
where the phrases involved appear within sentence boundaries - I 
assume that a sentence with 100 words would be pretty unlikely,
but still within the 1000 word separation done by increasing the
position. No linguistics necessary, actually, but it is an application
of a linguistic rule!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
respecting sentence boundaries and using them to affect a document's score in the 
ranking algorithm requires linguistic knowledge, not NLP knowledge. think about it.

Herb

-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 9:13 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


What you can do is use a pos tagger (i.e. a maximum entropy model based 
or  Brill tagger if you just have english) and use a data mining 
algorithm for weight your terms.
May be you can use a hidden Markov model for that.

You can build this on top of lucene, shouldn't be that difficult.

But may be I understand you wrong.. ..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
you cannot layer sentence boundary detection on top of Lucene and post process the hit 
list without effectively building a completely new search engine index. if i am going 
to go to this trouble, there is no point to using Lucene at all.

Herb

-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 8:30 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

Hmmh? You implied that there are some useful distance heuristics (words
5 words apart or more correlate much less), and others have pointed out Lucene 
has many useful components.

Building more complex system from small components is usually considered a 
Good Thing (tm), not an "ad hoc solution". In fact, I would guess most 
experienced people around here start with Lucene defaults, and build their 
own systems gradually customizing more and more of pieces.
It may be there are actual fundamental problems with Lucene, regarding 
approach you'd prefer, but I don't think it makes sense to brush off 
suggestions regarding distance  & fuzzy/sloppy queries by claiming they are 
"just hacks".

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
i am not implying rejection of a match across sentence boundaries, i am saying that it 
receives a lower score than a match within a sentence boundary.

Herb

-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 8:15 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

Isn't that quite strict interpretation, however? There are many cases where 
linguistically separate sentences do have strong dependendies; in web world 
simple things like list items may be very closely related. Put another way;
it may not be trivially easy to detect sentence boundaries, nor is it certain 
that what (from language viewpoint) is a boundary really is hard boundary 
from semantic perspective? And are there not varying levels of separation 
(sentences close to each other often are related, back references being 
common), not just one, between sentences?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
i am stuck with company policy with respect to open source project participation. this 
is why i am dropping some fairly detailed hints of what has to be done instead of 
doing it myself. this policy may change in the next year, but by then, i will have to 
be working with a solution and not just looking for one.

Herb

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 6:45 PM
To: Lucene Users List
Subject: Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space 
Model in Lucene?])


Hello Herb,

I don't approve of several teasing, mean, etc. emails I saw from a few
people.  This is a serious and polite email. :)

It sounds like you know about NLP and see places where Lucene could be
improved.  Lucene is open source and free, and could benefit from
knowledgeable people like you.  Are you interested in contributing some
computational linguistics smarts, either as improvement of Lucene core
(if improvements are such that they don't make Lucene use more
difficult and its code significantly more complex and harder to
maintain), or as an add-on module, or some kind of an extension, or
even just as application built on top of Lucene, all of which could and
would live outside of Lucene's core?

Otis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
i have a program written in Icon that does basic sentence splitting. with about 5 
heuristics and one small lookup table, i can get well over 90% accuracy doing sentence 
boundary detection on email. for well edited English text, like newswires, i can 
manage closer to 99%. this is all that is needed for significantly improving a search 
engine's performance when the query engine respects sentence boundaries. incidentally, 
the GATE Information Extraction framework cites some references that indicate that for 
named entity feature extraction, their system can exceed the ability of trained humans 
to detect and classify named entities if only one person does the detection. 
collaborating humans are still better, but no-one has the time in practical 
applications.

you probably know, since you know about Markov chains, that within sentence term 
correlation, and hence the language model, is different than across sentences. 
linguists have known this for a very long time. it isn't hard to put this capability 
into a search engine, but it absolutely breaks down unless there is sentence boundary 
information stored for use at query time.

Herb

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 5:54 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


Well ... Sure, nothing can replace a human mind. But believe it or not, 
there are studies which show that even human experts can significantly 
differ in their opinions on what are key-phrases for a given text. So, 
the results are never clear cut with humans either...

So, in this sense a heuristic tool for sentence splitting and key-phrase 
detection can go long ways. For example, the application I mentioned, 
uses quite a few heuristic rules (+ Markov chains as a heavier 
ammunition :-), and it comes up with the following phrases for your 
email discussion (the text quoted below):

(lang=EN): NLP, trainable rule-based tagging, natural language 
processing, apache, NLP expert

Now, this set of key-phrases does reflect the main noun-phrases in the 
text... which means I have a practical and tangible benefit from NLP. 
QED ;-)

Best regards,
Andrzej

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]