List
Subject: Re: inter-term correlation [was Re: Vector Space Model
in Lucene?]
On Monday 17 November 2003 07:40, Chong, Herb wrote:
i don't know what the Java implementation is like but the C++
one is very
fast.
...
I personally do not have any experience with the BreakIterator
[was Re: Vector Space Model in Lucene?]
In terms of speed I would tend to agree with you.
My question regarding efficiency was directed more towards the quality of
the results it provides. Is the BreakIterator breaking on correct sentence
boundaries or is it being confused by dots at the end
: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 5:54 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Well ... Sure, nothing can replace a human mind. But believe it or not,
there are studies which show that even human
looking for one.
Herb
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 6:45 PM
To: Lucene Users List
Subject: Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space
Model in Lucene?])
Hello Herb,
I don't
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Isn't that quite strict interpretation, however? There are many cases where
linguistically separate sentences do have strong dependendies; in web world
simple things like list items may be very closely related. Put
[mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 8:30 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Hmmh? You implied that there are some useful distance heuristics (words
5 words apart or more correlate much less), and others have
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
What you can do is use a pos tagger (i.e. a maximum entropy model based
or Brill tagger if you just have english) and use a data mining
algorithm for weight your terms.
May be you can use a hidden
Message-
From: Karsten Konrad [mailto:[EMAIL PROTECTED]
Sent: Saturday, November 15, 2003 7:16 AM
To: Lucene Users List
Subject: AW: inter-term correlation [was Re: Vector Space Model in
Lucene?]
Anyway, Herb is right, sentence boundaries do carry a meaning and the
linguistic rule could
: RE: inter-term correlation [was Re: Vector Space Model
in Lucene?]
i have a program written in Icon that does basic sentence
splitting. with about 5 heuristics and one small lookup table, i
can get well over 90% accuracy doing sentence boundary detection
on email. for well edited English
i don't know what the Java implementation is like but the C++ one is very fast.
Herb
-Original Message-
From: Philippe Laflamme [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:39 AM
To: Lucene Users List
Subject: RE: inter-term correlation [was Re: Vector Space Model
]
www.xtramind.com
-Ursprüngliche Nachricht-
Von: Philippe Laflamme [mailto:[EMAIL PROTECTED]
Gesendet: Montag, 17. November 2003 15:39
An: Lucene Users List
Betreff: RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
There is already an implementation in the Java API
My only concern with this being integrated into lucene is that it be done in
a way that doesn't make its use mandatory. Lucene is powerful enough that
it can be used for a lot of cases where NLP doesn't make any sense. For
example, I think that sentence boundaries would severely screw up the
show an example document.
Herb
-Original Message-
From: Dan Quaroni [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:48 AM
To: 'Lucene Users List'
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])
My only
Message -
From: Chong, Herb [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:00 AM
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])
show an example document.
Herb
-Original Message
-Original Message-
From: Joe Paulsen [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:12 AM
To: Lucene Users List
Subject: Re: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])
Hope this isn't out of context - but Dan makes a very
and the needs of the user.
-Original Message-
From: Chong, Herb [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:01 AM
To: Lucene Users List
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])
show an example document.
Herb
Joe Paulsen wrote:
Hope this isn't out of context - but Dan makes a very valid point.
Besides the potential performance slowdown if NLP was always applied to
a users query - there are times that an exact term match is desired
without the query expansion that an NLP process normally requires.
to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])
I'm not sure I can share a sample, but the specific situation I'm thinking
of is when you have data that doesn't exist within a sentence, for example
the name, address, etc of a company. Some foreign companies have
to do that, there is no point in using
Lucene.
Herb...
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:26 AM
To: Lucene Users List
Subject: Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector
Space Model
correlation
[was R e: Vector Space Model in Lucene?])
Query expansion can (and I believe should) be done efficiently
outside
the core of search engine. After all, it's a process of changing the
query according to some expansion/rewriting algorithms, but it is
still
the unchanged search
Karsten Konrad wrote:
I was wondering whether we could, while indexing, make a use of this by
increasing the position counter by a large number, let's say 1000,
whenever we encounter a sentence separator (Note, this is not trivial;
not every '.' ends a sentence etc. etc. etc.). Thus, searching
Space Model in Lucene?]
This is exactly the sort of approach I was advocating in earlier
messages. (Although I think you'd only need to increase the position
counter by 101 for the first word in each sentence.) Herb Chong didn't
seem to think this was appropriate, but I never understood why.
Doug
you could use the negative of the actual value.
Herb
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
This is exactly
PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
This is exactly the sort of approach I was advocating in earlier
messages. (Although I think you'd only need to increase the position
counter
On Monday 17 November 2003 07:40, Chong, Herb wrote:
i don't know what the Java implementation is like but the C++ one is very
fast.
...
I personally do not have any experience with the BreakIterator in Java. Has
anyone used it in any production environment? I'd be very interested to
learn
On Monday 17 November 2003 08:39, Chong, Herb wrote:
the core of the search engine has to have certain capabilities, however,
because they are next to impossible to add as a layer on top with any
efficiency. detecting sentence boundaries outside the core search engine is
really hard to do
[mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 12:39 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Herb
Hmm... Are you perhaps familiar with some open system which doesn't? I'm
curious because one of my projects (already using Lucene) could benefit
from
you are in flame mode anyway now :)
Regards,
Karsten
-Ursprüngliche Nachricht-
Von: petite_abeille [mailto:[EMAIL PROTECTED]
Gesendet: Freitag, 14. November 2003 20:04
An: Lucene Users List
Betreff: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
On Nov 14, 2003
. my project at the time was cancelled after TREC-7 and so there haven't been any new developments.
Herb
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 12:39 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Herb
Really? And what model is used/implemented by Lucene?
THX
Leo
Otis Gospodnetic wrote:
Lucene does not implement vector space model.
Otis
--- [EMAIL PROTECTED] wrote:
Hi,
does Lucene implement a Vector Space Model? If yes, does anybody have
an
example of how using it?
Cheers,
Ralf
--
NEU
does it matter? vector space is only one of several important ones.
Herb
-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 4:00 AM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Really? And what model is used
. November 2003 14:35
An: Lucene Users List
Betreff: RE: Vector Space Model in Lucene?
does it matter? vector space is only one of several important ones.
Herb
-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 4:00 AM
To: Lucene Users List
: Friday, November 14, 2003 4:00 AM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Really? And what model is used/implemented by Lucene?
THX
Leo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands
-Original Message-
From: Karsten Konrad [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 9:08 AM
To: Lucene Users List
Subject: AW: Vector Space Model in Lucene?
what are these several other important ones?
-
To unsubscribe
to me, vector space implies thinking inside the box.
Herb...
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
like all vector space models i have come across, Lucene ignores interterm correlation.
Herb
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Chong, Herb wrote:
like all vector space models i have come across, Lucene ignores interterm correlation.
Herb
Hmm... Are you perhaps familiar with some open system which doesn't? I'm
curious because one of my projects (already using Lucene) could benefit
from such feature. Right now I'm
: Re: Vector Space Model in Lucene?
Herb
Hmm... Are you perhaps familiar with some open system which doesn't? I'm
curious because one of my projects (already using Lucene) could benefit
from such feature. Right now I'm using a bastardized version of Markov
chains, but it's more of a hack
Incorporating inter-term correlation into Lucene isn't that hard; I've
done it. Nor is it incompatible with the vector-space model. I'm not
happy with the specific correlation metric that I picked, which is why
I'm not eager to generally release the code I wrote, but I think that
the basic
14, 2003 1:14 PM
To: Lucene Users List
Subject: inter-term correlation [was Re: Vector Space Model in Lucene?]
Incorporating inter-term correlation into Lucene isn't that hard; I've
done it. Nor is it incompatible with the vector-space model. I'm not
happy with the specific correlation metric
-
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 1:14 PM
To: Lucene Users List
Subject: inter-term correlation [was Re: Vector Space Model in Lucene?]
Incorporating inter-term correlation into Lucene isn't that hard; I've
done it. Nor is it incompatible
: Friday, November 14, 2003 1:53 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Not sure what you mean by terms can't cross sentence boundaries. If
you're only using single-word terms, that's trivially true. What is it
that you're trying
On Friday, November 14, 2003, at 01:13 PM, Chong, Herb wrote:
if you didn't have to change the index then you haven't got all the
factors needed to do it well. terms can't cross sentence boundaries
and the index doesn't store sentence boundaries.
You mean if you have text like this: Hello Herb.
, 2003 1:52 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
You mean if you have text like this: Hello Herb. Have a nice day!,
you want to prevent phrase queries for herb have? You could prevent
sentence boundary crossing with clever use
On Nov 14, 2003, at 19:50, Chong, Herb wrote:
if you are handling inter correlation properly, then terms can't cross
sentence boundaries.
Could you not break down your document along sentences boundary? If you
manage to figure out what a sentence is, that is.
if you are not paying attention to
On Friday, November 14, 2003, at 02:02 PM, Chong, Herb wrote:
if i just run this query against a million document newswire index, i
know i am going to get lots of hits. the phrase capital gains tax
hits a lot fewer documents, but is overrestrictive. the fact that the
three terms occur next to
.
Herb
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 12:39 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Herb
Hmm... Are you perhaps familiar with some open system which doesn't? I'm
curious
, November 14, 2003 2:10 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
With Lucene's analysis process, you can assign a position increment to
tokens. The default value is 1, meaning its the next position. Phrase
queries default to a slop of 0
Space Model in
Lucene?]
On Nov 14, 2003, at 19:50, Chong, Herb wrote:
if you are handling inter correlation properly, then terms can't cross
sentence boundaries.
Could you not break down your document along sentences boundary? If you
manage to figure out what a sentence
On Nov 14, 2003, at 20:27, Dror Matalon wrote:
I might be the only person on the list who's having a hard time
following this discussion.
Nope. I don't understand a word of what those guys are talking about
either :)
Would one of you wise folks care to point me
to a good dummies, also known as
:28 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Hi,
I might be the only person on the list who's having a hard time
following this discussion. Would one of you wise folks care to point me
to a good dummies, also known as an executive summary, resource about
the theoretical
On Nov 14, 2003, at 20:29, Philippe Laflamme wrote:
Rules of linguistics? Is there such a thing? :)
Actually, yes there is. Natural Language Processing (NLP) is a very
broad
research subject but a lot has come out of it.
A lot of what? If statements? :)
More specifically, Rule-based taggers
On Friday, November 14, 2003, at 02:32 PM, Chong, Herb wrote:
when people type in multiword queries, mostly they are interested in
phrases in the linguistic sense. phrases don't cross sentence
boundaries. you need certain features in the index and in the ranking
algorithm to capture that
Subject: Re: Vector Space Model in Lucene?
In the Lucene-sense of things, sounds like you're after one Document
per sentence. You then get your boundaries automatically as well as
the distance weighting through the coord() Similarity function. At
least that seems like a close approximation
On Friday, November 14, 2003, at 02:54 PM, Chong, Herb wrote:
it solves one part of the problem, but there are a lot of sentences in
a typical document. you'll need to composite a rank of a document from
its constituent sentences then. there are less drastic ways to solve
the problem. the
analysis. Maybe someone out there has some experience
they might want to share with us?
Thanks,
Phil
-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: November 14, 2003 14:36
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model
Chong, Herb wrote:
since i am working now on financial news, here is an example:
capital gains tax
if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase capital gains tax hits a lot fewer documents, but is overrestrictive. the fact
, but implementing it can be.
Herb
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:08 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
I get the feeling you're looking for reasons that Lucene is inadequate.
This may
Space Model in Lucene?
This all sounds wonderfully exotic, but, from all the different
esoteric approaches you ever tried, what, if anything, made a concrete
and noticeable impact on the quality of your search?
-
To unsubscribe, e
On Nov 14, 2003, at 21:16, Chong, Herb wrote:
if you know what TREC is, you know what i meant earlier. this isn't
exotic technology, this is close to 15 year old technology.
This is not really what I asked. What I would be interested to know is
what approach you consider to provide the biggest
to implement efficiently.
Herb...
-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:20 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
This is not really what I asked. What I would be interested to know is
what approach you
. there is psychology of query creation too and that is one thing i am taking advantage of.
Herb
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:15 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model
:33 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Certainly there are lots of scoring algorithms that one cannot easily
implement with Lucene. I'm just not yet clear on what you need to do
that Lucene cannot support
Leo Galambos wrote:
There are other (more trivial) problems as well. One geek from UFAL (our
NLP lab) reported, that it was a hard problem to find the boundaries, or
rather, to say whether a dot is a dot or something else, i.e. blah,
i.e. blah i.b.m. i.p. pavlov 3.14 28.10.2003 etc.
On the
On Nov 14, 2003, at 21:14, Philippe Laflamme wrote:
Rules of linguistics? Is there such a thing? :)
Actually, yes there is. Natural Language Processing (NLP) is a very
broad
research subject but a lot has come out of it.
A lot of what? If statements? :)
Yes... just like every software boils down
PA,
But Lucene is an low level indexing library.
I'm sure most people here will agree that lucene is much more than a
_low level_ indexing library.
May be it is just a library, but definitely the *highest level* search
technology available in the web for free.
You ride roughshod over the
Well ... Sure, nothing can replace a human mind. But believe it or not,
there are studies which show that even human experts can significantly
differ in their opinions on what are key-phrases for a given text. So,
the results are never clear cut with humans either...
So, in this sense a
Herb,
On Friday 14 November 2003 13:39, Chong, Herb wrote:
you're describing ad-hoc solutions to a problem that have an effect, but
not one that is easily predictable. one can concoct all sorts of
combinations of the query operators that would have something of the effect
that i am describing.
Lucene does not implement vector space model.
Otis
--- [EMAIL PROTECTED] wrote:
Hi,
does Lucene implement a Vector Space Model? If yes, does anybody have
an
example of how using it?
Cheers,
Ralf
--
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File
Hi,
does Lucene implement a Vector Space Model? If yes, does anybody have an
example of how using it?
Cheers,
Ralf
--
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService
Jetzt kostenlos anmelden unter http://www.gmx.net
70 matches
Mail list logo