AW: Lucene internal document number?

2004-08-06 Thread Karsten Konrad

Hi,



a have a short question it's regarding lucene internal document numbers: 
can you give me an idea where they are written into the index and how 
they are generated?


I am not 100% sure about the technical design, only
from my experience with Lucene:

The numbers depend on when the document was indexed. 
The older the document, the smaller the number. All 
documents are numbered from 0 to n-1 where n is the 
number of documents the current reader sees. There
are never any gaps in this numbering.

There is, to my knowledge, no explicit point where
these numbers are written in the index. Think of
positions in a list - they are not part of the
list itself. You have to take into account that
these numbers may change for documents after 
any deletions in the index.

Regards,

Karsten


--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

Xtramind Technologies GmbH 
Stuhlsatzenhausweg 3 
D-66123 Saarbrücken

Phone +49 (681) 3 02-51 13 
Fax +49 (681) 3 02-51 09
[EMAIL PROTECTED] 
www.xtramind.com

Besuchen Sie uns !
DMS |  Halle 2 Stand 2705 |  07.- 09. September 2004 |  Messe Essen |  www.dmsexpo.de



-Ursprüngliche Nachricht-
Von: B. Grimm [Eastbeam GmbH] [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 6. August 2004 13:42
An: [EMAIL PROTECTED]
Betreff: Lucene internal document number?


hi there,
i looked around through the source but i dont get it. i also read the 
faq and i know that numbers are incremental for each index and start by 
0 and change when optimizing and so one...

i looked at the doc writers in lucene, but i dont get the point where 
numbers are given and written (i assume by using writeVInt() or 
something like that).

it would be very kind if anyone can tell me what line in which file i 
had to look for.

thanks in andvance and kind regards from berlin, germany.

bastian

-- 
Mit freundlichem Gruß,
Bastian Grimm




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: How to acces informations from a part of the index

2004-07-09 Thread Karsten Konrad

Hi,

Why don't you just use two indexes? You probably do not hate to index the test set at 
all.

If you have two or more subsets, just use filters that only matches the subsets you 
are interested in. Counting documents and such that do contain a certain term in one 
of the subset becomes then a search over the filtered document index and counting the 
number of results. Filters are quite
efficient.

Hope this helps,

Karsten


--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

Xtramind Technologies GmbH 
Stuhlsatzenhausweg 3 
D-66123 Saarbrücken

Phone +49 (681) 3 02-51 13 
Fax +49 (681) 3 02-51 09
[EMAIL PROTECTED] 
www.xtramind.com




-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 9. Juli 2004 11:22
An: [EMAIL PROTECTED]
Betreff: How to acces informations from a part of the index


Hello,
for my thesis I have to use Lucene index for a Text categorization program. For that I 
need to split the index in two. So i have a learning set and a 
validation set. The problem is that I don't know how to ask lucene to give 
me,for exemple, the number of documents IN ONLY ONE of these subsets 
containing a specific term.
For example, I would to get number of document containing term hello in a 
subset of document. This subset is a set of the document number({5,3} and the 
complete index would contains document {0,1,2,3,4,5})
How can I do this in an efficient way?
I tried to get all document containing the term and then verify which document 
belong to my subset. However, it appears that it's very slow to do this. Thanks in 
advance Claude Libois


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: clustering results

2004-04-11 Thread Karsten Konrad

Hi (danger: shameless advertising below),

our partner, Brox It Solutions, is using our - XtraMind Technologies GmbH - clustering 
for implementing meta-search clustering of search results ala Vivisimo. Check out:

http://www.anyfinder.de/

The clustering is done on the snipplets coming from search engines, but the original 
version that we still use in our own products is based on modified Lucene indexes as 
these can efficiently handle lots of information on texts and terms. Our clustering 
engine does not only cluster search results, but also performs trend recognition for 
competitive intelligence and similar tasks, but not too many people require such 
specialized features.

Brox' price models for this engine may be interesting for those who find other 
products too expensive; it also works with all existing search engines, not only 
Lucene. 

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com







-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Gesendet: Sonntag, 11. April 2004 19:03
An: Lucene Users List
Betreff: Re: clustering results


I got all excited reading the subject line clustering results but this isn't 
really clustering is it?  This is more sorting.  Does anyone know of any work 
within Lucene (or another indexer) to do actual subject clustering (i.e. like 
Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)?  It would 
be pretty awesome if Lucene had such ability, I know there aren't a whole lot 
of clustering options, and the commercial products are very expensive.  
Anyhow, just curious.

A brief definition of clustering: automatically organizing search or database 
query results into meaningful hierarchical folders ... transforming long lists 
of search results into categorized information without any clumsy pre- processing of 
the source documents.

I'm not sure how it would be done...?  Based off of top Term Frequencies for a 
document?

-K

Quoting Michael A. Schoen [EMAIL PROTECTED]:

 So as Venu pointed out, sorting doesn't seem to help the problem. If 
 we have to walk the result set, access docs and dedupe using brute 
 force, we're better off w/ the standard order by relevance.
 
 If you've got an example of this type of clustering done in a more 
 efficient way, that'd be great.
 
 Any other ideas?
 
 
 - Original Message -
 From: Erik Hatcher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Saturday, April 10, 2004 12:35 AM
 Subject: Re: clustering results
 
 
  On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
   I have an index of urls, and need to display the top 10 results 
   for a given query, but want to display only 1 result per domain. 
   It seems that using either Hits or a HitCollector, I'll need to 
   access the doc, grab the domain field (I'll have it parse ahead of 
   time) and only take/display documents that are unique.
  
   A significant percentage of the time I expect I may have to access 
   thousands of results before I find 10 in unique domains. Is there 
   a faster approach that won't require accessing thousands of 
   documents?
 
  I have examples of this that I can post when I have more time, but a 
  quick pointer... check out the overloaded IndexSearcher.search() 
  methods which accept a Sort.  You can do really really interesting 
  slicing and dicing, I think, using it.  Try this one on for size:
 
   example.displayHits(allBooks,
   new Sort(new SortField[]{
 new SortField(category),
 SortField.FIELD_SCORE,
 new SortField(pubmonth, SortField.INT, true)
   }));
 
  Be clever indexing the piece you want to group on - I think you may 
  find this the solution you're looking for.
 
  Erik
 
 
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Paid support for Lucene

2004-01-30 Thread Karsten Konrad


and eHatcher Solutions would be happy to as well :))


I don't think that one can be in much better hands 
here :)

Anyway, for mid-size to larger projects around 
the use of any search engines in Germany I can recommend 
Brox IT-Solutions (http://www.brox.de/). They use
a nice flexible framework where you can apply Lucene
plus other optional search engines (I think, they 
have now some 10 engines to choose from with many tools
like summarization that work with all these engines). 

With the help of this framework, integrating Lucene
into an existing setup, enhancing or replacing other 
search engines can be done without programming ones 
leg off.

I know them because they use my clustering algorithm
when doing meta-searches. See

http://searchdemo.brox.de/

(search for Lucene - the clustering is geared towards German
though!)

Regards,

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com



-Ursprüngliche Nachricht-
Von: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Gesendet: Donnerstag, 29. Januar 2004 19:46
An: Lucene Users List
Betreff: Re: Paid support for Lucene


and eHatcher Solutions would be happy to as well :))




On Jan 29, 2004, at 12:16 PM, Ryan Ackley wrote:
 I know of two:

 http://superlinksoftware.com
 http://jboss.org

 - Original Message -
 From: Boris Goldowsky [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Thursday, January 29, 2004 12:04 PM
 Subject: Paid support for Lucene


 Strangely, the web site does not seem to list any vendors who provide 
 incident support for Lucene.  That can't be right, can it?

 Can anyone point me to organizations that would be willing to provide 
 support for Lucene issues?

 Thanks,
 Boris
 --
 Boris Goldowsky
 [EMAIL PROTECTED]
 www.goldowsky.com/consulting


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Copy Directory to Directory function ( backup)

2004-01-15 Thread Karsten Konrad

Hi,

an elegant method is to create an empty directory and merge
the index to be copied into it, using .addDirectories() of
IndexWriter. This way, you do not have to deal with files
at all.

Regards,

Karsten

-Ursprüngliche Nachricht-
Von: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED] 
Gesendet: Donnerstag, 15. Januar 2004 13:28
An: [EMAIL PROTECTED]
Betreff: Copy Directory to Directory function ( backup)


hy ,
i would like backup a index.

1) my first idea  is to make a system copy of all the files
but in the FSDirectory class,  there is no public method to know where is located the 
directory. A simple methode like 
public File getDirectoryFile() {
return directory; would be great;
}
2) so i decide to create a copy(Directory source, Directory target) method 
i seen the openFile() and createFile method but after i 
but i don't know how use it (see my function  , this function make a Exception )

private void copy (Directory source, Directory target) throws IOException {
String[] files=source.list();
for(int i=0; ifiles.length; i++) {
InputStream in=source.openFile(files[i]);
OutputStream out=target.createFile(files[i]);
byte c;

while((c=in.readByte())!=-1) {
out.writeByte(c);
}
in.close();
out.close();
}

someone could help me please 
nico 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Probabilistic Model in Lucene - possible?

2003-12-03 Thread Karsten Konrad

Hi,


I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible.


Sorry, I have no idea about how to use a probabilistic approach with 
Lucene, but if anyone does so, I would like to know, too. 

I am currently puzzled by a related question: I would like to know
if there are any approaches to get a confidence value for relevance 
rather than a ranking. I.e., it would be nice to have a ranking 
weight whose value has some kind of semantics such that we could 
compare results from different queries. Can probabilistic approches 
do anything like this? 

Any help appreciated,

Karsten



-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 3. Dezember 2003 15:13
An: [EMAIL PROTECTED]
Betreff: Probabilistic Model in Lucene - possible?


Hello group,

from the very inspiring conversations with Karsten I know that Lucene is based on a 
Vector Space Model. I am just wondering if it would be possible to turn this into a 
probabilistic Model approach. Of course I do know that I cannot change the underlying 
indexing and searching principles. However it would be possible to change the index 
term weight to eigther 1.0 (relevant) or 0.0 (non-relevant). For the similarity I 
would need to implement another similarity algorithm.

I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible. If yes, how much effort 
would need to go into that? I am sure there are many other issues which I have not 
considered...

Kind Regards,
Ralf


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Document Similarity

2003-12-03 Thread Karsten Konrad

Hi,

 Do they produce same ranking results? 

No; Lucene's operations on query weight and length normalization is not
equivalent to a vanilla cosine in vector space.

 I guess the 2nd approach will be more precise but slow.

Query similarity 
will indeed  be faster, but may actually not be worse. A straightforward 
cosine  without IDF weighting of terms (as Lucene does) will almost certainly 
be less precise if you have documents of different length - word
occurence probabilities in texts of different lengths vary greatly,
and the cosine of independent longer texts will often be greater than 
those that actually have the same topic, but are short, just because 
of randomly found non-content words.

If, on the other hand, you choose the right TF/IDF weighting  of 
terms, the cosine in this warped vector space could be (a) 
equivalent to the one Lucene does - requires some work to do so, or 
(b) might even get better on average.

However, the last time I counted, there where about 250 different 
TF/IDF formulas around in IR publications, machine learning,
computational linguistics and so on. Performance depends on domain
and language. 

But if I was you, I just would start playing and have fun with
the stuff...

Karsten


-Ursprüngliche Nachricht-
Von: Jing Su [mailto:[EMAIL PROTECTED] 
Gesendet: Dienstag, 2. Dezember 2003 18:12
An: [EMAIL PROTECTED]
Betreff: Document Similarity



Hi,

I have read some posts in user/developer archives about Lucene-based document 
similarity comparison. In summary there are two approaches are
mentioned:

1 - Construct document to a query;
2 - Calculate each document to be a vector, then rank accoring to their distance 
(cosine).

Do they produce same ranking results? Is there any other way to do so? I guess the 2nd 
approach will be more precise but slow.

Thanks.

Jing

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Real Boolean Model in Lucene?

2003-12-01 Thread Karsten Konrad

Hi,


My Question: Does Lucene use TF/IDF for getting this? (which would mean it does not 
use the boolean model for the boolean query...)


Lucene indeed uses TF/IDF with length normalization for fields and documents. 

However, Lucene is downward compatible to the Boolean Model where
documents are represented as 0/1-vectors in Vector Space. Ranking just 
adds weights to the elements of the result set, so the underlying 
interpretation of a query result can be still that of a 
Propositional/Boolean model. If a document appears in the result, 
its tokens valuate the query (which actually is a propositional 
formula formed over words and phrases) to true. The representation
of documents is more complex in Lucene than required for the Boolean
Model, and as a result, Lucene can efficiently handle phrases and 
proximity searches, but these seem to be compatible extensions -
if you can do it in the Boolean Model, you can do it in Lucene :)

One place where Lucene is not 100% compatible with a basic Boolean Model is that 
full negation is a bit tricky - you can not simply ask for all documents that 
do not contain a certain term unless you also have some term that appears in all 
documents. Not a great deal, really. 

If TF/IDF weighting is a problem to you, the Similarity interface implementation 
allows you 
to remove all references to length normalization and document frequencies.

Regards,

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com



-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 1. Dezember 2003 13:11
An: [EMAIL PROTECTED]
Betreff: Real Boolean Model in Lucene?


Hi,

is it possible to use a real boolean model in lucene for searching. When one is using 
the Queryparser with a boolean query (i.e. dog AND horse) one does get a list of 
documents from the Hits object. However these documents have a ranking (score).

My Question: Does Lucene use TF/IDF for getting this? (which would mean it does not 
use the boolean model for the boolean query...)

How can one use a boolean model search, where the outcome are all score=1 ? Example?

Cheers,
Ralph

-- 
Neu bei GMX: Preissenkung für MMS-Versand und FreeMMS!

Ideal für alle, die gerne MMS verschicken:
25 FreeMMS/Monat mit GMX TopMail. http://www.gmx.net/de/cgi/produktemail

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: AW: Real Boolean Model in Lucene?

2003-12-01 Thread Karsten Konrad

Hello Ralf,


According to your description, Lucene basically maps the boolean query into the vector 
space and measures the cosine similarity towards other documents in the vector space. 
If I understood you correctly you mean if a document is found by Lucene based on a 
boolean query it is relevant (boolean true). If it is not returned, if was boolean 
false. The score sits on top of it and can be used for ranking. If I would like to use 
true boolean model I would therefore just need to ignore the score of the Hits 
document. Did I understand correctly?


Yes, I think that this is indeed pretty close to some theoretical foundation: The 
Boolean Model 
explains which documents fit to a query, while some appropriate (Lucene is good!) 
similarity 
function in vector space yields the ranking.

Now hell would be the place for me where I would have to prove that Lucene's ranking 
is 
exactly equivalent to some transformation of vector space and then using the *cosine* 
for the 
ranking. Can't be really, as Lucene sometimes returns results  1.0 and only some 
ruthless
normalisation keeps it within 0.0 to 1.0. In other words, there still are some rough 
corners
in Lucene where a good theorist could find some work.

Could  we leave this topic aside until some suicid.. err, I mean enthusiastic fellow
tries to work out a really good theory?

Regards,

Karsten





-Ursprüngliche Nachricht-
Von: Ralf B [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 1. Dezember 2003 14:28
An: Lucene Users List
Betreff: Re: AW: Real Boolean Model in Lucene?


Hi Karsten,

I want to thank you for your qualified answer as well as your answer from the 14th of 
November, where you agreed with me that Lucene is basically a VSM implementation. 
Sometimes it is difficult to make the link between the clear theory and its 
implementation.

According to your description, Lucene basically maps the boolean query into the vector 
space and measures the cosine similarity towards other documents in the vector space. 
If I understood you correctly you mean if a document is found by Lucene based on a 
boolean query it is relevant (boolean true). If it is not returned, if was boolean 
false. The score sits on top of it and can be used for ranking. If I would like to use 
true boolean model I would therefore just need to ignore the score of the Hits 
document. Did I understand correctly?

I aggree that nobody really want to do that. My question intended to find out more 
about the implemented theory within Lucene.

Cheers,
Ralph


 
 Hi,
 
 
 My Question: Does Lucene use TF/IDF for getting this? (which would 
 mean
 it does not use the boolean model for the boolean query...)
 
 
 Lucene indeed uses TF/IDF with length normalization for fields and
 documents. 
 
 However, Lucene is downward compatible to the Boolean Model where 
 documents are represented as 0/1-vectors in Vector Space. Ranking just 
 adds weights to the elements of the result set, so the underlying 
 interpretation of a query result can be still that of a 
 Propositional/Boolean model. If a document appears in the result, its 
 tokens valuate the query (which actually is a propositional formula 
 formed over words and phrases) to true. The representation of 
 documents is more complex in Lucene than required for the Boolean 
 Model, and as a result, Lucene can efficiently handle phrases and 
 proximity searches, but these seem to be compatible extensions - if 
 you can do it in the Boolean Model, you can do it in Lucene :)
 
 One place where Lucene is not 100% compatible with a basic Boolean 
 Model
 is that 
 full negation is a bit tricky - you can not simply ask for all documents 
 that 
 do not contain a certain term unless you also have some term that 
 appears in all 
 documents. Not a great deal, really. 
 
 If TF/IDF weighting is a problem to you, the Similarity interface
 implementation allows you 
 to remove all references to length normalization and document 
 frequencies.
 
 Regards,
 
 Mit freundlichen Grüßen aus Saarbrücken
 
 --
 
 Dr.-Ing. Karsten Konrad
 Head of Artificial Intelligence Lab
 
 XtraMind Technologies GmbH
 Stuhlsatzenhausweg 3
 D-66123 Saarbrücken
 Phone: +49 (681) 3025113
 Fax: +49 (681) 3025109
 [EMAIL PROTECTED]
 www.xtramind.com
 
 
 
 -Ursprüngliche Nachricht-
 Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Gesendet: Montag, 1. Dezember 2003 13:11
 An: [EMAIL PROTECTED]
 Betreff: Real Boolean Model in Lucene?
 
 
 Hi,
 
 is it possible to use a real boolean model in lucene for searching. 
 When
 one is using the Queryparser with a boolean query (i.e. dog AND horse) 
 one does get a list of documents from the Hits object. However these 
 documents have a ranking (score).
 
 My Question: Does Lucene use TF/IDF for getting this? (which would 
 mean
 it does not use the boolean model for the boolean query...)
 
 How can one use a boolean model search, where the outcome are all
 score=1 ? Example?
 
 Cheers,
 Ralph
 
 --
 Neu bei GMX

AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Karsten Konrad

Hi,

it actually is quite nice and it can be used in production for such things as have 
been discussed
lately in this group. 

If you want to play it safe: The iterator breaks at dots after numbers (e.g. 15. 
March), the precision
of the algorithm can be increased if you never break after a number.

The implementation is fast.

Regards,

Karsten

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com


-Ursprüngliche Nachricht-
Von: Philippe Laflamme [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 17. November 2003 15:39
An: Lucene Users List
Betreff: RE: inter-term correlation [was Re: Vector Space Model in Lucene?]


There is already an implementation in the Java API for sentence boundary detection. 
The BreakIterator in the java.text package has this to say about sentence splitting:

Sentence boundary analysis allows selection with correct interpretation of periods 
within numbers and abbreviations, and trailing punctuation marks such as quotation 
marks and parentheses. 
http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html

The whole i18n Java API is based on the ICU framework from IBM: 
http://oss.software.ibm.com/icu/index.html
It supports many languages.

I personally do not have any experience with the BreakIterator in Java. Has anyone 
used it in any production environment? I'd be very interested to learn more about it's 
efficiency.

Regards,
Phil

 -Original Message-
 From: Chong, Herb [mailto:[EMAIL PROTECTED]
 Sent: November 17, 2003 08:53
 To: Lucene Users List
 Subject: RE: inter-term correlation [was Re: Vector Space Model in 
 Lucene?]


 i have a program written in Icon that does basic sentence splitting. 
 with about 5 heuristics and one small lookup table, i can get well 
 over 90% accuracy doing sentence boundary detection on email. for well 
 edited English text, like newswires, i can manage closer to 99%. this 
 is all that is needed for significantly improving a search engine's 
 performance when the query engine respects sentence boundaries. 
 incidentally, the GATE Information Extraction framework cites some 
 references that indicate that for named entity feature extraction, 
 their system can exceed the ability of trained humans to detect and 
 classify named entities if only one person does the detection.
 collaborating humans are still better, but no-one has the time in
 practical applications.

 you probably know, since you know about Markov chains, that within 
 sentence term correlation, and hence the language model, is different 
 than across sentences. linguists have known this for a very long time. 
 it isn't hard to put this capability into a search engine, but it 
 absolutely breaks down unless there is sentence boundary information 
 stored for use at query time.

 Herb

 -Original Message-
 From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
 Sent: Friday, November 14, 2003 5:54 PM
 To: Lucene Users List
 Subject: Re: inter-term correlation [was Re: Vector Space Model in 
 Lucene?]


 Well ... Sure, nothing can replace a human mind. But believe it or 
 not, there are studies which show that even human experts can 
 significantly differ in their opinions on what are key-phrases for a 
 given text. So, the results are never clear cut with humans either...

 So, in this sense a heuristic tool for sentence splitting and 
 key-phrase detection can go long ways. For example, the application I 
 mentioned, uses quite a few heuristic rules (+ Markov chains as a 
 heavier ammunition :-), and it comes up with the following phrases for 
 your email discussion (the text quoted below):

 (lang=EN): NLP, trainable rule-based tagging, natural language 
 processing, apache, NLP expert

 Now, this set of key-phrases does reflect the main noun-phrases in the 
 text... which means I have a practical and tangible benefit from NLP. 
 QED ;-)

 Best regards,
 Andrzej

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-15 Thread Karsten Konrad


Rules of linguistics? Is there such a thing? :)


Yes there are. How can you expect communication (the goal of
the game that natural language is about) to work if the game 
has no rules? 

Anyway, Herb is right, sentence boundaries do carry a meaning and the 
linguistic rule could be phrased as: Constituents (Concepts) mentioned 
in one sentence together have a closer relation than those that are not.

I was wondering whether we could, while indexing, make a use of this by 
increasing the position counter by a large number, let's say 1000, 
whenever we encounter a sentence separator (Note, this is not trivial; 
not every '.' ends a  sentence etc. etc. etc.). Thus, searching for

income tax~100 tax gain~100 income tax gain~100 income tax gain

would find income tax gain as usual, but would boost all texts
where the phrases involved appear within sentence boundaries - I 
assume that a sentence with 100 words would be pretty unlikely,
but still within the 1000 word separation done by increasing the
position. No linguistics necessary, actually, but it is an application
of a linguistic rule!


Sure. But my take on this, is that pigs will fly before NLP turns into 
a predictable science :)


You mean like physics (new models every 10 years), biology (same),
medicine (er.. cancer research anyone?), chemistry (the result could be
verified in 8 of 10 experiments...). What does predictabiltity mean
to you? What sciences beside mathematics do give you 100% certainty? 

But I guess you are in flame mode anyway now :)

Regards,

Karsten 


-Ursprüngliche Nachricht-
Von: petite_abeille [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 14. November 2003 20:04
An: Lucene Users List
Betreff: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]



On Nov 14, 2003, at 19:50, Chong, Herb wrote:

 if you are handling inter correlation properly, then terms can't cross
 sentence boundaries.

Could you not break down your document along sentences boundary? If you 
manage to figure out what a sentence is, that is.

 if you are not paying attention to sentence boundaries, then you are
 not following rules of linguistics.

Rules of linguistics? Is there such a thing? :)

PA.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sentence dependencies (was: inter-term relation)

2003-11-15 Thread Karsten Konrad

Hello,


There are many cases where 
linguistically separate sentences do have strong dependendies; in web world 
simple things like list items may be very closely related. Put another way; 
it may not be trivially easy to detect sentence boundaries, nor is it certain 
that what (from language viewpoint) is a boundary really is hard boundary 
from semantic perspective? And are there not varying levels of separation 
(sentences close to each other often are related, back references being 
common), not just one, between sentences?


There is a computational linguistic theory that deals with such questions, 
Rhetorical Structure Theory, see http://www.sil.org/~mannb/rst/. Basically, 
each text is seen as a hierarchical structure fromed from on a few rhetorical 
relations. Interestingly, some relations are not too hard to guess once your 
text is semi-structured already (the relation between a paragraph header and 
its paragraph is a rhetorical one for instance, a HTML list is a sequence 
of sentences connected by the list relation and so forth). 

Applying such theories to Lucene would require quite a lot of work while
analysing the texts, but I doubt whether Lucene could not be convinced to work
on such structures and boost the relation of terms more if they appear
within closer RST-structure connections.

Regards,

Karsten

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com


-Ursprüngliche Nachricht-
Von: Tatu Saloranta [mailto:[EMAIL PROTECTED] 
Gesendet: Samstag, 15. November 2003 02:15
An: Lucene Users List
Betreff: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


On Friday 14 November 2003 11:50, Chong, Herb wrote:
 if you are handling inter correlation properly, then terms can't cross 
 sentence boundaries. if you are not paying attention to sentence 
 boundaries, then you are not following rules of linguistics.

Isn't that quite strict interpretation, however? There are many cases where 
linguistically separate sentences do have strong dependendies; in web world 
simple things like list items may be very closely related. Put another way; it may not 
be trivially easy to detect sentence boundaries, nor is it certain 
that what (from language viewpoint) is a boundary really is hard boundary 
from semantic perspective? And are there not varying levels of separation 
(sentences close to each other often are related, back references being 
common), not just one, between sentences?

As to storing boundaries in index; am I naive if I suggested just marker 
tokens that could easily be used to mark boundaries (sentence, paragraph, 
section)? Code that uses that information would obviously need to know 
details of marking used, but would it be infeasible to use such in-band 
information?

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Slow response time with datefilter

2003-11-15 Thread Karsten Konrad


 Not only is the query slow, but it seems to be slower the more results 
 it returns.


 Any suggestions?


If you have a lot of terms in that range, 
you can see that there is obviously some cycles spinning to do the work 
needed.


If the number of different date terms causes this effect, why not round
the date to the nearest or next midnight while indexing. Thus, filtering 
for the last  15 days would require walking over 15-17 different date terms. 
If you don't do this, the number of different terms will be the same as
the number of documents you indexed, explaining the slowing down when you 
have more results.

Regards,

Karsten



-Ursprüngliche Nachricht-
Von: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Gesendet: Samstag, 15. November 2003 17:31
An: Lucene Users List
Betreff: Re: Slow response time with datefilter


On Friday, November 14, 2003, at 07:16  PM, Dror Matalon wrote:
 We're seeing slow response time when we apply datefilter. A search 
 that takes 7 msec with no datefilter takes 368 msec when I filter on 
 the last fifteen days, and 632 msec on the last 30 days.

 Initially we saved doing document.add(Field.Keyword(dtstamp, 
 dtstamp));

 and then change to doing document.add(Field.Keyword(dtstamp,
 DateField.dateToString(dtstamp)));

 where dtstamp is a java.util.Date

Both of the above lines of code are equivalent.  This is where having 
open-source is handy :)

   public static final Field Keyword(String name, Date value) {
 return new Field(name, DateField.dateToString(value), true, true, 
false);
   }

 We search doing the following:

   days_ago_value = Long.parseLong(days); //could throw
 NumberFormatException
   days_ago_value = new java.util.Date().getTime() - (days_ago_value * 
 8640L);
   hits = indexSearcher.search(query, DateFilter.After(dtstamp, 
 days_ago_value));

DateFilter itself is walking all the terms in the range you provide 
before executing the query.  If you have a lot of terms in that range, 
you can see that there is obviously some cycles spinning to do the work 
needed.

 Not only is the query slow, but it seems to be slower the more results 
 it returns.


 Any suggestions?

If this date range is pretty static, you could (in Lucene's CVS 
codebase) wrap the DateFilter with a CachingWrappingFilter.  Or you 
could construct a long-lived instance of an equivalent QueryFilter and 
reuse it across multiple queries.  You would likely see dramatic 
differences using either of these approaches.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Vector Space Model in Lucene?

2003-11-14 Thread Karsten Konrad

Hi,


 vector space is only one of several important ones.


what are these several other important ones?

While Lucene does not give an explicit vector space
representation - you can not efficiently access the vector 
of one document - the index' basic representation is
a reduction of each document to its terms and 
frequencies, hence a mapping into a vector space and hence
a vector space model. The relative term weights (TF/IDF) warp the
space and the vectors, but all of Lucene's search operations 
nevertheless are  operations on a vector space model 
(ok, maybe phrase  search is a bit different as it 
requires an extension by position information).

E.g., searching a term means finding all vectors  that have 
a certain common dimension and ranking means weighting these 
relatively to their angle in vector space. 

KK

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com


-Ursprüngliche Nachricht-
Von: Chong, Herb [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 14. November 2003 14:35
An: Lucene Users List
Betreff: RE: Vector Space Model in Lucene?


does it matter? vector space is only one of several important ones.

Herb

-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 4:00 AM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?


Really? And what model is used/implemented by Lucene?

THX
Leo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Negative boosting?

2003-09-11 Thread Karsten Konrad

I have done negative boosts, it does work; you must construct your
query terms accordingly. I found the results somewhat unintuitive -
the mixture of negative and postive boosts (mainly 1.0), TF/IDF 
and document length normalization will very often make documents 
more relevant that you did not expect to be. 

Regards,

Karsten

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com

-Ursprüngliche Nachricht-
Von: Terry Steichen [mailto:[EMAIL PROTECTED] 
Gesendet: Donnerstag, 11. September 2003 16:05
An: Lucene Users Group
Betreff: Negative boosting?


I've often found the use of query-based boosting to be very beneficial.  This is 
particularly so when it's easy to identify the term that I want to stand out as a 
primary selector.  

However, I've come across quite a few other cases where it would be easier (and more 
logical) to apply a negative boost - to de-emphasize the match when the term is 
present.  

Is it possible to apply a negative boost (It doesn't seem to work), and if not, would 
it break anything signficant if that were added?

Regards,

Terry

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Exceptions while Updating an Index

2003-08-28 Thread Karsten Konrad

Hi,

it is very easy to provoke the errrors you describe
when you are opening many alternating writers and 
readers on Windows.

You can circumvent this problem by using fewer
writer and reader objects, e.g., first delete
all documents to update, then write all the
updated documents. Or use a second index
only for the writing and merge this into the first
after you have deleted the update documents
there.

Regards,

Karsten



-Ursprüngliche Nachricht-
Von: Wilton, Reece [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 27. August 2003 23:18
An: Lucene Users List
Betreff: Exceptions while Updating an Index


Hi,

I am getting exceptions because Lucene can't rename files.  Here are a couple of the 
exceptions that I'm getting:
 - java.io.IOException: couldn't rename _6lr.tmp to _6lr.del
 - java.io.IOException: couldn't rename segments.new to segments

I am able to index many documents successfully on my Windows machine. The problem 
occurs for me during the updating process.  My updating process goes like this:

  for (each xml file i want to index) {
// create new document
parse the xml file
populate a new Lucene document with the fields from my XML file

// remove old document from index
open an index reader
delete the term from the index   // this successfully deletes the
one document
close the index reader

// add new document to index
open an index writer
add the document to the index writer
close the index writer
  }
   
Any ideas on how to stop these exceptions from occuring?  No other process is reading 
or writing to the index while this process is running.

Thanks,
Reece

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Mysterious bugs...

2003-06-24 Thread Karsten Konrad

Hi,

after indexing 238000 Documents on a Linux box, we get the
following error:

Caused by:java.lang.IllegalStateException: docs out of order
  at: java.lang.IllegalStateException: docs out of order
at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)

Another error message we sometimes see (not reproducable) is: 
IOException: No buffer space available.


Does anybody know the cause of these problems? Thanks!

Karsten

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Analyzers, Queries: three questions

2003-06-11 Thread Karsten Konrad

Hi,



1) How can I search untokenized fields? Do I have to pass my query 
through a NullAnalyzer?


No, the contents of an untokenized (i.e., keyword) field
are stored as one lucene token. Hence, you must build such
a token from your query and build a
TermQuery for being able to search it.

In general, index those fields over which you search unless
you want to treat field contents as identifiers (e.g.,
unique document names or such).



2) How can I pass the value of a field through an Analyzer before 
storing it?


A text field is automatically analyzed and tokenized by the given
analyzer, you do not have to do it manually. However, you
could preprocess your text in any way you want before that happens - 
simply apply your operations on the content you index, but make sure that
you use a compatible analyzer when searching.



3) How can I fine-tune my query, e.g. by saying that for searching 
within the contents fields I want to pass the query through an Analyzer, 
for searching within the title field, however, I don't want the Analyzer 
pass. And I want a Hit if either field provides it.


Unfortunately, you can not give different analyzers for different fields. 
You could process your query after parsing by traversing and manipulating
the query object; this method requires some programming, though; with
the power of Lucene's default query language, you might end in a lot
of work here.

Regards,

Karsten

-Ursprüngliche Nachricht-
Von: Ulrich Mayring [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 11. Juni 2003 11:50
An: [EMAIL PROTECTED]
Betreff: Analyzers, Queries: three questions


Hi folks,

I'm using the Snowball analyzer to index my documents. As an example I 
took the Tomcat documentation, which includes a document with the title 
Workers HowTo. I put this string in a field called title, within 
which I later do my query (of course again with the same SnowballAnalyzer).

At first I indexed the field as a Keyword (== not tokenized) and Lucene 
later couldn't find it, when I searched for Workers HowTo. I found out 
that tokenization apparently includes application of the Analyzer, so if 
I put my query through an Analyzer, then the field to search must be 
tokenized. Hence my first question:

1) How can I search untokenized fields? Do I have to pass my query 
through a NullAnalyzer?

Next I made the title field a Text field, so it is tokenized. Now Lucene 
finds the document, but with a low score of 0.27. Sure enough, browsing 
the index showed me that the value of the title field is stored 
unanalyzed, i.e. Workers HowTo - exactly as retrieved from the 
document. On the other hand, after parsing the query, the query is 
actually transformed to (title:worker title:howto). This does of 
course not give an exact match, therefore I guess the low score and my 
next questions:

2) How can I pass the value of a field through an Analyzer before 
storing it?

3) How can I fine-tune my query, e.g. by saying that for searching 
within the contents fields I want to pass the query through an Analyzer, 
for searching within the title field, however, I don't want the Analyzer 
pass. And I want a Hit if either field provides it.

Currently I'm using the MultiFieldQueryParser, but that only allows one 
Analyzer for all the fields.

Thank you very much in advance for any pointers,

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: AW: Analyzers, Queries: three questions

2003-06-11 Thread Karsten Konrad

Hi,

field contents indexed with Field.text are stored 
verbatim in the index - thus, you can get back the 
original text when you access it using stingValue(). 

This has nothing to do with how the text is 
indexed, i.e., how it is tokenized and stored into
the index. You probably have a token workers and 
one howto, both pointing to this text (that's why 
it is called an inverted index, the words point to
the text). Your analyzer does this tokenization
for you.

If you search using the query parser, you
can only do this on indexed fields, e.g.,
those indexed with Field.text or Field.UnStored. 
If you store a text as a keyword,
you must construct a TermQuery and search
with it. Thus, you would actually get a
term (title, Workers HowTo).

Regards,

Karsten

-Ursprüngliche Nachricht-
Von: Ulrich Mayring [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 11. Juni 2003 13:36
An: [EMAIL PROTECTED]
Betreff: Re: AW: Analyzers, Queries: three questions


Karsten Konrad wrote:
 
 2) How can I pass the value of a field through an Analyzer before 
 storing it?
 
 A text field is automatically analyzed and tokenized by the given
 analyzer, you do not have to do it manually.

Well, but if I browse my index I see all the terms stored in the 
original form. I use this code:

doc.add(Field.Text(title,Workers HowTo);
...
// Build and execute Query, so that only the above document is found
Document d = hits.doc(0);
Field field = d.get(title);
System.out.println(field.name() + , + field.stringValue());

This outputs title,Workers HowTo - the untokenized, unanalyzed form.

So, what's wrong here?

cheers,

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: DBDirectory available for download

2003-06-03 Thread Karsten Konrad

Thanks,

do you have already some numbers how it compares to the
file system implementation, i.e., how fast is indexing
and searching?

Regards,

Karsten

-Ursprüngliche Nachricht-
Von: Anthony Eden [mailto:[EMAIL PROTECTED]
Gesendet: Montag, 2. Juni 2003 22:23
An: Lucene Users List
Betreff: DBDirectory available for download


Version 1.0 of the DBDirectory library, which implements a Directory
which can store indeces in a database is now available for download.
There are two versions:

   Tar GZIP:
http://www.anthonyeden.com/download/lucene-dbdirectory-1.0.tar.gz
   ZIP: http://www.anthonyeden.com/download/lucene-dbdirectory-1.0.zip

The source code is included.  Please read the README file for
instructions on using DBDirectory.  I have only tested it with MySQL but
would be happy to add other database scripts if anyone would like to
submit them.  Please post any questions here on the mailing list.

Otis, is there anything left to do to get this into the sandbox?
Additionally, how will I maintain the code if it is in the sandbox?
Will I get write access to the part of the CVS repository which would
house DBDirectory?  I currently have all of the code in my private CVS.

Sincerely,
Anthony Eden




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Search for similar terms

2003-06-02 Thread Karsten Konrad


Hi,

the expensive part of the algorithm is the comparison
of two terms using Levenshtein edit distance which is 
done for all terms - with possibly horrible results for 
performance on large indexes.

With:

  TermEnum enum = reader.terms(new Term(field, start));

you get a term enumerator that starts at the given start prefix.

Use this to compute a term enumerator that starts near the
term(s) you are looking for. In the termCompare method,
you should make sure that the prefix is the same and that
the length of the terms to compare is not too different.

Like, e.g.:

if ((field == term.field())  target.startsWith(start)) {
  int targetlen = target.length();

  if (Math.abs(textlen - targetlen)  5) {
int dist = editDistance(text, target, textlen, targetlen);

distance = 1 - ((double)dist / (double)Math.min(textlen, targetlen));


The modification I propose here has some downsides - if a typo
occures at the beginning of a word, you will not get a proper
result.

I am not sure on this, but I think that term enumeration could be
much more efficient for purposes like this if the terms(Term t) method 
would only enumerate terms of the same field as t. As far as I
understand this comment, the enumeration goes over all terms
after t:

  /** Returns an enumeration of all terms after a given term.
The enumeration is ordered by Term.compareTo().  Each term
is greater than all that precede it in the enumeration.
   */
  public abstract TermEnum terms(Term t) throws IOException;

I haven't found a way to stop the enumeration once I am sure that
the input term can not match any more :)

Regards,

Karsten










-Ursprüngliche Nachricht-
Von: Eric Jain [mailto:[EMAIL PROTECTED]
Gesendet: Montag, 2. Juni 2003 13:17
An: Karsten Konrad
Cc: Lucene Users List
Betreff: Re: Search for similar terms


 have a look at the FuzzyTermEnum class in Lucene.


The FuzzyTermEnum class is truely useful... if I could get it to be a
bit faster. By faster I mean something in the order of one second for a
half gigabyte index; currently the best I get is five seconds.


What I am trying to accomplish:

- If a query does not yield any results, choose and display out of all
similar terms the one which occurs most often in the index.


What I have tried so far:

- Required first three characters to match exactely, excluded from
similarity search (time reduced from 15s to 5s).
- Increased FUZZY_THRESHOLD to 1.75 (no significant effect on time).
- Only executed termCompare for terms with a higher frequency than the
best matching term seen so far (no effect)


Observations:

- Time seems to be independant of the frequency of a term.


Any further ideas would be greatly appreciated!

Also (dear committers...), it would be great if FuzzyTermEnum could be
subclassed, rather than having to resort to copy paste (the class is
final).


--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Search for similar terms

2003-05-31 Thread Karsten Konrad

Hi,

please have a look at the FuzzyTermEnum class in Lucene.
There is an impressive implementation of Levenshtein distance 
there that you can use; simply set the fuzzy distance higher 
than 0.5 (0.75 seems to work fine) and modify the 
termCompare method such that the last term produced is always
the one which you consider the best, i.e., which has the smallest
edit distance but the highest idf. 

You can greatly speed up the computation by making sure in your
termCompare method that you only compare terms by Levenshtein that
have at least a common prefix of a few characters, say 3 or 4. 
Thus, it will repair notebok into notebook, but not nitebook 
into notebook. Most spelling errors seem to appear at the end of 
a word, so the restriction is not unreasonable. 

I use a similar method for auto-expanding dubious terms on large 
indexes ( 1Gig), and the performance is still quite good.

Regards,

Karsten





-Ursprüngliche Nachricht-
Von: Dario Dentale [mailto:[EMAIL PROTECTED]
Gesendet: Freitag, 30. Mai 2003 19:05
An: Lucene Users List
Betreff: Re: Search for similar terms


Thanks, for the answer.

I was searching for a solution not based on a dictionary, but on the list of
terms (with relative frequency) contained in the Lucene index.

In this way (I think) I can obtain more significant results,
I can use this method on multiple languages (without relative dictionary and
without know which laguage is used in the query string)
and especially on out-of-dictionary terms (i.e.: in a e-commerce site you
can find Nikon coolpix that are not in a dictionary).

I was searching for some algorithm that can calculate the similarity
coefficient between two terms and multiplying it
to the frequency in the indexed documents can obtain a score.

Do you think that this is a wrong way?

Regards,
Dario

- Original Message - 
From: [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, May 30, 2003 3:51 PM
Subject: Re: Search for similar terms



 Perform the lucene search.  If you get no or few hits, send the query term
 to a spell checker, like ispell.  Echo the alternative spelling(s) to the
 user.

 DaveB





   Dario Dentale
   [EMAIL PROTECTED]To:
[EMAIL PROTECTED]
   rtalis.it   cc:
Subject:  Search for
similar terms
   05/30/03 05:15 AM
   Please respond to
   Lucene Users
   List






 Hi,
 anybody knows which is the best way to implements in Lucene a fuctionality
 (that Google has) like this:

 Search text- notebok

 Answer- Did you mean: notebook ?

 Thanks,
 Dario


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]







 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]