date:20041215

Re: TFIDF Implementation

2004-12-15 Thread Christoph Kiefer

David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and dirty wrote a class to perform that task. It uses the Jama.Matrix
class to represent the similarity matrix at the moment. For show and
tell I attached it to this email.
Unfortunately it doesn't perform very well. My index stores about 25000
docs with a total of 75000 terms. The similarity matrix is very sparse
but nevertheless needs about 1'875'000'000 entries!!! I think this
current implementation will not be useable in this way. I also think I
switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason.

What do you think?

Best,
Christoph

-- 
Christoph Kiefer

Department of Informatics, University of Zurich

Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
/*
 * Created on Dec 14, 2004
 */
package ch.unizh.ifi.ddis.simpack.measure.featurevectors;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import Jama.Matrix;

/**
 * @author Christoph Kiefer
 */
public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure {
	
	private File indexDir = null;
	private File dataDir = null;
	private String target = ;
	private String query = ;
	private int targetDocumentNumber = -1;
	private final String ME = this.getClass().getName();
	private int fileCounter = 0;
	
	public TFIDF_Lucene( String indexDir, String dataDir, String target, String query ) {
		this.indexDir = new File(indexDir);
		this.dataDir = new File(dataDir);
		this.target = target;
		this.query = query;
	}
	
	public String getName() {
		return TFIDF_Lucene_Similarity_Measure;
	}
	
	private void makeIndex() {
		try {
			IndexWriter writer = new IndexWriter(indexDir, new SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS ), false);
			indexDirectory(writer, dataDir);
			writer.optimize();
			writer.close();
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
	
	private void indexDirectory(IndexWriter writer, File dir) {
		File[] files = dir.listFiles();
		for (int i=0; i  files.length; i++) {
			File f = files[i];
			if (f.isDirectory()) {
indexDirectory(writer, f);  // recurse
			} else if (f.getName().endsWith(.txt)) {
indexFile(writer, f);
			}
		}
	}
	
	private void indexFile(IndexWriter writer, File f) {
		try {
			System.out.println( Indexing  + f.getName() + ,  + (fileCounter++) );
			String name = f.getCanonicalPath();
			//System.out.println(name);
			Document doc = new Document();
			doc.add( Field.Text( contents, new FileReader(f), true ) );
			writer.addDocument( doc );
			
			if ( name.matches( dataDir + / + target + .txt ) ) {
targetDocumentNumber = writer.docCount();
			}
			
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
	
	public Matrix getTFIDFMatrix(File indexDir) throws IOException {
		Directory fsDir = FSDirectory.getDirectory( indexDir, false );
		IndexReader reader = IndexReader.open( fsDir );
		
		int numberOfTerms = 0;
		int numberOfDocuments = reader.numDocs();
		
		TermEnum allTerms = reader.terms();
		for ( ; allTerms.next(); ) {
			allTerms.term();
			numberOfTerms++;
		}
		
		System.out.println( Total number of terms in index is  + numberOfTerms );
		System.out.println( Total number of documents in index is  + numberOfDocuments );
		
		double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];
		
		for ( int i = 0; i  numberOfTerms; i++ ) {
			for ( int j = 0; j  numberOfDocuments; j++ ) {
TFIDFMatrix[i][j] = 0.0;
			}
		}
		
		allTerms = reader.terms();
		for ( int i = 0 ; allTerms.next(); i++ ) {
			
			Term term = allTerms.term();
			TermDocs td = reader.termDocs(term);
			for ( ; td.next(); ) {
TFIDFMatrix[i][td.doc()] = td.freq();
			}
			
		}
		
		allTerms = reader.terms();
		for ( int i = 0 ; allTerms.next(); i++ ) {
			for ( int j = 0; j  numberOfDocuments; j++ ) {
double tf = TFIDFMatrix[i][j];
double docFreq = (double)allTerms.docFreq();
double idf = ( Math.log( (double)numberOfDocuments / docFreq ) ) / 2.30258509299405;
//System.out.println( Term:  + i +  Document  + j +  TF/DocFreq/IDF:  + tf +   + docFreq +   + idf);
TFIDFMatrix[i][j] = tf * idf;
			}
		}
		
		reader.close();
		return

Re: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic

Hello Homam,

The batches I was referring to were batches of DB rows.
Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X
LIMIT=Y.

Don't close IndexWriter - use the single instance.

There is no MakeStable()-like method in Lucene, but you can control the
number of in-memory Documents, the frequence of segment merges, and
maximal size of an index segments with 3 IndexWriter parameters,
described fairly verbosely in the javadocs.

Since you are using the .Net version, you should really consult
dotLucene guy(s).  Running under the profiler should also tell you
where the time and memory go.

Otis

--- Homam S.A. [EMAIL PROTECTED] wrote:

 Thanks Otis!
 
 What do you mean by building it in batches? Does it
 mean I should close the IndexWriter every 1000 rows
 and reopen it? Does that releases references to the
 document objects so that they can be
 garbage-collected?
 
 I'm calling optimize() only at the end.
 
 I agree that 1500 documents is very small. I'm
 building the index on a PC with 512 megs, and the
 indexing process is quickly gobbling up around 400
 megs when I index around 1800 documents and the whole
 machine is grinding to a virtual halt. I'm using the
 latest DotLucene .NET port, so may be there's a memory
 leak in it.
 
 I have experience with AltaVista search (acquired by
 FastSearch), and I used to call MakeStable() every
 20,000 documents to flush memory structures to disk.
 There doesn't seem to be an equivalent in Lucene.
 
 -- Homam
 
 
 
 
 
 
 --- Otis Gospodnetic [EMAIL PROTECTED]
 wrote:
 
  Hello,
  
  There are a few things you can do:
  
  1) Don't just pull all rows from the DB at once.  Do
  that in batches.
  
  2) If you can get a Reader from your SqlDataReader,
  consider this:
 

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
  
  3) Give the JVM more memory to play with by using
  -Xms and -Xmx JVM
  parameters
  
  4) See IndexWriter's minMergeDocs parameter.
  
  5) Are you calling optimize() at some point by any
  chance?  Leave that
  call for the end.
  
  1500 documents with 30 columns of short
  String/number values is not a
  lot.  You may be doing something else not Lucene
  related that's slowing
  things down.
  
  Otis
  
  
  --- Homam S.A. [EMAIL PROTECTED] wrote:
  
   I'm trying to index a large number of records from
  the
   DB (a few millions). Each record will be stored as
  a
   document with about 30 fields, most of them are
   UnStored and represent small strings or numbers.
  No
   huge DB Text fields.
   
   But I'm running out of memory very fast, and the
   indexing is slowing down to a crawl once I hit
  around
   1500 records. The problem is each document is
  holding
   references to the string objects returned from
   ToString() on the DB field, and the IndexWriter is
   holding references to all these document objects
  in
   memory, so the garbage collector is getting a
  chance
   to clean these up.
   
   How do you guys go about indexing a large DB
  table?
   Here's a snippet of my code (this method is called
  for
   each record in the DB):
   
   private void IndexRow(SqlDataReader rdr,
  IndexWriter
   iw) {
 Document doc = new Document();
 for (int i = 0; i  BrowseFieldNames.Length; i++)
  {
 doc.Add(Field.UnStored(BrowseFieldNames[i],
   rdr.GetValue(i).ToString()));
 }
 iw.AddDocument(doc);
   }
   
   
   
   
 
   __ 
   Do you Yahoo!? 
   Yahoo! Mail - Find what you need with new enhanced
  search.
   http://info.mail.yahoo.com/mail_250
   
  
 
 -
   To unsubscribe, e-mail:
  [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
   
   
  
  
 
 -
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 
   
 __ 
 Do you Yahoo!? 
 Take Yahoo! Mail with you! Get it on your mobile phone. 
 http://mobile.yahoo.com/maildemo 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing a large number of DB records

2004-12-15 Thread Garrett Heaver

Hi Homan

I had a similar problem as you in that I was indexing A LOT of data

Essentially how I got round it was to batch the index.

What I was doing was to add 10,000 documents to a temporary index, use
addIndexes() to merge to temporary index into the live index (which also
optimizes the live index) then delete the temporary index. On the next loop
I'd only query rows from the db above the id in the maxdoc of the live index
and set the max rows of the query to to 10,000
i.e

SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
Index.MaxDoc()} ORDER BY [id_field] ASC

Ensuring that the documents go into the index sequentially your problem is
solved and memory usage on mine (dotlucene 1.3) is low

Regards
Garrett

-Original Message-
From: Homam S.A. [mailto:[EMAIL PROTECTED] 
Sent: 15 December 2004 02:43
To: Lucene Users List
Subject: Indexing a large number of DB records

I'm trying to index a large number of records from the
DB (a few millions). Each record will be stored as a
document with about 30 fields, most of them are
UnStored and represent small strings or numbers. No
huge DB Text fields.

But I'm running out of memory very fast, and the
indexing is slowing down to a crawl once I hit around
1500 records. The problem is each document is holding
references to the string objects returned from
ToString() on the DB field, and the IndexWriter is
holding references to all these document objects in
memory, so the garbage collector is getting a chance
to clean these up.

How do you guys go about indexing a large DB table?
Here's a snippet of my code (this method is called for
each record in the DB):

private void IndexRow(SqlDataReader rdr, IndexWriter
iw) {
Document doc = new Document();
for (int i = 0; i  BrowseFieldNames.Length; i++) {
doc.Add(Field.UnStored(BrowseFieldNames[i],
rdr.GetValue(i).ToString()));
}
iw.AddDocument(doc);
}





__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic

Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi Homan
 
 I had a similar problem as you in that I was indexing A LOT of data
 
 Essentially how I got round it was to batch the index.
 
 What I was doing was to add 10,000 documents to a temporary index,
 use
 addIndexes() to merge to temporary index into the live index (which
 also
 optimizes the live index) then delete the temporary index. On the
 next loop
 I'd only query rows from the db above the id in the maxdoc of the
 live index
 and set the max rows of the query to to 10,000
 i.e
 
 SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
 Index.MaxDoc()} ORDER BY [id_field] ASC
 
 Ensuring that the documents go into the index sequentially your
 problem is
 solved and memory usage on mine (dotlucene 1.3) is low
 
 Regards
 Garrett
 
 -Original Message-
 From: Homam S.A. [mailto:[EMAIL PROTECTED] 
 Sent: 15 December 2004 02:43
 To: Lucene Users List
 Subject: Indexing a large number of DB records
 
 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: C# Ports

2004-12-15 Thread Ben Litchfield



I have created a DLL from the lucene jars for use in the PDFBox project.
It uses IKVM(http://www.ikvm.net) to create a DLL from a jar.

The binary version can be found here
http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip

This includes the ant script used to create the DLL files.

This method is by far the easiest way to port it, see previous posts about
advantages and disadvantages.

Ben


On Wed, 15 Dec 2004, Garrett Heaver wrote:

 I was just wondering what tools (JLCA?) people are using to port Lucene to
 c# as I'd be well interesting in converting things like snowball stemmers,
 wordnet etc.



 Thanks

 Garrett



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting

Chuck Williams wrote:
I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize.  The pure vector space model implements a cosine in the strictly positive sector of the coordinate space.  This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., 0.8 means something about the result quality independent of the query).
I question whether such scores are more meaningful.  Yes, such scores 
would be guaranteed to be between zero and one, but would 0.8 really be 
meaningful?  I don't think so.  Do you have pointers to research which 
demonstrates this?  E.g., when such a scoring method is used, that 
thresholding by score is useful across queries?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Daniel Naber

On Wednesday 15 December 2004 19:29, Mike Snare wrote:

 In my case, the words are keywords that must remain as is, searchable
 with the hyphen in place. It was easy enough to modify the tokenizer
 to do what I need, so I'm not really asking for help there. I'm
 really just curious as to why it is that a-1 is considered a single
 token, but a-b is split.

a-1 is considered a typical product name that needs to be unchanged 
(there's a comment in the source that mentions this). Indexing 
hyphen-word as two tokens has the advantage that it can then be found 
with the following queries:
hypen-word (will be turned into a phrase query internally)
hypen word (phrase query)
(it cannot be found searching for hyphenword, however).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: A question about scoring function in Lucene

2004-12-15 Thread Chris Hostetter

: I question whether such scores are more meaningful.  Yes, such scores
: would be guaranteed to be between zero and one, but would 0.8 really be
: meaningful?  I don't think so.  Do you have pointers to research which
: demonstrates this?  E.g., when such a scoring method is used, that
: thresholding by score is useful across queries?

I freely admit that I'm way out of my league on these scoring discussions,
but I believe what the OP was refering to was not any intrinsic benefit in
having a score between 0 and 1, but of having a uniform normalization of
scores regardless of search terms.

For example, using the current scoring equation, if i do a search for
Doug Cutting and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to Doug Cutting

If I then do a search for Chris Hostetter and get back the following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1

...then I can assume the same corrisponding information is true about my
new search term (#9 is significantly better, and #7/#8 are equally as good)

However, I *cannot* say either of the following:
  x) document #9 is as relevant for Chris Hostetter as document #1 is
 relevant to Doug Cutting
  y) document #5 is equally relevant to both Chris Hostetter and
 Doug Cutting


I think the OP is arguing that if the scoring algorithm was modified in
the way they suggested, then you would be able to make statements x  y.

If they are correct, then I for one can see a definite benefit in that.
If for no other reason then in making minimum score thresholds more
meaningful.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Mike Snare

 a-1 is considered a typical product name that needs to be unchanged
 (there's a comment in the source that mentions this). Indexing
 hyphen-word as two tokens has the advantage that it can then be found
 with the following queries:
 hypen-word (will be turned into a phrase query internally)
 hypen word (phrase query)
 (it cannot be found searching for hyphenword, however).

Sure.  But phrase queries are slower than a single word query.  In my
case, using the standard analyzer prior to my modification caused a
single (hyphenated) word query to take upwards of 10 seconds (1M+
documents with ~400K terms).  The exact same search with the new
Analyzer takes .5 seconds (granted the new tokenization caused a
significant reduction in the number of terms).  Also, the phrase query
would place the same value on a doc that simply had the two words as a
doc that had the hyphenated version, wouldn't it?  This seems odd.

In addition, why do we assume that a-1 is a typical product name but
a-b isn't?

I am in no way second-guessing or suggesting a change, It just doesn't
make sense to me, and I'm trying to understand.  It is very likely, as
is oft the case, that this is just one of those things one has to
accept.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: A question about scoring function in Lucene

2004-12-15 Thread Otis Gospodnetic

There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago.  For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score is  X.  So that is where the absolue value of the
score would be useful.

I believe Chuck submitted some code that fixes this, which also helps
with MultiSearcher, where you have to have this contant score in order
to properly order hits from different Searchers, but I didn't dare to
touch that code without further studying, for which I didn't have time.

Otis


--- Doug Cutting [EMAIL PROTECTED] wrote:

 Chuck Williams wrote:
  I believe the biggest problem with Lucene's approach relative to
 the pure vector space model is that Lucene does not properly
 normalize.  The pure vector space model implements a cosine in the
 strictly positive sector of the coordinate space.  This is guaranteed
 intrinsically to be between 0 and 1, and produces scores that can be
 compared across distinct queries (i.e., 0.8 means something about
 the result quality independent of the query).
 
 I question whether such scores are more meaningful.  Yes, such scores
 
 would be guaranteed to be between zero and one, but would 0.8 really
 be 
 meaningful?  I don't think so.  Do you have pointers to research
 which 
 demonstrates this?  E.g., when such a scoring method is used, that 
 thresholding by score is useful across queries?
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Erik Hatcher

On Dec 15, 2004, at 3:14 PM, Mike Snare wrote:
[...]
In addition, why do we assume that a-1 is a typical product name but
a-b isn't?
I am in no way second-guessing or suggesting a change, It just doesn't
make sense to me, and I'm trying to understand.  It is very likely, as
is oft the case, that this is just one of those things one has to
accept.
It is one of those things we have to accept... or in this case write 
our own analyzer.  An Analyzer is a very special and custom choice.  
StandardAnalyzer is a general purpose one, but quite insufficient in 
many cases.  Like QueryParser.  We're lucky to have these kitchen-sink 
pieces in Lucene to get us going quickly, but digging deeper we often 
need custom solutions.

I'm working on indexing the e-book of Lucene in Action.  I'll blog up 
the details of this in the near future as case-study material, but 
here's the short version...

I got the PDF file, ran pdftotext on it.  Many words are split across 
lines with a hyphen.  Often these pieces should be combined with the 
hyphen removed.  Sometimes, though, these words are to be split.  The 
scenario is different than yours, because I want the hyphens gone - 
though sometimes they are a separator and sometimes they should be 
removed.  It depends.  I wrote a custom analyzer with several custom 
filters in the pipeline... dashes are originally kept in the stream, 
and a later filter combines two tokens and looks it up in an exception 
list and either combines it or leaves it separate.  StandardAnalyzer 
would have wreaked havoc.

The results of my work will soon be available to all to poke at, but 
for now a screenshot is all I have public:

http://www.lucenebook.com
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams

I'll try to address all the comments here.

The normalization I proposed a while back on lucene-dev is specified.
Its properties can be analyzed, so there is no reason to guess about
them.

Re. Hoss's example and analysis, yes, I believe it can be demonstrated
that the proposed normalization would make certain absolute statements
like x and y meaningful.  However, it is not a panacea -- there would be
some limitations in these statements.

To see what could be said meaningfully, it is necessary to recall a
couple detailed aspects of the proposal:
  1.  The normalization would not change the ranking order or the ratios
among scores in a single result set from what they are now.  Only two
things change:  the query normalization constant, and the ad hoc final
normalization in Hits is eliminated because the scores are intrinsically
between 0 and 1.  Another way to look at this is that the sole purpose
of the normalization is to set the score of the highest-scoring result.
Once this score is set, all the other scores are determined since the
ratios of their scores to that of the top-scoring result do not change
from today.  Put simply, Hoss's explanation is correct.
  2.  There are multiple ways to normalize and achieve property 1.  One
simple approach is to set the top score based on the boost-weighted
percentage of query terms it matches (assuming, for simplicity, the
query is an OR-type BooleanQuery).  So if all boosts are the same, the
top score is the percentage of query terms matched.  If there are
boosts, then these cause the terms to have a corresponding relative
importance in the determination of this percentage.

More complex normalization schemes would go further and allow the tf's
and/or idf's to play a role in the determination of the top score -- I
didn't specify details here and am not sure how good a thing that would
be to do.  So, for now, let's just consider the properties of the simple
boost-weighted-query-term percentage normalization.

Hoss's example could be interpreted as single-term phrases Doug
Cutting and Chris Hostetter, or as two-term BooleanQuery's.
Considering both of these cases illustrates the absolute-statement
properties and limitations of the proposed normalization.

If single-term PhraseQuery's, then the top score will always be 1.0
assuming the phrase matches (while the other results have arbitrary
fractional scores based on the tfidf ratios as today).  If the queries
are BooleanQuery's with no boosts, then the top score would be 1.0 or
0.5 depending on whether 1 or two terms were matched.  This is
meaningful.

In Lucene today, the top score is not meaningful.  It will always be 1.0
if the highest intrinsic score is = 1.0.  I believe this could happen,
for example, in a two-term BooleanQuery that matches only one term (if
the tf on the matched document for that term is high enough).

So, to be concrete, a score of 1.0 with the proposed normalization
scheme would mean that all query terms are matched, while today a score
of 1.0 doesn't really tell you anything.  Certain absolute statements
can therefore be made with the new scheme.  This makes the
absolute-threshold monitored search application possible, along with the
segregating and filtering applications I've previously mentioned (call
out good results and filter out bad results by using absolute
thresholds).

These analyses are simplified by using only BooleanQuery's, but I
believe the properties carry over generally.

Doug also asked about research results.  I don't know of published
research on this topic, but I can again repeat an experience from
InQuira.  We found that end users benefited from a search experience
where good results were called out and bad results were downplayed or
filtered out.  And we managed to achieve this with absolute thresholding
through careful normalization (of a much more complex scoring
mechanism).  To get a better intuitive feel for this, think about you
react to a search where all the results suck, but there is no visual
indication of this that is any different from a search that returns
great results.

Otis raised the patch I submitted for MultiSearcher.  This addresses a
related problem, in that the current MultiSearcher does not rank results
equivalently to a single unified index -- specifically it fails Daniel
Naber's test case.  However, this is just a simple bug whose fix doesn't
require the new normalization.  I submitted a patch to fix that bug,
along with a caveat that I'm not sure the patch is complete, or even
consistent with the intentions of the author of this mechanism.

I'm glad to see this topic is generating some interest, and apologize if
anything I've said comes across as overly abrasive.  I use and really
like Lucene.  I put a lot of focus on creating a great experience for
the end user, and so am perhaps more concerned about quality of results
and certain UI aspects than most other users.

Chuck

   -Original Message-
   From: Doug Cutting [mailto:[EMAIL

Re: LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception

2004-12-15 Thread Nader Henein

This is a OS file system error not a Lucene issue (not for this board) , 
Google it for Gentoo specifically you a get a whole bunch of results one 
of which is this thread on the Gentoo Forums, 
http://forums.gentoo.org/viewtopic.php?t=9620

Good Luck
Nader Henein
Karthik N S wrote:
Hi Guys
Some body tell me what this Exception am Getting Pleae
Sys Specifications
O/s Linux Gentoo
Appserver Apache Tomcat/4.1.24
Jdk build 1.4.2_03-b02
Lucene 1.4.1 ,2, 3
Note: - This Exception is displayed on Every 2nd Query after Tomcat is
started
java.io.IOException: Stale NFS file handle
   at java.io.RandomAccessFile.readBytes(Native Method)
   at java.io.RandomAccessFile.read(RandomAccessFile.java:307)
   at
org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:420)
   at
org.apache.lucene.store.InputStream.readBytes(InputStream.java:61)
   at
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou
ndFileReader.java:220)
   at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
   at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
   at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
   at
org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:142)
   at
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115)
   at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
   at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:137)
   at
org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:253)
   at
org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:69)
   at org.apache.lucene.search.Similarity.idf(Similarity.java:255)
   at
org.apache.lucene.search.TermQuery$TermWeight.sumOfSquaredWeights(TermQuery.
java:47)
   at org.apache.lucene.search.Query.weight(Query.java:86)
   at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)
   at
org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:
251)


 WITH WARM REGARDS
 HAVE A NICE DAY
 [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: A question about scoring function in Lucene

2004-12-15 Thread Nhan Nguyen Dang

Thank for your answer,
In Lucene scoring function, they use only norm_q,
but for one query, norm_q is the same for all
documents.
So norm_q is actually not effect the score.
But norm_d is different, each document has a different
norm_d; it effect the score of document d for query q.
If you drop it, the score information is not correct
anymore or it not space vector model anymore.  Could
you explain it a little bit.

I think that it's expensive to computed in incremetal
indexing because when one document is added, idf of
each term changed. But drop it is not a good choice.

What is the role of norm_d_t ?
Nhan.

--- Chuck Williams [EMAIL PROTECTED] wrote:

 Nhan,
 
 Re.  your two differences:
 
 1 is not a difference.  Norm_d and Norm_q are both
 independent of t, so summing over t has no effect on
 them.  I.e., Norm_d * Norm_q is constant wrt the
 summation, so it doesn't matter if the sum is over
 just the numerator or over the entire fraction, the
 result is the same.
 
 2 is a difference.  Lucene uses Norm_q instead of
 Norm_d because Norm_d is too expensive to compute,
 especially in the presence of incremental indexing. 
 E.g., adding or deleting any document changes the
 idf's, so if Norm_d was used it would have to be
 recomputed for ALL documents.  This is not feasible.
 
 Another point you did not mention is that the idf
 term is squared (in both of your formulas).  Salton,
 the originator of the vector space model, dropped
 one idf factor from his formula as it improved
 results empirically.  More recent theoretical
 justifications of tf*idf provide intuitive
 explanations of why idf should only be included
 linearly.  tf is best thought of as the real vector
 entry, while idf is a weighting term on the
 components of the inner product.  E.g., seen the
 excellent paper by Robertson, Understanding inverse
 document frequency: on theoretical arguments for
 IDF, available here: 
 http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
 if you sign up for an eval.
 
 It's easy to correct for idf^2 by using a customer
 Similarity that takes a final square root.
 
 Chuck
 
-Original Message-
From: Vikas Gupta [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 9:32 PM
To: Lucene Users List
Subject: Re: A question about scoring function
 in Lucene

Lucene uses the vector space model. To
 understand that:

-Read section 2.1 of Space optimizations for
 Total Ranking paper
(Linked
here
 http://lucene.sourceforge.net/publications.html)
-Read section 6 to 6.4 of
   

http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
-Read section 1 of
   

http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps

Vikas

On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:

 Hi all,
 Lucene score document based on the correlation
 between
 the query q and document t:
 (this is raw function, I don't pay attention
 to the
 boost_t, coord_q_d factor)

 score_d = sum_t( tf_q * idf_t / norm_q * tf_d
 * idf_t
 / norm_d_t)  (*)

 Could anybody explain it in detail ? Or are
 there any
 papers, documents about this function ?
 Because:

 I have also read the book: Modern Information
 Retrieval, author: Ricardo Baeza-Yates and
 Berthier
 Ribeiro-Neto, Addison Wesley (Hope you have
 read it
 too). In page 27, they also suggest a scoring
 funtion
 for vector model based on the correlation
 between
 query q and document d as follow (I use
 different
 symbol):

  sum_t( weight_t_d * weight_t_q)
 score_d(d, q)= 
 - (**)
   norm_d * norm_q

 where weight_t_d = tf_d * idf_t
   weight_t_q = tf_q * idf_t
   norm_d = sqrt( sum_t( (tf_d * idf_t)^2 )
 )
   norm_q = sqrt( sum_t( (tf_q * idf_t)^2 )
 )

 (**):  sum_t( tf_q*idf_t * tf_d*idf_t)
 score_d(d,
 q)=-  (***)
norm_d * norm_q

 The two function, (*) and (***), have 2
 differences:
 1. in (***), the sum_t is just for the
 numerator but
 in the (*), the sum_t is for everything. So,
 with
 norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
 calculated twice. Is this right? please
 explain.

 2. No factor that define norms of the
 document: norm_d
 in the function (*). Can you explain this.
 what is the
 role of factor norm_d_t ?

 One more question: could anybody give me
 documents,
 papers that explain this function in detail.
 so when I
 apply Lucene for my system, I can adapt the
 document,
 and the field so that I still receive the
 correct
 scoring information from Lucene .

 Best regard,
 Thanks every body,

 =
 Ð#7863;ng Nhân

   

-
To unsubscribe, e-mail:
 [EMAIL

C# Ports

2004-12-15 Thread Garrett Heaver

I was just wondering what tools (JLCA?) people are using to port Lucene to
c# as I'd be well interesting in converting things like snowball stemmers,
wordnet etc.

 

Thanks

Garrett

RE: C# Ports

2004-12-15 Thread George Aroush

Hi Garrett,

If you are referring to dotLucene
(http://sourceforge.net/projects/dotlucene/) than I can tell you how -- not
too long ago I posted on this list how I ported 1.4 and 1.4.3 to C#, please
search the list for the answer -- you can't just use JLCA.

As for the snwball, I have already started work on it.  The port is done,
but I have to test, etc. and I am too tied up right now with my work.
However, I plan to release it before end of this month, so if you can wait,
do wait, otherwise feel free to take the steps that I did to port Lucene to
C#.

Regards,

-- George Aroush

-Original Message-
From: Garrett Heaver [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 15, 2004 5:58 AM
To: [EMAIL PROTECTED]
Subject: C# Ports

I was just wondering what tools (JLCA?) people are using to port Lucene to
c# as I'd be well interesting in converting things like snowball stemmers,
wordnet etc.

 

Thanks

Garrett



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams

Nhan,

You are correct that dropping the document norm does cause Lucene's scoring 
model to deviate from the pure vector space model.  However, including norm_d 
would cause other problems -- e.g., with short queries, as are typical in 
reality, the resulting scores with norm_d would all be extremely small.  You 
are also correct that since norm_q is invariant, it does not affect relevance 
ranking.  Norm_q is simply part of the normalization of final scores.  There 
are many different formulas for scoring and relevance ranking in IR.  All of 
these have some intuitive justification, but in the end can only be evaluated 
empirically.  There is no correct formula.

I believe the biggest problem with Lucene's approach relative to the pure 
vector space model is that Lucene does not properly normalize.  The pure vector 
space model implements a cosine in the strictly positive sector of the 
coordinate space.  This is guaranteed intrinsically to be between 0 and 1, and 
produces scores that can be compared across distinct queries (i.e., 0.8 means 
something about the result quality independent of the query).

Lucene does not have this property.  Its formula produces scores of arbitrary 
magnitude depending on the query.  The results cannot be compared meaningfully 
across queries; i.e., 0.8 means nothing intrinsically.  To keep final scores 
between 0 and 1, Lucene introduces an ad hoc query-dependent final 
normalization in Hits:  viz., it divides all scores by the highest score if the 
highest score happens to be greater than 1.  This makes it impossible for an 
application to properly inform its users about the quality of the results, to 
cut off bad results, etc.  Applications may do that, but in fact what they are 
doing is random, not what they think they are doing.

I've proposed a fix for this -- there was a long thread on Lucene-dev.  It is 
possible to revise Lucene's scoring to keep its efficiency, keep its current 
per-query relevance ranking, and yet intrinsically normalize its scores so that 
they are meaningful across queries.  I posted a fairly detailed spec of how to 
do this in the Lucene-dev thread.  I'm hoping to have time to build it and 
submit it as a proposed update to Lucene, but it is a large effort that would 
involve changing just about every scoring class in Lucene.  I'm not sure it 
would be incorporated even if I did it as that would take considerable work 
from a developer.  There doesn't seem to be much concern about these various 
scoring and relevancy ranking issues among the general Lucene community.

Chuck

   -Original Message-
   From: Nhan Nguyen Dang [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 15, 2004 1:18 AM
   To: Lucene Users List
   Subject: RE: A question about scoring function in Lucene
   
   Thank for your answer,
   In Lucene scoring function, they use only norm_q,
   but for one query, norm_q is the same for all
   documents.
   So norm_q is actually not effect the score.
   But norm_d is different, each document has a different
   norm_d; it effect the score of document d for query q.
   If you drop it, the score information is not correct
   anymore or it not space vector model anymore.  Could
   you explain it a little bit.
   
   I think that it's expensive to computed in incremetal
   indexing because when one document is added, idf of
   each term changed. But drop it is not a good choice.
   
   What is the role of norm_d_t ?
   Nhan.
   
   --- Chuck Williams [EMAIL PROTECTED] wrote:
   
Nhan,
   
Re.  your two differences:
   
1 is not a difference.  Norm_d and Norm_q are both
independent of t, so summing over t has no effect on
them.  I.e., Norm_d * Norm_q is constant wrt the
summation, so it doesn't matter if the sum is over
just the numerator or over the entire fraction, the
result is the same.
   
2 is a difference.  Lucene uses Norm_q instead of
Norm_d because Norm_d is too expensive to compute,
especially in the presence of incremental indexing.
E.g., adding or deleting any document changes the
idf's, so if Norm_d was used it would have to be
recomputed for ALL documents.  This is not feasible.
   
Another point you did not mention is that the idf
term is squared (in both of your formulas).  Salton,
the originator of the vector space model, dropped
one idf factor from his formula as it improved
results empirically.  More recent theoretical
justifications of tf*idf provide intuitive
explanations of why idf should only be included
linearly.  tf is best thought of as the real vector
entry, while idf is a weighting term on the
components of the inner product.  E.g., seen the
excellent paper by Robertson, Understanding inverse
document frequency: on theoretical arguments for
IDF, available here:
http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
if you sign up for an eval.
   
It's easy to correct for idf^2 by

Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Mike Snare

I am writing a tool that uses lucene, and I immediately ran into a
problem searching for words that contain internal hyphens (dashes).
After looking at the StandardTokenizer, I saw that it was because
there is no rule that will match ALPHA P ALPHA or ALPHANUM P
ALPHANUM.  Based on what I can tell from the source, every other
term in a word containing any of the following (.,/-_) must contain at
least one digit.

I was wondering if someone could shed some light on why it was deemed
necessary to prevent indexing a word like 'word-with-hyphen' without
first splitting it into its constituent parts.  The only reason I can
think of (and the only one I've found) is to handle hyphenated words
at line breaks, although my first thought would be that this would be
undesired behavior, since a word that was broken due to a line break
should actually be reconstructed, and not split.

In my case, the words are keywords that must remain as is, searchable
with the hyphen in place.  It was easy enough to modify the tokenizer
to do what I need, so I'm not really asking for help there.  I'm
really just curious as to why it is that a-1 is considered a single
token, but a-b is split.

Anyone care to elaborate?

Thanks,
-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TFIDF Implementation

2004-12-15 Thread David Spencer

Christoph Kiefer wrote:
David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and dirty wrote a class to perform that task. It uses the Jama.Matrix
class to represent the similarity matrix at the moment. For show and
tell I attached it to this email.
Unfortunately it doesn't perform very well. My index stores about 25000
docs with a total of 75000 terms. The similarity matrix is very sparse
but nevertheless needs about 1'875'000'000 entries!!! I think this
current implementation will not be useable in this way. I also think I
switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason.
What do you think?
I don't have any deep thoughts, just a few questions/ideas...
[1] TFIDFMatrix, FeatureVectorSimilarityMeasure, and CosineMeasure are 
your classes right, which are not in the mail, but presumably the source 
isn't needed.

[2] Does the problem boil down to this line and the memory usage?
double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];
Thus using a sparse matrix would be a win, and so would using floats 
instead of doubles?

[3] Prob minor, but in getTFIDFMatrix() you might be able to ignore stop 
words, as you do so later in getSimilarity().

[4] You can also consider using Colt possibly even JUNG:
http://www-itg.lbl.gov/~hoschek/colt/api/cern/colt/matrix/impl/SparseDoubleMatrix2D.html
http://jung.sourceforge.net/doc/api/index.html
[5]
Related to #2, can you precalc the matrix and store it on disk, or is 
your index too dynamic?

[6] Also, in similar kinds of calculations I've seen code that filters 
out low frequency terms e.g. ignore all terms that don't occur in at 
least 5 docs.

-- Dave

Best,
Christoph


/*
 * Created on Dec 14, 2004
 */
package ch.unizh.ifi.ddis.simpack.measure.featurevectors;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import Jama.Matrix;
/**
 * @author Christoph Kiefer
 */
public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure {

private File indexDir = null;
private File dataDir = null;
private String target = ;
private String query = ;
private int targetDocumentNumber = -1;
private final String ME = this.getClass().getName();
private int fileCounter = 0;

public TFIDF_Lucene( String indexDir, String dataDir, String target, 
String query ) {
this.indexDir = new File(indexDir);
this.dataDir = new File(dataDir);
this.target = target;
this.query = query;
}

public String getName() {
return TFIDF_Lucene_Similarity_Measure;
}

private void makeIndex() {
try {
IndexWriter writer = new IndexWriter(indexDir, new 
SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS ), false);
indexDirectory(writer, dataDir);
writer.optimize();
writer.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}

private void indexDirectory(IndexWriter writer, File dir) {
File[] files = dir.listFiles();
for (int i=0; i  files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(writer, f);  // recurse
} else if (f.getName().endsWith(.txt)) {
indexFile(writer, f);
}
}
}

private void indexFile(IndexWriter writer, File f) {
try {
System.out.println( Indexing  + f.getName() + ,  + 
(fileCounter++) );
String name = f.getCanonicalPath();
//System.out.println(name);
Document doc = new Document();
doc.add( Field.Text( contents, new FileReader(f), 
true ) );
writer.addDocument( doc );

Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting

Otis Gospodnetic wrote:
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago.  For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score is  X.  So that is where the absolue value of the
score would be useful.
Right, but the question is, would a single score threshold be effective 
for all queries, or would one need a separate score threshold for each 
query?  My hunch is that the latter is better, regardless of the scoring 
algorithm.

Also, just because Lucene's default scoring does not guarantee scores 
between zero and one does not necessarily mean that these scores are 
less meaningful.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting

Chris Hostetter wrote:
For example, using the current scoring equation, if i do a search for
Doug Cutting and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to Doug Cutting
If I then do a search for Chris Hostetter and get back the following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1
...then I can assume the same corrisponding information is true about my
new search term (#9 is significantly better, and #7/#8 are equally as good)
However, I *cannot* say either of the following:
  x) document #9 is as relevant for Chris Hostetter as document #1 is
 relevant to Doug Cutting
  y) document #5 is equally relevant to both Chris Hostetter and
 Doug Cutting
That's right.  Thanks for the nice description of the issue.
I think the OP is arguing that if the scoring algorithm was modified in
the way they suggested, then you would be able to make statements x  y.
And I am not convinced that, with the changes Chuck describes, one can 
be any more confident of x and y.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Daniel Naber

On Wednesday 15 December 2004 21:14, Mike Snare wrote:

 Also, the phrase query
 would place the same value on a doc that simply had the two words as a
 doc that had the hyphenated version, wouldn't it? This seems odd.

Not if these words are spelling variations of the same concept, which 
doesn't seem unlikely.

 In addition, why do we assume that a-1 is a typical product name but
 a-b isn't?

Maybe for a-b, but what about English words like half-baked?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

File locking using java.nio.channels.FileLock

2004-12-15 Thread John Wang

Hi:

  When is Lucene planning on moving toward java 1.4+?

   I see there are some problems caused from the current lock file
implementation, e.g. Bug# 32171. The problems would be easily fixed by
using the java.nio.channels.FileLock object.

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TFIDF Implementation

Re: Indexing a large number of DB records

RE: Indexing a large number of DB records

RE: Indexing a large number of DB records

Re: C# Ports

Re: A question about scoring function in Lucene

Re: Why does the StandardTokenizer split hyphenated words?

Re: A question about scoring function in Lucene

Re: Why does the StandardTokenizer split hyphenated words?

Re: A question about scoring function in Lucene

Re: Why does the StandardTokenizer split hyphenated words?

RE: A question about scoring function in Lucene

Re: LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception

RE: A question about scoring function in Lucene

C# Ports

RE: C# Ports

RE: A question about scoring function in Lucene

Why does the StandardTokenizer split hyphenated words?

Re: TFIDF Implementation

Re: A question about scoring function in Lucene

Re: A question about scoring function in Lucene

Re: Why does the StandardTokenizer split hyphenated words?

File locking using java.nio.channels.FileLock

23 matches

Site Navigation

Mail list logo

Footer information