Re: TFIDF Implementation

2004-12-15 Thread Christoph Kiefer
David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and dirty wrote a class to perform that task. It uses the Jama.Matrix
class to represent the similarity matrix at the moment. For show and
tell I attached it to this email.
Unfortunately it doesn't perform very well. My index stores about 25000
docs with a total of 75000 terms. The similarity matrix is very sparse
but nevertheless needs about 1'875'000'000 entries!!! I think this
current implementation will not be useable in this way. I also think I
switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason.

What do you think?

Best,
Christoph

-- 
Christoph Kiefer

Department of Informatics, University of Zurich

Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
/*
 * Created on Dec 14, 2004
 */
package ch.unizh.ifi.ddis.simpack.measure.featurevectors;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import Jama.Matrix;

/**
 * @author Christoph Kiefer
 */
public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure {
	
	private File indexDir = null;
	private File dataDir = null;
	private String target = ;
	private String query = ;
	private int targetDocumentNumber = -1;
	private final String ME = this.getClass().getName();
	private int fileCounter = 0;
	
	public TFIDF_Lucene( String indexDir, String dataDir, String target, String query ) {
		this.indexDir = new File(indexDir);
		this.dataDir = new File(dataDir);
		this.target = target;
		this.query = query;
	}
	
	public String getName() {
		return TFIDF_Lucene_Similarity_Measure;
	}
	
	private void makeIndex() {
		try {
			IndexWriter writer = new IndexWriter(indexDir, new SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS ), false);
			indexDirectory(writer, dataDir);
			writer.optimize();
			writer.close();
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
	
	private void indexDirectory(IndexWriter writer, File dir) {
		File[] files = dir.listFiles();
		for (int i=0; i  files.length; i++) {
			File f = files[i];
			if (f.isDirectory()) {
indexDirectory(writer, f);  // recurse
			} else if (f.getName().endsWith(.txt)) {
indexFile(writer, f);
			}
		}
	}
	
	private void indexFile(IndexWriter writer, File f) {
		try {
			System.out.println( Indexing  + f.getName() + ,  + (fileCounter++) );
			String name = f.getCanonicalPath();
			//System.out.println(name);
			Document doc = new Document();
			doc.add( Field.Text( contents, new FileReader(f), true ) );
			writer.addDocument( doc );
			
			if ( name.matches( dataDir + / + target + .txt ) ) {
targetDocumentNumber = writer.docCount();
			}
			
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
	
	public Matrix getTFIDFMatrix(File indexDir) throws IOException {
		Directory fsDir = FSDirectory.getDirectory( indexDir, false );
		IndexReader reader = IndexReader.open( fsDir );
		
		int numberOfTerms = 0;
		int numberOfDocuments = reader.numDocs();
		
		TermEnum allTerms = reader.terms();
		for ( ; allTerms.next(); ) {
			allTerms.term();
			numberOfTerms++;
		}
		
		System.out.println( Total number of terms in index is  + numberOfTerms );
		System.out.println( Total number of documents in index is  + numberOfDocuments );
		
		double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];
		
		for ( int i = 0; i  numberOfTerms; i++ ) {
			for ( int j = 0; j  numberOfDocuments; j++ ) {
TFIDFMatrix[i][j] = 0.0;
			}
		}
		
		allTerms = reader.terms();
		for ( int i = 0 ; allTerms.next(); i++ ) {
			
			Term term = allTerms.term();
			TermDocs td = reader.termDocs(term);
			for ( ; td.next(); ) {
TFIDFMatrix[i][td.doc()] = td.freq();
			}
			
		}
		
		allTerms = reader.terms();
		for ( int i = 0 ; allTerms.next(); i++ ) {
			for ( int j = 0; j  numberOfDocuments; j++ ) {
double tf = TFIDFMatrix[i][j];
double docFreq = (double)allTerms.docFreq();
double idf = ( Math.log( (double)numberOfDocuments / docFreq ) ) / 2.30258509299405;
//System.out.println( Term:  + i +  Document  + j +  TF/DocFreq/IDF:  + tf +   + docFreq +   + idf);
TFIDFMatrix[i][j] = tf * idf;
			}
		}
		
		reader.close();
		return 

Re: TFIDF Implementation

2004-12-15 Thread David Spencer
Christoph Kiefer wrote:
David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and dirty wrote a class to perform that task. It uses the Jama.Matrix
class to represent the similarity matrix at the moment. For show and
tell I attached it to this email.
Unfortunately it doesn't perform very well. My index stores about 25000
docs with a total of 75000 terms. The similarity matrix is very sparse
but nevertheless needs about 1'875'000'000 entries!!! I think this
current implementation will not be useable in this way. I also think I
switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason.
What do you think?
I don't have any deep thoughts, just a few questions/ideas...
[1] TFIDFMatrix, FeatureVectorSimilarityMeasure, and CosineMeasure are 
your classes right, which are not in the mail, but presumably the source 
isn't needed.

[2] Does the problem boil down to this line and the memory usage?
double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];
Thus using a sparse matrix would be a win, and so would using floats 
instead of doubles?

[3] Prob minor, but in getTFIDFMatrix() you might be able to ignore stop 
words, as you do so later in getSimilarity().

[4] You can also consider using Colt possibly even JUNG:
http://www-itg.lbl.gov/~hoschek/colt/api/cern/colt/matrix/impl/SparseDoubleMatrix2D.html
http://jung.sourceforge.net/doc/api/index.html
[5]
Related to #2, can you precalc the matrix and store it on disk, or is 
your index too dynamic?

[6] Also, in similar kinds of calculations I've seen code that filters 
out low frequency terms e.g. ignore all terms that don't occur in at 
least 5 docs.

-- Dave

Best,
Christoph


/*
 * Created on Dec 14, 2004
 */
package ch.unizh.ifi.ddis.simpack.measure.featurevectors;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import Jama.Matrix;
/**
 * @author Christoph Kiefer
 */
public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure {

private File indexDir = null;
private File dataDir = null;
private String target = ;
private String query = ;
private int targetDocumentNumber = -1;
private final String ME = this.getClass().getName();
private int fileCounter = 0;

public TFIDF_Lucene( String indexDir, String dataDir, String target, 
String query ) {
this.indexDir = new File(indexDir);
this.dataDir = new File(dataDir);
this.target = target;
this.query = query;
}

public String getName() {
return TFIDF_Lucene_Similarity_Measure;
}

private void makeIndex() {
try {
IndexWriter writer = new IndexWriter(indexDir, new 
SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS ), false);
indexDirectory(writer, dataDir);
writer.optimize();
writer.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}

private void indexDirectory(IndexWriter writer, File dir) {
File[] files = dir.listFiles();
for (int i=0; i  files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(writer, f);  // recurse
} else if (f.getName().endsWith(.txt)) {
indexFile(writer, f);
}
}
}

private void indexFile(IndexWriter writer, File f) {
try {
System.out.println( Indexing  + f.getName() + ,  + 
(fileCounter++) );
String name = f.getCanonicalPath();
//System.out.println(name);
Document doc = new Document();
doc.add( Field.Text( contents, new FileReader(f), 
true ) );
writer.addDocument( doc );

 

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
Christoph,
I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox 
Ot oh, sorry, I'll try to get this checked in soonish. For me it's 
always one thing to do a prelim version of a piece of code, but another 
matter to get it correctly packasged.

so I've attached it here. Just repackage and test.

An alternate approach to find similar docs is to use all (possibly 
unique) tokens in the  source doc to form a large query. This is code I use:

'srch' is the entire untokenized text of the source doc
'a' is the analyzer you want to use
'field' is the field you want to search on e.g. contents or body
'stop' is an opt set of stop words to ignore
It returns a query, which you then use to search for similar docs, and 
then in the return result you need to make sure you ignore the source 
doc, which will prob come back 1st. You can use stemming, synonyms, or 
fuzzy expansion for each term too.

public static Query formSimilarQuery( 
String srch,			Analyzer a,	String field,	Set stop)
throws org.apache.lucene.queryParser.ParseException, IOException
{	
	TokenStream ts = a.tokenStream( foo, new StringReader( srch));
	org.apache.lucene.analysis.Token t;
	BooleanQuery tmp = new BooleanQuery();
	Set already = new HashSet();
	while ( (t = ts.next()) != null)
	{
		String word = t.termText();
		if ( stop != null 
			 stop.contains( word)) continue;
		if ( ! already.add( word)) continue;
		TermQuery tq = new TermQuery( new Term( field, word));
		tmp.add( tq, false, false);
	}
	return tmp;

}

Regards,
Bruce Ritchie
http://www.jivesoftware.com/   


-Original Message-
From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
Sent: December 14, 2004 11:45 AM
To: Lucene Users List
Subject: TFIDF Implementation

Hi,
My current task/problem is the following: I need to implement 
TFIDF document term ranking using Jakarta Lucene to compute a 
similarity rank between arbitrary documents in the constructed index.
I saw from the API that there are similar functions already 
implemented in the class Similarity and DefaultSimilarity but 
I don't know exactly how to use them. At the time my index 
has about 25000 (small) documents and there are about 75000 
terms stored in total.
Now, my question is simple. Does anybody has done this before 
or could point me to another location for help?

Thanks for any help in advance.
Christoph 

--
Christoph Kiefer
Department of Informatics, University of Zurich
Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Otis Gospodnetic wrote:
You can also see 'Books like this' example from here
https://secure.manning.com/catalog/view.php?book=hatcher2item=source
Well done, uses a term vector, instead of reparsing the orig doc, to 
form the similarity query. Also I like the way you exclude the source 
doc in the query, I didn't think of doing that in my code.

I don't trust calling vector.size() and vector.getTerms() within the 
loop but I haven't looked at the code to see if it calculates  the 
results each time or caches them...
Otis
--- Bruce Ritchie [EMAIL PROTECTED] wrote:

Christoph,
I'm not entirely certain if this is what you want, but a while back
David Spencer did code up a 'More Like This' class which can be used
for generating similarities between documents. I can't seem to find
this class in the sandbox so I've attached it here. Just repackage
and test.
Regards,
Bruce Ritchie
http://www.jivesoftware.com/   


-Original Message-
From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
Sent: December 14, 2004 11:45 AM
To: Lucene Users List
Subject: TFIDF Implementation

Hi,
My current task/problem is the following: I need to implement 
TFIDF document term ranking using Jakarta Lucene to compute a 
similarity rank between arbitrary documents in the constructed
index.
I saw from the API that there are similar functions already 
implemented in the class Similarity and DefaultSimilarity but 
I don't know exactly how to use them. At the time my index 
has about 25000 (small) documents and there are about 75000 
terms stored in total.
Now, my question is simple. Does anybody has done this before 
or could point me to another location for help?

Thanks for any help in advance.
Christoph 

--
Christoph Kiefer
Department of Informatics, University of Zurich
Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: TFIDF Implementation

2004-12-14 Thread Otis Gospodnetic
You can also see 'Books like this' example from here
https://secure.manning.com/catalog/view.php?book=hatcher2item=source

Otis

--- Bruce Ritchie [EMAIL PROTECTED] wrote:

 Christoph,
 
 I'm not entirely certain if this is what you want, but a while back
 David Spencer did code up a 'More Like This' class which can be used
 for generating similarities between documents. I can't seem to find
 this class in the sandbox so I've attached it here. Just repackage
 and test.
 
 
 Regards,
 
 Bruce Ritchie
 http://www.jivesoftware.com/   
 
  -Original Message-
  From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
  Sent: December 14, 2004 11:45 AM
  To: Lucene Users List
  Subject: TFIDF Implementation
  
  Hi,
  My current task/problem is the following: I need to implement 
  TFIDF document term ranking using Jakarta Lucene to compute a 
  similarity rank between arbitrary documents in the constructed
 index.
  I saw from the API that there are similar functions already 
  implemented in the class Similarity and DefaultSimilarity but 
  I don't know exactly how to use them. At the time my index 
  has about 25000 (small) documents and there are about 75000 
  terms stored in total.
  Now, my question is simple. Does anybody has done this before 
  or could point me to another location for help?
  
  Thanks for any help in advance.
  Christoph 
  
  --
  Christoph Kiefer
  
  Department of Informatics, University of Zurich
  
  Office: Uni Irchel 27-K-32
  Phone:  +41 (0) 44 / 635 67 26
  Email:  [EMAIL PROTECTED]
  Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 
-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
 
  You can also see 'Books like this' example from here 
  
 https://secure.manning.com/catalog/view.php?book=hatcher2item=source
 
 Well done, uses a term vector, instead of reparsing the orig 
 doc, to form the similarity query. Also I like the way you 
 exclude the source doc in the query, I didn't think of doing 
 that in my code.

I agree, it's a good way to exclude the source doc.
 
 I don't trust calling vector.size() and vector.getTerms() 
 within the loop but I haven't looked at the code to see if it 
 calculates  the results each time or caches them...

From the code I looked at, those calls don't recalculate on every call. 


Regards,

Bruce Ritchie

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
  From the code I looked at, those calls don't recalculate on 
 every call. 
 
 I was referring to this fragment below from BooksLikeThis.docsLike(), 
 and was mentioning it as the javadoc 
 http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in
 dex/TermFreqVector.html 
 does not say that the values returned by size() and getTerms() are 
 cached, and while the impl may cache them (haven't checked) it's not 
 guarenteed, thus it's safer to put the size() and getTerms() call 
 outside the loop.
 
   for (int j = 0; j  vector.size(); j++) {
TermQuery tq = new TermQuery(
new Term(subject, vector.getTerms()[j]));

I agree on your overall point that it's probably best to put those calls 
outside of the loop, I was just saying that I did look at the implementation 
and the calls do not recalculate anything. I'm sorry I didn't explain myself 
clearly enough.


Regards,

Bruce Ritchie

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
From the code I looked at, those calls don't recalculate on 
every call. 

I was referring to this fragment below from BooksLikeThis.docsLike(), 
and was mentioning it as the javadoc 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in
dex/TermFreqVector.html 
does not say that the values returned by size() and getTerms() are 
cached, and while the impl may cache them (haven't checked) it's not 
guarenteed, thus it's safer to put the size() and getTerms() call 
outside the loop.

 for (int j = 0; j  vector.size(); j++) {
  TermQuery tq = new TermQuery(
  new Term(subject, vector.getTerms()[j]));

I agree on your overall point that it's probably best to put those calls outside of the loop, I was just saying that I did look at the implementation and the calls do not recalculate anything. I'm sorry I didn't explain myself clearly enough.
Oh oh oh, sorry, 10-4, no prob.

Regards,
Bruce Ritchie
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
 

You can also see 'Books like this' example from here 

https://secure.manning.com/catalog/view.php?book=hatcher2item=source
Well done, uses a term vector, instead of reparsing the orig 
doc, to form the similarity query. Also I like the way you 
exclude the source doc in the query, I didn't think of doing 
that in my code.

I agree, it's a good way to exclude the source doc.
 

I don't trust calling vector.size() and vector.getTerms() 
within the loop but I haven't looked at the code to see if it 
calculates  the results each time or caches them...

From the code I looked at, those calls don't recalculate on every call. 
I was referring to this fragment below from BooksLikeThis.docsLike(), 
and was mentioning it as the javadoc 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html 
does not say that the values returned by size() and getTerms() are 
cached, and while the impl may cache them (haven't checked) it's not 
guarenteed, thus it's safer to put the size() and getTerms() call 
outside the loop.

 for (int j = 0; j  vector.size(); j++) {
  TermQuery tq = new TermQuery(
  new Term(subject, vector.getTerms()[j]));

Regards,
Bruce Ritchie
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
Christoph,

I'm not entirely certain if this is what you want, but a while back David 
Spencer did code up a 'More Like This' class which can be used for generating 
similarities between documents. I can't seem to find this class in the sandbox 
so I've attached it here. Just repackage and test.


Regards,

Bruce Ritchie
http://www.jivesoftware.com/   

 -Original Message-
 From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
 Sent: December 14, 2004 11:45 AM
 To: Lucene Users List
 Subject: TFIDF Implementation
 
 Hi,
 My current task/problem is the following: I need to implement 
 TFIDF document term ranking using Jakarta Lucene to compute a 
 similarity rank between arbitrary documents in the constructed index.
 I saw from the API that there are similar functions already 
 implemented in the class Similarity and DefaultSimilarity but 
 I don't know exactly how to use them. At the time my index 
 has about 25000 (small) documents and there are about 75000 
 terms stored in total.
 Now, my question is simple. Does anybody has done this before 
 or could point me to another location for help?
 
 Thanks for any help in advance.
 Christoph 
 
 --
 Christoph Kiefer
 
 Department of Informatics, University of Zurich
 
 Office: Uni Irchel 27-K-32
 Phone:  +41 (0) 44 / 635 67 26
 Email:  [EMAIL PROTECTED]
 Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TFIDF Implementation

2004-12-14 Thread Christoph Kiefer
Hi,
My current task/problem is the following: I need to implement TFIDF
document term ranking using Jakarta Lucene to compute a similarity rank
between arbitrary documents in the constructed index.
I saw from the API that there are similar functions already implemented
in the class Similarity and DefaultSimilarity but I don't know exactly
how to use them. At the time my index has about 25000 (small) documents
and there are about 75000 terms stored in total.
Now, my question is simple. Does anybody has done this before or could
point me to another location for help?

Thanks for any help in advance.
Christoph 

-- 
Christoph Kiefer

Department of Informatics, University of Zurich

Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]