Re: TFIDF Implementation
David, Bruce, Otis, Thank you all for the quick replies. I looked through the BooksLikeThis example. I also agree, it's a very good and effective way to find similar docs in the index. Nevertheless, what I need is really a similarity matrix holding all TF*IDF values. For illustration I quick and dirty wrote a class to perform that task. It uses the Jama.Matrix class to represent the similarity matrix at the moment. For show and tell I attached it to this email. Unfortunately it doesn't perform very well. My index stores about 25000 docs with a total of 75000 terms. The similarity matrix is very sparse but nevertheless needs about 1'875'000'000 entries!!! I think this current implementation will not be useable in this way. I also think I switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason. What do you think? Best, Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html /* * Created on Dec 14, 2004 */ package ch.unizh.ifi.ddis.simpack.measure.featurevectors; import java.io.File; import java.io.FileReader; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.snowball.SnowballAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.index.TermEnum; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import Jama.Matrix; /** * @author Christoph Kiefer */ public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure { private File indexDir = null; private File dataDir = null; private String target = ; private String query = ; private int targetDocumentNumber = -1; private final String ME = this.getClass().getName(); private int fileCounter = 0; public TFIDF_Lucene( String indexDir, String dataDir, String target, String query ) { this.indexDir = new File(indexDir); this.dataDir = new File(dataDir); this.target = target; this.query = query; } public String getName() { return TFIDF_Lucene_Similarity_Measure; } private void makeIndex() { try { IndexWriter writer = new IndexWriter(indexDir, new SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS ), false); indexDirectory(writer, dataDir); writer.optimize(); writer.close(); } catch (Exception ex) { ex.printStackTrace(); } } private void indexDirectory(IndexWriter writer, File dir) { File[] files = dir.listFiles(); for (int i=0; i files.length; i++) { File f = files[i]; if (f.isDirectory()) { indexDirectory(writer, f); // recurse } else if (f.getName().endsWith(.txt)) { indexFile(writer, f); } } } private void indexFile(IndexWriter writer, File f) { try { System.out.println( Indexing + f.getName() + , + (fileCounter++) ); String name = f.getCanonicalPath(); //System.out.println(name); Document doc = new Document(); doc.add( Field.Text( contents, new FileReader(f), true ) ); writer.addDocument( doc ); if ( name.matches( dataDir + / + target + .txt ) ) { targetDocumentNumber = writer.docCount(); } } catch (Exception ex) { ex.printStackTrace(); } } public Matrix getTFIDFMatrix(File indexDir) throws IOException { Directory fsDir = FSDirectory.getDirectory( indexDir, false ); IndexReader reader = IndexReader.open( fsDir ); int numberOfTerms = 0; int numberOfDocuments = reader.numDocs(); TermEnum allTerms = reader.terms(); for ( ; allTerms.next(); ) { allTerms.term(); numberOfTerms++; } System.out.println( Total number of terms in index is + numberOfTerms ); System.out.println( Total number of documents in index is + numberOfDocuments ); double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments]; for ( int i = 0; i numberOfTerms; i++ ) { for ( int j = 0; j numberOfDocuments; j++ ) { TFIDFMatrix[i][j] = 0.0; } } allTerms = reader.terms(); for ( int i = 0 ; allTerms.next(); i++ ) { Term term = allTerms.term(); TermDocs td = reader.termDocs(term); for ( ; td.next(); ) { TFIDFMatrix[i][td.doc()] = td.freq(); } } allTerms = reader.terms(); for ( int i = 0 ; allTerms.next(); i++ ) { for ( int j = 0; j numberOfDocuments; j++ ) { double tf = TFIDFMatrix[i][j]; double docFreq = (double)allTerms.docFreq(); double idf = ( Math.log( (double)numberOfDocuments / docFreq ) ) / 2.30258509299405; //System.out.println( Term: + i + Document + j + TF/DocFreq/IDF: + tf + + docFreq + + idf); TFIDFMatrix[i][j] = tf * idf; } } reader.close(); return
Re: TFIDF Implementation
Christoph Kiefer wrote: David, Bruce, Otis, Thank you all for the quick replies. I looked through the BooksLikeThis example. I also agree, it's a very good and effective way to find similar docs in the index. Nevertheless, what I need is really a similarity matrix holding all TF*IDF values. For illustration I quick and dirty wrote a class to perform that task. It uses the Jama.Matrix class to represent the similarity matrix at the moment. For show and tell I attached it to this email. Unfortunately it doesn't perform very well. My index stores about 25000 docs with a total of 75000 terms. The similarity matrix is very sparse but nevertheless needs about 1'875'000'000 entries!!! I think this current implementation will not be useable in this way. I also think I switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason. What do you think? I don't have any deep thoughts, just a few questions/ideas... [1] TFIDFMatrix, FeatureVectorSimilarityMeasure, and CosineMeasure are your classes right, which are not in the mail, but presumably the source isn't needed. [2] Does the problem boil down to this line and the memory usage? double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments]; Thus using a sparse matrix would be a win, and so would using floats instead of doubles? [3] Prob minor, but in getTFIDFMatrix() you might be able to ignore stop words, as you do so later in getSimilarity(). [4] You can also consider using Colt possibly even JUNG: http://www-itg.lbl.gov/~hoschek/colt/api/cern/colt/matrix/impl/SparseDoubleMatrix2D.html http://jung.sourceforge.net/doc/api/index.html [5] Related to #2, can you precalc the matrix and store it on disk, or is your index too dynamic? [6] Also, in similar kinds of calculations I've seen code that filters out low frequency terms e.g. ignore all terms that don't occur in at least 5 docs. -- Dave Best, Christoph /* * Created on Dec 14, 2004 */ package ch.unizh.ifi.ddis.simpack.measure.featurevectors; import java.io.File; import java.io.FileReader; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.snowball.SnowballAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.index.TermEnum; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import Jama.Matrix; /** * @author Christoph Kiefer */ public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure { private File indexDir = null; private File dataDir = null; private String target = ; private String query = ; private int targetDocumentNumber = -1; private final String ME = this.getClass().getName(); private int fileCounter = 0; public TFIDF_Lucene( String indexDir, String dataDir, String target, String query ) { this.indexDir = new File(indexDir); this.dataDir = new File(dataDir); this.target = target; this.query = query; } public String getName() { return TFIDF_Lucene_Similarity_Measure; } private void makeIndex() { try { IndexWriter writer = new IndexWriter(indexDir, new SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS ), false); indexDirectory(writer, dataDir); writer.optimize(); writer.close(); } catch (Exception ex) { ex.printStackTrace(); } } private void indexDirectory(IndexWriter writer, File dir) { File[] files = dir.listFiles(); for (int i=0; i files.length; i++) { File f = files[i]; if (f.isDirectory()) { indexDirectory(writer, f); // recurse } else if (f.getName().endsWith(.txt)) { indexFile(writer, f); } } } private void indexFile(IndexWriter writer, File f) { try { System.out.println( Indexing + f.getName() + , + (fileCounter++) ); String name = f.getCanonicalPath(); //System.out.println(name); Document doc = new Document(); doc.add( Field.Text( contents, new FileReader(f), true ) ); writer.addDocument( doc );
Re: TFIDF Implementation
Bruce Ritchie wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox Ot oh, sorry, I'll try to get this checked in soonish. For me it's always one thing to do a prelim version of a piece of code, but another matter to get it correctly packasged. so I've attached it here. Just repackage and test. An alternate approach to find similar docs is to use all (possibly unique) tokens in the source doc to form a large query. This is code I use: 'srch' is the entire untokenized text of the source doc 'a' is the analyzer you want to use 'field' is the field you want to search on e.g. contents or body 'stop' is an opt set of stop words to ignore It returns a query, which you then use to search for similar docs, and then in the return result you need to make sure you ignore the source doc, which will prob come back 1st. You can use stemming, synonyms, or fuzzy expansion for each term too. public static Query formSimilarQuery( String srch, Analyzer a, String field, Set stop) throws org.apache.lucene.queryParser.ParseException, IOException { TokenStream ts = a.tokenStream( foo, new StringReader( srch)); org.apache.lucene.analysis.Token t; BooleanQuery tmp = new BooleanQuery(); Set already = new HashSet(); while ( (t = ts.next()) != null) { String word = t.termText(); if ( stop != null stop.contains( word)) continue; if ( ! already.add( word)) continue; TermQuery tq = new TermQuery( new Term( field, word)); tmp.add( tq, false, false); } return tmp; } Regards, Bruce Ritchie http://www.jivesoftware.com/ -Original Message- From: Christoph Kiefer [mailto:[EMAIL PROTECTED] Sent: December 14, 2004 11:45 AM To: Lucene Users List Subject: TFIDF Implementation Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TFIDF Implementation
Otis Gospodnetic wrote: You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in the query, I didn't think of doing that in my code. I don't trust calling vector.size() and vector.getTerms() within the loop but I haven't looked at the code to see if it calculates the results each time or caches them... Otis --- Bruce Ritchie [EMAIL PROTECTED] wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached it here. Just repackage and test. Regards, Bruce Ritchie http://www.jivesoftware.com/ -Original Message- From: Christoph Kiefer [mailto:[EMAIL PROTECTED] Sent: December 14, 2004 11:45 AM To: Lucene Users List Subject: TFIDF Implementation Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Otis --- Bruce Ritchie [EMAIL PROTECTED] wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached it here. Just repackage and test. Regards, Bruce Ritchie http://www.jivesoftware.com/ -Original Message- From: Christoph Kiefer [mailto:[EMAIL PROTECTED] Sent: December 14, 2004 11:45 AM To: Lucene Users List Subject: TFIDF Implementation Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in the query, I didn't think of doing that in my code. I agree, it's a good way to exclude the source doc. I don't trust calling vector.size() and vector.getTerms() within the loop but I haven't looked at the code to see if it calculates the results each time or caches them... From the code I looked at, those calls don't recalculate on every call. Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in dex/TermFreqVector.html does not say that the values returned by size() and getTerms() are cached, and while the impl may cache them (haven't checked) it's not guarenteed, thus it's safer to put the size() and getTerms() call outside the loop. for (int j = 0; j vector.size(); j++) { TermQuery tq = new TermQuery( new Term(subject, vector.getTerms()[j])); I agree on your overall point that it's probably best to put those calls outside of the loop, I was just saying that I did look at the implementation and the calls do not recalculate anything. I'm sorry I didn't explain myself clearly enough. Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TFIDF Implementation
Bruce Ritchie wrote: From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in dex/TermFreqVector.html does not say that the values returned by size() and getTerms() are cached, and while the impl may cache them (haven't checked) it's not guarenteed, thus it's safer to put the size() and getTerms() call outside the loop. for (int j = 0; j vector.size(); j++) { TermQuery tq = new TermQuery( new Term(subject, vector.getTerms()[j])); I agree on your overall point that it's probably best to put those calls outside of the loop, I was just saying that I did look at the implementation and the calls do not recalculate anything. I'm sorry I didn't explain myself clearly enough. Oh oh oh, sorry, 10-4, no prob. Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TFIDF Implementation
Bruce Ritchie wrote: You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in the query, I didn't think of doing that in my code. I agree, it's a good way to exclude the source doc. I don't trust calling vector.size() and vector.getTerms() within the loop but I haven't looked at the code to see if it calculates the results each time or caches them... From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html does not say that the values returned by size() and getTerms() are cached, and while the impl may cache them (haven't checked) it's not guarenteed, thus it's safer to put the size() and getTerms() call outside the loop. for (int j = 0; j vector.size(); j++) { TermQuery tq = new TermQuery( new Term(subject, vector.getTerms()[j])); Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached it here. Just repackage and test. Regards, Bruce Ritchie http://www.jivesoftware.com/ -Original Message- From: Christoph Kiefer [mailto:[EMAIL PROTECTED] Sent: December 14, 2004 11:45 AM To: Lucene Users List Subject: TFIDF Implementation Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TFIDF Implementation
Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]