TermFrequency for a String
IndexReader.getTermFreqVectors(2)[0].getTermFrequencies()[5]; In the above example, Lucene gives me the term frequency of the 5th term (e.g. say "planet") in the tfv of the corpus document "2". But I need to get the term frequency for a specified term using its string value. E.g.: term frequency of the term specified as "planet" (i.e. not specified in terms of its position "5", but specified using its string value "planet"). Is there any way to do this? I highly appreciate your kind reply!
Total of term frequencies
Hi, Is there any way to get the total count of terms in the Term Frequency Vector (tvf)? I need to calculate the Normalized term frequency of each term in my tvf. I know how to obtain the length of the tvf, but it doesn't work since I need to count duplicate occurrences as well. Highly appreciate your kind response.
Only term frequencies
Hi, I have a document collection with hundreds of documents. I need to do know the term frequency for a given query term in each document. I know that 'hit.score' will give me the Lucene score for each document (and it includes term frequency as well). But I need to call only term frequencies in each document. How can I do this? I highly appreciate your kind response.
Re: hit.score
Thanks Adrien. On Mon, Mar 27, 2017 at 6:56 PM, Adrien Grand <jpou...@gmail.com> wrote: > You can use IndexSearcher.explain to see how the score was computed. > > Le lun. 27 mars 2017 à 14:46, Manjula Wijewickrema <manjul...@gmail.com> a > écrit : > > > Hi, > > > > Can someone help me to understand the value given by 'hit.score' in > Lucene. > > I indexed a single document with five different words with different > > frequencies and try to understand this value. However, it doesn't seem to > > be normalized term frequency or tf-idf. I am using Lucene 2.91. > > > > Any help would be highly appreciated. > > >
hit.score
Hi, Can someone help me to understand the value given by 'hit.score' in Lucene. I indexed a single document with five different words with different frequencies and try to understand this value. However, it doesn't seem to be normalized term frequency or tf-idf. I am using Lucene 2.91. Any help would be highly appreciated.
Why hit is 0 for bigrams?
Hi, I tried to index bigrams from a documhe system gave and the system gave me the following output with the frequencies of the bigrams(output 1): array size:15 array terms are:{contents: /1, assist librarian/1, assist manjula/2, assist sabaragamuwa/1, fine manjula/1, librari manjula/1, librarian sabaragamuwa/1, main librari/2, manjula assist/4, manjula fine/1, manjula name/1, name manjula/1, sabaragamuwa univers/3, univers main/2, univers sabaragamuwa/1} For this I used the follwing code in the createIndex() class: ShingleAnalyzerWrapper sw=*new *ShingleAnalyzerWrapper(analyzer,2); sw.setOutputUnigrams(*false*); Then I tried search the indexed bigrams of the same document using the following code in searchIndex()class: IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); Analyzer analyzer = *new* WhitespaceAnalyzer(); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(terms[pos[freqs.length-q1]]); System.*out*.println(Query: +query); Hits hits = indexSearcher.search(query); System.*out*.println(Number of hits: + hits.length()); For this, the system gave me the following output (output2): Query: contents:manjula contents:assist Number of hits: 0 Query: contents:sabaragamuwa contents:univers Number of hits: 0 Query: contents:univers contents:main Number of hits: 0 Query: contents:main contents:librari Number of hits: 0 If someone can please explain me; (1)why 'contents: /1' is included in the array as an array element? (output 1) (2) why the system return me the query as 'contents:manjula contents:assist' instead of 'manjula assist'? (output 2) (3) why the number of hits given as 0 instead of their frequencies? (output 2) I highly appreciate your kind reply. Manjula.
bigram problem
Hi, Could please explain me how to determine the tf-idf score for bigrams. My program is able to index and search bigrams correctly, but it does not calculate the tf-idf for bigrams. If someone can, please help me to resolve this. Regards, Manjula.
Re: bigram problem
Dear Parnab, Thanks a lot for your guidance. I prefer to follow the second method, as I have already indexed the bigrams using ShingleFilterWrapper. But, I have no any idea about how to use NGramTokenizer here. So, could you please write one or two lines of the code which shows how to use NGramTokenizer for bigrams. Thanks, Manjula. On Wed, Jul 2, 2014 at 7:05 PM, parnab kumar parnab.2...@gmail.com wrote: TF is straight forward, you can simply count the no of occurrences in the doc by simple string matching. For IDF you need to know total no of docs in the collection and the no. of docs having the bigram. reader.maxDoc() will give you the total no of docs in the collection. To calculate the number of docs containing the bigram use a phrase query with slop factor set to 0. The number of docs returned by the indexsearcher with the phrase query will be the number of docs having the bigram. I hope this is fine. Alternatively, use NGramTokenizer where ( n=2 in your case) while indexing. In such a case, each bigram can interpreted as a normal lucene term. Thanks, Parnab On Wed, Jul 2, 2014 at 8:45 AM, Manjula Wijewickrema manjul...@gmail.com wrote: Hi, Could please explain me how to determine the tf-idf score for bigrams. My program is able to index and search bigrams correctly, but it does not calculate the tf-idf for bigrams. If someone can, please help me to resolve this. Regards, Manjula.
Why bigram tf-idf is 0?
Hi, In my programme, I tried to select the most relevant document based on bigrams. System gives me the following output. {contents: /1, assist librarian/1, assist manjula/2, assist sabaragamuwa/1, fine manjula/1, librari manjula/1, librarian sabaragamuwa/1, main librari/2, manjula assist/4, manjula fine/1, manjula name/1, name manjula/1, sabaragamuwa univers/3, univers main/2, univers sabaragamuwa/1} The frequencies of the bigrams are also correctly identified by the system. But the tf-idf scores of these bigrams are given as 0. However, the same programme gives the correct tf-idf values for unigrams. Following is the code snippet that I wrote to determine the tf-idf of bigrams. for(int q1=1; q1NB+1; q1++){ //NB-Number of Bigrams IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); Analyzer analyzer = new WhitespaceAnalyzer(); QueryParser queryParser = new QueryParser(FIELD_CONTENTS, analyzer); Query query = queryParser.parse(terms[pos[freqs.length-q1]]); Hits hits = indexSearcher.search(query); IteratorHit it = hits.iterator(); TopDocs results=indexSearcher.search(query,10); ScoreDoc[] hits1=results.scoreDocs; for(ScoreDoc hit:hits1){ Document doc=indexSearcher.doc(hit.doc); tfidf[q1-1]=hit.score; } } *** Here, hit.score should give the tf-idf value of each bigram. Why it is given as 0? If someone can please explain me how to resolve this problem. Thanks, Manjula.
Re: ShingleAnalyzerWrapper question
Dear Steve, It works. Thanks. On Wed, Jun 11, 2014 at 6:18 PM, Steve Rowe sar...@gmail.com wrote: You should give sw rather than analyzer in the IndexWriter actor. Steve www.lucidworks.com On Jun 11, 2014 2:24 AM, Manjula Wijewickrema manjul...@gmail.com wrote: Hi, In my programme, I can index and search a document based on unigrams. I modified the code as follows to obtain the results based on bigrams. However, it did not give me the desired output. * *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { *final* String[] NEW_STOP_WORDS = {a, able, about, actually, after, allow, almost, already, also, although, always, am, an, and, any, anybody}; //only a portion SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English, NEW_STOP_WORDS ); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY* ); ShingleAnalyzerWrapper sw=*new* ShingleAnalyzerWrapper(analyzer,2); sw.setOutputUnigrams(*false*); IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, *true*,IndexWriter.MaxFieldLength.*UNLIMITED*); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document doc = *new* Document(); String text=; doc.add(*new* Field(contents,text,Field.Store.*YES*, Field.Index.UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); doc.add(*new* Field(*FIELD_CONTENTS*, reader)); w.addDocument(doc); } w.optimize(); w.close(); } Still the output is; {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3, name/1, sabaragamuwa/1, univers/1} *** If anybody can, please help me to obtain the correct output. Thanks, Manjula.
ShingleAnalyzerWrapper question
Hi, In my programme, I can index and search a document based on unigrams. I modified the code as follows to obtain the results based on bigrams. However, it did not give me the desired output. * *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { *final* String[] NEW_STOP_WORDS = {a, able, about, actually, after, allow, almost, already, also, although, always, am, an, and, any, anybody}; //only a portion SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English, NEW_STOP_WORDS ); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY* ); ShingleAnalyzerWrapper sw=*new* ShingleAnalyzerWrapper(analyzer,2); sw.setOutputUnigrams(*false*); IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, *true*,IndexWriter.MaxFieldLength.*UNLIMITED*); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document doc = *new* Document(); String text=; doc.add(*new* Field(contents,text,Field.Store.*YES*, Field.Index.UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); doc.add(*new* Field(*FIELD_CONTENTS*, reader)); w.addDocument(doc); } w.optimize(); w.close(); } Still the output is; {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3, name/1, sabaragamuwa/1, univers/1} *** If anybody can, please help me to obtain the correct output. Thanks, Manjula.
Re: Is it wrong to create index writer on each query request.
Hi, What are the other disadvantages (other than the time factor) of creating index for every request? Manjula. On Thu, Jun 5, 2014 at 2:34 PM, Aditya findbestopensou...@gmail.com wrote: Hi Rajendra You should NOT create index writer for every request. Whether it is time consuming to update index writer when new document will come. No. Regards Aditya www.findbestopensource.com On Thu, Jun 5, 2014 at 12:24 PM, Rajendra Rao rajendra@launchship.com wrote: I have system in which documents and Query comes frequently .I am creating index writer in memory every time for each query I request . I want to know Is it good to separate Index Writing and loading and Query request ? Whether It is good to save index writer on hard disk .Whether it is time consuming to update index writer when new document will come.
Re: Phrase indexing and searching
Hi Steve, Thanks for the reply. Could you please simply let me know how to embed SingleFilter in the code for both indexing and searching? Coz, different people suggest different snippets to the code and they did not do the job. Thanks, Manjula. On Mon, Dec 23, 2013 at 8:42 PM, Steve Rowe sar...@gmail.com wrote: Hi Manjula, Sounds like ShingleFilter will do what you want: http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html Steve www.lucidworks.com On Dec 22, 2013 11:25 PM, Manjula Wijewickrema manjul...@gmail.com wrote: Dear All, My Lucene programme is able to index single words and search the most matching documents (based on term frequencies) documents from a corpus to the input document. Now I want to index two word phrases and search the matching corpus documents (based on phrase frequencies) to the input documents. ex:- input document: blue house is very beautiful split it into phrases (say two term phrases) like: blue house house very very beautiful etc. Is it possible to do this with Lucene? If so how can I do it? Thanks, Manjula.
Phrase indexing and searching
Dear All, My Lucene programme is able to index single words and search the most matching documents (based on term frequencies) documents from a corpus to the input document. Now I want to index two word phrases and search the matching corpus documents (based on phrase frequencies) to the input documents. ex:- input document: blue house is very beautiful split it into phrases (say two term phrases) like: blue house house very very beautiful etc. Is it possible to do this with Lucene? If so how can I do it? Thanks, Manjula.
Phrase indexing and searching
Dear list, My Lucene programme is able to index single words and search the most matching documents (based on term frequencies) documents from a corpus to the input document. Now I want to index two word phrases and search the matching corpus documents (based on phrase frequencies) to the input documents. ex:- input document: blue house is very beautiful split it into phrases (say two term phrases) like: blue house house very very beautiful etc. Is it possible to do this with Lucene? If so how can I do it? Thanks, Manjula.
Re: Editing StopWordList
Hi Gupta, Thanx a lot for your reply. But I could not understand whether I could modify (adding more words) to the default stop word list or should I have to make a new list as an array as follows. public string[] NEW_STOP_WORDS = { a, and, are, as, at, be, but, by, for, if, in, into, is, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with, inc,incorporated,co.,ltd,ltd., we, you, your, us, etc...}; then call it as follows, SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English, StopAnalyzer.NEW_STOP_WORDS ); Am I correct? Or if not could you explain me how can I do this? Thanx in advance. Manjula. On Tue, Dec 21, 2010 at 10:36 AM, Anshum ansh...@gmail.com wrote: Hi Manjula, You could initialize the Analyzer using a modified stop word set. Use the *StopAnalyzer.ENGLISH_STOP_WORDS_SET *to get the default stopset and then add your own words to it. You could then initialize the analyzer using this new stop set instead of the default stop set. Hope that helps. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Dec 21, 2010 at 9:20 AM, manjula wijewickrema manjul...@gmail.comwrote: Hi, 1) In my application, I need to add more words to the stop word list. Therefore, is it possible to add more words into the default lucene stop word list? 2) If is it possible, then how can I do this? Appreciate any comment from you. Thanks, Manjula.
Editing StopWordList
Hi, 1) In my application, I need to add more words to the stop word list. Therefore, is it possible to add more words into the default lucene stop word list? 2) If is it possible, then how can I do this? Appreciate any comment from you. Thanks, Manjula.
Re: Analyzer
Dear Erick, Thanx for your information. Manjula. On Tue, Nov 30, 2010 at 6:37 PM, Erick Erickson erickerick...@gmail.comwrote: WhitespaceAnalyzer does just that, splits the incoming stream on white space. From the javadocs for StandardAnalyzer: A grammar-based tokenizer constructed with JFlex This should be a good tokenizer for most European-language documents: - Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. - Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. - Recognizes email addresses and internet hostnames as one token. Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. Best Erick On Tue, Nov 30, 2010 at 12:06 AM, manjula wijewickrema manjul...@gmail.comwrote: Hi Steve, Thanx a lot for your reply. Yes there are only two classes and it's corrcet that the way you have realized the problem. As you have instructed, I checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and it seems to me that it gives better results rather than StandardAnalyzer. So could you please let me know what are the differences between StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your response. Thanx. Manjula. On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe sar...@syr.edu wrote: Hi Manjula, It's not terribly clear what you're doing here - I got lost in your description of your (two? or maybe four?) classes. Sometimes things are easier to understand if you provide more concrete detail. I suspect that you could benefit from reading the book Lucene in Action, 2nd edition: http://www.manning.com/hatcher3/ You would also likely benefit from using Luke, the Lucene index browser, to better understand your indexes' contents and debug how queries match documents: http://code.google.com/p/luke/ I think your question is whether you're using Analyzers correctly. It sounds like you are creating two separate indexes (one for each of your classes), and you're using SnowballAnalyzer on the indexing side for both indexes, and StandardAnalyzer on the query side. The usual advice is to use the same Analyzer on both the query and the index side. But it appears to be the case that you are taking stemmed index terms from your index #1 and then querying index #2 using these stemmed terms. If this is true, then you want the query-time analyzer in your second index not to change the query terms. You'll likely get better results using WhitespaceAnalyzer, which tokenizes on whitespace and does no further analysis, rather than StandardAnalyzer. Steve -Original Message- From: manjula wijewickrema [mailto:manjul...@gmail.com] Sent: Monday, November 29, 2010 4:32 AM To: java-user@lucene.apache.org Subject: Analyzer Hi, In my work, I am using Lucene and two java classes. In the first one, I index a document and in the second one, I try to search the most relevant document for the indexed document in the first one. In the first java class, I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer in the searchIndex method and pass the highest frequency terms into the second Java class. In the second class, I use SnowballAnalyzer in the createIndex method (this index is for the collection of documents to be searched, or it is my database) and StandardAnalyser in the searchIndex method (I pass the highest frequently occuring term of the first class as the search term parameter to the searchIndex method of the second class). Using Analyzers in this manner, what I am willing is to do the stemming, stop-words in both indexes (in both classes) and to search those a few high frequency words (of the first index) in the second index. So, if my intention is clear to you, could you please let me know whether it is correct or not the way I have used Analyzers? I highly appreciate any comment. Thanx. Manjula.
Analyzer
Hi, In my work, I am using Lucene and two java classes. In the first one, I index a document and in the second one, I try to search the most relevant document for the indexed document in the first one. In the first java class, I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer in the searchIndex method and pass the highest frequency terms into the second Java class. In the second class, I use SnowballAnalyzer in the createIndex method (this index is for the collection of documents to be searched, or it is my database) and StandardAnalyser in the searchIndex method (I pass the highest frequently occuring term of the first class as the search term parameter to the searchIndex method of the second class). Using Analyzers in this manner, what I am willing is to do the stemming, stop-words in both indexes (in both classes) and to search those a few high frequency words (of the first index) in the second index. So, if my intention is clear to you, could you please let me know whether it is correct or not the way I have used Analyzers? I highly appreciate any comment. Thanx. Manjula.
Re: Analyzer
Hi Steve, Thanx a lot for your reply. Yes there are only two classes and it's corrcet that the way you have realized the problem. As you have instructed, I checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and it seems to me that it gives better results rather than StandardAnalyzer. So could you please let me know what are the differences between StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your response. Thanx. Manjula. On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe sar...@syr.edu wrote: Hi Manjula, It's not terribly clear what you're doing here - I got lost in your description of your (two? or maybe four?) classes. Sometimes things are easier to understand if you provide more concrete detail. I suspect that you could benefit from reading the book Lucene in Action, 2nd edition: http://www.manning.com/hatcher3/ You would also likely benefit from using Luke, the Lucene index browser, to better understand your indexes' contents and debug how queries match documents: http://code.google.com/p/luke/ I think your question is whether you're using Analyzers correctly. It sounds like you are creating two separate indexes (one for each of your classes), and you're using SnowballAnalyzer on the indexing side for both indexes, and StandardAnalyzer on the query side. The usual advice is to use the same Analyzer on both the query and the index side. But it appears to be the case that you are taking stemmed index terms from your index #1 and then querying index #2 using these stemmed terms. If this is true, then you want the query-time analyzer in your second index not to change the query terms. You'll likely get better results using WhitespaceAnalyzer, which tokenizes on whitespace and does no further analysis, rather than StandardAnalyzer. Steve -Original Message- From: manjula wijewickrema [mailto:manjul...@gmail.com] Sent: Monday, November 29, 2010 4:32 AM To: java-user@lucene.apache.org Subject: Analyzer Hi, In my work, I am using Lucene and two java classes. In the first one, I index a document and in the second one, I try to search the most relevant document for the indexed document in the first one. In the first java class, I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer in the searchIndex method and pass the highest frequency terms into the second Java class. In the second class, I use SnowballAnalyzer in the createIndex method (this index is for the collection of documents to be searched, or it is my database) and StandardAnalyser in the searchIndex method (I pass the highest frequently occuring term of the first class as the search term parameter to the searchIndex method of the second class). Using Analyzers in this manner, what I am willing is to do the stemming, stop-words in both indexes (in both classes) and to search those a few high frequency words (of the first index) in the second index. So, if my intention is clear to you, could you please let me know whether it is correct or not the way I have used Analyzers? I highly appreciate any comment. Thanx. Manjula.
Re: Databases
Hi, Thanks a lot for your information. Regards, Manjula. On Fri, Jul 23, 2010 at 12:48 PM, tarun sapra t.sapr...@gmail.com wrote: You can use HibernateSearch to maintain the synchronization between Lucene index and Mysql RDBMS. On Fri, Jul 23, 2010 at 11:16 AM, manjula wijewickrema manjul...@gmail.comwrote: Hi, Normally, when I am building my index directory for indexed documents, I used to keep my indexed files simply in a directory called 'filesToIndex'. So in this case, I do not use any standar database management system such as mySql or any other. 1) Will it be possible to use mySql or any other for the purpose of manage indexed documents in Lucene? 2) Is it necessary to follow such kind of methodology with Lucene? 3) If we do not use such type of database management system, will there be any disadvantages with large number of indexed files? Appreciate any reply from you. Thanks, Manjula. -- Thanks Regards Tarun Sapra
Databases
Hi, Normally, when I am building my index directory for indexed documents, I used to keep my indexed files simply in a directory called 'filesToIndex'. So in this case, I do not use any standar database management system such as mySql or any other. 1) Will it be possible to use mySql or any other for the purpose of manage indexed documents in Lucene? 2) Is it necessary to follow such kind of methodology with Lucene? 3) If we do not use such type of database management system, will there be any disadvantages with large number of indexed files? Appreciate any reply from you. Thanks, Manjula.
Re: scoring and index size
Hi Koji, Thanks for your information Manjula On Fri, Jul 9, 2010 at 5:04 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: (10/07/09 19:30), manjula wijewickrema wrote: Uwe, thanx for your comments. Following is the code I used in this case. Could you pls. let me know where I have to insert UNLIMITED field length? and how? Tanx again! Manjula Manjula, You can set UNLIMITED field length to IW constructor: http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#IndexWriter%28org.apache.lucene.store.Directory,%20org.apache.lucene.analysis.Analyzer,%20boolean,%20org.apache.lucene.index.IndexWriter.MaxFieldLength%29 Koji -- http://www.rondhuit.com/en/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MaxFieldLength
Ok Erick, answer is there. If there is no any document exceeds the default maxfieldlength, then no any document will be truncated although we increase the no. of documents in the index. A'm I correct? Thanx for your commitment. Manjula. On Tue, Jul 13, 2010 at 3:57 AM, Erick Erickson erickerick...@gmail.comwrote: I'm not sure I understand your question. The number of documents has no bearing on the field length of each, which is what the max field length is all about. You can change the value here by calling Indexwriter.setMaxFieldLength to something shorter than the default. So no, if no document exceeds the default (Terms, not characters), no document will be truncated. The 10,000 limit also has no bearing on how much space indexing a document takes as long as there are fewer then 10,000 terms. That is, a document with 5,000 terms will take up just as much space with any MaxfieldLength 5,000. HTH Erick On Mon, Jul 12, 2010 at 4:00 AM, manjula wijewickrema manjul...@gmail.comwrote: Hi, I have seen that, onece the field length of a document goes over a certain limit ( http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH gives it as 10,000 terms-default) Lucene truncates those documents. Is there any possibility to truncate documents, if we increase the number of indexed documents (assume, there are no any individual documents which exceed the default MaxFieldLength of Lucene)? Thanx Manjula.
Re: Why not normalization?
Hi Rebecca, Thanks for your valuble comments. Yes I observed tha, once the number of terms of the goes up, fieldNorm value goes down correspondingly. I think, therefore there won't be any default due to the variation of total number of terms in the document. Am I right? Manjula. On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson bec.wat...@gmail.com wrote: hi, 1) Although Lucene uses tf to calculate scoring it seems to me that term frequency has not been normalized. Even if I index several documents, it does not normalize tf value. Therefore, since the total number of words in index documents are varied, can't there be a fault in Lucene's scoring? tf = term frequency i.e. the number of times the term appears in the document, while idf is inverse document frequency - is a measure of how rare a term is, i.e. related to how many documents the term appears in. if term1 occurs more frequently in a document i.e. tf is higher, you want to weight the document higher when you search for term1 but if term1 is a very frequent term, ie. in lots of documents, then its probably not as important to an overall search (where we have term1, term2 etc) so you want to downweight it (idf comes in) then the normalisations like length normalisation (allow for 'fair' scoring across varied field length) come in too. the tf-idf scoring formula used by lucene is a scoring method that's been around a long long time... there are competing scoring metrics but that's an IR thing and not an argument you want to start on the lucene lists! :) these are IR ('information retrieval') concepts and you might want to start by going to through the tf-idf scoring / some explanations for this kind of scoring. http://en.wikipedia.org/wiki/Tf%E2%80%93idf http://wiki.apache.org/lucene-java/InformationRetrieval 2) What is the formula to calculate this fieldNorm value? in terms of how lucene implements its tf-idf scoring - you can see here: http://lucene.apache.org/java/3_0_2/scoring.html also, the lucene in action book is a really good book if you are starting out with lucene (and will save you a lot of grief with understanding lucene / setting up your application!), it covers all the basics and then moves on to more advanced stuff and has lots of code examples too: http://www.manning.com/hatcher2/ hope that helps, bec :) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
scoring and index size
Hi, I run a single programme to see the way of scoring by Lucene for single indexed document. The explain() method gave me the following results. *** Searching for 'metaphysics' Number of hits: 1 0.030706111 0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of: 10.246951 = tf(termFreq(contents:metaphys)=105) 0.30685282 = idf(docFreq=1, maxDocs=1) 0.009765625 = fieldNorm(field=contents, doc=0) * But I encountered the following problems; 1) In this case, I did not change or done anything to Boost values. So that should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in Lucene email archive, default boost values=1) 2) But, even if I manually calculate the value for fieldNorm (as =1/sqrt(terms in field)), it doesn't match (approximately it matches) with the value with given by the system for fieldNorm. Can this be due to encode/decode precision loss of norm? 3) In my indexed document, my indexed document was consisted with total number of 19078 words including 125 times of word 'metaphysics' (i.e my query. I input single term query) . But as you can see in the above output, system gives only 105 counts for word 'metaphysics'. But once I reduce some part of my index document and count the number of 'metaphysics' words and checked with the system results. I noticed that with reduction of text from index document, system counts it correctly. Why this kind of behaviour? Is there any limitation for the indexed documents? If somebody can pls. help me to solve these problems. Thanks! Manjula.
Re: scoring and index size
Uwe, thanx for your comments. Following is the code I used in this case. Could you pls. let me know where I have to insert UNLIMITED field length? and how? Tanx again! Manjula code-- * public* *class* LuceneDemo { *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = filesToIndex ; *public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory; *public* *static* *final* String *FIELD_PATH* = path; *public* *static* *final* String *FIELD_CONTENTS* = contents; *public* *static* *void* main(String[] args) *throws* Exception { *createIndex*(); //searchIndex(rice AND milk); *searchIndex*(metaphysics); //searchIndex(banana); //searchIndex(foo); } *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); *boolean* recreateIndexIfExists = *true*; IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, recreateIndexIfExists); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document document = *new* Document(); //contents#setOmitNorms(true); String path = file.getCanonicalPath(); document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index. UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); document.add(*new* Field(*FIELD_CONTENTS*, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } *public* *static* *void* searchIndex(String searchString) *throws*IOException, ParseException { System.*out*.println(Searching for ' + searchString + '); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.*out*.println(Number of hits: + hits.length()); TopDocs results = indexSearcher.search(query,10); ScoreDoc[] hits1 = results.scoreDocs; *for* (ScoreDoc hit : hits1) { Document doc = indexSearcher.doc(hit.doc); //System.out.printf(%5.3f %s\n,hit.score,doc.get(FIELD_CONTENTS)); System.*out*.println(hit.score); //Searcher.explain(rice,0); //System.out.println(indexSearcher.explain(query, 0)); } System.*out*.println(indexSearcher.explain(query, 0)); //System.out.println(indexSearcher.explain(query, 1)); //System.out.println(indexSearcher.explain(query, 2)); //System.out.println(indexSearcher.explain(query, 3)); IteratorHit it = hits.iterator(); *while* (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(*FIELD_PATH*); System.*out*.println(Hit: + path); } } } On Fri, Jul 9, 2010 at 1:06 PM, Uwe Schindler u...@thetaphi.de wrote: Maybe you have MaxFieldLength.LIMITED instead of UNLIMITED? Then the number of terms per document is limited. The calculation precision is limited by the float norm encoding, but also if your analyzer removed stop words, so the norm is not what you exspect? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: manjula wijewickrema [mailto:manjul...@gmail.com] Sent: Friday, July 09, 2010 9:21 AM To: java-user@lucene.apache.org Subject: scoring and index size Hi, I run a single programme to see the way of scoring by Lucene for single indexed document. The explain() method gave me the following results. *** Searching for 'metaphysics' Number of hits: 1 0.030706111 0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of: 10.246951 = tf(termFreq(contents:metaphys)=105) 0.30685282 = idf(docFreq=1, maxDocs=1) 0.009765625 = fieldNorm(field=contents, doc=0) * But I encountered the following problems; 1) In this case, I did not change or done anything to Boost values. So that should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in Lucene email archive, default boost values=1) 2) But, even if I manually calculate the value for fieldNorm (as =1/sqrt(terms in field)), it doesn't match (approximately it matches) with the value with given by the system for fieldNorm. Can this be due to encode/decode precision loss of norm? 3) In my indexed document, my indexed document was consisted with total number of 19078 words including 125 times of word 'metaphysics' (i.e my query. I input single term query) . But as you can see in the above output, system gives only 105 counts for word 'metaphysics'. But once I reduce some part of my index document
Re: Why not normalization?
Thanx On Fri, Jul 9, 2010 at 1:10 PM, Uwe Schindler u...@thetaphi.de wrote: Thanks for your valuble comments. Yes I observed tha, once the number of terms of the goes up, fieldNorm value goes down correspondingly. I think, therefore there won't be any default due to the variation of total number of terms in the document. Am I right? With the current scoring model advanced statistics are not available. There are currently some approaches to add BM25 support to Lucene, for what the index format needs to be enhanced to contain more statistics (number of terms per document, avg number of terms per document,...). On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson bec.wat...@gmail.com wrote: hi, 1) Although Lucene uses tf to calculate scoring it seems to me that term frequency has not been normalized. Even if I index several documents, it does not normalize tf value. Therefore, since the total number of words in index documents are varied, can't there be a fault in Lucene's scoring? tf = term frequency i.e. the number of times the term appears in the document, while idf is inverse document frequency - is a measure of how rare a term is, i.e. related to how many documents the term appears in. if term1 occurs more frequently in a document i.e. tf is higher, you want to weight the document higher when you search for term1 but if term1 is a very frequent term, ie. in lots of documents, then its probably not as important to an overall search (where we have term1, term2 etc) so you want to downweight it (idf comes in) then the normalisations like length normalisation (allow for 'fair' scoring across varied field length) come in too. the tf-idf scoring formula used by lucene is a scoring method that's been around a long long time... there are competing scoring metrics but that's an IR thing and not an argument you want to start on the lucene lists! :) these are IR ('information retrieval') concepts and you might want to start by going to through the tf-idf scoring / some explanations for this kind of scoring. http://en.wikipedia.org/wiki/Tf%E2%80%93idf http://wiki.apache.org/lucene-java/InformationRetrieval 2) What is the formula to calculate this fieldNorm value? in terms of how lucene implements its tf-idf scoring - you can see here: http://lucene.apache.org/java/3_0_2/scoring.html also, the lucene in action book is a really good book if you are starting out with lucene (and will save you a lot of grief with understanding lucene / setting up your application!), it covers all the basics and then moves on to more advanced stuff and has lots of code examples too: http://www.manning.com/hatcher2/ hope that helps, bec :) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Scoring
Dear Ian, Thanks a lot for your reply. The way you proposed, working correctly and solved half of my matter. Once I run the program, system gave me the following output. output- ** Searching for 'milk' Number of hits: 1 0.13287117 0.13287117 = (MATCH) fieldWeight(contents:milk in 0), product of: 1.7320508 = tf(termFreq(contents:milk)=3) 0.30685282 = idf(docFreq=1, maxDocs=1) 0.25 = fieldNorm(field=contents, doc=0) Hit: D:\JADE\work\MobilNet\Lucene291\filesToIndex\deron-foods.txt *** Here, I have no any problems of calculating values for tf, and idf. But I have no idea of how to calculate fieldNorm. According to http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int) I think norm(t,d) gives the value for fieldNorm and in my case, the system returns the value lengthNorm(field) for norm(t,d), 1) Am I correct? 2) If so, coluld you pls. let me know the way (formula) of calculating lengthNorm(field)? (I checked several documents and codes to understand this. But was unable to find the mathematical formula behind this method). 3) If lengthNorm(field) is not the case behind fieldNorm, then how to calculate fieldNorm? Pls. help me to resolve this matter. Manjula. On Tue, Jul 6, 2010 at 12:47 PM, Ian Lea ian@gmail.com wrote: You are calling the explain method incorrectly. You need something like System.out.println(indexSearcher.explain(query, 0)); See the javadocs for details. -- Ian. On Tue, Jul 6, 2010 at 7:39 AM, manjula wijewickrema manjul...@gmail.com wrote: Dear Grant, Thanks a lot for your guidence. As you have mentioned, I tried to use explain() method to get the explanations for relevant scoring. But, once I call the explain() method, system indicated the following error. Error- 'The method explain(Query,int) in the type Searcher is not applicable for the arguments (String, int)'. In my code I call the explain() method as follows- Searcher.explain(rice,0); Possibly the wrong with my way of passing parameters. In my case, I have chosen rice as my query and indexed only one document. Could you pls. let me know what's wrong with this. I also included the code with this. Thanx Manjula code- ** *import* org.apache.lucene.search.Searcher; *public* *class* LuceneDemo { *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = filesToIndex ; *public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory; *public* *static* *final* String *FIELD_PATH* = path; *public* *static* *final* String *FIELD_CONTENTS* = contents; *public* *static* *void* main(String[] args) *throws* Exception { *createIndex*(); *searchIndex*(rice); } *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); *boolean* recreateIndexIfExists = *true*; IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, recreateIndexIfExists); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document document = *new* Document(); String path = file.getCanonicalPath(); document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index. UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); document.add(*new* Field(*FIELD_CONTENTS*, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } *public* *static* *void* searchIndex(String searchString) *throws*IOException, ParseException { System.*out*.println(Searching for ' + searchString + '); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.*out*.println(Number of hits: + hits.length()); TopDocs results = indexSearcher.search(query,10); ScoreDoc[] hits1 = results.scoreDocs; *for* (ScoreDoc hit : hits1) { Document doc = indexSearcher.doc(hit.doc); //System.out.printf(%5.3f %s\n,hit.score,doc.get(FIELD_CONTENTS)); System.*out*.println(hit.score); Searcher.explain(rice,0); } IteratorHit it = hits.iterator(); *while* (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(*FIELD_PATH*); System.*out
Re: Lucene Scoring
Dear Grant, Thanks a lot for your guidence. As you have mentioned, I tried to use explain() method to get the explanations for relevant scoring. But, once I call the explain() method, system indicated the following error. Error- 'The method explain(Query,int) in the type Searcher is not applicable for the arguments (String, int)'. In my code I call the explain() method as follows- Searcher.explain(rice,0); Possibly the wrong with my way of passing parameters. In my case, I have chosen rice as my query and indexed only one document. Could you pls. let me know what's wrong with this. I also included the code with this. Thanx Manjula code- ** *import* org.apache.lucene.search.Searcher; *public* *class* LuceneDemo { *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = filesToIndex ; *public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory; *public* *static* *final* String *FIELD_PATH* = path; *public* *static* *final* String *FIELD_CONTENTS* = contents; *public* *static* *void* main(String[] args) *throws* Exception { *createIndex*(); *searchIndex*(rice); } *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); *boolean* recreateIndexIfExists = *true*; IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, recreateIndexIfExists); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document document = *new* Document(); String path = file.getCanonicalPath(); document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index. UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); document.add(*new* Field(*FIELD_CONTENTS*, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } *public* *static* *void* searchIndex(String searchString) *throws*IOException, ParseException { System.*out*.println(Searching for ' + searchString + '); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.*out*.println(Number of hits: + hits.length()); TopDocs results = indexSearcher.search(query,10); ScoreDoc[] hits1 = results.scoreDocs; *for* (ScoreDoc hit : hits1) { Document doc = indexSearcher.doc(hit.doc); //System.out.printf(%5.3f %s\n,hit.score,doc.get(FIELD_CONTENTS)); System.*out*.println(hit.score); Searcher.explain(rice,0); } IteratorHit it = hits.iterator(); *while* (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(*FIELD_PATH*); System.*out*.println(Hit: + path); } } } On Mon, Jul 5, 2010 at 7:46 PM, Grant Ingersoll gsing...@apache.org wrote: On Jul 5, 2010, at 5:02 AM, manjula wijewickrema wrote: Hi, In my application, I input only single term query (at one time) and get back the corresponding scorings for those queries. But I am little struggling of understanding Lucene scoring. I have reffered http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html and some other pages to resolve my matters. But some are still remain. 1) Why it has taken the squareroot of frequency as the tf value and square of the idf vale in score function? Somewhat arbitrary, I suppose, but I think someone way back did some tests and decided it performed best in general. More importantly, the point of the Similarity class is you can override these if you desire. 2) If I enter single term query, then what will return bythe coord(q,d)? Since there are always one term in the query, I think always it should be 1! Am I correct? Should be. You can run the explain() method to confirm. 3) I am also struggling understanding sumOfSquaredWeights (in queryNorm(q)). As I can understand, this value depends on the nature of the query we input and depends on that, it uses different methods such as TermQuery, MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery, etc. But if I always use single term query, then what will be the way selected by the system from above? The queryNorm is an attempt at making scores comparable across queries. Again, I'd try the explain() method to see the practical aspects of how it effects score. See http://lucene.apache.org/java/2_4_0/scoring.html for more info on scoring. -Grant - To unsubscribe, e-mail: java-user-unsubscr
Lucene Scoring
Hi, In my application, I input only single term query (at one time) and get back the corresponding scorings for those queries. But I am little struggling of understanding Lucene scoring. I have reffered http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html and some other pages to resolve my matters. But some are still remain. 1) Why it has taken the squareroot of frequency as the tf value and square of the idf vale in score function? 2) If I enter single term query, then what will return bythe coord(q,d)? Since there are always one term in the query, I think always it should be 1! Am I correct? 3) I am also struggling understanding sumOfSquaredWeights (in queryNorm(q)). As I can understand, this value depends on the nature of the query we input and depends on that, it uses different methods such as TermQuery, MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery, etc. But if I always use single term query, then what will be the way selected by the system from above? If somebody can pls. help me to resolve these problems. Appreciate any reply from you. Regards, Manjula
Re: How to get file names instead of paths?
Dear Ian, The segment you have suggested, working nicely. Thanx a lot for your kind help. Manjula. On Fri, Jun 11, 2010 at 4:00 PM, Ian Lea ian@gmail.com wrote: Something like this File f = new File(path); String fn = f.getName(); return fn.substring(0, fn.lastIndexOf(.)); -- Ian. On Fri, Jun 11, 2010 at 11:20 AM, manjula wijewickrema manjul...@gmail.com wrote: Hi, Using the following programme I was able to get the entire file path of indexed files which matched with the given queries. But my intention is to get only the file names even without .txt extention as I need to send these file names as labels to another application. So, pls. let me know how can I get only the file names in the following code. Thanx in advance! Manjula. My code: * public* *class* LuceneDemo { *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = filesToIndex ; *public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory; *public* *static* *final* String *FIELD_PATH* = path; *public* *static* *final* String *FIELD_CONTENTS* = contents; *public* *static* *void* main(String[] args) *throws* Exception { *createIndex*(); *searchIndex*(rice); *searchIndex*(milk); *searchIndex*(banana); *searchIndex*(foo); } *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); *boolean* recreateIndexIfExists = *true*; IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, recreateIndexIfExists); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document document = *new* Document(); String path = file.getCanonicalPath(); document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index. UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); document.add(*new* Field(*FIELD_CONTENTS*, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } *public* *static* *void* searchIndex(String searchString) *throws*IOException, ParseException { System.*out*.println(Searching for ' + searchString + '); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.*out*.println(Number of hits: + hits.length()); TopDocs results = indexSearcher.search(query,10); ScoreDoc[] hits1 = results.scoreDocs; *for* (ScoreDoc hit : hits1) { Document doc = indexSearcher.doc(hit.doc); System.*out*.printf(%5.3f %s\n,hit.score,doc.get(*FIELD_CONTENTS*)); } IteratorHit it = hits.iterator(); *while* (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(*FIELD_PATH*); System.*out*.println(Hit: + path); } } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How to get file names instead of paths?
Hi, Using the following programme I was able to get the entire file path of indexed files which matched with the given queries. But my intention is to get only the file names even without .txt extention as I need to send these file names as labels to another application. So, pls. let me know how can I get only the file names in the following code. Thanx in advance! Manjula. My code: * public* *class* LuceneDemo { *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = filesToIndex ; *public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory; *public* *static* *final* String *FIELD_PATH* = path; *public* *static* *final* String *FIELD_CONTENTS* = contents; *public* *static* *void* main(String[] args) *throws* Exception { *createIndex*(); *searchIndex*(rice); *searchIndex*(milk); *searchIndex*(banana); *searchIndex*(foo); } *public* *static* *void* createIndex() *throws* CorruptIndexException, LockObtainFailedException, IOException { SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); *boolean* recreateIndexIfExists = *true*; IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer, recreateIndexIfExists); File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*); File[] files = dir.listFiles(); *for* (File file : files) { Document document = *new* Document(); String path = file.getCanonicalPath(); document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index. UN_TOKENIZED,Field.TermVector.*YES*)); Reader reader = *new* FileReader(file); document.add(*new* Field(*FIELD_CONTENTS*, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } *public* *static* *void* searchIndex(String searchString) *throws*IOException, ParseException { System.*out*.println(Searching for ' + searchString + '); Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = *new* IndexSearcher(indexReader); SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS); QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.*out*.println(Number of hits: + hits.length()); TopDocs results = indexSearcher.search(query,10); ScoreDoc[] hits1 = results.scoreDocs; *for* (ScoreDoc hit : hits1) { Document doc = indexSearcher.doc(hit.doc); System.*out*.printf(%5.3f %s\n,hit.score,doc.get(*FIELD_CONTENTS*)); } IteratorHit it = hits.iterator(); *while* (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(*FIELD_PATH*); System.*out*.println(Hit: + path); } } }
Re: Arrange terms[i]
Dear Grant, Thanks for your reply. Manjula On Mon, May 24, 2010 at 4:37 PM, Grant Ingersoll gsing...@apache.orgwrote: On May 20, 2010, at 5:15 AM, manjula wijewickrema wrote: Hi, I wrote aprogram to get the ferquencies and terms of an indexed document. The output comes as follows; If I print : +tfv[0] Output: array terms are:{title: capabl/1, code/2, frequenc/1, lucen/4, over/1, sampl/1, term/4, test/1} In the same way I can print terms[i] and freqs[i], but the problem is while I am printing terms[i], output (array elements) comes according to the English alphabetic order (as above) and freqs[i] also arrange according that particular order. Is there a way to arrange terms[i] according to the ascending/descending order of their frequencies? Yes, have a look at the TermVectorMapper. You will need to implement a variation of this to build up the data structures you need. -Grant - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem of getTermFrequencies()
Thanx On Mon, May 17, 2010 at 10:19 PM, Grant Ingersoll gsing...@apache.orgwrote: Note, depending on your downstream use, you may consider using a TermVectorMapper that allows you to construct your own data structures as needed. -Grant On May 17, 2010, at 3:16 PM, Ian Lea wrote: terms and freqs are arrays. Try terms[i] and freqs[i]. -- Ian. On Mon, May 17, 2010 at 12:23 PM, manjula wijewickrema manjul...@gmail.com wrote: Hi, I wrote a code with a view to display the indexed terms and get their term frequencies of a single document. Although it displys those terms in the index, it does not give the term frequencies. Instead it displays ' frequencies are:[...@80fa6f '. What's the reason for this. The code I have written and the display, can be given as follows. Code: * import* org.apache.lucene.analysis.standard.StandardAnalyzer; * import* org.apache.lucene.document.Document; * import* org.apache.lucene.document.Field; * import* org.apache.lucene.index.IndexWriter; * import* org.apache.lucene.index.IndexReader; * import* org.apache.lucene.queryParser.ParseException; * import* org.apache.lucene.queryParser.QueryParser; * import* org.apache.lucene.search.*; * import* org.apache.lucene.store.Directory; * import* org.apache.lucene.store.RAMDirectory; * import* org.apache.lucene.util.Version; * import* org.apache.lucene.index.TermFreqVector; * import* java.io.BufferedReader; * import* java.io.FileReader; * import* java.io.IOException; * import* org.apache.lucene.analysis.StopAnalyzer; * import* org.apache.lucene.analysis.snowball.SnowballAnalyzer; * public* *class* Testing{ * public* *static* *void* main(String[] args) *throws* IOException, ParseException { //StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English, StopAnalyzer. ENGLISH_STOP_WORDS); *try*{ Directory directory=*new* RAMDirectory(); IndexWriter w = *new* IndexWriter(directory, analyzer, *true*, IndexWriter.MaxFieldLength.*UNLIMITED*); Document doc = *new* Document(); String text=This is a sample codes code for testing lucene's capabilities over lucene term frequencies; doc.add(*new* Field(title, text, Field.Store.*YES*, Field.Index.*ANALYZED* ,Field.TermVector.*YES*)); w.addDocument(doc); w.close(); IndexReader ir=IndexReader.open(directory); TermFreqVector[] tfv=ir.getTermFreqVectors(0); // for (int xy = 0; xy tfv.length; xy++) { String[] terms = tfv[0].getTerms(); *int*[] freqs=tfv[0].getTermFrequencies(); //System.out.println(terms are:+tfv[xy]); //System.out.println(length is:+terms.length); System.*out*.println(array terms are:+tfv[0]); System.*out*.println(terms are:+terms); System.*out*.println(frequencies are:+freqs); // } }*catch*(Exception ex){ ex.printStackTrace(); } } } Display: array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1, sampl/1, term/1, test/1} terms are:[Ljava.lang.String;@1e13d52 frequencies are:[...@80fa6f If some body can pls. help me to get the desired output. Thanx, Manjula. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Arrange terms[i]
Hi, I wrote aprogram to get the ferquencies and terms of an indexed document. The output comes as follows; If I print : +tfv[0] Output: array terms are:{title: capabl/1, code/2, frequenc/1, lucen/4, over/1, sampl/1, term/4, test/1} In the same way I can print terms[i] and freqs[i], but the problem is while I am printing terms[i], output (array elements) comes according to the English alphabetic order (as above) and freqs[i] also arrange according that particular order. Is there a way to arrange terms[i] according to the ascending/descending order of their frequencies? Thanx in advance. Manjula
Re: How to call high fre. terms using HighFreTerms class
hi Erick, Thanx On Sat, May 15, 2010 at 5:37 PM, Erick Erickson erickerick...@gmail.comwrote: It looks like a stand-alone program, so you don't call it. You probably want to get the source code and take a look at how that program works to get an idea of how to do what you want. See the instructions here for getting the source: http://wiki.apache.org/lucene-java/HowToContribute HTH Erick On Sat, May 15, 2010 at 1:49 AM, manjula wijewickrema manjul...@gmail.comwrote: Hi, I am struggling with using HighFreTerms class for the purpose of find high fre. terms in my index. My target is to get the high frequency terms in an indexed document (single document). To do that I have added org.apache.lucene.misc package into my project. I think upto that point I am correct. But after that I have no an idea of how to call this in my coding. Although I have looked in the lucene email archive, I was unable to find a hint regarding to call of this class. If anybody can pls. give me a sample code for using this class (and relevent methods) in the code which suit to my purpose. I appreciate your kind help. Thanks Manjula
Problem of getTermFrequencies()
Hi, I wrote a code with a view to display the indexed terms and get their term frequencies of a single document. Although it displys those terms in the index, it does not give the term frequencies. Instead it displays ' frequencies are:[...@80fa6f '. What's the reason for this. The code I have written and the display, can be given as follows. Code: * import* org.apache.lucene.analysis.standard.StandardAnalyzer; * import* org.apache.lucene.document.Document; * import* org.apache.lucene.document.Field; * import* org.apache.lucene.index.IndexWriter; * import* org.apache.lucene.index.IndexReader; * import* org.apache.lucene.queryParser.ParseException; * import* org.apache.lucene.queryParser.QueryParser; * import* org.apache.lucene.search.*; * import* org.apache.lucene.store.Directory; * import* org.apache.lucene.store.RAMDirectory; * import* org.apache.lucene.util.Version; * import* org.apache.lucene.index.TermFreqVector; * import* java.io.BufferedReader; * import* java.io.FileReader; * import* java.io.IOException; * import* org.apache.lucene.analysis.StopAnalyzer; * import* org.apache.lucene.analysis.snowball.SnowballAnalyzer; * public* *class* Testing{ * public* *static* *void* main(String[] args) *throws* IOException, ParseException { //StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English, StopAnalyzer. ENGLISH_STOP_WORDS); *try*{ Directory directory=*new* RAMDirectory(); IndexWriter w = *new* IndexWriter(directory, analyzer, *true*, IndexWriter.MaxFieldLength.*UNLIMITED*); Document doc = *new* Document(); String text=This is a sample codes code for testing lucene's capabilities over lucene term frequencies; doc.add(*new* Field(title, text, Field.Store.*YES*, Field.Index.*ANALYZED* ,Field.TermVector.*YES*)); w.addDocument(doc); w.close(); IndexReader ir=IndexReader.open(directory); TermFreqVector[] tfv=ir.getTermFreqVectors(0); // for (int xy = 0; xy tfv.length; xy++) { String[] terms = tfv[0].getTerms(); *int*[] freqs=tfv[0].getTermFrequencies(); //System.out.println(terms are:+tfv[xy]); //System.out.println(length is:+terms.length); System.*out*.println(array terms are:+tfv[0]); System.*out*.println(terms are:+terms); System.*out*.println(frequencies are:+freqs); // } }*catch*(Exception ex){ ex.printStackTrace(); } } } Display: array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1, sampl/1, term/1, test/1} terms are:[Ljava.lang.String;@1e13d52 frequencies are:[...@80fa6f If some body can pls. help me to get the desired output. Thanx, Manjula.
Re: Problem of getTermFrequencies()
Dear Ian, I changed it as you said and now it is working nicely. Thanks a lot for your kind help. Manjula On Mon, May 17, 2010 at 6:46 PM, Ian Lea ian@gmail.com wrote: terms and freqs are arrays. Try terms[i] and freqs[i]. -- Ian. On Mon, May 17, 2010 at 12:23 PM, manjula wijewickrema manjul...@gmail.com wrote: Hi, I wrote a code with a view to display the indexed terms and get their term frequencies of a single document. Although it displys those terms in the index, it does not give the term frequencies. Instead it displays ' frequencies are:[...@80fa6f '. What's the reason for this. The code I have written and the display, can be given as follows. Code: * import* org.apache.lucene.analysis.standard.StandardAnalyzer; * import* org.apache.lucene.document.Document; * import* org.apache.lucene.document.Field; * import* org.apache.lucene.index.IndexWriter; * import* org.apache.lucene.index.IndexReader; * import* org.apache.lucene.queryParser.ParseException; * import* org.apache.lucene.queryParser.QueryParser; * import* org.apache.lucene.search.*; * import* org.apache.lucene.store.Directory; * import* org.apache.lucene.store.RAMDirectory; * import* org.apache.lucene.util.Version; * import* org.apache.lucene.index.TermFreqVector; * import* java.io.BufferedReader; * import* java.io.FileReader; * import* java.io.IOException; * import* org.apache.lucene.analysis.StopAnalyzer; * import* org.apache.lucene.analysis.snowball.SnowballAnalyzer; * public* *class* Testing{ * public* *static* *void* main(String[] args) *throws* IOException, ParseException { //StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English, StopAnalyzer. ENGLISH_STOP_WORDS); *try*{ Directory directory=*new* RAMDirectory(); IndexWriter w = *new* IndexWriter(directory, analyzer, *true*, IndexWriter.MaxFieldLength.*UNLIMITED*); Document doc = *new* Document(); String text=This is a sample codes code for testing lucene's capabilities over lucene term frequencies; doc.add(*new* Field(title, text, Field.Store.*YES*, Field.Index.*ANALYZED* ,Field.TermVector.*YES*)); w.addDocument(doc); w.close(); IndexReader ir=IndexReader.open(directory); TermFreqVector[] tfv=ir.getTermFreqVectors(0); // for (int xy = 0; xy tfv.length; xy++) { String[] terms = tfv[0].getTerms(); *int*[] freqs=tfv[0].getTermFrequencies(); //System.out.println(terms are:+tfv[xy]); //System.out.println(length is:+terms.length); System.*out*.println(array terms are:+tfv[0]); System.*out*.println(terms are:+terms); System.*out*.println(frequencies are:+freqs); // } }*catch*(Exception ex){ ex.printStackTrace(); } } } Display: array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1, sampl/1, term/1, test/1} terms are:[Ljava.lang.String;@1e13d52 frequencies are:[...@80fa6f If some body can pls. help me to get the desired output. Thanx, Manjula. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Error of the code
Hi Ian, Thanx for your reply. vector.size() returns the total number of indexed terms in the index. However I was able to run the program and get the results finally with your help. Thanks a lot. Manjula On Thu, May 13, 2010 at 6:52 PM, Ian Lea ian@gmail.com wrote: What does vector.size() return? You don't appear to be doing anything with the String term in for ( String term : vector.getTerms() ) - presumably you intend to. -- Ian. On Thu, May 13, 2010 at 1:16 PM, manjula wijewickrema manjul...@gmail.com wrote: Dear Ian, Thanks a lot for your immediate reply. As you have mentioned I replaced the lines as follows. IndexReader ir=IndexReader.open(directory); TermFreqVector vector=ir.getTermFreqVector(0,fieldname); Now the error has been vanished and thanks for it. But I can't still see the results although I have moved those lines after iwriter.close(). What's the reason for this? sample code after modifications: . String text = This is the text to be indexed.; doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.* ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*)); iwriter.addDocument(doc); iwriter.close(); IndexReader ir=IndexReader.open(directory); TermFreqVector vector=ir.getTermFreqVector(0,fieldname); * int* size = vector.size(); *for* ( String term : vector.getTerms() ) System.*out*.println( size = + size ); IndexSearcher isearcher = *new* IndexSearcher(directory, *true*); .. .. I appreciate your kind coperation Manjula On Thu, May 13, 2010 at 3:45 PM, Ian Lea ian@gmail.com wrote: You need to replace this: TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname ); with IndexReader ir = whatever(...); TermFreqVector vector = ir.getTermFreqVector(0, fieldname ); And you'll need to move it to after the writer.close() call if you want it to see the doc you've just added. -- Ian. On Thu, May 13, 2010 at 11:07 AM, manjula wijewickrema manjul...@gmail.com wrote: Dear All, I am trying to get the term frequencies (through TermFreqVector) of a document (using Lucene 2.9.1). In order to do that I have used the following code. But there is a compile time error in the code and I can't figure it out. Could somebody can guide me what's wrong with it. Compile time error I got: Cannot make a static reference to the non-static method getTermFreqVector(int, String) from the type IndexReader. Code: *import* org.apache.lucene.analysis.standard.StandardAnalyzer; *import* org.apache.lucene.document.Document; * import* org.apache.lucene.document.Field; * import* org.apache.lucene.index.IndexWriter; * import* org.apache.lucene.queryParser.ParseException; * import* org.apache.lucene.queryParser.QueryParser; * import* org.apache.lucene.search.*; * import* org.apache.lucene.store.Directory; * import* org.apache.lucene.store.RAMDirectory; * import* org.apache.lucene.util.Version; * import* org.apache.lucene.index.IndexReader; * import* org.apache.lucene.index.TermEnum; * import* org.apache.lucene.index.Term; * import* org.apache.lucene.index.TermFreqVector; * import* java.io.IOException; * public* *class* DemoTest { *public* *static* *void* main(String[] args) { StandardAnalyzer analyzer = *new* StandardAnalyzer(Version.*LUCENE_CURRENT* ); *try* { Directory directory = *new* RAMDirectory(); IndexWriter iwriter = *new* IndexWriter(directory, analyzer, *true*,*new*IndexWriter.MaxFieldLength(25000)); Document doc = *new* Document(); String text = This is the text to be indexed.; doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.* ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*)); iwriter.addDocument(doc); TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname ); * int* size = vector.size(); *for* ( String term : vector.getTerms() ) System.*out*.println( size = + size ); iwriter.close(); IndexSearcher isearcher = *new* IndexSearcher(directory, *true*); QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*, fieldname, analyzer); Query query = parser.parse(text); ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs; System.*out*.println(hits.length(1) = + hits.length); // Iterate through the results: *for* (*int* i = 0; i hits.length; i++) { Document hitDoc = isearcher.doc(hits.doc); System.*out*.println(hitDoc.get(\fieldname\) (This is the text to be indexed) = + hitDoc.get(fieldname)); } isearcher.close(); directory.close(); } *catch* (Exception ex
Access indexed terms
Hi, Is it possible to put the indexed terms into an array in lucene. For example, imagine I have indexed a single document in Lucene and now I want to acces those terms in the index. Is it possible to retrieve (call) those terms as array elements? If it is possible, then how? Thanks, Manjula
Re: Access indexed terms
Hi Andrzej Thanx for the reply. But as you have mentioned, creating arrays for indexed terms seems to be little difficult. Here my intention is to find the term frequencies (of terms) of an indexed document. I can find the term frequency of a particular term (giving as a query) if I specify the term in the code. But I really want is to get the term frequency (or even the number of times it appears in the document) of the all indexed terms (or high frequency terms) without named them in the code. Is there an alternative way to do that? Thanks Manjula On Fri, May 14, 2010 at 4:00 PM, Andrzej Bialecki a...@getopt.org wrote: On 2010-05-14 11:35, manjula wijewickrema wrote: Hi, Is it possible to put the indexed terms into an array in lucene. For example, imagine I have indexed a single document in Lucene and now I want to acces those terms in the index. Is it possible to retrieve (call) those terms as array elements? If it is possible, then how? In short: unless you created TermFrequencyVector when adding the document, the answer is with great difficulty. For a working code that does this see here: http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/DocReconstructor.java If you really need such kind of access in your application then add your documents with term vectors with offsets and positions. Even then, depending on the Analyzer you used, the process is lossy - some input data that was discarded by Analyzer is simply no longer available. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Access indexed terms
Dear Andrzej, Thanx for your valuable help. I also noticed this HighFreqTerms approach in the Lucene email archive and try to use it. In order to do that I have downloaded lucene-misc-2.9.1.jar and added org.apache.lucene.misc package into my project. Now I think I have to call this HighFreqTerms class in my code. But I was unable to find any guidence of how to do it? If you can pls. be kind enough to tell me how can I use this class in my code. Thanx Manjula On Fri, May 14, 2010 at 6:16 PM, Andrzej Bialecki a...@getopt.org wrote: On 2010-05-14 14:24, manjula wijewickrema wrote: Hi Andrzej Thanx for the reply. But as you have mentioned, creating arrays for indexed terms seems to be little difficult. Here my intention is to find the term frequencies (of terms) of an indexed document. I can find the term frequency of a particular term (giving as a query) if I specify the term in the code. But I really want is to get the term frequency (or even the number of times it appears in the document) of the all indexed terms (or high frequency terms) without named them in the code. Is there an alternative way to do that? Yes, see the discussion here: https://issues.apache.org/jira/browse/LUCENE-2393 -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How to call high fre. terms using HighFreTerms class
Hi, I am struggling with using HighFreTerms class for the purpose of find high fre. terms in my index. My target is to get the high frequency terms in an indexed document (single document). To do that I have added org.apache.lucene.misc package into my project. I think upto that point I am correct. But after that I have no an idea of how to call this in my coding. Although I have looked in the lucene email archive, I was unable to find a hint regarding to call of this class. If anybody can pls. give me a sample code for using this class (and relevent methods) in the code which suit to my purpose. I appreciate your kind help. Thanks Manjula
Re: Class_for_HighFrequencyTerms
thanks On Tue, May 11, 2010 at 3:31 PM, adam.salt...@gmail.com wrote: Sounds like your path is messed up and you're not using maven correctly. Start with the jar version that contains the class you require and use maven pom to correctly resolve dependencies Adam Sent using BlackBerry® from Orange -Original Message- From: manjula wijewickrema manjul...@gmail.com Date: Tue, 11 May 2010 15:13:12 To: java-user@lucene.apache.org Subject: Re: Class_for_HighFrequencyTerms Dear Erick, I lokked for it and even added IndexReader.java and TermFreqVector.java from http://www.jarvana.com/jarvana/search?search_type=classjava_class=org.apache.lucene.index.IndexReader . But after adding the system indicated a lot of errors in the source code IndexReader.java (eg: DirectoryOwningReader cannot be resolved to a type, indexCommit cannot be resolved to a type, SegmentInfos cannot be resolved, TermEnum cannot be resolved to a type, etc.). I am using Lucene 2.9.1 and this particular website has listed this source code under 2.9.1 version of Lucene. What is the reason for this kind of scenario? Do I have to add another JAR file (in order to solve this even I added lucene-core-2.9.1-sources.jar, but nothing happened). Pls. be kind enough to make a reply. Tanks Manjula On Tue, May 11, 2010 at 1:26 AM, Erick Erickson erickerick...@gmail.com wrote: Have you looked at TermFreqVector? Best Erick On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema manjul...@gmail.comwrote: Hi, If I index a document (single document) in Lucene, then how can I get the term frequencies (even the first and second highest occuring terms) of that document? Is there any class/method to do taht? If anybody knows, pls. help me. Thanks Manjula
Error of the code
Dear All, I am trying to get the term frequencies (through TermFreqVector) of a document (using Lucene 2.9.1). In order to do that I have used the following code. But there is a compile time error in the code and I can't figure it out. Could somebody can guide me what's wrong with it. Compile time error I got: Cannot make a static reference to the non-static method getTermFreqVector(int, String) from the type IndexReader. Code: *import* org.apache.lucene.analysis.standard.StandardAnalyzer; *import* org.apache.lucene.document.Document; * import* org.apache.lucene.document.Field; * import* org.apache.lucene.index.IndexWriter; * import* org.apache.lucene.queryParser.ParseException; * import* org.apache.lucene.queryParser.QueryParser; * import* org.apache.lucene.search.*; * import* org.apache.lucene.store.Directory; * import* org.apache.lucene.store.RAMDirectory; * import* org.apache.lucene.util.Version; * import* org.apache.lucene.index.IndexReader; * import* org.apache.lucene.index.TermEnum; * import* org.apache.lucene.index.Term; * import* org.apache.lucene.index.TermFreqVector; * import* java.io.IOException; * public* *class* DemoTest { *public* *static* *void* main(String[] args) { StandardAnalyzer analyzer = *new* StandardAnalyzer(Version.*LUCENE_CURRENT* ); *try* { Directory directory = *new* RAMDirectory(); IndexWriter iwriter = *new* IndexWriter(directory, analyzer, *true*,*new*IndexWriter.MaxFieldLength(25000)); Document doc = *new* Document(); String text = This is the text to be indexed.; doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.* ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*)); iwriter.addDocument(doc); TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname ); * int* size = vector.size(); *for* ( String term : vector.getTerms() ) System.*out*.println( size = + size ); iwriter.close(); IndexSearcher isearcher = *new* IndexSearcher(directory, *true*); QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*, fieldname, analyzer); Query query = parser.parse(text); ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs; System.*out*.println(hits.length(1) = + hits.length); // Iterate through the results: *for* (*int* i = 0; i hits.length; i++) { Document hitDoc = isearcher.doc(hits.doc); System.*out*.println(hitDoc.get(\fieldname\) (This is the text to be indexed) = + hitDoc.get(fieldname)); } isearcher.close(); directory.close(); } *catch* (Exception ex) { ex.printStackTrace(); } } } Thanks in advance Manjula
Re: Error of the code
Dear Ian, Thanks a lot for your immediate reply. As you have mentioned I replaced the lines as follows. IndexReader ir=IndexReader.open(directory); TermFreqVector vector=ir.getTermFreqVector(0,fieldname); Now the error has been vanished and thanks for it. But I can't still see the results although I have moved those lines after iwriter.close(). What's the reason for this? sample code after modifications: . String text = This is the text to be indexed.; doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.* ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*)); iwriter.addDocument(doc); iwriter.close(); IndexReader ir=IndexReader.open(directory); TermFreqVector vector=ir.getTermFreqVector(0,fieldname); * int* size = vector.size(); *for* ( String term : vector.getTerms() ) System.*out*.println( size = + size ); IndexSearcher isearcher = *new* IndexSearcher(directory, *true*); .. .. I appreciate your kind coperation Manjula On Thu, May 13, 2010 at 3:45 PM, Ian Lea ian@gmail.com wrote: You need to replace this: TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname ); with IndexReader ir = whatever(...); TermFreqVector vector = ir.getTermFreqVector(0, fieldname ); And you'll need to move it to after the writer.close() call if you want it to see the doc you've just added. -- Ian. On Thu, May 13, 2010 at 11:07 AM, manjula wijewickrema manjul...@gmail.com wrote: Dear All, I am trying to get the term frequencies (through TermFreqVector) of a document (using Lucene 2.9.1). In order to do that I have used the following code. But there is a compile time error in the code and I can't figure it out. Could somebody can guide me what's wrong with it. Compile time error I got: Cannot make a static reference to the non-static method getTermFreqVector(int, String) from the type IndexReader. Code: *import* org.apache.lucene.analysis.standard.StandardAnalyzer; *import* org.apache.lucene.document.Document; * import* org.apache.lucene.document.Field; * import* org.apache.lucene.index.IndexWriter; * import* org.apache.lucene.queryParser.ParseException; * import* org.apache.lucene.queryParser.QueryParser; * import* org.apache.lucene.search.*; * import* org.apache.lucene.store.Directory; * import* org.apache.lucene.store.RAMDirectory; * import* org.apache.lucene.util.Version; * import* org.apache.lucene.index.IndexReader; * import* org.apache.lucene.index.TermEnum; * import* org.apache.lucene.index.Term; * import* org.apache.lucene.index.TermFreqVector; * import* java.io.IOException; * public* *class* DemoTest { *public* *static* *void* main(String[] args) { StandardAnalyzer analyzer = *new* StandardAnalyzer(Version.*LUCENE_CURRENT* ); *try* { Directory directory = *new* RAMDirectory(); IndexWriter iwriter = *new* IndexWriter(directory, analyzer, *true*,*new*IndexWriter.MaxFieldLength(25000)); Document doc = *new* Document(); String text = This is the text to be indexed.; doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.* ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*)); iwriter.addDocument(doc); TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname ); * int* size = vector.size(); *for* ( String term : vector.getTerms() ) System.*out*.println( size = + size ); iwriter.close(); IndexSearcher isearcher = *new* IndexSearcher(directory, *true*); QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*, fieldname, analyzer); Query query = parser.parse(text); ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs; System.*out*.println(hits.length(1) = + hits.length); // Iterate through the results: *for* (*int* i = 0; i hits.length; i++) { Document hitDoc = isearcher.doc(hits.doc); System.*out*.println(hitDoc.get(\fieldname\) (This is the text to be indexed) = + hitDoc.get(fieldname)); } isearcher.close(); directory.close(); } *catch* (Exception ex) { ex.printStackTrace(); } } } Thanks in advance Manjula - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Class_for_HighFrequencyTerms
Dear Erick, I lokked for it and even added IndexReader.java and TermFreqVector.java from http://www.jarvana.com/jarvana/search?search_type=classjava_class=org.apache.lucene.index.IndexReader . But after adding the system indicated a lot of errors in the source code IndexReader.java (eg: DirectoryOwningReader cannot be resolved to a type, indexCommit cannot be resolved to a type, SegmentInfos cannot be resolved, TermEnum cannot be resolved to a type, etc.). I am using Lucene 2.9.1 and this particular website has listed this source code under 2.9.1 version of Lucene. What is the reason for this kind of scenario? Do I have to add another JAR file (in order to solve this even I added lucene-core-2.9.1-sources.jar, but nothing happened). Pls. be kind enough to make a reply. Tanks Manjula On Tue, May 11, 2010 at 1:26 AM, Erick Erickson erickerick...@gmail.comwrote: Have you looked at TermFreqVector? Best Erick On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema manjul...@gmail.comwrote: Hi, If I index a document (single document) in Lucene, then how can I get the term frequencies (even the first and second highest occuring terms) of that document? Is there any class/method to do taht? If anybody knows, pls. help me. Thanks Manjula
Re: Trace only exactly matching terms!
Hi Anshum Erick, As you have mentioned, I used SnowballAnalyzer for stemming purposes. It worked nicely. Thnks a lot for your guidence. Manjula. On Fri, May 7, 2010 at 8:27 PM, Erick Erickson erickerick...@gmail.comwrote: The other approach is to use a stemmer both at index and query time. BTW, it's very easy to make a custom analyzer by chaining together the Tokenizer and as many filters (e.g. PorterStemFilter), essentially composing your analyzer from various pre-built Lucene parts. HTH Erick On Fri, May 7, 2010 at 9:07 AM, Anshum ansh...@gmail.com wrote: Hi Manjula, Yes lucene by default would only tackle exact term matches unless you use a custom analyzer to expand the index/query. -- Anshum Gupta http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Fri, May 7, 2010 at 2:22 PM, manjula wijewickrema manjul...@gmail.com wrote: Hi, I am using Lucene 2.9.1 . I have downloaded and run the 'HelloLucene.java' class by modifing the input document and user query in various ways. Once I put the document sentenses as 'Lucene in actions' insted of 'Lucene in action', and I gave the query as 'action' and run the programme. But it did not show me the 'Lucene in action as a hit'! What is the reason for this? Why it doesn't tackle word 'actions' as a hit? Does Lucene identify only the exactly matching words? Thanks Manjula
Class_for_HighFrequencyTerms
Hi, If I index a document (single document) in Lucene, then how can I get the term frequencies (even the first and second highest occuring terms) of that document? Is there any class/method to do taht? If anybody knows, pls. help me. Thanks Manjula
Trace only exactly matching terms!
Hi, I am using Lucene 2.9.1 . I have downloaded and run the 'HelloLucene.java' class by modifing the input document and user query in various ways. Once I put the document sentenses as 'Lucene in actions' insted of 'Lucene in action', and I gave the query as 'action' and run the programme. But it did not show me the 'Lucene in action as a hit'! What is the reason for this? Why it doesn't tackle word 'actions' as a hit? Does Lucene identify only the exactly matching words? Thanks Manjula
Term/Phrase frequencies
Hi, I am new to Lucene. If I want to know the term or phrase frequency of an input document, will it be possible through Lucene? Thanks, Manjula