from:"Manjula Wijewickrema"

TermFrequency for a String

2017-04-28 Thread Manjula Wijewickrema

IndexReader.getTermFreqVectors(2)[0].getTermFrequencies()[5];

In the above example, Lucene gives me the term frequency of the 5th term
(e.g. say "planet") in the tfv of the corpus document "2".

But I need to get the term frequency for a specified term using its string
value.

E.g.:
term frequency of the term specified as "planet" (i.e. not specified in
terms of its position "5", but specified using its string value "planet").

Is there any way to do this?

I highly appreciate your kind reply!

Total of term frequencies

2017-04-16 Thread Manjula Wijewickrema

Hi,

Is there any way to get the total count of terms in the Term Frequency
Vector  (tvf)? I need to calculate the Normalized term frequency of each
term in my tvf. I know how to obtain the length of the tvf, but it doesn't
work since I need to count duplicate occurrences as well.

Highly appreciate your kind response.

Only term frequencies

2017-04-06 Thread Manjula Wijewickrema

Hi,

I have a document collection with hundreds of documents. I need to do know
the term frequency for a given query term in each document. I know that
'hit.score' will give me the Lucene score for each document (and it
includes term frequency as well). But I need to call only term frequencies
in each document. How can I do this?

I highly appreciate your kind response.

Re: hit.score

2017-03-27 Thread Manjula Wijewickrema

Thanks Adrien.

On Mon, Mar 27, 2017 at 6:56 PM, Adrien Grand <jpou...@gmail.com> wrote:

> You can use IndexSearcher.explain to see how the score was computed.
>
> Le lun. 27 mars 2017 à 14:46, Manjula Wijewickrema <manjul...@gmail.com> a
> écrit :
>
> > Hi,
> >
> > Can someone help me to understand the value given by 'hit.score' in
> Lucene.
> > I indexed a single document with five different words with different
> > frequencies and try to understand this value. However, it doesn't seem to
> > be normalized term frequency or tf-idf. I am using Lucene 2.91.
> >
> > Any help would be highly appreciated.
> >
>

hit.score

2017-03-27 Thread Manjula Wijewickrema

Hi,

Can someone help me to understand the value given by 'hit.score' in Lucene.
I indexed a single document with five different words with different
frequencies and try to understand this value. However, it doesn't seem to
be normalized term frequency or tf-idf. I am using Lucene 2.91.

Any help would be highly appreciated.

Why hit is 0 for bigrams?

2014-07-07 Thread Manjula Wijewickrema

Hi,

I tried to index bigrams from a documhe system gave and the system gave me
the following output with the frequencies of the bigrams(output 1):

array size:15
array terms are:{contents: /1, assist librarian/1, assist manjula/2, assist
sabaragamuwa/1, fine manjula/1, librari manjula/1, librarian
sabaragamuwa/1, main librari/2, manjula assist/4, manjula fine/1, manjula
name/1, name manjula/1, sabaragamuwa univers/3, univers main/2, univers
sabaragamuwa/1}

For this I used the follwing code in the createIndex() class:


ShingleAnalyzerWrapper sw=*new *ShingleAnalyzerWrapper(analyzer,2);

sw.setOutputUnigrams(*false*);



Then I tried search the indexed bigrams of the same document using the
following code in searchIndex()class:


IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);



Analyzer analyzer = *new* WhitespaceAnalyzer();



QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);



Query query = queryParser.parse(terms[pos[freqs.length-q1]]);



System.*out*.println(Query:  +query);



Hits hits = indexSearcher.search(query);

System.*out*.println(Number of hits:  + hits.length());




For this, the system gave me the following output (output2):


Query: contents:manjula contents:assist

Number of hits: 0

Query: contents:sabaragamuwa contents:univers

Number of hits: 0

Query: contents:univers contents:main

Number of hits: 0

Query: contents:main contents:librari

Number of hits: 0


If someone can please explain me;


(1)why 'contents: /1' is included in the array as an array element? (output
1)


(2) why the system return me the query as 'contents:manjula
contents:assist' instead of 'manjula assist'? (output 2)


(3) why the number of hits given as 0 instead of their frequencies? (output
2)


I highly appreciate your kind reply.


Manjula.

bigram problem

2014-07-02 Thread Manjula Wijewickrema

Hi,

Could please explain me how to determine the tf-idf score for bigrams. My
program is able to index and search bigrams correctly, but it does not
calculate the tf-idf for bigrams. If someone can, please help me to resolve
this.

Regards,
Manjula.

Re: bigram problem

2014-07-02 Thread Manjula Wijewickrema

Dear Parnab,

Thanks a lot for your guidance. I prefer to follow the second method, as I
have already indexed the bigrams using ShingleFilterWrapper. But, I have no
any idea about how to use NGramTokenizer here. So, could you please write
one or two lines of the code which shows how to use NGramTokenizer for
bigrams.

Thanks,
Manjula.


On Wed, Jul 2, 2014 at 7:05 PM, parnab kumar parnab.2...@gmail.com wrote:

 TF is straight forward, you can simply count the no of occurrences in the
 doc by simple string matching. For IDF you need to know total no of docs in
 the collection and the no. of docs having the bigram. reader.maxDoc() will
 give you the total no of docs in the collection. To calculate the number of
 docs containing the bigram use a phrase query with slop factor set to 0.
 The number of docs returned by the indexsearcher with the phrase query will
 be the number of docs having the bigram. I hope this is fine.

 Alternatively, use   NGramTokenizer where ( n=2 in your case) while
 indexing. In such a case, each bigram can interpreted as a normal lucene
 term.

 Thanks,
 Parnab


 On Wed, Jul 2, 2014 at 8:45 AM, Manjula Wijewickrema manjul...@gmail.com
 wrote:

  Hi,
 
  Could please explain me how to determine the tf-idf score for bigrams. My
  program is able to index and search bigrams correctly, but it does not
  calculate the tf-idf for bigrams. If someone can, please help me to
 resolve
  this.
 
  Regards,
  Manjula.

Why bigram tf-idf is 0?

2014-06-24 Thread Manjula Wijewickrema

Hi,

In my programme, I tried to select the most relevant document based on
bigrams.

System gives me the following output.

{contents: /1, assist librarian/1, assist manjula/2, assist sabaragamuwa/1,
fine manjula/1, librari manjula/1, librarian sabaragamuwa/1, main
librari/2, manjula assist/4, manjula fine/1, manjula name/1, name
manjula/1, sabaragamuwa univers/3, univers main/2, univers sabaragamuwa/1}

The frequencies of the bigrams are also correctly identified by the system.
But the tf-idf scores of these bigrams are given as 0. However, the same
programme gives the correct tf-idf values for unigrams.

Following is the code snippet that I wrote to determine the tf-idf of
bigrams.




for(int q1=1; q1NB+1; q1++){ //NB-Number of Bigrams
  IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
 Analyzer analyzer = new WhitespaceAnalyzer();
 QueryParser queryParser = new QueryParser(FIELD_CONTENTS, analyzer);
Query query = queryParser.parse(terms[pos[freqs.length-q1]]);
Hits hits = indexSearcher.search(query);
 IteratorHit it = hits.iterator();
 TopDocs results=indexSearcher.search(query,10);
ScoreDoc[] hits1=results.scoreDocs;
for(ScoreDoc hit:hits1){
 Document doc=indexSearcher.doc(hit.doc);
 tfidf[q1-1]=hit.score;
 }
  }

***
Here, hit.score should give the tf-idf value of each bigram. Why it is
given as 0? If someone can please explain me how to resolve this problem.

Thanks,
Manjula.

Re: ShingleAnalyzerWrapper question

2014-06-16 Thread Manjula Wijewickrema

Dear Steve,

It works. Thanks.




On Wed, Jun 11, 2014 at 6:18 PM, Steve Rowe sar...@gmail.com wrote:

 You should give sw rather than analyzer in the IndexWriter actor.

 Steve
 www.lucidworks.com
  On Jun 11, 2014 2:24 AM, Manjula Wijewickrema manjul...@gmail.com
 wrote:

  Hi,
 
  In my programme, I can index and search a document based on unigrams. I
  modified the code as follows to obtain the results based on bigrams.
  However, it did not give me the desired output.
 
  *
 
  *public* *static* *void* createIndex() *throws* CorruptIndexException,
  LockObtainFailedException,
 
 
 
  IOException {
 
 
 
 
 
  *final* String[] NEW_STOP_WORDS = {a, able, about,
  actually, after, allow, almost, already, also, although,
  always, am,   an, and, any, anybody};  //only a portion
 
 
 
  SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English,
  NEW_STOP_WORDS );
 
  Directory directory =
  FSDirectory.getDirectory(*INDEX_DIRECTORY*
  );
 
 
 
  ShingleAnalyzerWrapper sw=*new*
  ShingleAnalyzerWrapper(analyzer,2);
 
  sw.setOutputUnigrams(*false*);
 
 
 
  IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
  *true*,IndexWriter.MaxFieldLength.*UNLIMITED*);
 
  File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
 
  File[] files = dir.listFiles();
 
 
 
 
 
  *for* (File file : files) {
 
 
 
Document doc = *new* Document();
 
String text=;
 
doc.add(*new* Field(contents,text,Field.Store.*YES*,
  Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));
 
 
 
 
 
Reader reader = *new* FileReader(file);
 
doc.add(*new* Field(*FIELD_CONTENTS*, reader));
 
w.addDocument(doc);
 
  }
 
  w.optimize();
 
  w.close();
 
 
 
}
 
 
  
 
  Still the output is;
 
 
  {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1,
 manjula/3,
  name/1, sabaragamuwa/1, univers/1}
 
  ***
 
 
  If anybody can, please help me to obtain the correct output.
 
 
  Thanks,
 
 
  Manjula.

ShingleAnalyzerWrapper question

2014-06-11 Thread Manjula Wijewickrema

Hi,

In my programme, I can index and search a document based on unigrams. I
modified the code as follows to obtain the results based on bigrams.
However, it did not give me the desired output.

*

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException,



IOException {





*final* String[] NEW_STOP_WORDS = {a, able, about,
actually, after, allow, almost, already, also, although,
always, am,   an, and, any, anybody};  //only a portion



SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English,
NEW_STOP_WORDS );

Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*
);



ShingleAnalyzerWrapper sw=*new*
ShingleAnalyzerWrapper(analyzer,2);

sw.setOutputUnigrams(*false*);



IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
*true*,IndexWriter.MaxFieldLength.*UNLIMITED*);

File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

File[] files = dir.listFiles();





*for* (File file : files) {



  Document doc = *new* Document();

  String text=;

  doc.add(*new* Field(contents,text,Field.Store.*YES*,
Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));





  Reader reader = *new* FileReader(file);

  doc.add(*new* Field(*FIELD_CONTENTS*, reader));

  w.addDocument(doc);

}

w.optimize();

w.close();



  }




Still the output is;


{contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3,
name/1, sabaragamuwa/1, univers/1}

***


If anybody can, please help me to obtain the correct output.


Thanks,


Manjula.

Re: Is it wrong to create index writer on each query request.

2014-06-05 Thread Manjula Wijewickrema

Hi,

What are the other disadvantages (other than the time factor) of creating
index for every request?

Manjula.


On Thu, Jun 5, 2014 at 2:34 PM, Aditya findbestopensou...@gmail.com wrote:

 Hi Rajendra

 You should NOT create index writer for every request.

 Whether it is time consuming to update index writer when new document
 will come.
 No.

 Regards
 Aditya
 www.findbestopensource.com



 On Thu, Jun 5, 2014 at 12:24 PM, Rajendra Rao rajendra@launchship.com
 
 wrote:

  I have system in which documents and Query comes  frequently  .I am
  creating index writer in memory every time for each query I request . I
  want to know Is it good to separate Index Writing and loading  and Query
  request ?  Whether It is good to save index writer on hard disk .Whether
 it
  is time consuming to update index writer when new document will come.

Re: Phrase indexing and searching

2013-12-23 Thread Manjula Wijewickrema

Hi Steve,

Thanks for the reply. Could you please simply let me know how to embed
SingleFilter in the code for both indexing and searching? Coz, different
people suggest different snippets to the code and they did not do the job.

Thanks,

Manjula.


On Mon, Dec 23, 2013 at 8:42 PM, Steve Rowe sar...@gmail.com wrote:

 Hi Manjula,

 Sounds like ShingleFilter will do what you want: 

 http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
 

 Steve
 www.lucidworks.com
 On Dec 22, 2013 11:25 PM, Manjula Wijewickrema manjul...@gmail.com
 wrote:

  Dear All,
 
  My Lucene programme is able to index single words and search the most
  matching documents (based on term frequencies) documents from a corpus to
  the input document.
  Now I want to index two word phrases and search the matching corpus
  documents (based on phrase frequencies) to the input documents.
 
  ex:-
  input document:
  blue house is very beautiful
 
  split it into phrases (say two term phrases) like:
  blue house
  house very
  very beautiful
  etc.
 
   Is it possible to do this with Lucene? If so how can I do it?
 
  Thanks,
 
  Manjula.

Phrase indexing and searching

2013-12-22 Thread Manjula Wijewickrema

Dear All,

My Lucene programme is able to index single words and search the most
matching documents (based on term frequencies) documents from a corpus to
the input document.
Now I want to index two word phrases and search the matching corpus
documents (based on phrase frequencies) to the input documents.

ex:-
input document:
blue house is very beautiful

split it into phrases (say two term phrases) like:
blue house
house very
very beautiful
etc.

 Is it possible to do this with Lucene? If so how can I do it?

Thanks,

Manjula.

Phrase indexing and searching

2013-12-18 Thread Manjula Wijewickrema

Dear list,

My Lucene programme is able to index single words and search the most
matching documents (based on term frequencies) documents from a corpus to
the input document.
Now I want to index two word phrases and search the matching corpus
documents (based on phrase frequencies) to the input documents.

ex:-
input document:
blue house is very beautiful

split it into phrases (say two term phrases) like:
blue house
house very
very beautiful
etc.

 Is it possible to do this with Lucene? If so how can I do it?

Thanks,

Manjula.

Re: Editing StopWordList

2010-12-21 Thread manjula wijewickrema

Hi Gupta,

Thanx a lot for your reply. But I could not understand whether I could
modify (adding more words) to the default stop word list or should I have to
make a new list as an array as follows.
public string[] NEW_STOP_WORDS = { a, and, are, as, at, be,
but, by, for, if, in, into, is, no, not, of, on, or,

s, such, t, that, the, their, then, there, these, they,
this, to, was, will, with,
inc,incorporated,co.,ltd,ltd., we, you, your, us,
etc...};

then call it as follows,

SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English,
StopAnalyzer.NEW_STOP_WORDS );
Am I correct?
Or if not could you explain me how can I do this?

Thanx in advance.
Manjula.

On Tue, Dec 21, 2010 at 10:36 AM, Anshum ansh...@gmail.com wrote:

 Hi Manjula,
 You could initialize the Analyzer using a modified stop word set. Use
 the *StopAnalyzer.ENGLISH_STOP_WORDS_SET
 *to get the default stopset and then add your own words to it. You could
 then initialize the analyzer using this new stop set instead of the default
 stop set.
 Hope that helps.

 --
 Anshum Gupta
 http://ai-cafe.blogspot.com


 On Tue, Dec 21, 2010 at 9:20 AM, manjula wijewickrema
 manjul...@gmail.comwrote:

  Hi,
 
  1) In my application, I need to add more words to the stop word list.
  Therefore, is it possible to add more words into the default lucene stop
  word list?
 
  2) If is it possible, then how can I do this?
 
  Appreciate any comment from you.
 
  Thanks,
  Manjula.

Editing StopWordList

2010-12-20 Thread manjula wijewickrema

Hi,

1) In my application, I need to add more words to the stop word list.
Therefore, is it possible to add more words into the default lucene stop
word list?

2) If is it possible, then how can I do this?

Appreciate any comment from you.

Thanks,
Manjula.

Re: Analyzer

2010-12-02 Thread manjula wijewickrema

Dear Erick,

Thanx for your information.

Manjula.

On Tue, Nov 30, 2010 at 6:37 PM, Erick Erickson erickerick...@gmail.comwrote:

 WhitespaceAnalyzer does just that, splits the incoming stream on
 white space.

 From the javadocs for StandardAnalyzer:

 A grammar-based tokenizer constructed with JFlex

 This should be a good tokenizer for most European-language documents:

   - Splits words at punctuation characters, removing punctuation. However,
   a dot that's not followed by whitespace is considered part of a token.
   - Splits words at hyphens, unless there's a number in the token, in which
   case the whole token is interpreted as a product number and is not split.
   - Recognizes email addresses and internet hostnames as one token.

 Many applications have specific tokenizer needs. If this tokenizer does not
 suit your application, please consider copying this source code directory
 to
 your project and maintaining your own grammar-based tokenizer.


 Best

 Erick

 On Tue, Nov 30, 2010 at 12:06 AM, manjula wijewickrema
 manjul...@gmail.comwrote:

  Hi Steve,
 
  Thanx a lot for your reply. Yes there are only two classes and it's
 corrcet
  that the way you have realized the problem. As you have instructed, I
  checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and
  it
  seems to me that it gives better results rather than StandardAnalyzer. So
  could you please let me know what are the differences between
  StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your
 response.
  Thanx.
 
  Manjula.
 
 
  On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe sar...@syr.edu wrote:
 
   Hi Manjula,
  
   It's not terribly clear what you're doing here - I got lost in your
   description of your (two? or maybe four?) classes.  Sometimes things
 are
   easier to understand if you provide more concrete detail.
  
   I suspect that you could benefit from reading the book Lucene in
 Action,
   2nd edition:
  
 http://www.manning.com/hatcher3/
  
   You would also likely benefit from using Luke, the Lucene index
 browser,
  to
   better understand your indexes' contents and debug how queries match
   documents:
  
 http://code.google.com/p/luke/
  
   I think your question is whether you're using Analyzers correctly.  It
   sounds like you are creating two separate indexes (one for each of your
   classes), and you're using SnowballAnalyzer on the indexing side for
 both
   indexes, and StandardAnalyzer on the query side.
  
   The usual advice is to use the same Analyzer on both the query and the
   index side.  But it appears to be the case that you are taking stemmed
  index
   terms from your index #1 and then querying index #2 using these stemmed
   terms.  If this is true, then you want the query-time analyzer in your
   second index not to change the query terms.  You'll likely get better
   results using WhitespaceAnalyzer, which tokenizes on whitespace and
 does
  no
   further analysis, rather than StandardAnalyzer.
  
   Steve
  
-Original Message-
From: manjula wijewickrema [mailto:manjul...@gmail.com]
Sent: Monday, November 29, 2010 4:32 AM
To: java-user@lucene.apache.org
Subject: Analyzer
   
Hi,
   
In my work, I am using Lucene and two java classes. In the first one,
 I
index a document and in the second one, I try to search the most
  relevant
document for the indexed document in the first one. In the first java
class,
I use the SnowballAnalyzer in the createIndex method and
  StandardAnalyzer
in
the searchIndex method and pass the highest frequency terms into the
second
Java class. In the second class, I use SnowballAnalyzer in the
   createIndex
method (this index is for the collection of documents to be searched,
  or
it
is my database) and StandardAnalyser in the searchIndex method (I
 pass
   the
highest frequently occuring term of the first class as the search
 term
parameter to the searchIndex method of the second class). Using
  Analyzers
in
this manner, what I am willing is to do the stemming, stop-words in
  both
indexes (in both classes) and to search those a few high frequency
  words
(of
the first index) in the second index. So, if my intention is clear to
   you,
could you please let me know whether it is correct or not the way I
  have
used Analyzers? I highly appreciate any comment.
   
Thanx.
Manjula.

Analyzer

2010-11-29 Thread manjula wijewickrema

Hi,

In my work, I am using Lucene and two java classes. In the first one, I
index a document and in the second one, I try to search the most relevant
document for the indexed document in the first one. In the first java class,
I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer in
the searchIndex method and pass the highest frequency terms into the second
Java class. In the second class, I use SnowballAnalyzer in the createIndex
method (this index is for the collection of documents to be searched, or it
is my database) and StandardAnalyser in the searchIndex method (I pass the
highest frequently occuring term of the first class as the search term
parameter to the searchIndex method of the second class). Using Analyzers in
this manner, what I am willing is to do the stemming, stop-words in both
indexes (in both classes) and to search those a few high frequency words (of
the first index) in the second index. So, if my intention is clear to you,
could you please let me know whether it is correct or not the way I have
used Analyzers? I highly appreciate any comment.

Thanx.
Manjula.

Re: Analyzer

2010-11-29 Thread manjula wijewickrema

Hi Steve,

Thanx a lot for your reply. Yes there are only two classes and it's corrcet
that the way you have realized the problem. As you have instructed, I
checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and it
seems to me that it gives better results rather than StandardAnalyzer. So
could you please let me know what are the differences between
StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your response.
Thanx.

Manjula.


On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe sar...@syr.edu wrote:

 Hi Manjula,

 It's not terribly clear what you're doing here - I got lost in your
 description of your (two? or maybe four?) classes.  Sometimes things are
 easier to understand if you provide more concrete detail.

 I suspect that you could benefit from reading the book Lucene in Action,
 2nd edition:

   http://www.manning.com/hatcher3/

 You would also likely benefit from using Luke, the Lucene index browser, to
 better understand your indexes' contents and debug how queries match
 documents:

   http://code.google.com/p/luke/

 I think your question is whether you're using Analyzers correctly.  It
 sounds like you are creating two separate indexes (one for each of your
 classes), and you're using SnowballAnalyzer on the indexing side for both
 indexes, and StandardAnalyzer on the query side.

 The usual advice is to use the same Analyzer on both the query and the
 index side.  But it appears to be the case that you are taking stemmed index
 terms from your index #1 and then querying index #2 using these stemmed
 terms.  If this is true, then you want the query-time analyzer in your
 second index not to change the query terms.  You'll likely get better
 results using WhitespaceAnalyzer, which tokenizes on whitespace and does no
 further analysis, rather than StandardAnalyzer.

 Steve

  -Original Message-
  From: manjula wijewickrema [mailto:manjul...@gmail.com]
  Sent: Monday, November 29, 2010 4:32 AM
  To: java-user@lucene.apache.org
  Subject: Analyzer
 
  Hi,
 
  In my work, I am using Lucene and two java classes. In the first one, I
  index a document and in the second one, I try to search the most relevant
  document for the indexed document in the first one. In the first java
  class,
  I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer
  in
  the searchIndex method and pass the highest frequency terms into the
  second
  Java class. In the second class, I use SnowballAnalyzer in the
 createIndex
  method (this index is for the collection of documents to be searched, or
  it
  is my database) and StandardAnalyser in the searchIndex method (I pass
 the
  highest frequently occuring term of the first class as the search term
  parameter to the searchIndex method of the second class). Using Analyzers
  in
  this manner, what I am willing is to do the stemming, stop-words in both
  indexes (in both classes) and to search those a few high frequency words
  (of
  the first index) in the second index. So, if my intention is clear to
 you,
  could you please let me know whether it is correct or not the way I have
  used Analyzers? I highly appreciate any comment.
 
  Thanx.
  Manjula.

Re: Databases

2010-07-28 Thread manjula wijewickrema

Hi,

Thanks a lot for your information.

Regards,
Manjula.

On Fri, Jul 23, 2010 at 12:48 PM, tarun sapra t.sapr...@gmail.com wrote:

 You can use HibernateSearch to maintain the synchronization between Lucene
 index and Mysql  RDBMS.

 On Fri, Jul 23, 2010 at 11:16 AM, manjula wijewickrema
 manjul...@gmail.comwrote:

  Hi,
 
  Normally, when I am building my index directory for indexed documents, I
  used to keep my indexed files simply in a directory called
 'filesToIndex'.
  So in this case, I do not use any standar database management system such
  as mySql or any other.
 
  1) Will it be possible to use mySql or any other for the purpose of
 manage
  indexed documents in Lucene?
 
  2) Is it necessary to follow such kind of methodology with Lucene?
 
  3) If we do not use such type of database management system, will there
 be
  any disadvantages with large number of indexed files?
 
  Appreciate any reply from you.
  Thanks,
  Manjula.
 



 --
 Thanks  Regards
 Tarun Sapra

Databases

2010-07-22 Thread manjula wijewickrema

Hi,

Normally, when I am building my index directory for indexed documents, I
used to keep my indexed files simply in a directory called 'filesToIndex'.
So in this case, I do not use any standar database management system such
as mySql or any other.

1) Will it be possible to use mySql or any other for the purpose of manage
indexed documents in Lucene?

2) Is it necessary to follow such kind of methodology with Lucene?

3) If we do not use such type of database management system, will there be
any disadvantages with large number of indexed files?

Appreciate any reply from you.
Thanks,
Manjula.

Re: scoring and index size

2010-07-12 Thread manjula wijewickrema

Hi Koji,

Thanks for your information

Manjula



On Fri, Jul 9, 2010 at 5:04 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 (10/07/09 19:30), manjula wijewickrema wrote:

 Uwe, thanx for your comments. Following is the code I used in this case.
 Could you pls. let me know where I have to insert UNLIMITED field length?
 and how?
 Tanx again!
 Manjula



 Manjula,

 You can set UNLIMITED field length to IW constructor:


 http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#IndexWriter%28org.apache.lucene.store.Directory,%20org.apache.lucene.analysis.Analyzer,%20boolean,%20org.apache.lucene.index.IndexWriter.MaxFieldLength%29

 Koji

 --
 http://www.rondhuit.com/en/



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: MaxFieldLength

2010-07-12 Thread manjula wijewickrema

Ok Erick, answer is there. If there is no any document exceeds the default
maxfieldlength, then no any document will be truncated although we increase
the no. of documents in the index. A'm I correct? Thanx for your commitment.

Manjula.

On Tue, Jul 13, 2010 at 3:57 AM, Erick Erickson erickerick...@gmail.comwrote:

I'm not sure I understand your question. The number of documents
has no bearing on the field length of each, which is what the
max field length is all about. You can change the value here
by calling Indexwriter.setMaxFieldLength to something shorter
than the default.

So no, if no document exceeds the default (Terms, not characters),
no document will be truncated.

The 10,000 limit also has no bearing on how much space indexing
a document takes as long as there are fewer then 10,000 terms. That
is, a document with 5,000 terms will take up just as much space
with any MaxfieldLength 5,000.

HTH
Erick

On Mon, Jul 12, 2010 at 4:00 AM, manjula wijewickrema
manjul...@gmail.comwrote:

Hi,

I have seen that, onece the field length of a document goes over a
certain
limit (

http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
gives
it as 10,000 terms-default) Lucene truncates those documents. Is there
any
possibility to truncate documents, if we increase the number of indexed
documents (assume, there are no any individual documents which exceed the
default MaxFieldLength of Lucene)?

Thanx
Manjula.

Re: Why not normalization?

2010-07-09 Thread manjula wijewickrema

Hi Rebecca,

Thanks for your valuble comments. Yes I observed tha, once the number of
terms of the goes up, fieldNorm value goes down correspondingly. I think,
therefore there won't be any default due to the variation of total number of
terms in the document. Am I right?

Manjula.

On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson bec.wat...@gmail.com wrote:

 hi,

  1) Although Lucene uses tf to calculate scoring it seems to me that term
  frequency has not been normalized. Even if I index several documents, it
  does not normalize tf value. Therefore, since the total number of words
  in index documents are varied, can't there be a fault in Lucene's
 scoring?

 tf = term frequency i.e. the number of times the term appears in the
 document,
 while idf is inverse document frequency - is a measure of how rare a term
 is,
 i.e. related to how many documents the term appears in.

 if term1 occurs more frequently in a document i.e. tf is higher, you
 want to weight
 the document higher when you search for term1

 but if term1 is a very frequent term, ie. in lots of documents, then
 its probably not
 as important to an overall search (where we have term1, term2 etc) so you
 want
 to downweight it (idf comes in)

 then the normalisations like length normalisation (allow for 'fair' scoring
 across varied field length) come in too.

 the tf-idf scoring formula used by lucene is a  scoring method that's
 been around
 a long long time... there are competing scoring metrics but that's an IR
 thing
 and not an argument you want to start on the lucene lists! :)

 these are IR ('information retrieval') concepts and you might want to start
 by
 going to through the tf-idf scoring / some explanations for this kind
 of scoring.

 http://en.wikipedia.org/wiki/Tf%E2%80%93idf
 http://wiki.apache.org/lucene-java/InformationRetrieval


  2) What is the formula to calculate this fieldNorm value?

 in terms of how lucene implements its tf-idf scoring - you can see here:
 http://lucene.apache.org/java/3_0_2/scoring.html

 also, the lucene in action book is a really good book if you are starting
 out
 with lucene (and will save you a lot of grief with understanding
 lucene / setting
 up your application!), it covers all the basics and then moves on to more
 advanced stuff and has lots of code examples too:
 http://www.manning.com/hatcher2/

 hope that helps,

 bec :)

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

scoring and index size

2010-07-09 Thread manjula wijewickrema

Hi,

I run a single programme to see the way of scoring by Lucene for single
indexed document. The explain() method gave me the following results.
***

Searching for 'metaphysics'

Number of hits: 1

0.030706111

0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of:

10.246951 = tf(termFreq(contents:metaphys)=105)

0.30685282 = idf(docFreq=1, maxDocs=1)

0.009765625 = fieldNorm(field=contents, doc=0)

*

But I encountered the following problems;

1) In this case, I did not change or done anything to Boost values. So that
should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in Lucene
email archive,  default boost values=1)

2) But, even if I manually calculate the value for fieldNorm (as
=1/sqrt(terms in field)), it doesn't match (approximately it matches) with
the value with given by the system for fieldNorm. Can this be due to
encode/decode precision loss of norm?

3) In my indexed document, my indexed document was consisted with total
number of 19078 words including 125 times of word 'metaphysics' (i.e my
query. I input single term query) . But as you can see in the above output,
system gives only 105 counts for word 'metaphysics'. But once I reduce some
part of my index document and count the number of 'metaphysics' words and
checked with the system results. I noticed that with reduction of text from
index document, system counts it correctly. Why this kind of behaviour? Is
there any limitation for the indexed documents?

If somebody can pls. help me to solve these problems.

Thanks!

Manjula.

Re: scoring and index size

2010-07-09 Thread manjula wijewickrema

Uwe, thanx for your comments. Following is the code I used in this case.
Could you pls. let me know where I have to insert UNLIMITED field length?
and how?
Tanx again!
Manjula

code--

*

public* *class* LuceneDemo {

*public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = filesToIndex
;

*public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory;

*public* *static* *final* String *FIELD_PATH* = path;

*public* *static* *final* String *FIELD_CONTENTS* = contents;

*public* *static* *void* main(String[] args) *throws* Exception {

*createIndex*();

//searchIndex(rice AND milk);

*searchIndex*(metaphysics);

//searchIndex(banana);

//searchIndex(foo);

 }

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException, IOException {

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
StopAnalyzer.ENGLISH_STOP_WORDS);

*boolean* recreateIndexIfExists = *true*;

IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
recreateIndexIfExists);

File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

File[] files = dir.listFiles();

*for* (File file : files) {

Document document = *new* Document();

//contents#setOmitNorms(true);

String path = file.getCanonicalPath();

document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index.
UN_TOKENIZED,Field.TermVector.*YES*));

Reader reader = *new* FileReader(file);

document.add(*new* Field(*FIELD_CONTENTS*, reader));

indexWriter.addDocument(document);

 }

indexWriter.optimize();

indexWriter.close();

}

*public* *static* *void* searchIndex(String searchString)
*throws*IOException, ParseException {

System.*out*.println(Searching for ' + searchString + ');

Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);

IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
StopAnalyzer.ENGLISH_STOP_WORDS);

QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);

Query query = queryParser.parse(searchString);

Hits hits = indexSearcher.search(query);

System.*out*.println(Number of hits:  + hits.length());

TopDocs results = indexSearcher.search(query,10);

ScoreDoc[] hits1 = results.scoreDocs;

*for* (ScoreDoc hit : hits1) {

Document doc = indexSearcher.doc(hit.doc);

//System.out.printf(%5.3f %s\n,hit.score,doc.get(FIELD_CONTENTS));

System.*out*.println(hit.score);

//Searcher.explain(rice,0);

//System.out.println(indexSearcher.explain(query, 0));

}

System.*out*.println(indexSearcher.explain(query, 0));

//System.out.println(indexSearcher.explain(query, 1));

//System.out.println(indexSearcher.explain(query, 2));

//System.out.println(indexSearcher.explain(query, 3));

IteratorHit it = hits.iterator();

*while* (it.hasNext()) {

Hit hit = it.next();

Document document = hit.getDocument();

String path = document.get(*FIELD_PATH*);

System.*out*.println(Hit:  + path);

}

}

}






On Fri, Jul 9, 2010 at 1:06 PM, Uwe Schindler u...@thetaphi.de wrote:

 Maybe you have MaxFieldLength.LIMITED instead of UNLIMITED? Then the number
 of terms per document is limited.

 The calculation precision is limited by the float norm encoding, but also
 if
 your analyzer removed stop words, so the norm is not what you exspect?

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: manjula wijewickrema [mailto:manjul...@gmail.com]
  Sent: Friday, July 09, 2010 9:21 AM
  To: java-user@lucene.apache.org
  Subject: scoring and index size
 
  Hi,
 
  I run a single programme to see the way of scoring by Lucene for single
  indexed document. The explain() method gave me the following results.
  ***
 
  Searching for 'metaphysics'
 
  Number of hits: 1
 
  0.030706111
 
  0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of:
 
  10.246951 = tf(termFreq(contents:metaphys)=105)
 
  0.30685282 = idf(docFreq=1, maxDocs=1)
 
  0.009765625 = fieldNorm(field=contents, doc=0)
 
  *
 
  But I encountered the following problems;
 
  1) In this case, I did not change or done anything to Boost values. So
 that
  should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in
 Lucene
  email archive,  default boost values=1)
 
  2) But, even if I manually calculate the value for fieldNorm (as
 =1/sqrt(terms
  in field)), it doesn't match (approximately it matches) with the value
 with
  given by the system for fieldNorm. Can this be due to encode/decode
  precision loss of norm?
 
  3) In my indexed document, my indexed document was consisted with total
  number of 19078 words including 125 times of word 'metaphysics' (i.e my
  query. I input single term query) . But as you can see in the above
 output,
  system gives only 105 counts for word 'metaphysics'. But once I reduce
 some
  part of my index document

Re: Why not normalization?

2010-07-09 Thread manjula wijewickrema

Thanx

On Fri, Jul 9, 2010 at 1:10 PM, Uwe Schindler u...@thetaphi.de wrote:

  Thanks for your valuble comments. Yes I observed tha, once the number of
  terms of the goes up, fieldNorm value goes down correspondingly. I think,
  therefore there won't be any default due to the variation of total number
 of
  terms in the document. Am I right?

 With the current scoring model advanced statistics are not available. There
 are currently some approaches to add BM25 support to Lucene, for what the
 index format needs to be enhanced to contain more statistics (number of
 terms per document, avg number of terms per document,...).

  On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson bec.wat...@gmail.com
  wrote:
 
   hi,
  
1) Although Lucene uses tf to calculate scoring it seems to me that
term frequency has not been normalized. Even if I index several
documents, it does not normalize tf value. Therefore, since the
total number of words in index documents are varied, can't there be
a fault in Lucene's
   scoring?
  
   tf = term frequency i.e. the number of times the term appears in the
   document, while idf is inverse document frequency - is a measure of
   how rare a term is, i.e. related to how many documents the term
   appears in.
  
   if term1 occurs more frequently in a document i.e. tf is higher, you
   want to weight the document higher when you search for term1
  
   but if term1 is a very frequent term, ie. in lots of documents, then
   its probably not as important to an overall search (where we have
   term1, term2 etc) so you want to downweight it (idf comes in)
  
   then the normalisations like length normalisation (allow for 'fair'
   scoring across varied field length) come in too.
  
   the tf-idf scoring formula used by lucene is a  scoring method that's
   been around a long long time... there are competing scoring metrics
   but that's an IR thing and not an argument you want to start on the
   lucene lists! :)
  
   these are IR ('information retrieval') concepts and you might want to
   start by going to through the tf-idf scoring / some explanations for
   this kind of scoring.
  
   http://en.wikipedia.org/wiki/Tf%E2%80%93idf
   http://wiki.apache.org/lucene-java/InformationRetrieval
  
  
2) What is the formula to calculate this fieldNorm value?
  
   in terms of how lucene implements its tf-idf scoring - you can see
 here:
   http://lucene.apache.org/java/3_0_2/scoring.html
  
   also, the lucene in action book is a really good book if you are
   starting out with lucene (and will save you a lot of grief with
   understanding lucene / setting up your application!), it covers all
   the basics and then moves on to more advanced stuff and has lots of
   code examples too:
   http://www.manning.com/hatcher2/
  
   hope that helps,
  
   bec :)
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene Scoring

2010-07-07 Thread manjula wijewickrema

Dear Ian,

Thanks a lot for your reply. The way you proposed, working correctly and
solved half of my matter.
Once I run the program, system gave me the following output.
output-
**
Searching for 'milk'

Number of hits: 1

0.13287117

0.13287117 = (MATCH) fieldWeight(contents:milk in 0), product of:

1.7320508 = tf(termFreq(contents:milk)=3)

0.30685282 = idf(docFreq=1, maxDocs=1)

0.25 = fieldNorm(field=contents, doc=0)

Hit: D:\JADE\work\MobilNet\Lucene291\filesToIndex\deron-foods.txt
***
Here, I have no any problems of calculating values for tf, and idf. But I
have no idea of how to calculate fieldNorm. According to
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int)
I think norm(t,d) gives the value for fieldNorm and in my case, the system
returns the value lengthNorm(field) for norm(t,d),

1) Am I correct?
2) If so, coluld you pls. let me know the way (formula) of calculating
lengthNorm(field)? (I checked several documents and codes to understand
this. But was unable to find the mathematical formula behind this method).
3) If lengthNorm(field) is not the case behind fieldNorm, then how to
calculate fieldNorm?

Pls. help me to resolve this matter.

Manjula.


On Tue, Jul 6, 2010 at 12:47 PM, Ian Lea ian@gmail.com wrote:

 You are calling the explain method incorrectly.  You need something like

  System.out.println(indexSearcher.explain(query, 0));


 See the javadocs for details.


 --
 Ian.


 On Tue, Jul 6, 2010 at 7:39 AM, manjula wijewickrema
 manjul...@gmail.com wrote:
  Dear Grant,
 
  Thanks a lot for your guidence. As you have mentioned, I tried to use
  explain() method to get the explanations for relevant scoring. But, once
 I
  call the explain() method, system indicated the following error.
 
  Error-
  'The method explain(Query,int) in the type Searcher is not applicable for
  the arguments (String, int)'.
 
  In my code I call the explain() method as follows-
  Searcher.explain(rice,0);
 
  Possibly the wrong with my way of passing parameters. In my case, I have
  chosen rice as my query and indexed only one document.
 
  Could you pls. let me know what's wrong with this. I also included the
 code
  with this.
 
  Thanx
  Manjula
 
  code-
  **
 
  *import* org.apache.lucene.search.Searcher;
 
  *public* *class* LuceneDemo {
 
  *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* =
 filesToIndex
  ;
 
  *public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory;
 
  *public* *static* *final* String *FIELD_PATH* = path;
 
  *public* *static* *final* String *FIELD_CONTENTS* = contents;
 
  *public* *static* *void* main(String[] args) *throws* Exception {
 
  *createIndex*();
 
  *searchIndex*(rice);
 
   }
 
  *public* *static* *void* createIndex() *throws* CorruptIndexException,
  LockObtainFailedException, IOException {
 
   SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
  StopAnalyzer.ENGLISH_STOP_WORDS);
 
  *boolean* recreateIndexIfExists = *true*;
 
  IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
  recreateIndexIfExists);
 
  File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
 
  File[] files = dir.listFiles();
 
  *for* (File file : files) {
 
  Document document = *new* Document();
 
  String path = file.getCanonicalPath();
 
  document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*,
 Field.Index.
  UN_TOKENIZED,Field.TermVector.*YES*));
 
  Reader reader = *new* FileReader(file);
 
  document.add(*new* Field(*FIELD_CONTENTS*, reader));
 
  indexWriter.addDocument(document);
 
   }
 
  indexWriter.optimize();
 
  indexWriter.close();
 
  }
 
  *public* *static* *void* searchIndex(String searchString)
  *throws*IOException, ParseException {
 
  System.*out*.println(Searching for ' + searchString + ');
 
  Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);
 
  IndexReader indexReader = IndexReader.open(directory);
 
  IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);
 
   SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
  StopAnalyzer.ENGLISH_STOP_WORDS);
 
  QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);
 
  Query query = queryParser.parse(searchString);
 
  Hits hits = indexSearcher.search(query);
 
  System.*out*.println(Number of hits:  + hits.length());
 
  TopDocs results = indexSearcher.search(query,10);
 
  ScoreDoc[] hits1 = results.scoreDocs;
 
  *for* (ScoreDoc hit : hits1) {
 
  Document doc = indexSearcher.doc(hit.doc);
 
  //System.out.printf(%5.3f %s\n,hit.score,doc.get(FIELD_CONTENTS));
 
  System.*out*.println(hit.score);
 
  Searcher.explain(rice,0);
 
  }
 
   IteratorHit it = hits.iterator();
 
  *while* (it.hasNext()) {
 
  Hit hit = it.next();
 
  Document document = hit.getDocument();
 
  String path = document.get(*FIELD_PATH*);
 
  System.*out

Re: Lucene Scoring

2010-07-06 Thread manjula wijewickrema

Dear Grant,

Thanks a lot for your guidence. As you have mentioned, I tried to use
explain() method to get the explanations for relevant scoring. But, once I
call the explain() method, system indicated the following error.

Error-
'The method explain(Query,int) in the type Searcher is not applicable for
the arguments (String, int)'.

In my code I call the explain() method as follows-
Searcher.explain(rice,0);

Possibly the wrong with my way of passing parameters. In my case, I have
chosen rice as my query and indexed only one document.

Could you pls. let me know what's wrong with this. I also included the code
with this.

Thanx
Manjula

code-
**

*import* org.apache.lucene.search.Searcher;

*public* *class* LuceneDemo {

*public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = filesToIndex
;

*public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory;

*public* *static* *final* String *FIELD_PATH* = path;

*public* *static* *final* String *FIELD_CONTENTS* = contents;

*public* *static* *void* main(String[] args) *throws* Exception {

*createIndex*();

*searchIndex*(rice);

 }

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException, IOException {

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
StopAnalyzer.ENGLISH_STOP_WORDS);

*boolean* recreateIndexIfExists = *true*;

IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
recreateIndexIfExists);

File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

File[] files = dir.listFiles();

*for* (File file : files) {

Document document = *new* Document();

String path = file.getCanonicalPath();

document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index.
UN_TOKENIZED,Field.TermVector.*YES*));

Reader reader = *new* FileReader(file);

document.add(*new* Field(*FIELD_CONTENTS*, reader));

indexWriter.addDocument(document);

 }

indexWriter.optimize();

indexWriter.close();

}

*public* *static* *void* searchIndex(String searchString)
*throws*IOException, ParseException {

System.*out*.println(Searching for ' + searchString + ');

Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);

IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
StopAnalyzer.ENGLISH_STOP_WORDS);

QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);

Query query = queryParser.parse(searchString);

Hits hits = indexSearcher.search(query);

System.*out*.println(Number of hits:  + hits.length());

TopDocs results = indexSearcher.search(query,10);

ScoreDoc[] hits1 = results.scoreDocs;

*for* (ScoreDoc hit : hits1) {

Document doc = indexSearcher.doc(hit.doc);

//System.out.printf(%5.3f %s\n,hit.score,doc.get(FIELD_CONTENTS));

System.*out*.println(hit.score);

Searcher.explain(rice,0);

}

 IteratorHit it = hits.iterator();

*while* (it.hasNext()) {

Hit hit = it.next();

Document document = hit.getDocument();

String path = document.get(*FIELD_PATH*);

System.*out*.println(Hit:  + path);

}

}

}


On Mon, Jul 5, 2010 at 7:46 PM, Grant Ingersoll gsing...@apache.org wrote:


 On Jul 5, 2010, at 5:02 AM, manjula wijewickrema wrote:

  Hi,
 
  In my application, I input only single term query (at one time) and get
 back
  the corresponding scorings for those queries. But I am little struggling
 of
  understanding Lucene scoring. I have reffered
 
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
  and
  some other pages to resolve my matters. But some are still remain.
 
  1) Why it has taken the squareroot of frequency as the tf value and
 square
  of the idf vale in score function?

 Somewhat arbitrary, I suppose, but I think someone way back did some tests
 and decided it performed best in general.  More importantly, the point of
 the Similarity class is you can override these if you desire.

 
  2) If I enter single term query, then what will return bythe coord(q,d)?
  Since there are always one term in the query, I think always it should be
 1!
  Am I correct?

 Should be.  You can run the explain() method to confirm.

 
  3) I am also struggling understanding sumOfSquaredWeights (in
 queryNorm(q)).
  As I can understand, this value depends on the nature of the query we
 input
  and depends on that, it uses different methods such as TermQuery,
  MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery,
 etc.
  But if I always use single term query, then what will be the way selected
 by
  the system from above?

 The queryNorm is an attempt at making scores comparable across queries.
  Again, I'd try the explain() method to see the practical aspects of how it
 effects score.

 See http://lucene.apache.org/java/2_4_0/scoring.html for more info on
 scoring.

 -Grant
 -
 To unsubscribe, e-mail: java-user-unsubscr

Lucene Scoring

2010-07-05 Thread manjula wijewickrema

Hi,

In my application, I input only single term query (at one time) and get back
the corresponding scorings for those queries. But I am little struggling of
understanding Lucene scoring. I have reffered
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
and
some other pages to resolve my matters. But some are still remain.

1) Why it has taken the squareroot of frequency as the tf value and square
of the idf vale in score function?

2) If I enter single term query, then what will return bythe coord(q,d)?
Since there are always one term in the query, I think always it should be 1!
Am I correct?

3) I am also struggling understanding sumOfSquaredWeights (in queryNorm(q)).
As I can understand, this value depends on the nature of the query we input
and depends on that, it uses different methods such as TermQuery,
MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery, etc.
But if I always use single term query, then what will be the way selected by
the system from above?

If somebody can pls. help me to resolve these problems. Appreciate any reply
from you.

Regards,
Manjula

Re: How to get file names instead of paths?

2010-06-15 Thread manjula wijewickrema

Dear Ian,

The segment you have suggested, working nicely. Thanx a lot for your kind
help.

Manjula.

On Fri, Jun 11, 2010 at 4:00 PM, Ian Lea ian@gmail.com wrote:

 Something like this

 File f = new File(path);
 String fn = f.getName();
 return fn.substring(0, fn.lastIndexOf(.));


 --
 Ian.


 On Fri, Jun 11, 2010 at 11:20 AM, manjula wijewickrema
 manjul...@gmail.com wrote:
  Hi,
 
  Using the following programme I was able to get the entire file path of
  indexed files which matched with the given queries. But my intention is
 to
  get only the file names even without .txt extention as I need to send
 these
  file names as labels to another application. So, pls. let me know how can
 I
  get only the file names in the following code.
 
  Thanx in advance!
  Manjula.
 
 
  My code:
 
  *
 
  public* *class* LuceneDemo {
 
  *public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* =
 filesToIndex
  ;
 
  *public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory;
 
  *public* *static* *final* String *FIELD_PATH* = path;
 
  *public* *static* *final* String *FIELD_CONTENTS* = contents;
 
  *public* *static* *void* main(String[] args) *throws* Exception {
 
  *createIndex*();
 
  *searchIndex*(rice);
 
  *searchIndex*(milk);
 
  *searchIndex*(banana);
 
  *searchIndex*(foo);
 
   }
 
  *public* *static* *void* createIndex() *throws* CorruptIndexException,
  LockObtainFailedException, IOException {
 
   SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
  StopAnalyzer.ENGLISH_STOP_WORDS);
 
  *boolean* recreateIndexIfExists = *true*;
 
  IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
  recreateIndexIfExists);
 
  File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
 
  File[] files = dir.listFiles();
 
  *for* (File file : files) {
 
  Document document = *new* Document();
 
  String path = file.getCanonicalPath();
 
  document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*,
 Field.Index.
  UN_TOKENIZED,Field.TermVector.*YES*));
 
  Reader reader = *new* FileReader(file);
 
  document.add(*new* Field(*FIELD_CONTENTS*, reader));
 
  indexWriter.addDocument(document);
 
   }
 
  indexWriter.optimize();
 
  indexWriter.close();
 
  }
 
  *public* *static* *void* searchIndex(String searchString)
  *throws*IOException, ParseException {
 
  System.*out*.println(Searching for ' + searchString + ');
 
  Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);
 
  IndexReader indexReader = IndexReader.open(directory);
 
  IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);
 
   SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
  StopAnalyzer.ENGLISH_STOP_WORDS);
 
  QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);
 
  Query query = queryParser.parse(searchString);
 
  Hits hits = indexSearcher.search(query);
 
  System.*out*.println(Number of hits:  + hits.length());
 
  TopDocs results = indexSearcher.search(query,10);
 
  ScoreDoc[] hits1 = results.scoreDocs;
 
  *for* (ScoreDoc hit : hits1) {
 
  Document doc = indexSearcher.doc(hit.doc);
 
  System.*out*.printf(%5.3f %s\n,hit.score,doc.get(*FIELD_CONTENTS*));
 
  }
 
   IteratorHit it = hits.iterator();
 
  *while* (it.hasNext()) {
 
  Hit hit = it.next();
 
  Document document = hit.getDocument();
 
  String path = document.get(*FIELD_PATH*);
 
  System.*out*.println(Hit:  + path);
 
  }
 
  }
 
  }
 

  -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

How to get file names instead of paths?

2010-06-11 Thread manjula wijewickrema

Hi,

Using the following programme I was able to get the entire file path of
indexed files which matched with the given queries. But my intention is to
get only the file names even without .txt extention as I need to send these
file names as labels to another application. So, pls. let me know how can I
get only the file names in the following code.

Thanx in advance!
Manjula.


My code:

*

public* *class* LuceneDemo {

*public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = filesToIndex
;

*public* *static* *final* String *INDEX_DIRECTORY* = indexDirectory;

*public* *static* *final* String *FIELD_PATH* = path;

*public* *static* *final* String *FIELD_CONTENTS* = contents;

*public* *static* *void* main(String[] args) *throws* Exception {

*createIndex*();

*searchIndex*(rice);

*searchIndex*(milk);

*searchIndex*(banana);

*searchIndex*(foo);

 }

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException, IOException {

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
StopAnalyzer.ENGLISH_STOP_WORDS);

*boolean* recreateIndexIfExists = *true*;

IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
recreateIndexIfExists);

File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

File[] files = dir.listFiles();

*for* (File file : files) {

Document document = *new* Document();

String path = file.getCanonicalPath();

document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index.
UN_TOKENIZED,Field.TermVector.*YES*));

Reader reader = *new* FileReader(file);

document.add(*new* Field(*FIELD_CONTENTS*, reader));

indexWriter.addDocument(document);

 }

indexWriter.optimize();

indexWriter.close();

}

*public* *static* *void* searchIndex(String searchString)
*throws*IOException, ParseException {

System.*out*.println(Searching for ' + searchString + ');

Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);

IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( English,
StopAnalyzer.ENGLISH_STOP_WORDS);

QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);

Query query = queryParser.parse(searchString);

Hits hits = indexSearcher.search(query);

System.*out*.println(Number of hits:  + hits.length());

TopDocs results = indexSearcher.search(query,10);

ScoreDoc[] hits1 = results.scoreDocs;

*for* (ScoreDoc hit : hits1) {

Document doc = indexSearcher.doc(hit.doc);

System.*out*.printf(%5.3f %s\n,hit.score,doc.get(*FIELD_CONTENTS*));

}

 IteratorHit it = hits.iterator();

*while* (it.hasNext()) {

Hit hit = it.next();

Document document = hit.getDocument();

String path = document.get(*FIELD_PATH*);

System.*out*.println(Hit:  + path);

}

}

}

Re: Arrange terms[i]

2010-05-25 Thread manjula wijewickrema

Dear Grant,

Thanks for your reply.

Manjula

On Mon, May 24, 2010 at 4:37 PM, Grant Ingersoll gsing...@apache.orgwrote:


 On May 20, 2010, at 5:15 AM, manjula wijewickrema wrote:

  Hi,
 
  I wrote aprogram to get the ferquencies and terms of an indexed document.
  The output comes as follows;
 
 
  If I print : +tfv[0]
 
  Output:
 
  array terms are:{title: capabl/1, code/2, frequenc/1, lucen/4, over/1,
  sampl/1, term/4, test/1}
 
  In the same way I can print terms[i] and freqs[i], but the problem is
 while
  I am printing terms[i], output (array elements) comes according to the
  English alphabetic order (as above) and freqs[i] also arrange according
 that
  particular order. Is there a way to arrange terms[i] according to the
  ascending/descending order of their frequencies?

 Yes, have a look at the TermVectorMapper.  You will need to implement a
 variation of this to build up the data structures you need.

 -Grant
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Problem of getTermFrequencies()

2010-05-20 Thread manjula wijewickrema

Thanx

On Mon, May 17, 2010 at 10:19 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Note, depending on your downstream use, you may consider using a
 TermVectorMapper that allows you to construct your own data structures as
 needed.

 -Grant

 On May 17, 2010, at 3:16 PM, Ian Lea wrote:

  terms and freqs are arrays.  Try terms[i] and freqs[i].
 
 
  --
  Ian.
 
 
  On Mon, May 17, 2010 at 12:23 PM, manjula wijewickrema
  manjul...@gmail.com wrote:
  Hi,
 
  I wrote a code with a view to display the indexed terms and get their
 term
  frequencies of a single document. Although it displys those terms in the
  index, it does not give the term frequencies. Instead it displays '
 frequencies
  are:[...@80fa6f '. What's the reason for this. The code I have written
 and the
  display, can be given as follows.
 
  Code:
 
   *
 
  import* org.apache.lucene.analysis.standard.StandardAnalyzer;
  *
 
  import* org.apache.lucene.document.Document;
  *
 
  import* org.apache.lucene.document.Field;
  *
 
  import* org.apache.lucene.index.IndexWriter;
  *
 
  import* org.apache.lucene.index.IndexReader;
  *
 
  import* org.apache.lucene.queryParser.ParseException;
  *
 
  import* org.apache.lucene.queryParser.QueryParser;
  *
 
  import* org.apache.lucene.search.*;
  *
 
  import* org.apache.lucene.store.Directory;
  *
 
  import* org.apache.lucene.store.RAMDirectory;
  *
 
  import* org.apache.lucene.util.Version;
  *
 
  import* org.apache.lucene.index.TermFreqVector;
 
  *
 
  import* java.io.BufferedReader;
  *
 
  import* java.io.FileReader;
  *
 
  import* java.io.IOException;
  *
 
  import* org.apache.lucene.analysis.StopAnalyzer;
  *
 
  import* org.apache.lucene.analysis.snowball.SnowballAnalyzer;
 
 
  *
 
  public* *class* Testing{
 
  *
 
  public* *static* *void* main(String[] args) *throws* IOException,
  ParseException {
 
  //StandardAnalyzer analyzer = new
 StandardAnalyzer(Version.LUCENE_CURRENT);
 
  SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English,
 StopAnalyzer.
  ENGLISH_STOP_WORDS);
 
  *try*{
 
  Directory directory=*new* RAMDirectory();
 
  IndexWriter w = *new* IndexWriter(directory, analyzer, *true*,
 
  IndexWriter.MaxFieldLength.*UNLIMITED*);
 
  Document doc = *new* Document();
 
  String text=This is a sample codes code for testing lucene's
 capabilities
  over lucene term frequencies;
 
  doc.add(*new* Field(title, text, Field.Store.*YES*,
 Field.Index.*ANALYZED*
  ,Field.TermVector.*YES*));
 
  w.addDocument(doc);
 
  w.close();
 
  IndexReader ir=IndexReader.open(directory);
 
  TermFreqVector[] tfv=ir.getTermFreqVectors(0);
 
  // for (int xy = 0; xy  tfv.length; xy++) {
 
  String[] terms = tfv[0].getTerms();
 
  *int*[] freqs=tfv[0].getTermFrequencies();
 
  //System.out.println(terms are:+tfv[xy]);
 
  //System.out.println(length is:+terms.length);
 
  System.*out*.println(array terms are:+tfv[0]);
 
  System.*out*.println(terms are:+terms);
 
  System.*out*.println(frequencies are:+freqs);
 
  // }
 
   }*catch*(Exception ex){
 
  ex.printStackTrace();
 
  }
 
  }
 
  }
 
 
 
  Display:
 
  array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1,
  sampl/1, term/1, test/1}
 
  terms are:[Ljava.lang.String;@1e13d52
 
  frequencies are:[...@80fa6f
 
 
 
  If some body can pls. help me to get the desired output.
 
  Thanx,
 
  Manjula.
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

Arrange terms[i]

2010-05-20 Thread manjula wijewickrema

Hi,

I wrote aprogram to get the ferquencies and terms of an indexed document.
The output comes as follows;


If I print : +tfv[0]

Output:

array terms are:{title: capabl/1, code/2, frequenc/1, lucen/4, over/1,
sampl/1, term/4, test/1}

In the same way I can print terms[i] and freqs[i], but the problem is while
I am printing terms[i], output (array elements) comes according to the
English alphabetic order (as above) and freqs[i] also arrange according that
particular order. Is there a way to arrange terms[i] according to the
ascending/descending order of their frequencies?

Thanx in advance.

Manjula

Re: How to call high fre. terms using HighFreTerms class

2010-05-17 Thread manjula wijewickrema

hi Erick,
Thanx

On Sat, May 15, 2010 at 5:37 PM, Erick Erickson erickerick...@gmail.comwrote:

 It looks like a stand-alone program, so you don't call it.
 You probably want to get the source code and take a look at
 how that program works to get an idea of how to do what you want.

 See the instructions here for getting the source:
 http://wiki.apache.org/lucene-java/HowToContribute

 HTH
 Erick

 On Sat, May 15, 2010 at 1:49 AM, manjula wijewickrema
 manjul...@gmail.comwrote:

  Hi,
 
  I am struggling with using HighFreTerms class for the purpose of find
 high
  fre. terms in my index. My target is to get the high frequency terms in
 an
  indexed document (single document). To do that I have added
  org.apache.lucene.misc package into my project. I think upto that point I
  am
  correct. But after that I have no an idea of how to call this in my
  coding. Although I have looked in the lucene email archive, I was unable
 to
  find a hint regarding to call of this class. If anybody can pls. give me
 a
  sample code for using this class (and relevent methods) in the code which
  suit to my purpose. I appreciate your kind help.
 
  Thanks
  Manjula

Problem of getTermFrequencies()

2010-05-17 Thread manjula wijewickrema

Hi,

I wrote a code with a view to display the indexed terms and get their term
frequencies of a single document. Although it displys those terms in the
index, it does not give the term frequencies. Instead it displays ' frequencies
are:[...@80fa6f '. What's the reason for this. The code I have written and the
display, can be given as follows.

Code:

 *

import* org.apache.lucene.analysis.standard.StandardAnalyzer;
*

import* org.apache.lucene.document.Document;
*

import* org.apache.lucene.document.Field;
*

import* org.apache.lucene.index.IndexWriter;
*

import* org.apache.lucene.index.IndexReader;
*

import* org.apache.lucene.queryParser.ParseException;
*

import* org.apache.lucene.queryParser.QueryParser;
*

import* org.apache.lucene.search.*;
*

import* org.apache.lucene.store.Directory;
*

import* org.apache.lucene.store.RAMDirectory;
*

import* org.apache.lucene.util.Version;
*

import* org.apache.lucene.index.TermFreqVector;

*

import* java.io.BufferedReader;
*

import* java.io.FileReader;
*

import* java.io.IOException;
*

import* org.apache.lucene.analysis.StopAnalyzer;
*

import* org.apache.lucene.analysis.snowball.SnowballAnalyzer;


*

public* *class* Testing{

*

public* *static* *void* main(String[] args) *throws* IOException,
ParseException {

//StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English, StopAnalyzer.
ENGLISH_STOP_WORDS);

*try*{

Directory directory=*new* RAMDirectory();

IndexWriter w = *new* IndexWriter(directory, analyzer, *true*,

IndexWriter.MaxFieldLength.*UNLIMITED*);

Document doc = *new* Document();

String text=This is a sample codes code for testing lucene's capabilities
over lucene term frequencies;

doc.add(*new* Field(title, text, Field.Store.*YES*, Field.Index.*ANALYZED*
,Field.TermVector.*YES*));

w.addDocument(doc);

w.close();

IndexReader ir=IndexReader.open(directory);

TermFreqVector[] tfv=ir.getTermFreqVectors(0);

// for (int xy = 0; xy  tfv.length; xy++) {

String[] terms = tfv[0].getTerms();

*int*[] freqs=tfv[0].getTermFrequencies();

//System.out.println(terms are:+tfv[xy]);

//System.out.println(length is:+terms.length);

System.*out*.println(array terms are:+tfv[0]);

System.*out*.println(terms are:+terms);

System.*out*.println(frequencies are:+freqs);

// }

 }*catch*(Exception ex){

ex.printStackTrace();

}

}

}



Display:

array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1,
sampl/1, term/1, test/1}

terms are:[Ljava.lang.String;@1e13d52

frequencies are:[...@80fa6f



If some body can pls. help me to get the desired output.

Thanx,

Manjula.

Re: Problem of getTermFrequencies()

2010-05-17 Thread manjula wijewickrema

Dear Ian,

I changed it as you said and now it is working nicely. Thanks a lot for your
kind help.

Manjula

On Mon, May 17, 2010 at 6:46 PM, Ian Lea ian@gmail.com wrote:

 terms and freqs are arrays.  Try terms[i] and freqs[i].


 --
 Ian.


 On Mon, May 17, 2010 at 12:23 PM, manjula wijewickrema
 manjul...@gmail.com wrote:
  Hi,
 
  I wrote a code with a view to display the indexed terms and get their
 term
  frequencies of a single document. Although it displys those terms in the
  index, it does not give the term frequencies. Instead it displays '
 frequencies
  are:[...@80fa6f '. What's the reason for this. The code I have written and
 the
  display, can be given as follows.
 
  Code:
 
   *
 
  import* org.apache.lucene.analysis.standard.StandardAnalyzer;
  *
 
  import* org.apache.lucene.document.Document;
  *
 
  import* org.apache.lucene.document.Field;
  *
 
  import* org.apache.lucene.index.IndexWriter;
  *
 
  import* org.apache.lucene.index.IndexReader;
  *
 
  import* org.apache.lucene.queryParser.ParseException;
  *
 
  import* org.apache.lucene.queryParser.QueryParser;
  *
 
  import* org.apache.lucene.search.*;
  *
 
  import* org.apache.lucene.store.Directory;
  *
 
  import* org.apache.lucene.store.RAMDirectory;
  *
 
  import* org.apache.lucene.util.Version;
  *
 
  import* org.apache.lucene.index.TermFreqVector;
 
  *
 
  import* java.io.BufferedReader;
  *
 
  import* java.io.FileReader;
  *
 
  import* java.io.IOException;
  *
 
  import* org.apache.lucene.analysis.StopAnalyzer;
  *
 
  import* org.apache.lucene.analysis.snowball.SnowballAnalyzer;
 
 
  *
 
  public* *class* Testing{
 
  *
 
  public* *static* *void* main(String[] args) *throws* IOException,
  ParseException {
 
  //StandardAnalyzer analyzer = new
 StandardAnalyzer(Version.LUCENE_CURRENT);
 
  SnowballAnalyzer analyzer = *new* SnowballAnalyzer(English,
 StopAnalyzer.
  ENGLISH_STOP_WORDS);
 
  *try*{
 
  Directory directory=*new* RAMDirectory();
 
  IndexWriter w = *new* IndexWriter(directory, analyzer, *true*,
 
  IndexWriter.MaxFieldLength.*UNLIMITED*);
 
  Document doc = *new* Document();
 
  String text=This is a sample codes code for testing lucene's
 capabilities
  over lucene term frequencies;
 
  doc.add(*new* Field(title, text, Field.Store.*YES*,
 Field.Index.*ANALYZED*
  ,Field.TermVector.*YES*));
 
  w.addDocument(doc);
 
  w.close();
 
  IndexReader ir=IndexReader.open(directory);
 
  TermFreqVector[] tfv=ir.getTermFreqVectors(0);
 
  // for (int xy = 0; xy  tfv.length; xy++) {
 
  String[] terms = tfv[0].getTerms();
 
  *int*[] freqs=tfv[0].getTermFrequencies();
 
  //System.out.println(terms are:+tfv[xy]);
 
  //System.out.println(length is:+terms.length);
 
  System.*out*.println(array terms are:+tfv[0]);
 
  System.*out*.println(terms are:+terms);
 
  System.*out*.println(frequencies are:+freqs);
 
  // }
 
   }*catch*(Exception ex){
 
  ex.printStackTrace();
 
  }
 
  }
 
  }
 
 
 
  Display:
 
  array terms are:{title: capabl/1, code/2, frequenc/1, lucen/2, over/1,
  sampl/1, term/1, test/1}
 
  terms are:[Ljava.lang.String;@1e13d52
 
  frequencies are:[...@80fa6f
 
 
 
  If some body can pls. help me to get the desired output.
 
  Thanx,
 
  Manjula.
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Error of the code

2010-05-14 Thread manjula wijewickrema

Hi Ian,

Thanx for your reply. vector.size() returns the total number of indexed
terms in the index. However I was able to run the program and get the
results finally with your help. Thanks a lot.

Manjula

On Thu, May 13, 2010 at 6:52 PM, Ian Lea ian@gmail.com wrote:

 What does vector.size() return?  You don't appear to be doing anything
 with the String term in for ( String term : vector.getTerms() ) -
 presumably you intend to.


 --
 Ian.

 On Thu, May 13, 2010 at 1:16 PM, manjula wijewickrema
  manjul...@gmail.com wrote:
  Dear Ian,
 
  Thanks a lot for your immediate reply. As you have mentioned I replaced
 the
  lines as follows.
 
 
  IndexReader ir=IndexReader.open(directory);
 
  TermFreqVector vector=ir.getTermFreqVector(0,fieldname);
 
  Now the error has been vanished and thanks for it. But I can't still see
 the
  results although I have moved those lines after iwriter.close(). What's
 the
  reason for this?
 
  sample code after modifications:
  .
  
 
  String text = This is the text to be indexed.;
 
   doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.*
  ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*));
 
  iwriter.addDocument(doc);
 
  iwriter.close();
 
  IndexReader ir=IndexReader.open(directory);
 
  TermFreqVector vector=ir.getTermFreqVector(0,fieldname);
  *
 
  int* size = vector.size();
 
  *for* ( String term : vector.getTerms() )
 
  System.*out*.println( size =  + size );
 
  IndexSearcher isearcher = *new* IndexSearcher(directory, *true*);
  ..
  ..
  I appreciate your kind coperation
  Manjula
  On Thu, May 13, 2010 at 3:45 PM, Ian Lea ian@gmail.com wrote:
 
  You need to replace this:
 
  TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname );
 
  with
 
  IndexReader ir = whatever(...);
  TermFreqVector vector = ir.getTermFreqVector(0, fieldname );
 
  And you'll need to move it to after the writer.close() call if you
  want it to see the doc you've just added.
 
 
 
  --
  Ian.
 
 
 
  On Thu, May 13, 2010 at 11:07 AM, manjula wijewickrema
  manjul...@gmail.com wrote:
   Dear All,
  
   I am trying to get the term frequencies (through TermFreqVector) of a
   document (using Lucene 2.9.1). In order to do that I have used the
  following
   code. But there is a compile time error in the code and I can't figure
 it
   out. Could somebody can guide me what's wrong with it.
   Compile time error I got:
   Cannot make a static reference to the non-static method
   getTermFreqVector(int, String) from the type IndexReader.
  
   Code:
  
*import* org.apache.lucene.analysis.standard.StandardAnalyzer;
  
   *import* org.apache.lucene.document.Document;
   *
  
   import* org.apache.lucene.document.Field;
   *
  
   import* org.apache.lucene.index.IndexWriter;
   *
  
   import* org.apache.lucene.queryParser.ParseException;
   *
  
   import* org.apache.lucene.queryParser.QueryParser;
   *
  
   import* org.apache.lucene.search.*;
   *
  
   import* org.apache.lucene.store.Directory;
   *
  
   import* org.apache.lucene.store.RAMDirectory;
   *
  
   import* org.apache.lucene.util.Version;
  
   *
  
   import* org.apache.lucene.index.IndexReader;
   *
  
   import* org.apache.lucene.index.TermEnum;
   *
  
   import* org.apache.lucene.index.Term;
   *
  
   import* org.apache.lucene.index.TermFreqVector;
  
   *
  
   import* java.io.IOException;
   *
  
   public* *class* DemoTest {
  
   *public* *static* *void* main(String[] args) {
  
   StandardAnalyzer analyzer = *new*
  StandardAnalyzer(Version.*LUCENE_CURRENT*
   );
  
   *try* {
  
   Directory directory = *new* RAMDirectory();
  
   IndexWriter iwriter = *new* IndexWriter(directory, analyzer,
   *true*,*new*IndexWriter.MaxFieldLength(25000));
  
   Document doc = *new* Document();
  
   String text = This is the text to be indexed.;
  
   doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.*
   ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*));
  
   iwriter.addDocument(doc);
  
   TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname
 );
   *
  
   int* size = vector.size();
  
   *for* ( String term : vector.getTerms() )
  
   System.*out*.println( size =  + size );
  
   iwriter.close();
  
   IndexSearcher isearcher = *new* IndexSearcher(directory, *true*);
  
   QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*,
  fieldname,
   analyzer);
  
   Query query = parser.parse(text);
  
   ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs;
  
   System.*out*.println(hits.length(1) =  + hits.length);
  
   // Iterate through the results:
  
   *for* (*int* i = 0; i  hits.length; i++) {
  
   Document hitDoc = isearcher.doc(hits.doc);
  
   System.*out*.println(hitDoc.get(\fieldname\) (This is the text to
 be
   indexed) =  +
  
   hitDoc.get(fieldname));
  
   }
  
   isearcher.close();
  
   directory.close();
  
   } *catch* (Exception ex

Access indexed terms

2010-05-14 Thread manjula wijewickrema

Hi,

Is it possible to put the indexed terms into an array in lucene. For
example, imagine I have indexed a single document in Lucene and now I want
to acces those terms in the index. Is it possible to retrieve (call) those
terms as array elements? If it is possible, then how?

Thanks,
Manjula

Re: Access indexed terms

2010-05-14 Thread manjula wijewickrema

Hi Andrzej

Thanx for the reply. But as you have mentioned, creating arrays for indexed
terms seems to be little difficult. Here my intention is to find the term
frequencies (of terms) of an indexed document. I can find the term frequency
of a particular term (giving as a query) if I specify the term in the code.
But I really want is to get the term frequency (or even the number of times
it appears in the document) of the all indexed terms (or high frequency
terms) without named them in the code. Is there an alternative way to do
that?

Thanks
Manjula

On Fri, May 14, 2010 at 4:00 PM, Andrzej Bialecki a...@getopt.org wrote:

On 2010-05-14 11:35, manjula wijewickrema wrote:
Hi,

Is it possible to put the indexed terms into an array in lucene. For
example, imagine I have indexed a single document in Lucene and now I
want
to acces those terms in the index. Is it possible to retrieve (call)
those
terms as array elements? If it is possible, then how?

In short: unless you created TermFrequencyVector when adding the
document, the answer is with great difficulty.

For a working code that does this see here:

http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/DocReconstructor.java

If you really need such kind of access in your application then add your
documents with term vectors with offsets and positions. Even then,
depending on the Analyzer you used, the process is lossy - some input
data that was discarded by Analyzer is simply no longer available.

--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Access indexed terms

2010-05-14 Thread manjula wijewickrema

Dear Andrzej,

Thanx for your valuable help. I also noticed this HighFreqTerms approach in
the Lucene email archive and try to use it. In order to do that I have
downloaded lucene-misc-2.9.1.jar and added org.apache.lucene.misc package
into my project. Now I think I have to call this HighFreqTerms class in my
code. But I was unable to find any guidence of how to do it? If you can pls.
be kind enough to tell me how can I use this class in my code.

Thanx
Manjula


On Fri, May 14, 2010 at 6:16 PM, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-05-14 14:24, manjula wijewickrema wrote:
  Hi Andrzej
 
  Thanx for the reply. But as you have mentioned, creating arrays for
 indexed
  terms seems to be little difficult. Here my intention is to find the term
  frequencies (of terms) of an indexed document. I can find the term
 frequency
  of a particular term (giving as a query) if I specify the term in the
 code.
  But I really want is to get the term frequency (or even the number of
 times
  it appears in the document) of the all indexed terms (or high frequency
  terms) without named them in the code. Is there an alternative way to do
  that?

 Yes, see the discussion here:

 https://issues.apache.org/jira/browse/LUCENE-2393


 --
  Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

How to call high fre. terms using HighFreTerms class

2010-05-14 Thread manjula wijewickrema

Hi,

I am struggling with using HighFreTerms class for the purpose of find high
fre. terms in my index. My target is to get the high frequency terms in an
indexed document (single document). To do that I have added
org.apache.lucene.misc package into my project. I think upto that point I am
correct. But after that I have no an idea of how to call this in my
coding. Although I have looked in the lucene email archive, I was unable to
find a hint regarding to call of this class. If anybody can pls. give me a
sample code for using this class (and relevent methods) in the code which
suit to my purpose. I appreciate your kind help.

Thanks
Manjula

Re: Class_for_HighFrequencyTerms

2010-05-13 Thread manjula wijewickrema

thanks

On Tue, May 11, 2010 at 3:31 PM, adam.salt...@gmail.com wrote:

Sounds like your path is messed up and you're not using maven correctly.
Start with the jar version that contains the class you require and use maven
pom to correctly resolve dependencies
Adam
Sent using BlackBerry® from Orange

-Original Message-
From: manjula wijewickrema manjul...@gmail.com
Date: Tue, 11 May 2010 15:13:12
To: java-user@lucene.apache.org
Subject: Re: Class_for_HighFrequencyTerms

Dear Erick,

I lokked for it and even added IndexReader.java and TermFreqVector.java
from

http://www.jarvana.com/jarvana/search?search_type=classjava_class=org.apache.lucene.index.IndexReader
.
But after adding the system indicated a lot of errors in the source code
IndexReader.java (eg: DirectoryOwningReader cannot be resolved to a
type, indexCommit
cannot be resolved to a type, SegmentInfos cannot be resolved, TermEnum
cannot be resolved to a type, etc.). I am using Lucene 2.9.1 and this
particular website has listed this source code under 2.9.1 version of
Lucene. What is the reason for this kind of scenario? Do I have to add
another JAR file (in order to solve this even I added
lucene-core-2.9.1-sources.jar, but nothing happened). Pls. be kind enough
to
make a reply.

Tanks
Manjula

On Tue, May 11, 2010 at 1:26 AM, Erick Erickson erickerick...@gmail.com
wrote:

Have you looked at TermFreqVector?

Best
Erick

On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema
manjul...@gmail.comwrote:

Hi,

If I index a document (single document) in Lucene, then how can I get
the
term frequencies (even the first and second highest occuring terms) of
that
document? Is there any class/method to do taht? If anybody knows, pls.
help
me.

Thanks
Manjula

Error of the code

2010-05-13 Thread manjula wijewickrema

Dear All,

I am trying to get the term frequencies (through TermFreqVector) of a
document (using Lucene 2.9.1). In order to do that I have used the following
code. But there is a compile time error in the code and I can't figure it
out. Could somebody can guide me what's wrong with it.
Compile time error I got:
Cannot make a static reference to the non-static method
getTermFreqVector(int, String) from the type IndexReader.

Code:

 *import* org.apache.lucene.analysis.standard.StandardAnalyzer;

*import* org.apache.lucene.document.Document;
*

import* org.apache.lucene.document.Field;
*

import* org.apache.lucene.index.IndexWriter;
*

import* org.apache.lucene.queryParser.ParseException;
*

import* org.apache.lucene.queryParser.QueryParser;
*

import* org.apache.lucene.search.*;
*

import* org.apache.lucene.store.Directory;
*

import* org.apache.lucene.store.RAMDirectory;
*

import* org.apache.lucene.util.Version;

*

import* org.apache.lucene.index.IndexReader;
*

import* org.apache.lucene.index.TermEnum;
*

import* org.apache.lucene.index.Term;
*

import* org.apache.lucene.index.TermFreqVector;

*

import* java.io.IOException;
*

public* *class* DemoTest {

*public* *static* *void* main(String[] args) {

StandardAnalyzer analyzer = *new* StandardAnalyzer(Version.*LUCENE_CURRENT*
);

*try* {

Directory directory = *new* RAMDirectory();

IndexWriter iwriter = *new* IndexWriter(directory, analyzer,
*true*,*new*IndexWriter.MaxFieldLength(25000));

Document doc = *new* Document();

String text = This is the text to be indexed.;

doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.*
ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*));

iwriter.addDocument(doc);

TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname );
*

int* size = vector.size();

*for* ( String term : vector.getTerms() )

System.*out*.println( size =  + size );

iwriter.close();

IndexSearcher isearcher = *new* IndexSearcher(directory, *true*);

QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*, fieldname,
analyzer);

Query query = parser.parse(text);

ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs;

System.*out*.println(hits.length(1) =  + hits.length);

// Iterate through the results:

*for* (*int* i = 0; i  hits.length; i++) {

Document hitDoc = isearcher.doc(hits.doc);

System.*out*.println(hitDoc.get(\fieldname\) (This is the text to be
indexed) =  +

hitDoc.get(fieldname));

}

isearcher.close();

directory.close();

} *catch* (Exception ex) {

ex.printStackTrace();

}

}

}



Thanks in advance

Manjula

Re: Error of the code

2010-05-13 Thread manjula wijewickrema

Dear Ian,

Thanks a lot for your immediate reply. As you have mentioned I replaced the
lines as follows.


IndexReader ir=IndexReader.open(directory);

TermFreqVector vector=ir.getTermFreqVector(0,fieldname);

Now the error has been vanished and thanks for it. But I can't still see the
results although I have moved those lines after iwriter.close(). What's the
reason for this?

sample code after modifications:
.


String text = This is the text to be indexed.;

 doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.*
ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*));

iwriter.addDocument(doc);

iwriter.close();

IndexReader ir=IndexReader.open(directory);

TermFreqVector vector=ir.getTermFreqVector(0,fieldname);
*

int* size = vector.size();

*for* ( String term : vector.getTerms() )

System.*out*.println( size =  + size );

IndexSearcher isearcher = *new* IndexSearcher(directory, *true*);
..
..
I appreciate your kind coperation
Manjula
On Thu, May 13, 2010 at 3:45 PM, Ian Lea ian@gmail.com wrote:

 You need to replace this:

 TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname );

 with

 IndexReader ir = whatever(...);
 TermFreqVector vector = ir.getTermFreqVector(0, fieldname );

 And you'll need to move it to after the writer.close() call if you
 want it to see the doc you've just added.



 --
 Ian.



 On Thu, May 13, 2010 at 11:07 AM, manjula wijewickrema
 manjul...@gmail.com wrote:
  Dear All,
 
  I am trying to get the term frequencies (through TermFreqVector) of a
  document (using Lucene 2.9.1). In order to do that I have used the
 following
  code. But there is a compile time error in the code and I can't figure it
  out. Could somebody can guide me what's wrong with it.
  Compile time error I got:
  Cannot make a static reference to the non-static method
  getTermFreqVector(int, String) from the type IndexReader.
 
  Code:
 
   *import* org.apache.lucene.analysis.standard.StandardAnalyzer;
 
  *import* org.apache.lucene.document.Document;
  *
 
  import* org.apache.lucene.document.Field;
  *
 
  import* org.apache.lucene.index.IndexWriter;
  *
 
  import* org.apache.lucene.queryParser.ParseException;
  *
 
  import* org.apache.lucene.queryParser.QueryParser;
  *
 
  import* org.apache.lucene.search.*;
  *
 
  import* org.apache.lucene.store.Directory;
  *
 
  import* org.apache.lucene.store.RAMDirectory;
  *
 
  import* org.apache.lucene.util.Version;
 
  *
 
  import* org.apache.lucene.index.IndexReader;
  *
 
  import* org.apache.lucene.index.TermEnum;
  *
 
  import* org.apache.lucene.index.Term;
  *
 
  import* org.apache.lucene.index.TermFreqVector;
 
  *
 
  import* java.io.IOException;
  *
 
  public* *class* DemoTest {
 
  *public* *static* *void* main(String[] args) {
 
  StandardAnalyzer analyzer = *new*
 StandardAnalyzer(Version.*LUCENE_CURRENT*
  );
 
  *try* {
 
  Directory directory = *new* RAMDirectory();
 
  IndexWriter iwriter = *new* IndexWriter(directory, analyzer,
  *true*,*new*IndexWriter.MaxFieldLength(25000));
 
  Document doc = *new* Document();
 
  String text = This is the text to be indexed.;
 
  doc.add(*new* Field(fieldname, text, Field.Store.*YES*,Field.Index.*
  ANALYZED*,Field.TermVector.*WITH_POSITIONS_OFFSETS*));
 
  iwriter.addDocument(doc);
 
  TermFreqVector vector = IndexReader.getTermFreqVector(0, fieldname );
  *
 
  int* size = vector.size();
 
  *for* ( String term : vector.getTerms() )
 
  System.*out*.println( size =  + size );
 
  iwriter.close();
 
  IndexSearcher isearcher = *new* IndexSearcher(directory, *true*);
 
  QueryParser parser = *new* QueryParser(Version.*LUCENE_CURRENT*,
 fieldname,
  analyzer);
 
  Query query = parser.parse(text);
 
  ScoreDoc[] hits = isearcher.search(query, *null*, 1000).scoreDocs;
 
  System.*out*.println(hits.length(1) =  + hits.length);
 
  // Iterate through the results:
 
  *for* (*int* i = 0; i  hits.length; i++) {
 
  Document hitDoc = isearcher.doc(hits.doc);
 
  System.*out*.println(hitDoc.get(\fieldname\) (This is the text to be
  indexed) =  +
 
  hitDoc.get(fieldname));
 
  }
 
  isearcher.close();
 
  directory.close();
 
  } *catch* (Exception ex) {
 
  ex.printStackTrace();
 
  }
 
  }
 
  }
 
 
 
  Thanks in advance
 
  Manjula
 

  -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Class_for_HighFrequencyTerms

2010-05-11 Thread manjula wijewickrema

Dear Erick,

I lokked for it and even added IndexReader.java and TermFreqVector.java
from
http://www.jarvana.com/jarvana/search?search_type=classjava_class=org.apache.lucene.index.IndexReader
.
But after adding the system indicated a lot of errors in the source code
IndexReader.java (eg: DirectoryOwningReader cannot be resolved to a
type, indexCommit
cannot be resolved to a type, SegmentInfos cannot be resolved, TermEnum
cannot be resolved to a type, etc.). I am using Lucene 2.9.1 and this
particular website has listed this source code under 2.9.1 version of
Lucene. What is the reason for this kind of scenario? Do I have to add
another JAR file (in order to solve this even I added
lucene-core-2.9.1-sources.jar, but nothing happened). Pls. be kind enough to
make a reply.

Tanks
Manjula

On Tue, May 11, 2010 at 1:26 AM, Erick Erickson erickerick...@gmail.comwrote:

 Have you looked at TermFreqVector?

 Best
 Erick

 On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema
 manjul...@gmail.comwrote:

  Hi,
 
  If I index a document (single document) in Lucene, then how can I get the
  term frequencies (even the first and second highest occuring terms) of
 that
  document? Is there any class/method to do taht? If anybody knows, pls.
 help
  me.
 
  Thanks
  Manjula

Re: Trace only exactly matching terms!

2010-05-10 Thread manjula wijewickrema

Hi Anshum  Erick,

As you have mentioned, I used SnowballAnalyzer for stemming purposes. It
worked nicely. Thnks a lot for your guidence.

Manjula.

On Fri, May 7, 2010 at 8:27 PM, Erick Erickson erickerick...@gmail.comwrote:

 The other approach is to use a stemmer both at index and query time.

 BTW, it's very easy to make a custom analyzer by chaining together
 the Tokenizer and as many filters (e.g. PorterStemFilter), essentially
 composing your analyzer from various pre-built Lucene parts.

 HTH
 Erick

 On Fri, May 7, 2010 at 9:07 AM, Anshum ansh...@gmail.com wrote:

  Hi Manjula,
  Yes lucene by default would only tackle exact term matches unless you use
 a
  custom analyzer to expand the index/query.
 
  --
  Anshum Gupta
  http://ai-cafe.blogspot.com
 
  The facts expressed here belong to everybody, the opinions to me. The
  distinction is yours to draw
 
 
  On Fri, May 7, 2010 at 2:22 PM, manjula wijewickrema 
 manjul...@gmail.com
  wrote:
 
   Hi,
  
   I am using Lucene 2.9.1 . I have downloaded and run the
  'HelloLucene.java'
   class by modifing the input document and user query in various ways.
 Once
  I
   put the document sentenses as 'Lucene in actions' insted of 'Lucene in
   action', and I gave the query as 'action' and run the programme. But it
  did
   not show me the 'Lucene in action as a hit'! What is the reason for
 this?
   Why it doesn't tackle word 'actions' as a hit? Does Lucene identify
 only
   the
   exactly matching words?
  
   Thanks
   Manjula

Class_for_HighFrequencyTerms

2010-05-10 Thread manjula wijewickrema

Hi,

If I index a document (single document) in Lucene, then how can I get the
term frequencies (even the first and second highest occuring terms) of that
document? Is there any class/method to do taht? If anybody knows, pls. help
me.

Thanks
Manjula

Trace only exactly matching terms!

2010-05-07 Thread manjula wijewickrema

Hi,

I am using Lucene 2.9.1 . I have downloaded and run the 'HelloLucene.java'
class by modifing the input document and user query in various ways. Once I
put the document sentenses as 'Lucene in actions' insted of 'Lucene in
action', and I gave the query as 'action' and run the programme. But it did
not show me the 'Lucene in action as a hit'! What is the reason for this?
Why it doesn't tackle word 'actions' as a hit? Does Lucene identify only the
exactly matching words?

Thanks
Manjula

Term/Phrase frequencies

2010-05-06 Thread manjula wijewickrema

Hi,

I am new to Lucene. If I want to know the term or phrase frequency of an
input document, will it be possible through Lucene?

Thanks,
Manjula

52 matches

Mail list logo