date:20070525

Writing a document using two different Analyzers

2007-05-25 Thread Paulo Silveira


Hello!

I have a Document with tow fields: one I would like to write with
SimpleAnalyzer, the other I want to use StandardAnalyzer, is there a
simple way to do it?

thanks

--
Paulo E. A. Silveira
Caelum Ensino e Soluções em Java
http://www.caelum.com.br/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Writing a document using two different Analyzers

2007-05-25 Thread karl wettin



25 maj 2007 kl. 09.32 skrev Paulo Silveira:



I have a Document with tow fields: one I would like to write with
SimpleAnalyzer, the other I want to use StandardAnalyzer, is there a
simple way to do it?


PerFieldAnalyzerWrapper

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/ 
org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: number of times the keyword match

2007-05-25 Thread Anny Bridge


Hi Grant,

Is there any code example for this case?

Thanks,
Anny

On 5/15/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


Yes, have a look at the SpanQuery functionality.

-Grant

On May 15, 2007, at 3:05 AM, Anny Bridge wrote:

> Hi all,
>
> When do search with lucene,can i get the number of times the
> keyword match
> with a specific document?
>
> Thanks in advance.
>
> Best Regards,
> Anny.

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Writing a document using two different Analyzers

2007-05-25 Thread Paulo Silveira


On 5/25/07, karl wettin <[EMAIL PROTECTED]> wrote:


PerFieldAnalyzerWrapper



that was fast! thanks!



http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/
org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html

--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Paulo E. A. Silveira
Caelum Ensino e Soluções em Java
http://www.caelum.com.br/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Setting the maximum number of documents in a lucene segment

2007-05-25 Thread Ard Schrijvers

Hello,

I am trying to change the maximum number of documents in a lucene segment. By 
default it seems to be 10. When I have a mergeFactor of say 10, then on 
average, after every 100 added documents lucene is merging segments.

I want each segment to contain more then the default 10 documents, because I 
need to minimize merging.

Is there a way to achieve this? writer.setMaxBufferedDocs(largeValue) does not 
do the trick (I think because in my case because the writer is flushed and 
closed after an few updates)

Does anyone know wether it is possible to make the default number of documents 
a segment can contain larger?

Thanks in advance, 

Ard Schrijvers


-- 

Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-
[EMAIL PROTECTED] / http://www.hippo.nl
-- 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Setting the maximum number of documents in a lucene segment

2007-05-25 Thread Ard Schrijvers


> 
> Hello,
> 
> I am trying to change the maximum number of documents in a 
> lucene segment. By default it seems to be 10.

Correction: 10 for the smallest (just created) segments of course, because 
obviously merged segments are likely to contain many more documents

> When I have a 
> mergeFactor of say 10, then on average, after every 100 added 
> documents lucene is merging segments.
> 
> I want each segment to contain more then the default 10 
> documents, because I need to minimize merging.
> 
> Is there a way to achieve this? 
> writer.setMaxBufferedDocs(largeValue) does not do the trick 
> (I think because in my case because the writer is flushed and 
> closed after an few updates)
> 
> Does anyone know wether it is possible to make the default 
> number of documents a segment can contain larger?
> 
> Thanks in advance, 
> 
> Ard Schrijvers
> 
> 
> -- 
> 
> Hippo
> Oosteinde 11
> 1017WT Amsterdam
> The Netherlands
> Tel  +31 (0)20 5224466
> -
> [EMAIL PROTECTED] / http://www.hippo.nl
> -- 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

2007-05-25 Thread Walt Stoneburner


Grant writes:

Have a look at the DisjunctionMaxQuery, I think it might help,
although I am not sure it will fully cover your case.


The definition for DisjunctionMaxQuery is provided at this URL:
http://incubator.apache.org/lucene.net/docs/2.1/Lucene.Net.Search.DisjunctionMaxQuery.html,

Grossly doing editorial cuts of the synopsis text, we end up with this
simplified description:
'This is useful when searching for a word in multiple fields ... if
the query is "albino elephant" this ensures that "albino" matching one
field and "elephant" matching another gets a higher score than
"albino" matching both fields'

First off, thanks Grant -- I hadn't even considered the possibility of
what happens if multiple fields in the _same_ document matched.
That's an intriguing case, indeed.

However, for my particular dataset, I only have the one field
containing the contents of the document, so unless I've missed an
alternate way of using it, I'm not how I should apply it to my
specific case.


For clarification, what I'm trying to do is make sure that if a
document uses a single term many times, that it doesn't drown out a
document that uses more search terms, though less frequently, when the
scores are returned.

Take a document that says: "Albino. Albino. Albino. Albino. Albino.
Albino. Albino!"  Right there, that's seven hits on albino, so this
must _really_ be a document about albino.

Take a document that says "Albino elephant." and nothing more.  This
only has two keyword hits.

What I want to do is make sure the returns results don't go "Oh, 7 is
more than 2, let's return the Albino document first."

Instead, I'm looking for "This document matched 2 of the things he was
looking for, albino and also elephant, while the other document only
matched 1 of the things he was looking for -- 2 is more than 1, so
give the 'Albino elephant.' the best score."

Thanks,
-wls

ps.  I wasn't even aware DisjunctionMaxQuery existed; is there a
resource that describes the purpose of BooleanQuery,
DisjunctionMaxQuery, and others in simple reference?

For instance, if I go to the BooleanQuery page,
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/BooleanQuery.html,
It doesn't even say "sum of the field scores" -- maybe I'm looking in
the wrong place, but for someone new to the API, it's very hard to
figure out what class you want when it's unclear what specific affect
it has on scoring.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

2007-05-25 Thread Walt Stoneburner


In reading the math for scoring at the bottom of:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html

It appears that if I can make tf() and idf(), term frequency and
inverse document frequency respectively, both return 1, then coord(),
which is now the primary factor of the product, is what I'm looking
for.

Would anyone have enough knowledge to confirm / deny?

-wls

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

multiple tokens at the same position

2007-05-25 Thread Enis Soztutar


Hi,

In nutch we have a use case in which we need to store tokens with their 
original text plus their stemmed form plus their canonical form(through 
some asciifization). From my understanding of lucene, it makes sense to 
write a tokenstream which generates several tokens for each "word", but 
place all the tokens for the "word" at the same position with 
Token#setPositionIncrement(0).
This way we could be able to search over this field using any 
form(stemmed, canonical, original) of the "word". Actually i have two 
questions here. First is that is there any way to avoid matching stemmed 
or canonical forms to a phrase query. Moreover it seems that adding 
multiple forms of the "word"s alters statistical calculations for 
scoring, especially for tf and idf, because the frequency of the root 
form of the word is incremented at each word with that root form. Is 
there any way that we could avoid it?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multiple tokens at the same position

2007-05-25 Thread Steven Rowe

Hi Enis,

Enis Soztutar wrote:
> In nutch we have a use case in which we need to store tokens with their
> original text plus their stemmed form plus their canonical form(through
> some asciifization). From my understanding of lucene, it makes sense to
> write a tokenstream which generates several tokens for each "word", but
> place all the tokens for the "word" at the same position with
> Token#setPositionIncrement(0).
> This way we could be able to search over this field using any
> form(stemmed, canonical, original) of the "word". Actually i have two
> questions here. First is that is there any way to avoid matching stemmed
> or canonical forms to a phrase query. Moreover it seems that adding
> multiple forms of the "word"s alters statistical calculations for
> scoring, especially for tf and idf, because the frequency of the root
> form of the word is incremented at each word with that root form. Is
> there any way that we could avoid it?

Answering both questions: Couldn't you just use a different field for
each form?

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multiple tokens at the same position

2007-05-25 Thread Enis Soztutar


Yes, indeed we could but it brings other problems, for example increasing
the index size, and extending the query to search for multiple fields, etc.

On 5/25/07, Steven Rowe <[EMAIL PROTECTED]> wrote:


Hi Enis,

Enis Soztutar wrote:
> In nutch we have a use case in which we need to store tokens with their
> original text plus their stemmed form plus their canonical form(through
> some asciifization). From my understanding of lucene, it makes sense to
> write a tokenstream which generates several tokens for each "word", but
> place all the tokens for the "word" at the same position with
> Token#setPositionIncrement(0).
> This way we could be able to search over this field using any
> form(stemmed, canonical, original) of the "word". Actually i have two
> questions here. First is that is there any way to avoid matching stemmed
> or canonical forms to a phrase query. Moreover it seems that adding
> multiple forms of the "word"s alters statistical calculations for
> scoring, especially for tf and idf, because the frequency of the root
> form of the word is incremented at each word with that root form. Is
> there any way that we could avoid it?

Answering both questions: Couldn't you just use a different field for
each form?

--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

2007-05-25 Thread Yonik Seeley


On 5/25/07, Walt Stoneburner <[EMAIL PROTECTED]> wrote:

In reading the math for scoring at the bottom of:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html

It appears that if I can make tf() and idf(), term frequency and
inverse document frequency respectively, both return 1, then coord(),
which is now the primary factor of the product, is what I'm looking
for.


Pretty close, I think.  There is still the length normalization factor
that biases short fields over long.  That's calculated at index time,
and stored in the "norm" along with the boost (they are multiplied
together).

You can change the similarity during indexing, or you can completely
knock out norms via Field.setOmitNorms(true)

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Setting the maximum number of documents in a lucene segment

2007-05-25 Thread Otis Gospodnetic

Hello Ard,

What you are after is a higher mergeFactor and probably also a higher 
maxBufferedDocs.  Is indexing performance the concern?
Don't go crazy with setting a super high (e.g. 100+) mergeFactor, unless you 
really have the number of open files on your server(s) set to a solid/high 
number. maxBufferedDocs can be set to a much higher number, typically, 
depending on the size of the documents you are trying to index and the amount 
of heap the JVM has to work with.  There is also a new API for explicit flushes 
of in-memory documents while indexing to control memory consumption.

Otis
--
Lucene Consulting -- http://lucene-consulting.com/


- Original Message 
From: Ard Schrijvers <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, May 25, 2007 8:40:26 AM
Subject: RE: Setting the maximum number of documents in a lucene segment


> 
> Hello,
> 
> I am trying to change the maximum number of documents in a 
> lucene segment. By default it seems to be 10.

Correction: 10 for the smallest (just created) segments of course, because 
obviously merged segments are likely to contain many more documents

> When I have a 
> mergeFactor of say 10, then on average, after every 100 added 
> documents lucene is merging segments.
> 
> I want each segment to contain more then the default 10 
> documents, because I need to minimize merging.
> 
> Is there a way to achieve this? 
> writer.setMaxBufferedDocs(largeValue) does not do the trick 
> (I think because in my case because the writer is flushed and 
> closed after an few updates)
> 
> Does anyone know wether it is possible to make the default 
> number of documents a segment can contain larger?
> 
> Thanks in advance, 
> 
> Ard Schrijvers
> 
> 
> -- 
> 
> Hippo
> Oosteinde 11
> 1017WT Amsterdam
> The Netherlands
> Tel  +31 (0)20 5224466
> -
> [EMAIL PROTECTED] / http://www.hippo.nl
> -- 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multiple tokens at the same position

2007-05-25 Thread Erick Erickson


I can only speak to the " avoid matching stemmed
or canonical forms" part...

Yes, but you've got to do some fancy dancing when you index,
something like adding a special signifier to, say, the original token.
I'll ignore the canonical part of your question for the sake of
brevity.


Consider indexing "running"
You'd index "run" and "running$".

Now, whenever you care about the original token, you append
the '$' to the term and search for that.

This has one other advantage. Say you index the term "run" with
the above. If you don't do something like adding the $ to the
original, you can't distinguish between getting a hit on the
stem or not. That is, you can't distinguish between getting a hit
where the original word was "run" and one where the original
was "running". This may be important for "exact match".

Best
Erick

On 5/25/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:


Hi,

In nutch we have a use case in which we need to store tokens with their
original text plus their stemmed form plus their canonical form(through
some asciifization). From my understanding of lucene, it makes sense to
write a tokenstream which generates several tokens for each "word", but
place all the tokens for the "word" at the same position with
Token#setPositionIncrement(0).
This way we could be able to search over this field using any
form(stemmed, canonical, original) of the "word". Actually i have two
questions here. First is that is there any way to avoid matching stemmed
or canonical forms to a phrase query. Moreover it seems that adding
multiple forms of the "word"s alters statistical calculations for
scoring, especially for tf and idf, because the frequency of the root
form of the word is incremented at each word with that root form. Is
there any way that we could avoid it?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Indexing help needed

2007-05-25 Thread jim shirreffs

I've been working on this for a while, I am trying to get the demo code that 
comes with Lucene to index OpenOffice documentss. I've looked at LIUS code 
and at Nutch code. But can't find an easy way. So I am digging into the 
code.




I wrote a KcmiDocument class that returns a Document. In it I do a doc.add() 
where I the specify "contents" and a FileReader


/*

* Add the contents of the file to a field named "contents". Specify a 
Reader,


* so that the text of the file is tokenized and indexed, but not stored.

* Note that FileReader expects the file to be in the system's default 
encoding.


* If that's not the case searching for special characters will fail.

* FileReader is the key, need to add the correct reader for none text 
formats.


*/

doc.add(new Field("contents", new FileReader(f)));





Now if I could just add a file reader for OpenOffice say OOFileReader() that 
unzip and did all the dom stuff hen everything would work and the code 
changes would be minimal, right? My question is, am I correct in my 
thinking? And if so does any one know of an OOFileReader? If I am not 
correct what am I missing here. It is kind of important that I learn how to 
add different files types like OO or AutoCad, so we can make a build (with 
Lucene) or buy call.


Thanks to all that try to help me out

Jim S

P.S. If I get it working I will be happy to email post the code.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing help needed

2007-05-25 Thread Andrzej Bialecki


jim shirreffs wrote:


Thanks to all that try to help me out

Jim S

P.S. If I get it working I will be happy to email post the code.


If you looked at the code in Nutch, you can take most of the parse-oo 
plugin verbatim, because all this plugin does is it extracts the text 
content and metadata from OO files.




--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multiple tokens at the same position

2007-05-25 Thread Chris Hostetter

: Yes, indeed we could but it brings other problems, for example increasing
: the index size, and extending the query to search for multiple fields, etc.

1) if you index both teh raw and stemmed forms your index is going to grow
to roughly the same size regardless of wether the stem and the arw are in
teh same field or differnet fields.

2) For a particular user action, if you only want to search for one form
(either raw or stemmed) then your query doesn't have to search over
multiple fields -- it just has to search on which ever field you care
about for the particular user action. if you want to search for *both*
the stemmed and raw forms n a single query, then the compelxity of hte
query is the same regardless of wether the two clauses are on the same
field or differnet fields.

: > > or canonical forms to a phrase query. Moreover it seems that adding
: > > multiple forms of the "word"s alters statistical calculations for
: > > scoring, especially for tf and idf, because the frequency of the root
: > > form of the word is incremented at each word with that root form. Is

it's a matter of opinion wether this is "right" or not ... if you are
storing both the raw and stemmed forms then in theory your tf/idf numbers
now both represent "twice" what they normally would and it balances out..
"dog" is not only a raw word, but also it's own stem, so a tf(docA,dog)=2
for one real instance of "dog" is just as correct as a doc that contains
"dogs" and gets tf(docB,dog)=1 and tf(docB,dogs)=1.

If you disagree with this line of thinking, asimple way to fix the problem
is to use a TokenFilter that removes any tokens at the same position which
contain the same text...

http://lucene.apache.org/solr/api/org/apache/solr/analysis/RemoveDuplicatesTokenFilter.html

-Hoss

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multiple tokens at the same position

2007-05-25 Thread Enis Soztutar

On 5/25/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

: Yes, indeed we could but it brings other problems, for example
increasing
: the index size, and extending the query to search for multiple fields,
etc.

1) if you index both teh raw and stemmed forms your index is going to grow
to roughly the same size regardless of wether the stem and the arw are in
teh same field or differnet fields.

Well using a single field i think we do not need to store stemmed form of
the word if it is indeed the same with the original, so that the index will
not grow two times the size. However then one can say that using more than
one field, it is also possible to not index the root forms twice (but i am
not sure how to query them).

2) For a particular user action, if you only want to search for one form

(either raw or stemmed) then your query doesn't have to search over
multiple fields -- it just has to search on which ever field you care
about for the particular user action. if you want to search for *both*
the stemmed and raw forms n a single query, then the compelxity of hte
query is the same regardless of wether the two clauses are on the same
field or differnet fields.

If we also stem the query from the user, then infact we need to only search
for the stemmed form. But it is then necessary to determine the language of
the query(if possible) in order to stem it.

: > > or canonical forms to a phrase query. Moreover it seems that adding

: > > multiple forms of the "word"s alters statistical calculations for
: > > scoring, especially for tf and idf, because the frequency of the
root
: > > form of the word is incremented at each word with that root form. Is

You are right about in that calculation, but i was wondering about the tf
and idf in cases where we do not store the root forms twice. In that case i
think tf of the root forms increases, however the derived forms
decreases(since total number of terms nearly doubles).

If you disagree with this line of thinking, asimple way to fix the problem

is to use a TokenFilter that removes any tokens at the same position which
contain the same text...

http://lucene.apache.org/solr/api/org/apache/solr/analysis/RemoveDuplicatesTokenFilter.html

thanks for the pointer

-Hoss

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

I think Erick's way of indexing solves both problems 1 and 2. However it is
effectively not different than storing the forms in seperate fields. How
would the query performance differ the case that we index all the forms in
one field(w/o storing root forms twice), and query only this field, from the
case which we index the forms in seperate fields and run the query in all of
them?

Re: Indexing help needed

2007-05-25 Thread jim shirreffs

Thanks for the advice, I just don't see where in the Lucene code I should 
plug OOParcer into Lucene.


I've walked the code in LIUS and Nutch (moving on to Solr) trying to find 
common objects. If I can find common objects in Lucene and Nutch I'll know 
where to plug in.



Lucene Objects looks like this

IndexWriter
   Analyzer
   StandardAnalyzer
   Document
   Reader
   FileReader
   StringReader
   DocumentWriter


But when I search thru the Nutch or LIUS code I can not find these objects. 
LIUS uses reflection so I'm not going to find anything in the code, but 
unforturnately the liusConfig.xml is incomplete and I can not find the class 
names for the OpenOffice stuff in it.


This is all very frustrating since it should be a realatively easy to add 
support for unsupported formats. The Lucene code is very nice, lius code 
less so. Seems Lucene is setup to drop in new file formats I just do not 
know where to drop it in or what kind of objects need to be dropped in.


Oh well guess I will code up a Reader the just spites out "Here I am" a few 
hundred times and see what happens. LOL.



thank you for the reply and advice.

jim s



- Original Message - 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>

To: 
Sent: Friday, May 25, 2007 1:10 PM
Subject: Re: Indexing help needed



jim shirreffs wrote:


Thanks to all that try to help me out

Jim S

P.S. If I get it working I will be happy to email post the code.


If you looked at the code in Nutch, you can take most of the parse-oo 
plugin verbatim, because all this plugin does is it extracts the text 
content and metadata from OO files.




--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

2007-05-25 Thread Grant Ingersoll

I know you have a solution already that I agree with, but I do think  
the DisjunctionMaxQuery could serve as the start for writing your own  
Query that did what you want.  Why would you want to?  Well, maybe  
you have other ways you want to search as well and don't want to mess  
with custom Similarity, omit norms, etc. or having to duplicate your  
fields to support both.  Just a thought.


Also, see below...

On May 25, 2007, at 9:49 AM, Walt Stoneburner wrote:

ps.  I wasn't even aware DisjunctionMaxQuery existed; is there a
resource that describes the purpose of BooleanQuery,
DisjunctionMaxQuery, and others in simple reference?



http://lucene.apache.org/java/docs/scoring.html has some links to the  
search javadocs which contains info on the queries.



For instance, if I go to the BooleanQuery page,
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ 
javadoc/org/apache/lucene/search/BooleanQuery.html,

It doesn't even say "sum of the field scores" -- maybe I'm looking in
the wrong place, but for someone new to the API, it's very hard to
figure out what class you want when it's unclear what specific affect
it has on scoring.


Good point.  I am _hoping_ to focus on documentation this year, but I  
have been saying that for a while now and it is already almost June!   
I guess, at a minimum, you should write up a bug on how to improve  
it, even better is a patch!


Lucene in Action has good docs on the query type, but, of course,  
that requires a purchase, so it is less than satisfactory even as  
good as the book is.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multiple tokens at the same position

2007-05-25 Thread Mark Miller

Another (obvious) option is to use two indexes and direct the query to the
appropriate index depending on the search specification. Of course you
double your space requirements, but your basically going to do that anyway
if you use two fields. I chose this for the slight benefit of fewer fields
on the index of interest (I need my norms and have *plenty* of fields).

My first approach was to just put the stemmed form at the same position as
the unstemmed form, but I did not like the options involved in choosing to
search just stemmed or just unstemmed. In the end I decided the space
savings where just not worth the hassle.

- Mark

On 5/25/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:

On 5/25/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : Yes, indeed we could but it brings other problems, for example
> increasing
> : the index size, and extending the query to search for multiple fields,
> etc.
>
> 1) if you index both teh raw and stemmed forms your index is going to
grow
> to roughly the same size regardless of wether the stem and the arw are
in
> teh same field or differnet fields.

Well using a single field i think we do not need to store stemmed form of
the word if it is indeed the same with the original, so that the index
will
not grow two times the size. However then one can say that using more than
one field, it is also possible to not index the root forms twice (but i am
not sure how to query them).

2) For a particular user action, if you only want to search for one form
> (either raw or stemmed) then your query doesn't have to search over
> multiple fields -- it just has to search on which ever field you care
> about for the particular user action.  if you want to search for *both*
> the stemmed and raw forms n a single query, then the compelxity of hte
> query is the same regardless of wether the two clauses are on the same
> field or differnet fields.

If we also stem the query from the user, then infact we need to only
search
for the stemmed form. But it is then necessary to determine the language
of
the query(if possible) in order to stem it.

: > > or canonical forms to a phrase query. Moreover it seems that adding
> : > > multiple forms of the "word"s alters statistical calculations for
> : > > scoring, especially for tf and idf, because the frequency of the
> root
> : > > form of the word is incremented at each word with that root form.
Is
>
> it's a matter of opinion wether this is "right" or not ... if you are
> storing both the raw and stemmed forms then in theory your tf/idf
numbers
> now both represent "twice" what they normally would and it balances
out..
> "dog" is not only a raw word, but also it's own stem, so a
tf(docA,dog)=2
> for one real instance of "dog" is just as correct as a doc that contains
> "dogs" and gets tf(docB,dog)=1 and tf(docB,dogs)=1.

You are right about in that calculation, but i was wondering about the tf
and idf in cases where we do not store the root forms twice. In that case
i
think tf of the root forms increases, however the derived forms
decreases(since total number of terms nearly doubles).

If you disagree with this line of thinking, asimple way to fix the problem
> is to use a TokenFilter that removes any tokens at the same position
which
> contain the same text...
>
>
>
http://lucene.apache.org/solr/api/org/apache/solr/analysis/RemoveDuplicatesTokenFilter.html

thanks for the pointer

-Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
I think Erick's way of indexing solves both problems 1 and 2. However it
is
effectively not different than storing the forms in seperate fields. How
would the query performance differ the case that we index all the forms in
one field(w/o storing root forms twice), and query only this field, from
the
case which we index the forms in seperate fields and run the query in all
of
them?

Re: Indexing help needed

2007-05-25 Thread Andrzej Bialecki


jim shirreffs wrote:
Thanks for the advice, I just don't see where in the Lucene code I 
should plug OOParcer into Lucene.


I've walked the code in LIUS and Nutch (moving on to Solr) trying to 
find common objects. If I can find common objects in Lucene and Nutch 
I'll know where to plug in.


You seem to be somewhat confused about what Lucene really is. It's just 
a library, and not an application. It's up to you to provide the logic 
and glue, or to extend any existing demo application to accomodate your 
needs. It's also a _plain_ _text_ search library. So if you want to 
index anything else you need to first convert it to a plain text format.


That's essentially what OOParser does in Nutch. It extracts data from OO 
documents and converts it to plain text. Disregard other stuff in that 
plugin - it has to do with how Nutch passes this data to storage (and 
indexing takes place in a completely different step, so you won't find 
it here). Just use the parts that extract plain text data - and then use 
this plain text data to add fields to Lucene documents.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Search Query AND OR for Title and Description Fields

2007-05-25 Thread Ram Peters


I have title field and description field indexed.  Now I want to
search for "object oriented programming" either in title "and or"
description using Lucene search query.

How do I do this?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Writing a document using two different Analyzers

Re: Writing a document using two different Analyzers

Re: number of times the keyword match

Re: Writing a document using two different Analyzers

Setting the maximum number of documents in a lucene segment

RE: Setting the maximum number of documents in a lucene segment

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

multiple tokens at the same position

Re: multiple tokens at the same position

Re: multiple tokens at the same position

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

Re: Setting the maximum number of documents in a lucene segment

Re: multiple tokens at the same position

Indexing help needed

Re: Indexing help needed

Re: multiple tokens at the same position

Re: multiple tokens at the same position

Re: Indexing help needed

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

Re: multiple tokens at the same position

Re: Indexing help needed

Search Query AND OR for Title and Description Fields

23 matches

Site Navigation

Mail list logo

Footer information