date:20100222

Boost Problem (again), need example !

2010-02-22 Thread pdaures


Hi,
I know that there are many topics about scoring issues, but I didn't find an
answer in the topics.
This is the problem :
Imagine I'm a teacher, and I have to index all the results, comments and
score about students.

Student :
String name (eg : John Smith)
String comments : (eg: John is a good student, but he needs to be more self
confident bla bla bla)
float score (eg : 98)

I have to index all the students and when I use the search class, I want to
get first the best students. So, if John Smith is a better student than John
Mickael, when I search "John" I want to have John Smith BEFORE John Mickeal.

To do that, I'm using BooleanQuery to search in name and comment fields.

First, I thought I could use the function Document.setBoost(float boost)
while indexing student, with boost = Student.score. But the result was not
what I was expected, it didn't work correctly.

Then I thought I could use a FunctionQuery to search :
FunctionQuery functionQuery = new FunctionQuery(new
ReverseOrdFieldSource("score"));
But the result was still incorrect.

I don't know what I'm doing wrong. Could you help me to find a solution ?
Thank you :)
-- 
View this message in context: 
http://old.nabble.com/Boost-Problem-%28again%29%2C-need-example-%21-tp27684388p27684388.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Boost Problem (again), need example !

2010-02-22 Thread Ian Lea

Can't you simply sort by descending score (your score, not lucene's)?
Seems to me that would give you what you are asking for.

The setBoost() method is unlikely to work consistently because it only
infuences the score rather than setting it.  If your John Mickeal doc
happens to have a higher lucene score, because of the normal
idf/tf/etc stuff, then the setBoost() with a higher value for John
Smith may well not be enough to force John Smith to the top.

I don't know enough about function queries to help you much there but
FieldScoreQuery might work.  I can't see any sign of class
FunctionQuery in the 3.0.0 core package so am not clear what that is.


--
Ian.



On Mon, Feb 22, 2010 at 8:54 AM, pdaures  wrote:
>
> Hi,
> I know that there are many topics about scoring issues, but I didn't find an
> answer in the topics.
> This is the problem :
> Imagine I'm a teacher, and I have to index all the results, comments and
> score about students.
>
> Student :
> String name (eg : John Smith)
> String comments : (eg: John is a good student, but he needs to be more self
> confident bla bla bla)
> float score (eg : 98)
>
> I have to index all the students and when I use the search class, I want to
> get first the best students. So, if John Smith is a better student than John
> Mickael, when I search "John" I want to have John Smith BEFORE John Mickeal.
>
> To do that, I'm using BooleanQuery to search in name and comment fields.
>
> First, I thought I could use the function Document.setBoost(float boost)
> while indexing student, with boost = Student.score. But the result was not
> what I was expected, it didn't work correctly.
>
> Then I thought I could use a FunctionQuery to search :
> FunctionQuery functionQuery = new FunctionQuery(new
> ReverseOrdFieldSource("score"));
> But the result was still incorrect.
>
> I don't know what I'm doing wrong. Could you help me to find a solution ?
> Thank you :)
> --
> View this message in context: 
> http://old.nabble.com/Boost-Problem-%28again%29%2C-need-example-%21-tp27684388p27684388.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Boost Problem (again), need example !

2010-02-22 Thread Uwe Schindler

It's CustomScoreQuery in 2.9 and 3.0.

Please wait for 2.9.2 and 3.0.1 for an important API change in this 
experimental query type to work correct with the new per-segment-search! You 
can test the release artifacts of both new versions here: 
http://people.apache.org/~uschindler/staging-area/lucene-292-301-take2-rev912433/

With e.g. ValueSourceQuery you can score your documents using a separate 
numeric field from your documents (it uses FieldCache).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Ian Lea [mailto:ian@gmail.com]
> Sent: Monday, February 22, 2010 10:33 AM
> To: java-user@lucene.apache.org
> Subject: Re: Boost Problem (again), need example !
> 
> Can't you simply sort by descending score (your score, not lucene's)?
> Seems to me that would give you what you are asking for.
> 
> The setBoost() method is unlikely to work consistently because it only
> infuences the score rather than setting it.  If your John Mickeal doc
> happens to have a higher lucene score, because of the normal
> idf/tf/etc stuff, then the setBoost() with a higher value for John
> Smith may well not be enough to force John Smith to the top.
> 
> I don't know enough about function queries to help you much there but
> FieldScoreQuery might work.  I can't see any sign of class
> FunctionQuery in the 3.0.0 core package so am not clear what that is.
> 
> 
> --
> Ian.
> 
> 
> 
> On Mon, Feb 22, 2010 at 8:54 AM, pdaures 
> wrote:
> >
> > Hi,
> > I know that there are many topics about scoring issues, but I didn't
> find an
> > answer in the topics.
> > This is the problem :
> > Imagine I'm a teacher, and I have to index all the results, comments
> and
> > score about students.
> >
> > Student :
> > String name (eg : John Smith)
> > String comments : (eg: John is a good student, but he needs to be
> more self
> > confident bla bla bla)
> > float score (eg : 98)
> >
> > I have to index all the students and when I use the search class, I
> want to
> > get first the best students. So, if John Smith is a better student
> than John
> > Mickael, when I search "John" I want to have John Smith BEFORE John
> Mickeal.
> >
> > To do that, I'm using BooleanQuery to search in name and comment
> fields.
> >
> > First, I thought I could use the function Document.setBoost(float
> boost)
> > while indexing student, with boost = Student.score. But the result
> was not
> > what I was expected, it didn't work correctly.
> >
> > Then I thought I could use a FunctionQuery to search :
> > FunctionQuery functionQuery = new FunctionQuery(new
> > ReverseOrdFieldSource("score"));
> > But the result was still incorrect.
> >
> > I don't know what I'm doing wrong. Could you help me to find a
> solution ?
> > Thank you :)
> > --
> > View this message in context: http://old.nabble.com/Boost-Problem-
> %28again%29%2C-need-example-%21-tp27684388p27684388.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

range of scores : queryNorm()

2010-02-22 Thread Smith G

Hello ,
  I have observed that even if we change boosting
drastically, scores are being normalized at the end because of
queryNorm value. Is there anything ( regarding to the queryNorm) that
we can rely on ? like score will always be under 10 or some fixed
value ? The main objective is to provide scores in a fixed range to
the partner. So have you been experienced anything like this? Is it
possible to do so ?.
   Have you been experienced any strange situation like for a
particular query, result scores were really high compared to routine?
if yes,I would like to know  the factor that effected scores
drastically, because it may help me to proceed or understand the
cases.
Thanks

(NOTE : I am sorry, I have also posted in solr group, there were no
replies and also I feel this place is even more apt.).

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Boost Problem (again), need example !

2010-02-22 Thread pdaures


HI !
Thank you for your help.
I think I don't use CustomScoreQuery correctly when I do a "search".

BooleanQuery combinedQuery = new BooleanQuery();
combinedQuery.add(textQuery, Occur.MUST);
combinedQuery.add(titleQuery, Occur.MUST);

CustomScoreQuery customQuery = new CustomScoreQuery(combinedQuery,new
FieldScoreQuery(BOOST_FIELD,Type.INT));

indexSearcher.search(..., customQuery, ).

in order to index the BOOST_FIELD, I do that :
Field boostField = new Field(BOOST_FIELD, Integer.toString(boost),
Field.Store.YES, Field.Index.ANALYZED.NO);


Is that correct ?
Thank you




Uwe Schindler wrote:
> 
> It's CustomScoreQuery in 2.9 and 3.0.
> 
> Please wait for 2.9.2 and 3.0.1 for an important API change in this
> experimental query type to work correct with the new per-segment-search!
> You can test the release artifacts of both new versions here:
> http://people.apache.org/~uschindler/staging-area/lucene-292-301-take2-rev912433/
> 
> With e.g. ValueSourceQuery you can score your documents using a separate
> numeric field from your documents (it uses FieldCache).
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
>> -Original Message-
>> From: Ian Lea [mailto:ian@gmail.com]
>> Sent: Monday, February 22, 2010 10:33 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Boost Problem (again), need example !
>> 
>> Can't you simply sort by descending score (your score, not lucene's)?
>> Seems to me that would give you what you are asking for.
>> 
>> The setBoost() method is unlikely to work consistently because it only
>> infuences the score rather than setting it.  If your John Mickeal doc
>> happens to have a higher lucene score, because of the normal
>> idf/tf/etc stuff, then the setBoost() with a higher value for John
>> Smith may well not be enough to force John Smith to the top.
>> 
>> I don't know enough about function queries to help you much there but
>> FieldScoreQuery might work.  I can't see any sign of class
>> FunctionQuery in the 3.0.0 core package so am not clear what that is.
>> 
>> 
>> --
>> Ian.
>> 
>> 
>> 
>> On Mon, Feb 22, 2010 at 8:54 AM, pdaures 
>> wrote:
>> >
>> > Hi,
>> > I know that there are many topics about scoring issues, but I didn't
>> find an
>> > answer in the topics.
>> > This is the problem :
>> > Imagine I'm a teacher, and I have to index all the results, comments
>> and
>> > score about students.
>> >
>> > Student :
>> > String name (eg : John Smith)
>> > String comments : (eg: John is a good student, but he needs to be
>> more self
>> > confident bla bla bla)
>> > float score (eg : 98)
>> >
>> > I have to index all the students and when I use the search class, I
>> want to
>> > get first the best students. So, if John Smith is a better student
>> than John
>> > Mickael, when I search "John" I want to have John Smith BEFORE John
>> Mickeal.
>> >
>> > To do that, I'm using BooleanQuery to search in name and comment
>> fields.
>> >
>> > First, I thought I could use the function Document.setBoost(float
>> boost)
>> > while indexing student, with boost = Student.score. But the result
>> was not
>> > what I was expected, it didn't work correctly.
>> >
>> > Then I thought I could use a FunctionQuery to search :
>> > FunctionQuery functionQuery = new FunctionQuery(new
>> > ReverseOrdFieldSource("score"));
>> > But the result was still incorrect.
>> >
>> > I don't know what I'm doing wrong. Could you help me to find a
>> solution ?
>> > Thank you :)
>> > --
>> > View this message in context: http://old.nabble.com/Boost-Problem-
>> %28again%29%2C-need-example-%21-tp27684388p27684388.html
>> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Boost-Problem-%28again%29%2C-need-example-%21-tp27684388p27685594.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Boost Problem (again), need example !

2010-02-22 Thread Ian Lea

boostField needs to be indexed to be used in the FieldScoreQuery.

Are you now using one of the the latest releases that Uwe mentioned,
with fixes for CustomScoreQuery?

And unless you provide your own implementation of
CustomScoreQuery.customScore() I think that you are still not
guaranteed to get what you want since the default implementation is to
calculate the score as subQueryScore * valSrcScore.


--
Ian.


On Mon, Feb 22, 2010 at 11:00 AM, pdaures  wrote:
>
> HI !
> Thank you for your help.
> I think I don't use CustomScoreQuery correctly when I do a "search".
>
> BooleanQuery combinedQuery = new BooleanQuery();
> combinedQuery.add(textQuery, Occur.MUST);
> combinedQuery.add(titleQuery, Occur.MUST);
>
> CustomScoreQuery customQuery = new CustomScoreQuery(combinedQuery,new
> FieldScoreQuery(BOOST_FIELD,Type.INT));
>
> indexSearcher.search(..., customQuery, ).
>
> in order to index the BOOST_FIELD, I do that :
> Field boostField = new Field(BOOST_FIELD, Integer.toString(boost),
> Field.Store.YES, Field.Index.ANALYZED.NO);
>
>
> Is that correct ?
> Thank you
>
>
>
>
> Uwe Schindler wrote:
>>
>> It's CustomScoreQuery in 2.9 and 3.0.
>>
>> Please wait for 2.9.2 and 3.0.1 for an important API change in this
>> experimental query type to work correct with the new per-segment-search!
>> You can test the release artifacts of both new versions here:
>> http://people.apache.org/~uschindler/staging-area/lucene-292-301-take2-rev912433/
>>
>> With e.g. ValueSourceQuery you can score your documents using a separate
>> numeric field from your documents (it uses FieldCache).
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>> -Original Message-
>>> From: Ian Lea [mailto:ian@gmail.com]
>>> Sent: Monday, February 22, 2010 10:33 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Boost Problem (again), need example !
>>>
>>> Can't you simply sort by descending score (your score, not lucene's)?
>>> Seems to me that would give you what you are asking for.
>>>
>>> The setBoost() method is unlikely to work consistently because it only
>>> infuences the score rather than setting it.  If your John Mickeal doc
>>> happens to have a higher lucene score, because of the normal
>>> idf/tf/etc stuff, then the setBoost() with a higher value for John
>>> Smith may well not be enough to force John Smith to the top.
>>>
>>> I don't know enough about function queries to help you much there but
>>> FieldScoreQuery might work.  I can't see any sign of class
>>> FunctionQuery in the 3.0.0 core package so am not clear what that is.
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>>
>>> On Mon, Feb 22, 2010 at 8:54 AM, pdaures 
>>> wrote:
>>> >
>>> > Hi,
>>> > I know that there are many topics about scoring issues, but I didn't
>>> find an
>>> > answer in the topics.
>>> > This is the problem :
>>> > Imagine I'm a teacher, and I have to index all the results, comments
>>> and
>>> > score about students.
>>> >
>>> > Student :
>>> > String name (eg : John Smith)
>>> > String comments : (eg: John is a good student, but he needs to be
>>> more self
>>> > confident bla bla bla)
>>> > float score (eg : 98)
>>> >
>>> > I have to index all the students and when I use the search class, I
>>> want to
>>> > get first the best students. So, if John Smith is a better student
>>> than John
>>> > Mickael, when I search "John" I want to have John Smith BEFORE John
>>> Mickeal.
>>> >
>>> > To do that, I'm using BooleanQuery to search in name and comment
>>> fields.
>>> >
>>> > First, I thought I could use the function Document.setBoost(float
>>> boost)
>>> > while indexing student, with boost = Student.score. But the result
>>> was not
>>> > what I was expected, it didn't work correctly.
>>> >
>>> > Then I thought I could use a FunctionQuery to search :
>>> > FunctionQuery functionQuery = new FunctionQuery(new
>>> > ReverseOrdFieldSource("score"));
>>> > But the result was still incorrect.
>>> >
>>> > I don't know what I'm doing wrong. Could you help me to find a
>>> solution ?
>>> > Thank you :)
>>> > --
>>> > View this message in context: http://old.nabble.com/Boost-Problem-
>>> %28again%29%2C-need-example-%21-tp27684388p27684388.html
>>> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>> >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apa

RE: Boost Problem (again), need example !

2010-02-22 Thread Uwe Schindler

The simple fix for that is to wrap the subQuery using: new 
ConstantScoreQuery(new QueryWrapperFilter(query)) - after that its score is 
constant and the ValueSource only scores.

I recommend to use NumericField for indexing this boost (no storing needed, 
only indexing, precisionStep=Integer.MAX_VALUE). Else (if using standard Field) 
the boost field does not need to be "stored", it must be indexed as 
NOT_ANALYZED. 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Ian Lea [mailto:ian@gmail.com]
> Sent: Monday, February 22, 2010 12:26 PM
> To: java-user@lucene.apache.org
> Subject: Re: Boost Problem (again), need example !
> 
> boostField needs to be indexed to be used in the FieldScoreQuery.
> 
> Are you now using one of the the latest releases that Uwe mentioned,
> with fixes for CustomScoreQuery?
> 
> And unless you provide your own implementation of
> CustomScoreQuery.customScore() I think that you are still not
> guaranteed to get what you want since the default implementation is to
> calculate the score as subQueryScore * valSrcScore.
> 
> 
> --
> Ian.
> 
> 
> On Mon, Feb 22, 2010 at 11:00 AM, pdaures 
> wrote:
> >
> > HI !
> > Thank you for your help.
> > I think I don't use CustomScoreQuery correctly when I do a "search".
> >
> > BooleanQuery combinedQuery = new BooleanQuery();
> > combinedQuery.add(textQuery, Occur.MUST);
> > combinedQuery.add(titleQuery, Occur.MUST);
> >
> > CustomScoreQuery customQuery = new CustomScoreQuery(combinedQuery,new
> > FieldScoreQuery(BOOST_FIELD,Type.INT));
> >
> > indexSearcher.search(..., customQuery, ).
> >
> > in order to index the BOOST_FIELD, I do that :
> > Field boostField = new Field(BOOST_FIELD, Integer.toString(boost),
> > Field.Store.YES, Field.Index.ANALYZED.NO);
> >
> >
> > Is that correct ?
> > Thank you
> >
> >
> >
> >
> > Uwe Schindler wrote:
> >>
> >> It's CustomScoreQuery in 2.9 and 3.0.
> >>
> >> Please wait for 2.9.2 and 3.0.1 for an important API change in this
> >> experimental query type to work correct with the new per-segment-
> search!
> >> You can test the release artifacts of both new versions here:
> >> http://people.apache.org/~uschindler/staging-area/lucene-292-301-
> take2-rev912433/
> >>
> >> With e.g. ValueSourceQuery you can score your documents using a
> separate
> >> numeric field from your documents (it uses FieldCache).
> >>
> >> Uwe
> >>
> >> -
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: u...@thetaphi.de
> >>
> >>> -Original Message-
> >>> From: Ian Lea [mailto:ian@gmail.com]
> >>> Sent: Monday, February 22, 2010 10:33 AM
> >>> To: java-user@lucene.apache.org
> >>> Subject: Re: Boost Problem (again), need example !
> >>>
> >>> Can't you simply sort by descending score (your score, not
> lucene's)?
> >>> Seems to me that would give you what you are asking for.
> >>>
> >>> The setBoost() method is unlikely to work consistently because it
> only
> >>> infuences the score rather than setting it.  If your John Mickeal
> doc
> >>> happens to have a higher lucene score, because of the normal
> >>> idf/tf/etc stuff, then the setBoost() with a higher value for John
> >>> Smith may well not be enough to force John Smith to the top.
> >>>
> >>> I don't know enough about function queries to help you much there
> but
> >>> FieldScoreQuery might work.  I can't see any sign of class
> >>> FunctionQuery in the 3.0.0 core package so am not clear what that
> is.
> >>>
> >>>
> >>> --
> >>> Ian.
> >>>
> >>>
> >>>
> >>> On Mon, Feb 22, 2010 at 8:54 AM, pdaures 
> >>> wrote:
> >>> >
> >>> > Hi,
> >>> > I know that there are many topics about scoring issues, but I
> didn't
> >>> find an
> >>> > answer in the topics.
> >>> > This is the problem :
> >>> > Imagine I'm a teacher, and I have to index all the results,
> comments
> >>> and
> >>> > score about students.
> >>> >
> >>> > Student :
> >>> > String name (eg : John Smith)
> >>> > String comments : (eg: John is a good student, but he needs to be
> >>> more self
> >>> > confident bla bla bla)
> >>> > float score (eg : 98)
> >>> >
> >>> > I have to index all the students and when I use the search class,
> I
> >>> want to
> >>> > get first the best students. So, if John Smith is a better
> student
> >>> than John
> >>> > Mickael, when I search "John" I want to have John Smith BEFORE
> John
> >>> Mickeal.
> >>> >
> >>> > To do that, I'm using BooleanQuery to search in name and comment
> >>> fields.
> >>> >
> >>> > First, I thought I could use the function Document.setBoost(float
> >>> boost)
> >>> > while indexing student, with boost = Student.score. But the
> result
> >>> was not
> >>> > what I was expected, it didn't work correctly.
> >>> >
> >>> > Then I thought I could use a FunctionQuery to search :
> >>> > FunctionQuery functionQuery = new FunctionQuery(new
> >>> > ReverseOrdFieldSource("score"));
> >>> > But the result w

RE: Boost Problem (again), need example !

2010-02-22 Thread pdaures


It WORKS !

Thank you so much, I spent a lot of time trying to do that, thank you again
!


Uwe Schindler wrote:
> 
> The simple fix for that is to wrap the subQuery using: new
> ConstantScoreQuery(new QueryWrapperFilter(query)) - after that its score
> is constant and the ValueSource only scores.
> 
> I recommend to use NumericField for indexing this boost (no storing
> needed, only indexing, precisionStep=Integer.MAX_VALUE). Else (if using
> standard Field) the boost field does not need to be "stored", it must be
> indexed as NOT_ANALYZED. 
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
>> -Original Message-
>> From: Ian Lea [mailto:ian@gmail.com]
>> Sent: Monday, February 22, 2010 12:26 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Boost Problem (again), need example !
>> 
>> boostField needs to be indexed to be used in the FieldScoreQuery.
>> 
>> Are you now using one of the the latest releases that Uwe mentioned,
>> with fixes for CustomScoreQuery?
>> 
>> And unless you provide your own implementation of
>> CustomScoreQuery.customScore() I think that you are still not
>> guaranteed to get what you want since the default implementation is to
>> calculate the score as subQueryScore * valSrcScore.
>> 
>> 
>> --
>> Ian.
>> 
>> 
>> On Mon, Feb 22, 2010 at 11:00 AM, pdaures 
>> wrote:
>> >
>> > HI !
>> > Thank you for your help.
>> > I think I don't use CustomScoreQuery correctly when I do a "search".
>> >
>> > BooleanQuery combinedQuery = new BooleanQuery();
>> > combinedQuery.add(textQuery, Occur.MUST);
>> > combinedQuery.add(titleQuery, Occur.MUST);
>> >
>> > CustomScoreQuery customQuery = new CustomScoreQuery(combinedQuery,new
>> > FieldScoreQuery(BOOST_FIELD,Type.INT));
>> >
>> > indexSearcher.search(..., customQuery, ).
>> >
>> > in order to index the BOOST_FIELD, I do that :
>> > Field boostField = new Field(BOOST_FIELD, Integer.toString(boost),
>> > Field.Store.YES, Field.Index.ANALYZED.NO);
>> >
>> >
>> > Is that correct ?
>> > Thank you
>> >
>> >
>> >
>> >
>> > Uwe Schindler wrote:
>> >>
>> >> It's CustomScoreQuery in 2.9 and 3.0.
>> >>
>> >> Please wait for 2.9.2 and 3.0.1 for an important API change in this
>> >> experimental query type to work correct with the new per-segment-
>> search!
>> >> You can test the release artifacts of both new versions here:
>> >> http://people.apache.org/~uschindler/staging-area/lucene-292-301-
>> take2-rev912433/
>> >>
>> >> With e.g. ValueSourceQuery you can score your documents using a
>> separate
>> >> numeric field from your documents (it uses FieldCache).
>> >>
>> >> Uwe
>> >>
>> >> -
>> >> Uwe Schindler
>> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> http://www.thetaphi.de
>> >> eMail: u...@thetaphi.de
>> >>
>> >>> -Original Message-
>> >>> From: Ian Lea [mailto:ian@gmail.com]
>> >>> Sent: Monday, February 22, 2010 10:33 AM
>> >>> To: java-user@lucene.apache.org
>> >>> Subject: Re: Boost Problem (again), need example !
>> >>>
>> >>> Can't you simply sort by descending score (your score, not
>> lucene's)?
>> >>> Seems to me that would give you what you are asking for.
>> >>>
>> >>> The setBoost() method is unlikely to work consistently because it
>> only
>> >>> infuences the score rather than setting it.  If your John Mickeal
>> doc
>> >>> happens to have a higher lucene score, because of the normal
>> >>> idf/tf/etc stuff, then the setBoost() with a higher value for John
>> >>> Smith may well not be enough to force John Smith to the top.
>> >>>
>> >>> I don't know enough about function queries to help you much there
>> but
>> >>> FieldScoreQuery might work.  I can't see any sign of class
>> >>> FunctionQuery in the 3.0.0 core package so am not clear what that
>> is.
>> >>>
>> >>>
>> >>> --
>> >>> Ian.
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Feb 22, 2010 at 8:54 AM, pdaures 
>> >>> wrote:
>> >>> >
>> >>> > Hi,
>> >>> > I know that there are many topics about scoring issues, but I
>> didn't
>> >>> find an
>> >>> > answer in the topics.
>> >>> > This is the problem :
>> >>> > Imagine I'm a teacher, and I have to index all the results,
>> comments
>> >>> and
>> >>> > score about students.
>> >>> >
>> >>> > Student :
>> >>> > String name (eg : John Smith)
>> >>> > String comments : (eg: John is a good student, but he needs to be
>> >>> more self
>> >>> > confident bla bla bla)
>> >>> > float score (eg : 98)
>> >>> >
>> >>> > I have to index all the students and when I use the search class,
>> I
>> >>> want to
>> >>> > get first the best students. So, if John Smith is a better
>> student
>> >>> than John
>> >>> > Mickael, when I search "John" I want to have John Smith BEFORE
>> John
>> >>> Mickeal.
>> >>> >
>> >>> > To do that, I'm using BooleanQuery to search in name and comment
>> >>> fields.
>> >>> >
>> >>> > First, I thought I could use the function Document.setBoost(float
>> >>> boost)
>> >>> > while indexing student, with boost = Student.score.

Re: PayloadNearSpanScorer explain method

2010-02-22 Thread Peter Keegan

Patch is in JIRA: LUCENE-2272

On Wed, Feb 17, 2010 at 8:40 PM, Peter Keegan wrote:

> Yes, I will provide a patch. Our new proxy server has broken my access to
> the svn repository, though :-(
>
>
> On Tue, Feb 16, 2010 at 1:12 PM, Grant Ingersoll wrote:
>
>> That sounds reasonable.  Patch?
>>
>> On Feb 15, 2010, at 10:29 AM, Peter Keegan wrote:
>>
>> > The 'explain' method in PayloadNearSpanScorer assumes the
>> > AveragePayloadFunction was used. I don't see an easy way to override
>> this
>> > because 'payloadsSeen' and 'payloadScore' are private/protected. It
>> seems
>> > like the 'PayloadFunction' interface should have an 'explain' method
>> that
>> > the Scorer could call. Any thoughts?
>> >
>> > Peter
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

Re: Boost Problem (again), need example !

2010-02-22 Thread Erick Erickson

I still don't understand why a simple sort as suggested by Ian wouldn't
work.
It'd be a lot more reliable than fiddling with doc scores if you want a
strict
ordering on a particular field (make sure it's untokenized though).

Erick

On Mon, Feb 22, 2010 at 8:19 AM, pdaures  wrote:

>
> It WORKS !
>
> Thank you so much, I spent a lot of time trying to do that, thank you again
> !
>
>
> Uwe Schindler wrote:
> >
> > The simple fix for that is to wrap the subQuery using: new
> > ConstantScoreQuery(new QueryWrapperFilter(query)) - after that its score
> > is constant and the ValueSource only scores.
> >
> > I recommend to use NumericField for indexing this boost (no storing
> > needed, only indexing, precisionStep=Integer.MAX_VALUE). Else (if using
> > standard Field) the boost field does not need to be "stored", it must be
> > indexed as NOT_ANALYZED.
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> >> -Original Message-
> >> From: Ian Lea [mailto:ian@gmail.com]
> >> Sent: Monday, February 22, 2010 12:26 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Re: Boost Problem (again), need example !
> >>
> >> boostField needs to be indexed to be used in the FieldScoreQuery.
> >>
> >> Are you now using one of the the latest releases that Uwe mentioned,
> >> with fixes for CustomScoreQuery?
> >>
> >> And unless you provide your own implementation of
> >> CustomScoreQuery.customScore() I think that you are still not
> >> guaranteed to get what you want since the default implementation is to
> >> calculate the score as subQueryScore * valSrcScore.
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >> On Mon, Feb 22, 2010 at 11:00 AM, pdaures 
> >> wrote:
> >> >
> >> > HI !
> >> > Thank you for your help.
> >> > I think I don't use CustomScoreQuery correctly when I do a "search".
> >> >
> >> > BooleanQuery combinedQuery = new BooleanQuery();
> >> > combinedQuery.add(textQuery, Occur.MUST);
> >> > combinedQuery.add(titleQuery, Occur.MUST);
> >> >
> >> > CustomScoreQuery customQuery = new CustomScoreQuery(combinedQuery,new
> >> > FieldScoreQuery(BOOST_FIELD,Type.INT));
> >> >
> >> > indexSearcher.search(..., customQuery, ).
> >> >
> >> > in order to index the BOOST_FIELD, I do that :
> >> > Field boostField = new Field(BOOST_FIELD, Integer.toString(boost),
> >> > Field.Store.YES, Field.Index.ANALYZED.NO);
> >> >
> >> >
> >> > Is that correct ?
> >> > Thank you
> >> >
> >> >
> >> >
> >> >
> >> > Uwe Schindler wrote:
> >> >>
> >> >> It's CustomScoreQuery in 2.9 and 3.0.
> >> >>
> >> >> Please wait for 2.9.2 and 3.0.1 for an important API change in this
> >> >> experimental query type to work correct with the new per-segment-
> >> search!
> >> >> You can test the release artifacts of both new versions here:
> >> >> http://people.apache.org/~uschindler/staging-area/lucene-292-301-
> >> take2-rev912433/
> >> >>
> >> >> With e.g. ValueSourceQuery you can score your documents using a
> >> separate
> >> >> numeric field from your documents (it uses FieldCache).
> >> >>
> >> >> Uwe
> >> >>
> >> >> -
> >> >> Uwe Schindler
> >> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >> http://www.thetaphi.de
> >> >> eMail: u...@thetaphi.de
> >> >>
> >> >>> -Original Message-
> >> >>> From: Ian Lea [mailto:ian@gmail.com]
> >> >>> Sent: Monday, February 22, 2010 10:33 AM
> >> >>> To: java-user@lucene.apache.org
> >> >>> Subject: Re: Boost Problem (again), need example !
> >> >>>
> >> >>> Can't you simply sort by descending score (your score, not
> >> lucene's)?
> >> >>> Seems to me that would give you what you are asking for.
> >> >>>
> >> >>> The setBoost() method is unlikely to work consistently because it
> >> only
> >> >>> infuences the score rather than setting it.  If your John Mickeal
> >> doc
> >> >>> happens to have a higher lucene score, because of the normal
> >> >>> idf/tf/etc stuff, then the setBoost() with a higher value for John
> >> >>> Smith may well not be enough to force John Smith to the top.
> >> >>>
> >> >>> I don't know enough about function queries to help you much there
> >> but
> >> >>> FieldScoreQuery might work.  I can't see any sign of class
> >> >>> FunctionQuery in the 3.0.0 core package so am not clear what that
> >> is.
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Ian.
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Mon, Feb 22, 2010 at 8:54 AM, pdaures 
> >> >>> wrote:
> >> >>> >
> >> >>> > Hi,
> >> >>> > I know that there are many topics about scoring issues, but I
> >> didn't
> >> >>> find an
> >> >>> > answer in the topics.
> >> >>> > This is the problem :
> >> >>> > Imagine I'm a teacher, and I have to index all the results,
> >> comments
> >> >>> and
> >> >>> > score about students.
> >> >>> >
> >> >>> > Student :
> >> >>> > String name (eg : John Smith)
> >> >>> > String comments : (eg: John is a good student, but he needs to be
> >> >>> more self
> >> >>> > confident bla bla bla)
> >> >>> > float score (eg :

Re: range of scores : queryNorm()

2010-02-22 Thread Ian Lea

> I have observed that even if we change boosting
> drastically, scores are being normalized at the end because of
> queryNorm value. Is there anything ( regarding to the queryNorm) that
> we can rely on ?

Dunno.

> like score will always be under 10

No.

> or some fixed  value ?

I think not.

> The main objective is to provide scores in a fixed range to
> the partner. So have you been experienced anything like this? Is it
> possible to do so ?.

You could normalize the scores yourself, probably most easily in a
pass through them once the search has completed.  Beware of comparing
scores across searches and indexes.

> Have you been experienced any strange situation like for a
> particular query, result scores were really high compared to routine?

Not me, but I rarely look at scores directly.  I just care that the
right docs get found.


--
Ian.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: range of scores : queryNorm()

2010-02-22 Thread Erick Erickson

Could you back up a step and tell us what the upper-level
task you're trying to accomplish is? That is, why the partner
wants the number?

Because the raw score in Lucene is only relevant within that
single query, and then only for ranking. The normalized score
*is* in a fixed range already, between 0 and 1. Would it serve
to just modify that and send it back to the partner?

Erick

On Mon, Feb 22, 2010 at 5:26 AM, Smith G  wrote:

> Hello ,
>  I have observed that even if we change boosting
> drastically, scores are being normalized at the end because of
> queryNorm value. Is there anything ( regarding to the queryNorm) that
> we can rely on ? like score will always be under 10 or some fixed
> value ? The main objective is to provide scores in a fixed range to
> the partner. So have you been experienced anything like this? Is it
> possible to do so ?.
>   Have you been experienced any strange situation like for a
> particular query, result scores were really high compared to routine?
> if yes,I would like to know  the factor that effected scores
> drastically, because it may help me to proceed or understand the
> cases.
> Thanks
>
> (NOTE : I am sorry, I have also posted in solr group, there were no
> replies and also I feel this place is even more apt.).
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Scanning docs at index time

2010-02-22 Thread Nigel

I'd like to scan documents as they're being indexed, to find out immediately
if any of them match certain queries.  The goal is to find out of there are
any new hits for these queries as soon as possible, without re-searching the
index over and over (which would be inefficient, and higher latency).  The
documents still need to be indexed (not just scanned) so they can be
searched later with different queries not known at index time.

The indexing throughput is in the tens of millions per day, and there are
maybe a thousand queries or so to be matched.  So this has to work pretty
fast.  (-:  Fortunately the number and size of fields are both fairly small.

This scanning could of course be completely decoupled from the indexing
process.  But my thinking was that since we already have the documents in
hand, and we'll be analyzing various fields in the course of indexing, we
could ideally reuse those token streams somehow for this on-the-fly scanning
process.

I took a look at the org.apache.lucene.index.memory.MemoryIndex class in
contrib.  It looks like that would work, but I'm not sure if it's the most
appropriate solution (for one thing, it would have to re-analyze all the
fields).  Has anyone here done something similar and/or know of other
classes that would be suitable?

Thanks,
Chris

IndexWriter.getReader.getVersion behavior

2010-02-22 Thread Peter Keegan

Using Lucene 2.9.1, I have the following pseudocode which gets repeated at
regular intervals:

1. FSDirectory dir = FSDirectory.open(java.io.File);
2. dir.setLockFactory(new SingleInstanceLockFactory());
3. IndexWriter writer = new IndexWriter(dir, Analyzer, false, maxFieldLen)
4. writer.getReader().getVersion();
5. writer.prepareCommit();
6. writer.getReader().getVersion();
7. writer.commit();
8. writer.close();

I'm using the version number to keep external data in synch with the index.
Usually, the version number from (6) is 1 greater than from (4) and the
version from (4) equals the version from the previous (6). At least once a
day, however, the version from (4) is 1 greater than from the previous (6).
What would explain this sporadic behavior of version numbers?

Thanks,
Peter

Re: IndexWriter.getReader.getVersion behavior

2010-02-22 Thread Jason Rutherglen

Peter,

Perhaps other concurrent operations?

Jason

On Tue, Feb 23, 2010 at 10:43 AM, Peter Keegan  wrote:
> Using Lucene 2.9.1, I have the following pseudocode which gets repeated at
> regular intervals:
>
> 1. FSDirectory dir = FSDirectory.open(java.io.File);
> 2. dir.setLockFactory(new SingleInstanceLockFactory());
> 3. IndexWriter writer = new IndexWriter(dir, Analyzer, false, maxFieldLen)
> 4. writer.getReader().getVersion();
> 5. writer.prepareCommit();
> 6. writer.getReader().getVersion();
> 7. writer.commit();
> 8. writer.close();
>
> I'm using the version number to keep external data in synch with the index.
> Usually, the version number from (6) is 1 greater than from (4) and the
> version from (4) equals the version from the previous (6). At least once a
> day, however, the version from (4) is 1 greater than from the previous (6).
> What would explain this sporadic behavior of version numbers?
>
> Thanks,
> Peter
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexWriter.getReader.getVersion behavior

2010-02-22 Thread Michael McCandless

That's curious.

It's only on prepareCommit (or, commit, if you didn't first prepare,
since that will call prepareCommit internally) that this version
should increase.

Is there only 1 thread doing this?

Oh, and, are you passing false for autoCommit?

Mike

On Mon, Feb 22, 2010 at 11:43 AM, Peter Keegan  wrote:
> Using Lucene 2.9.1, I have the following pseudocode which gets repeated at
> regular intervals:
>
> 1. FSDirectory dir = FSDirectory.open(java.io.File);
> 2. dir.setLockFactory(new SingleInstanceLockFactory());
> 3. IndexWriter writer = new IndexWriter(dir, Analyzer, false, maxFieldLen)
> 4. writer.getReader().getVersion();
> 5. writer.prepareCommit();
> 6. writer.getReader().getVersion();
> 7. writer.commit();
> 8. writer.close();
>
> I'm using the version number to keep external data in synch with the index.
> Usually, the version number from (6) is 1 greater than from (4) and the
> version from (4) equals the version from the previous (6). At least once a
> day, however, the version from (4) is 1 greater than from the previous (6).
> What would explain this sporadic behavior of version numbers?
>
> Thanks,
> Peter
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexWriter.getReader.getVersion behavior

2010-02-22 Thread Peter Keegan

Only one writer thread and one writer process.
I'm calling IndexWriter(Directory d, Analyzer a, boolean create,
MaxFieldLength mfl), which sets autocommit=false.

Peter

On Mon, Feb 22, 2010 at 12:24 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> That's curious.
>
> It's only on prepareCommit (or, commit, if you didn't first prepare,
> since that will call prepareCommit internally) that this version
> should increase.
>
> Is there only 1 thread doing this?
>
> Oh, and, are you passing false for autoCommit?
>
> Mike
>
> On Mon, Feb 22, 2010 at 11:43 AM, Peter Keegan 
> wrote:
> > Using Lucene 2.9.1, I have the following pseudocode which gets repeated
> at
> > regular intervals:
> >
> > 1. FSDirectory dir = FSDirectory.open(java.io.File);
> > 2. dir.setLockFactory(new SingleInstanceLockFactory());
> > 3. IndexWriter writer = new IndexWriter(dir, Analyzer, false,
> maxFieldLen)
> > 4. writer.getReader().getVersion();
> > 5. writer.prepareCommit();
> > 6. writer.getReader().getVersion();
> > 7. writer.commit();
> > 8. writer.close();
> >
> > I'm using the version number to keep external data in synch with the
> index.
> > Usually, the version number from (6) is 1 greater than from (4) and the
> > version from (4) equals the version from the previous (6). At least once
> a
> > day, however, the version from (4) is 1 greater than from the previous
> (6).
> > What would explain this sporadic behavior of version numbers?
> >
> > Thanks,
> > Peter
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: IndexWriter.getReader.getVersion behavior

2010-02-22 Thread Michael McCandless

Well I'm at a loss then.  The version should only increment on commit.

Can you make it all happen when infoStream is on, and post back?

Mike

On Mon, Feb 22, 2010 at 12:35 PM, Peter Keegan  wrote:
> Only one writer thread and one writer process.
> I'm calling IndexWriter(Directory d, Analyzer a, boolean create,
> MaxFieldLength mfl), which sets autocommit=false.
>
> Peter
>
> On Mon, Feb 22, 2010 at 12:24 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> That's curious.
>>
>> It's only on prepareCommit (or, commit, if you didn't first prepare,
>> since that will call prepareCommit internally) that this version
>> should increase.
>>
>> Is there only 1 thread doing this?
>>
>> Oh, and, are you passing false for autoCommit?
>>
>> Mike
>>
>> On Mon, Feb 22, 2010 at 11:43 AM, Peter Keegan 
>> wrote:
>> > Using Lucene 2.9.1, I have the following pseudocode which gets repeated
>> at
>> > regular intervals:
>> >
>> > 1. FSDirectory dir = FSDirectory.open(java.io.File);
>> > 2. dir.setLockFactory(new SingleInstanceLockFactory());
>> > 3. IndexWriter writer = new IndexWriter(dir, Analyzer, false,
>> maxFieldLen)
>> > 4. writer.getReader().getVersion();
>> > 5. writer.prepareCommit();
>> > 6. writer.getReader().getVersion();
>> > 7. writer.commit();
>> > 8. writer.close();
>> >
>> > I'm using the version number to keep external data in synch with the
>> index.
>> > Usually, the version number from (6) is 1 greater than from (4) and the
>> > version from (4) equals the version from the previous (6). At least once
>> a
>> > day, however, the version from (4) is 1 greater than from the previous
>> (6).
>> > What would explain this sporadic behavior of version numbers?
>> >
>> > Thanks,
>> > Peter
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexWriter.getReader.getVersion behavior

2010-02-22 Thread Peter Keegan

I'm pretty sure there are flushes and segment merges going on, but as you
said, that shouldn't affect the version increment. I'll see what I can do to
get infoStream output.

Thanks,
Peter

On Mon, Feb 22, 2010 at 2:30 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Well I'm at a loss then.  The version should only increment on commit.
>
> Can you make it all happen when infoStream is on, and post back?
>
> Mike
>
> On Mon, Feb 22, 2010 at 12:35 PM, Peter Keegan 
> wrote:
> > Only one writer thread and one writer process.
> > I'm calling IndexWriter(Directory d, Analyzer a, boolean create,
> > MaxFieldLength mfl), which sets autocommit=false.
> >
> > Peter
> >
> > On Mon, Feb 22, 2010 at 12:24 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> That's curious.
> >>
> >> It's only on prepareCommit (or, commit, if you didn't first prepare,
> >> since that will call prepareCommit internally) that this version
> >> should increase.
> >>
> >> Is there only 1 thread doing this?
> >>
> >> Oh, and, are you passing false for autoCommit?
> >>
> >> Mike
> >>
> >> On Mon, Feb 22, 2010 at 11:43 AM, Peter Keegan 
> >> wrote:
> >> > Using Lucene 2.9.1, I have the following pseudocode which gets
> repeated
> >> at
> >> > regular intervals:
> >> >
> >> > 1. FSDirectory dir = FSDirectory.open(java.io.File);
> >> > 2. dir.setLockFactory(new SingleInstanceLockFactory());
> >> > 3. IndexWriter writer = new IndexWriter(dir, Analyzer, false,
> >> maxFieldLen)
> >> > 4. writer.getReader().getVersion();
> >> > 5. writer.prepareCommit();
> >> > 6. writer.getReader().getVersion();
> >> > 7. writer.commit();
> >> > 8. writer.close();
> >> >
> >> > I'm using the version number to keep external data in synch with the
> >> index.
> >> > Usually, the version number from (6) is 1 greater than from (4) and
> the
> >> > version from (4) equals the version from the previous (6). At least
> once
> >> a
> >> > day, however, the version from (4) is 1 greater than from the previous
> >> (6).
> >> > What would explain this sporadic behavior of version numbers?
> >> >
> >> > Thanks,
> >> > Peter
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

can IndexWriter.addIndexes de-dupe documents?

2010-02-22 Thread jchang


When I call IndexWriter.addIndexes, is there anything I can do to make it
filter out duplicates based a certain field (or group of fields)?   If I
know that the id field of the document is unique, can I make addIndexes know
that if it finds a new document bat the same id, the new one is valid and
the old one should be overwritten (or deleted and the new one added in its
place)?

I don't see anything like unique constraint in the Field class; I know
Lucene is not a SQL database, but i just wanted to check to make sure I'm
not missing anything.


-- 
View this message in context: 
http://old.nabble.com/can-IndexWriter.addIndexes-de-dupe-documents--tp27694763p27694763.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: can IndexWriter.addIndexes de-dupe documents?

2010-02-22 Thread Michael McCandless

addIndexes doesn't make this possible.

Maybe add the indexes but then make a 2nd pass to dedup?

Mike

On Mon, Feb 22, 2010 at 4:26 PM, jchang  wrote:
>
> When I call IndexWriter.addIndexes, is there anything I can do to make it
> filter out duplicates based a certain field (or group of fields)?   If I
> know that the id field of the document is unique, can I make addIndexes know
> that if it finds a new document bat the same id, the new one is valid and
> the old one should be overwritten (or deleted and the new one added in its
> place)?
>
> I don't see anything like unique constraint in the Field class; I know
> Lucene is not a SQL database, but i just wanted to check to make sure I'm
> not missing anything.
>
>
> --
> View this message in context: 
> http://old.nabble.com/can-IndexWriter.addIndexes-de-dupe-documents--tp27694763p27694763.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: can IndexWriter.addIndexes de-dupe documents?

2010-02-22 Thread Erick Erickson

What sorts of rules would govern which one should be
kept? Say you were adding three indexes and there
was a document in each that was identical. Which one
should be kept?

I suspect any rule would be wrong at least part of
the time

FWIW
Erick

On Mon, Feb 22, 2010 at 5:02 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> addIndexes doesn't make this possible.
>
> Maybe add the indexes but then make a 2nd pass to dedup?
>
> Mike
>
> On Mon, Feb 22, 2010 at 4:26 PM, jchang  wrote:
> >
> > When I call IndexWriter.addIndexes, is there anything I can do to make it
> > filter out duplicates based a certain field (or group of fields)?   If I
> > know that the id field of the document is unique, can I make addIndexes
> know
> > that if it finds a new document bat the same id, the new one is valid and
> > the old one should be overwritten (or deleted and the new one added in
> its
> > place)?
> >
> > I don't see anything like unique constraint in the Field class; I know
> > Lucene is not a SQL database, but i just wanted to check to make sure I'm
> > not missing anything.
> >
> >
> > --
> > View this message in context:
> http://old.nabble.com/can-IndexWriter.addIndexes-de-dupe-documents--tp27694763p27694763.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Scanning docs at index time

2010-02-22 Thread Apoorv Sharma

I don't know of classes which will be suitable but if they are ordered
queries a simple code could easily be written.

On Mon, Feb 22, 2010 at 9:59 PM, Nigel  wrote:

> I'd like to scan documents as they're being indexed, to find out
> immediately
> if any of them match certain queries.  The goal is to find out of there are
> any new hits for these queries as soon as possible, without re-searching
> the
> index over and over (which would be inefficient, and higher latency).  The
> documents still need to be indexed (not just scanned) so they can be
> searched later with different queries not known at index time.
>
> The indexing throughput is in the tens of millions per day, and there are
> maybe a thousand queries or so to be matched.  So this has to work pretty
> fast.  (-:  Fortunately the number and size of fields are both fairly
> small.
>
> This scanning could of course be completely decoupled from the indexing
> process.  But my thinking was that since we already have the documents in
> hand, and we'll be analyzing various fields in the course of indexing, we
> could ideally reuse those token streams somehow for this on-the-fly
> scanning
> process.
>
> I took a look at the org.apache.lucene.index.memory.MemoryIndex class in
> contrib.  It looks like that would work, but I'm not sure if it's the most
> appropriate solution (for one thing, it would have to re-analyze all the
> fields).  Has anyone here done something similar and/or know of other
> classes that would be suitable?
>
> Thanks,
> Chris
>

Boost Problem (again), need example !

Re: Boost Problem (again), need example !

RE: Boost Problem (again), need example !

range of scores : queryNorm()

RE: Boost Problem (again), need example !

Re: Boost Problem (again), need example !

RE: Boost Problem (again), need example !

RE: Boost Problem (again), need example !

Re: PayloadNearSpanScorer explain method

Re: Boost Problem (again), need example !

Re: range of scores : queryNorm()

Re: range of scores : queryNorm()

Scanning docs at index time

IndexWriter.getReader.getVersion behavior

Re: IndexWriter.getReader.getVersion behavior

Re: IndexWriter.getReader.getVersion behavior

Re: IndexWriter.getReader.getVersion behavior

Re: IndexWriter.getReader.getVersion behavior

Re: IndexWriter.getReader.getVersion behavior

can IndexWriter.addIndexes de-dupe documents?

Re: can IndexWriter.addIndexes de-dupe documents?

Re: can IndexWriter.addIndexes de-dupe documents?

Re: Scanning docs at index time

23 matches

Site Navigation

Mail list logo

Footer information