Sorting a Lucene index

2010-08-18 Thread Shelly_Singh
Hi,

I have a Lucene index that contains a numeric field along with certain other 
fields. The order of incoming documents is random and un-predictable. As a 
result, while creating an index, I end up adding docs in random order with 
respect to the numeric field value.

For example, documents may be added in following order:
12,y,d
100,o,p
1,x,y
23,u,i
31,v,m
22,b,m
109,k,l

My requirement is that at search time, I want the documents in order of the 
numeric field.
One, option is to do a score/sort on the numeric field.
But, this may be a costly operation.

Hence, I am trying to find if there is some way, such that, my stored index is 
sorted by itself.

Please help.

Thanks and Regards,

Shelly Singh
Center For KNowledge Driven Information Systems, Infosys
Email: shelly_si...@infosys.com
Phone: (M) 91 992 369 7200, (VoIP)2022978622





TermQuery and ConstantScoreQuery on TermsFilter

2010-08-18 Thread Shelly_Singh
Hi,

In my index lucene index, I want to search on a field, but the score or order 
of returned documents is not important. What is important is which documents 
are returned.

As, I do not need score or even default sorting(order  by docid), what is the 
best way to write a query.

I compared performance of two options - TermQuery and ConstantScoreQuery on 
TermsFilter. I was expecting smaller search time with ConstantScoreQuery with 
TermsFilter, but it has turned out otherwise.
Please help me understand this behavior.

Thanks and Regards,

Shelly Singh
Center For KNowledge Driven Information Systems, Infosys
Email: shelly_si...@infosys.com
Phone: (M) 91 992 369 7200, (VoIP)2022978622




 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: Sorting a Lucene index

2010-08-18 Thread Anshum
Hi Shelly,
The search results so returned are sorted either by relevance, index order,
stored field, or custom order.
As you are saying that you would not be able to maintain the index order,
 you would have to do the sort at run time.
Sorting on a stored field is not costly and you may use it comfortably. btw,
are you facing any issues in sort time or is it a presumption?

--
Anshum Gupta
http://ai-cafe.blogspot.com


On Wed, Aug 18, 2010 at 5:12 PM, Shelly_Singh wrote:

> Hi,
>
> I have a Lucene index that contains a numeric field along with certain
> other fields. The order of incoming documents is random and un-predictable.
> As a result, while creating an index, I end up adding docs in random order
> with respect to the numeric field value.
>
> For example, documents may be added in following order:
> 12,y,d
> 100,o,p
> 1,x,y
> 23,u,i
> 31,v,m
> 22,b,m
> 109,k,l
>
> My requirement is that at search time, I want the documents in order of the
> numeric field.
> One, option is to do a score/sort on the numeric field.
> But, this may be a costly operation.
>
> Hence, I am trying to find if there is some way, such that, my stored index
> is sorted by itself.
>
> Please help.
>
> Thanks and Regards,
>
> Shelly Singh
> Center For KNowledge Driven Information Systems, Infosys
> Email: shelly_si...@infosys.com
> Phone: (M) 91 992 369 7200, (VoIP)2022978622
>
>
>
>


Re: TermQuery and ConstantScoreQuery on TermsFilter

2010-08-18 Thread Ian Lea
Hard to say - there are many factors involved in searching.  I'd just
use the easiest queries that were fast enough.  If you want a better
answer more info would be useful.  For starters:

What version of lucene.
How big is the index.
How many hits.
Exactly what do the queries look like (q.toString()).
How are you constructing the filters.
Are the filters cached.

And don't forget to ignore the times of the first few searches on a
new Searcher.


--
Ian.


On Wed, Aug 18, 2010 at 12:47 PM, Shelly_Singh  wrote:
> Hi,
>
> In my index lucene index, I want to search on a field, but the score or order 
> of returned documents is not important. What is important is which documents 
> are returned.
>
> As, I do not need score or even default sorting(order  by docid), what is the 
> best way to write a query.
>
> I compared performance of two options - TermQuery and ConstantScoreQuery on 
> TermsFilter. I was expecting smaller search time with ConstantScoreQuery with 
> TermsFilter, but it has turned out otherwise.
> Please help me understand this behavior.
>
> Thanks and Regards,
>
> Shelly Singh
> Center For KNowledge Driven Information Systems, Infosys
> Email: shelly_si...@infosys.com
> Phone: (M) 91 992 369 7200, (VoIP)2022978622
>
>
>
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are 
> not
> to copy, disclose, or distribute this e-mail or its contents to any other 
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has 
> taken
> every reasonable precaution to minimize this risk, but is not liable for any 
> damage
> you may sustain as a result of any virus in this e-mail. You should carry out 
> your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this 
> e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: "Natural sorting" of documents in a Lucene index - possible?

2010-08-18 Thread Michel Nadeau
Cool, so I'll try these things -

* Replace timestamps with MMDD - will minimize unique terms count;
* Use NumericField's for dates and numbers - will remove all string sorting.
Thanks guys!

--

But - to come back to my original question... is there any way to have a
"natural order" of documents other that the DocId In Lucene? For example, is
there any way to have an index automatically sorted on a specific field,
like :

DocId Count Data
-
  5 1   First test
  1 3   Otter
  8 4   Test
  2 8   Aloha
 1011   Zulu
  917   Bingo
  346   Alpha test
  6   112   Tango
  4   120   Charlie test
  7   200   Kiwi

Notice the DocId and Data random orders, but Count is sorted. That would be
the 'natural order' in the index, and searching for 'test' would return (in
that order) :

DocId Count Data
-
  5 1   First test
  346   Alpha test
   4   120   Charlie test

Already sorted on the Count.

Thanks!

- Mike
aka...@gmail.com


On Tue, Aug 17, 2010 at 4:08 PM, Ian Lea  wrote:

> Using NumericField for dates and other numbers is likely to help a
> lot, and removes padding problems.  I'd try that first, or just sort
> the top n hits yourself.
>
>
> --
> Ian.
>
>
> On Tue, Aug 17, 2010 at 8:46 PM, Michel Nadeau  wrote:
> > I could at least drop hours/mins/sec, we don't need them, so my timestamp
> > could become 'MMDD', that would cut the number of unique terms at
> least
> > for dates.
> >
> > What about my other question about numbers : *" We do pad our numbers
> with
> > zeros though (for example: 10 becomes 0010, etc.) because we had
> trouble
> > with sorting (100 was smaller than 2) ; is that considered as "string
> > sorting" ? This might explain a part of the problem."* ? Thanks.
> >
> > - Mike
> > aka...@gmail.com
> >
> >
> > On Tue, Aug 17, 2010 at 3:40 PM, Erick Erickson  >wrote:
> >
> >> Hmmm, I glossed over your comment about sorting the top 250. There's
> >> no reason that wouldn't work.
> >>
> >> Well, one way for, say, dates is to store separate fields. , MM, DD,
> >> HH, MM, SS, MS. That gives you say, 100 year terms, + 12 month
> >> +31 days +  for a very small total. You pay the price though by
> >> having to change your queries and sorts to respect all 6 fields...
> >>
> >> But I'd only really go there after seeing if other options don't work.
> >>
> >>
> >> Best
> >> Erick
> >>
> >> On Tue, Aug 17, 2010 at 3:35 PM, Michel Nadeau 
> wrote:
> >>
> >> > Would our approach to limit the search top 250 documents (and then
> sort
> >> > these 250 documents) work fine ? Or even 250 unique terms with a lot
> of
> >> > users is bad on memory when sorting ?
> >> >
> >> > We didn't look at trie fields - I will do though, thanks for the tip !
> >> >
> >> > We do store the original 'Data' field (only the 'SearchableData' field
> is
> >> > analyzed, all other fields are not analyzed), the users mainly sort on
> >> > numeric values; not a lot on string values (in fact I could compltely
> >> drop
> >> > the sort by string feature). We do pad our numbers with zeros though
> (for
> >> > example: 10 becomes 0010, etc.) because we had trouble with
> sorting
> >> > (100
> >> > was smaller than 2) ; is that considered as "string sorting" ? This
> might
> >> > explain a part of the problem.
> >> >
> >> > Why/how would I reduce the count of unique terms?
> >> >
> >> >
> >> > - Mike
> >> > aka...@gmail.com
> >> >
> >> >
> >> > On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson <
> erickerick...@gmail.com
> >> > >wrote:
> >> >
> >> > > If you have tens of millions of documents, almost all with unique
> >> fields
> >> > > that you're sorting on, you'll chew through memory like there's no
> >> > > tomorrow.
> >> > >
> >> > > Have you looked at trie fields? See:
> >> > >
> >> > >
> >> >
> >>
> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
> >> > >
> >> > > I'm a little concerned that the user can sort on Data. Any field
> used
> >> for
> >> > > sorting
> >> > > should NOT be analyzed, so unless you are indexing "Data"
> unanalyzed,
> >> > > that's
> >> > > a problem. And if you are sorting on strings unique to each
> document,
> >> > > that's
> >> > > also a memory hog. Not to mention whether capitalization counts.
> >> > >
> >> > > You might enumerate the terms in your index for each of the sortable
> >> > fields
> >> > > to figure out what the total number of unique terms each is and use
> >> that
> >> > as
> >> > > a basis for reducing their count
> >> > >
> >> > > HTH
> >> > > Erick
> >> > >
> >> > > On Tue, Aug 17, 2010 at 3:05 PM, Michel Nadeau 
> >> wrote:
> >> > >
> >> > > > Hi Erick,
> >> > > >
> >> > > > Here's some more details about our structure. First here's an
> example
> >> > of
> >> > > > document in our index :
>

Re: "Natural sorting" of documents in a Lucene index - possible?

2010-08-18 Thread Ian Lea
> But - to come back to my original question... is there any way to have a
> "natural order" of documents other that the DocId In Lucene?

No.


--
Ian.


On Wed, Aug 18, 2010 at 3:21 PM, Michel Nadeau  wrote:
> Cool, so I'll try these things -
>
> * Replace timestamps with MMDD - will minimize unique terms count;
> * Use NumericField's for dates and numbers - will remove all string sorting.
> Thanks guys!
>
> --
>
> But - to come back to my original question... is there any way to have a
> "natural order" of documents other that the DocId In Lucene? For example, is
> there any way to have an index automatically sorted on a specific field,
> like :
>
> DocId     Count     Data
> -
>  5         1       First test
>  1         3       Otter
>  8         4       Test
>  2         8       Aloha
>  10        11       Zulu
>  9        17       Bingo
>  3        46       Alpha test
>  6       112       Tango
>  4       120       Charlie test
>  7       200       Kiwi
>
> Notice the DocId and Data random orders, but Count is sorted. That would be
> the 'natural order' in the index, and searching for 'test' would return (in
> that order) :
>
> DocId     Count     Data
> -
>  5         1       First test
>  3        46       Alpha test
>   4       120       Charlie test
>
> Already sorted on the Count.
>
> Thanks!
>
> - Mike
> aka...@gmail.com
>
>
> On Tue, Aug 17, 2010 at 4:08 PM, Ian Lea  wrote:
>
>> Using NumericField for dates and other numbers is likely to help a
>> lot, and removes padding problems.  I'd try that first, or just sort
>> the top n hits yourself.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, Aug 17, 2010 at 8:46 PM, Michel Nadeau  wrote:
>> > I could at least drop hours/mins/sec, we don't need them, so my timestamp
>> > could become 'MMDD', that would cut the number of unique terms at
>> least
>> > for dates.
>> >
>> > What about my other question about numbers : *" We do pad our numbers
>> with
>> > zeros though (for example: 10 becomes 0010, etc.) because we had
>> trouble
>> > with sorting (100 was smaller than 2) ; is that considered as "string
>> > sorting" ? This might explain a part of the problem."* ? Thanks.
>> >
>> > - Mike
>> > aka...@gmail.com
>> >
>> >
>> > On Tue, Aug 17, 2010 at 3:40 PM, Erick Erickson > >wrote:
>> >
>> >> Hmmm, I glossed over your comment about sorting the top 250. There's
>> >> no reason that wouldn't work.
>> >>
>> >> Well, one way for, say, dates is to store separate fields. , MM, DD,
>> >> HH, MM, SS, MS. That gives you say, 100 year terms, + 12 month
>> >> +31 days +  for a very small total. You pay the price though by
>> >> having to change your queries and sorts to respect all 6 fields...
>> >>
>> >> But I'd only really go there after seeing if other options don't work.
>> >>
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Tue, Aug 17, 2010 at 3:35 PM, Michel Nadeau 
>> wrote:
>> >>
>> >> > Would our approach to limit the search top 250 documents (and then
>> sort
>> >> > these 250 documents) work fine ? Or even 250 unique terms with a lot
>> of
>> >> > users is bad on memory when sorting ?
>> >> >
>> >> > We didn't look at trie fields - I will do though, thanks for the tip !
>> >> >
>> >> > We do store the original 'Data' field (only the 'SearchableData' field
>> is
>> >> > analyzed, all other fields are not analyzed), the users mainly sort on
>> >> > numeric values; not a lot on string values (in fact I could compltely
>> >> drop
>> >> > the sort by string feature). We do pad our numbers with zeros though
>> (for
>> >> > example: 10 becomes 0010, etc.) because we had trouble with
>> sorting
>> >> > (100
>> >> > was smaller than 2) ; is that considered as "string sorting" ? This
>> might
>> >> > explain a part of the problem.
>> >> >
>> >> > Why/how would I reduce the count of unique terms?
>> >> >
>> >> >
>> >> > - Mike
>> >> > aka...@gmail.com
>> >> >
>> >> >
>> >> > On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> > >wrote:
>> >> >
>> >> > > If you have tens of millions of documents, almost all with unique
>> >> fields
>> >> > > that you're sorting on, you'll chew through memory like there's no
>> >> > > tomorrow.
>> >> > >
>> >> > > Have you looked at trie fields? See:
>> >> > >
>> >> > >
>> >> >
>> >>
>> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
>> >> > >
>> >> > > I'm a little concerned that the user can sort on Data. Any field
>> used
>> >> for
>> >> > > sorting
>> >> > > should NOT be analyzed, so unless you are indexing "Data"
>> unanalyzed,
>> >> > > that's
>> >> > > a problem. And if you are sorting on strings unique to each
>> document,
>> >> > > that's
>> >> > > also a memory hog. Not to mention whether capitalization counts.
>> >> > >
>> >> > > You might enumerate the terms in your index for each of the sortable
>> >> > fields
>> >> > > to figure out what the total number

Re: "Natural sorting" of documents in a Lucene index - possible?

2010-08-18 Thread Michel Nadeau
Thanks !

- Mike
aka...@gmail.com


On Wed, Aug 18, 2010 at 10:37 AM, Ian Lea  wrote:

> > But - to come back to my original question... is there any way to have a
> > "natural order" of documents other that the DocId In Lucene?
>
> No.
>
>
> --
> Ian.
>
>
> On Wed, Aug 18, 2010 at 3:21 PM, Michel Nadeau  wrote:
> > Cool, so I'll try these things -
> >
> > * Replace timestamps with MMDD - will minimize unique terms count;
> > * Use NumericField's for dates and numbers - will remove all string
> sorting.
> > Thanks guys!
> >
> > --
> >
> > But - to come back to my original question... is there any way to have a
> > "natural order" of documents other that the DocId In Lucene? For example,
> is
> > there any way to have an index automatically sorted on a specific field,
> > like :
> >
> > DocId Count Data
> > -
> >  5 1   First test
> >  1 3   Otter
> >  8 4   Test
> >  2 8   Aloha
> >  1011   Zulu
> >  917   Bingo
> >  346   Alpha test
> >  6   112   Tango
> >  4   120   Charlie test
> >  7   200   Kiwi
> >
> > Notice the DocId and Data random orders, but Count is sorted. That would
> be
> > the 'natural order' in the index, and searching for 'test' would return
> (in
> > that order) :
> >
> > DocId Count Data
> > -
> >  5 1   First test
> >  346   Alpha test
> >   4   120   Charlie test
> >
> > Already sorted on the Count.
> >
> > Thanks!
> >
> > - Mike
> > aka...@gmail.com
> >
> >
> > On Tue, Aug 17, 2010 at 4:08 PM, Ian Lea  wrote:
> >
> >> Using NumericField for dates and other numbers is likely to help a
> >> lot, and removes padding problems.  I'd try that first, or just sort
> >> the top n hits yourself.
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >> On Tue, Aug 17, 2010 at 8:46 PM, Michel Nadeau 
> wrote:
> >> > I could at least drop hours/mins/sec, we don't need them, so my
> timestamp
> >> > could become 'MMDD', that would cut the number of unique terms at
> >> least
> >> > for dates.
> >> >
> >> > What about my other question about numbers : *" We do pad our numbers
> >> with
> >> > zeros though (for example: 10 becomes 0010, etc.) because we had
> >> trouble
> >> > with sorting (100 was smaller than 2) ; is that considered as "string
> >> > sorting" ? This might explain a part of the problem."* ? Thanks.
> >> >
> >> > - Mike
> >> > aka...@gmail.com
> >> >
> >> >
> >> > On Tue, Aug 17, 2010 at 3:40 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >wrote:
> >> >
> >> >> Hmmm, I glossed over your comment about sorting the top 250. There's
> >> >> no reason that wouldn't work.
> >> >>
> >> >> Well, one way for, say, dates is to store separate fields. , MM,
> DD,
> >> >> HH, MM, SS, MS. That gives you say, 100 year terms, + 12 month
> >> >> +31 days +  for a very small total. You pay the price though by
> >> >> having to change your queries and sorts to respect all 6 fields...
> >> >>
> >> >> But I'd only really go there after seeing if other options don't
> work.
> >> >>
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Tue, Aug 17, 2010 at 3:35 PM, Michel Nadeau 
> >> wrote:
> >> >>
> >> >> > Would our approach to limit the search top 250 documents (and then
> >> sort
> >> >> > these 250 documents) work fine ? Or even 250 unique terms with a
> lot
> >> of
> >> >> > users is bad on memory when sorting ?
> >> >> >
> >> >> > We didn't look at trie fields - I will do though, thanks for the
> tip !
> >> >> >
> >> >> > We do store the original 'Data' field (only the 'SearchableData'
> field
> >> is
> >> >> > analyzed, all other fields are not analyzed), the users mainly sort
> on
> >> >> > numeric values; not a lot on string values (in fact I could
> compltely
> >> >> drop
> >> >> > the sort by string feature). We do pad our numbers with zeros
> though
> >> (for
> >> >> > example: 10 becomes 0010, etc.) because we had trouble with
> >> sorting
> >> >> > (100
> >> >> > was smaller than 2) ; is that considered as "string sorting" ? This
> >> might
> >> >> > explain a part of the problem.
> >> >> >
> >> >> > Why/how would I reduce the count of unique terms?
> >> >> >
> >> >> >
> >> >> > - Mike
> >> >> > aka...@gmail.com
> >> >> >
> >> >> >
> >> >> > On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson <
> >> erickerick...@gmail.com
> >> >> > >wrote:
> >> >> >
> >> >> > > If you have tens of millions of documents, almost all with unique
> >> >> fields
> >> >> > > that you're sorting on, you'll chew through memory like there's
> no
> >> >> > > tomorrow.
> >> >> > >
> >> >> > > Have you looked at trie fields? See:
> >> >> > >
> >> >> > >
> >> >> >
> >> >>
> >>
> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
> >> >> > >
> >> >> > > I'm a little concerned that the user can sort on Data. Any field
> >> used
> >> >> f

Re: "Natural sorting" of documents in a Lucene index - possible?

2010-08-18 Thread Michel Nadeau
Can you guys tell me more about "warm up queries" strategies ?

I know that once you made one query, the second time is super quick because
it's in cache - but how can you do warm up queries when you don't know what
users are going to search ?

- Mike
aka...@gmail.com


On Wed, Aug 18, 2010 at 11:26 AM, Michel Nadeau  wrote:

> Thanks !
>
> - Mike
> aka...@gmail.com
>
>
> On Wed, Aug 18, 2010 at 10:37 AM, Ian Lea  wrote:
>
>> > But - to come back to my original question... is there any way to have a
>> > "natural order" of documents other that the DocId In Lucene?
>>
>> No.
>>
>>
>> --
>> Ian.
>>
>>
>> On Wed, Aug 18, 2010 at 3:21 PM, Michel Nadeau  wrote:
>> > Cool, so I'll try these things -
>> >
>> > * Replace timestamps with MMDD - will minimize unique terms count;
>> > * Use NumericField's for dates and numbers - will remove all string
>> sorting.
>> > Thanks guys!
>> >
>> > --
>> >
>> > But - to come back to my original question... is there any way to have a
>> > "natural order" of documents other that the DocId In Lucene? For
>> example, is
>> > there any way to have an index automatically sorted on a specific field,
>> > like :
>> >
>> > DocId Count Data
>> > -
>> >  5 1   First test
>> >  1 3   Otter
>> >  8 4   Test
>> >  2 8   Aloha
>> >  1011   Zulu
>> >  917   Bingo
>> >  346   Alpha test
>> >  6   112   Tango
>> >  4   120   Charlie test
>> >  7   200   Kiwi
>> >
>> > Notice the DocId and Data random orders, but Count is sorted. That would
>> be
>> > the 'natural order' in the index, and searching for 'test' would return
>> (in
>> > that order) :
>> >
>> > DocId Count Data
>> > -
>> >  5 1   First test
>> >  346   Alpha test
>> >   4   120   Charlie test
>> >
>> > Already sorted on the Count.
>> >
>> > Thanks!
>> >
>> > - Mike
>> > aka...@gmail.com
>> >
>> >
>> > On Tue, Aug 17, 2010 at 4:08 PM, Ian Lea  wrote:
>> >
>> >> Using NumericField for dates and other numbers is likely to help a
>> >> lot, and removes padding problems.  I'd try that first, or just sort
>> >> the top n hits yourself.
>> >>
>> >>
>> >> --
>> >> Ian.
>> >>
>> >>
>> >> On Tue, Aug 17, 2010 at 8:46 PM, Michel Nadeau 
>> wrote:
>> >> > I could at least drop hours/mins/sec, we don't need them, so my
>> timestamp
>> >> > could become 'MMDD', that would cut the number of unique terms at
>> >> least
>> >> > for dates.
>> >> >
>> >> > What about my other question about numbers : *" We do pad our numbers
>> >> with
>> >> > zeros though (for example: 10 becomes 0010, etc.) because we had
>> >> trouble
>> >> > with sorting (100 was smaller than 2) ; is that considered as "string
>> >> > sorting" ? This might explain a part of the problem."* ? Thanks.
>> >> >
>> >> > - Mike
>> >> > aka...@gmail.com
>> >> >
>> >> >
>> >> > On Tue, Aug 17, 2010 at 3:40 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >wrote:
>> >> >
>> >> >> Hmmm, I glossed over your comment about sorting the top 250. There's
>> >> >> no reason that wouldn't work.
>> >> >>
>> >> >> Well, one way for, say, dates is to store separate fields. , MM,
>> DD,
>> >> >> HH, MM, SS, MS. That gives you say, 100 year terms, + 12 month
>> >> >> +31 days +  for a very small total. You pay the price though by
>> >> >> having to change your queries and sorts to respect all 6 fields...
>> >> >>
>> >> >> But I'd only really go there after seeing if other options don't
>> work.
>> >> >>
>> >> >>
>> >> >> Best
>> >> >> Erick
>> >> >>
>> >> >> On Tue, Aug 17, 2010 at 3:35 PM, Michel Nadeau 
>> >> wrote:
>> >> >>
>> >> >> > Would our approach to limit the search top 250 documents (and then
>> >> sort
>> >> >> > these 250 documents) work fine ? Or even 250 unique terms with a
>> lot
>> >> of
>> >> >> > users is bad on memory when sorting ?
>> >> >> >
>> >> >> > We didn't look at trie fields - I will do though, thanks for the
>> tip !
>> >> >> >
>> >> >> > We do store the original 'Data' field (only the 'SearchableData'
>> field
>> >> is
>> >> >> > analyzed, all other fields are not analyzed), the users mainly
>> sort on
>> >> >> > numeric values; not a lot on string values (in fact I could
>> compltely
>> >> >> drop
>> >> >> > the sort by string feature). We do pad our numbers with zeros
>> though
>> >> (for
>> >> >> > example: 10 becomes 0010, etc.) because we had trouble with
>> >> sorting
>> >> >> > (100
>> >> >> > was smaller than 2) ; is that considered as "string sorting" ?
>> This
>> >> might
>> >> >> > explain a part of the problem.
>> >> >> >
>> >> >> > Why/how would I reduce the count of unique terms?
>> >> >> >
>> >> >> >
>> >> >> > - Mike
>> >> >> > aka...@gmail.com
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson <
>> >> erickerick...@gmail.com
>> >> >> > >wrote:
>> >> >> >
>> >> >> > > If you

Re: "Natural sorting" of documents in a Lucene index - possible?

2010-08-18 Thread Ian Lea
> Can you guys tell me more about "warm up queries" strategies ?
>
> I know that once you made one query, the second time is super quick because
> it's in cache - but how can you do warm up queries when you don't know what
> users are going to search ?

It's not so much that the hits or queries are cached (they aren't) but
that some lucene internal structures are loaded from disk.  And if you
are sorting, the field cache is also loaded.  So it usually doesn't
matter exactly what queries you use.  One strategy is to keep a list
of the last n queries and execute them.  Another is just to pick some
arbitrary queries that you know are representative of real queries
e.g. if people search on title and sort on date then use a query that
does that.


--
Ian.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Problems with Lucene 3.0.2 and Java 1.6.0_12

2010-08-18 Thread Nader, John P

This is a follow up related to my original post Term browsing performance 
problems with our upgrade to Lucene 3.0.2.  The suggestions were helpful and 
did give us a performance increase.  However, in a full scale environment under 
load our performance issue remained a problem.

Our investigation lead us to an issue with our patch level of Java.  This is 
not too surprising considering we were on revision 12, and the current is 21.  
The behavior we saw was that some JVMs would come up in a state where browsing 
ran very slowly, while others would run as expected.  The JVM would stay in 
that state until it was restarted.  

The only change we made was to upgrade our JVM to 1.6.0_21.  At that point, all 
JVMs performed consistently with no issues.  I wanted to make sure I share this 
info with the forum as others may encounter similar problems.

I have no specific info about what changed between 3.0.2 from 2.4.0 that would 
cause an issue with Java 1.6.0_12.  Nor do any Java release notes indicate what 
might have been fixed to address this issue.  I strongly suspect that it is JIT 
compiler related and would be glad to share thoughts on this with anyone that 
is interested. 

-John


-Original Message-
From: Nader, John P [mailto:john.na...@cengage.com] 
Sent: Friday, July 30, 2010 3:17 PM
To: java-user@lucene.apache.org
Subject: RE: Term browsing much slower in Lucene 3.x.x

Mike,

We took your suggestion and refactored like this:

 TermEnum termEnum = indexReader.terms(new Term(field, "0"));
 TermDocs allTermDocs = indexReader.termDocs();

 while(termEnum.next() && termEnum.term().field().equals(field) {
   allTermsDocs.seek(termEnum);
   while(allTermDocs.next()) {
...doo something to each doc...
   }
 }

The results were much better than creating a new TermDocs for each term.  We 
were about 6x faster than the old algorithm in Lucene 3.0.2, and 3x faster than 
the old algorithm in Lucene 2.4.0.

Thanks much for your help.

With respect to a 3.0.2 enhancement that would yield the same performance 
without using different APIs, I'm not sure what the impact would be.  We 
definitely have proven the synchronization had a dramatic impact in our 
environment.  But the synchronization in the constructor looks like it is 
necessary in other API calls.

BTW, that environment is Java 1.6.0_12 on 64-bit SUSE Linux with 32G of RAM and 
using MMapDirectory.

Thanks.

-John


-Original Message-
From: Nader, John P [mailto:john.na...@cengage.com] 
Sent: Thursday, July 29, 2010 5:49 PM
To: java-user@lucene.apache.org
Subject: RE: Term browsing much slower in Lucene 3.x.x

Thanks much for your response.  Yes, our terms are sorted in index-sort order.  
I think you have a good suggestion, which is to get the term docs once and then 
seek to each term.  I will try that approach and report back to the forum on 
the results.

Like you I am surprised by the overhead of the added synchronization.  I don't 
think is waiting on locks, but rather the memory flush and loading that goes on.

-John

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Thursday, July 29, 2010 5:55 AM
To: java-user@lucene.apache.org
Subject: Re: Term browsing much slower in Lucene 3.x.x

On Wed, Jul 28, 2010 at 2:39 PM, Nader, John P  wrote:
> We recently upgraded from lucene 2.4.0 to lucene 3.0.2.  Our load testing 
> revealed a serious performance drop specific to traversing the list of terms 
> and their associated documents for a given indexed field.  Our code looks 
> something like this:
>
> for(Term term : terms) {
> TermDocs termDocs = indexReader.termDocs(term);
> while(termDocs.next()) {   //  much slower here
>    int doc = termDocs.doc();
>    ...do something with each doc...
> }

Is that IndexReader reading multiple segments or single segment?

> The slowness is all on the first call to TermDocs.next() for each term.  
> Further investigation comparing 2.4.0 and 3.0.2 revealed that there is some 
> new synchronization on the SegmentTermDocs constructor and the 
> SegmentReader.getTermsReader().  The first call to next() hits this 
> synchronization, causing a 4x slowdown on an 8 CPU machine.

There was some added sync, however, the code within those sync blocks
is minuscule (looking up a field).  It's weird that you're seeing a 4X
hit because of this.  We could conceivably optimize this code to avoid
the sync blocks if the reader is readOnly.

> My first question is should we be using a different approach to process each 
> term's doc list that would be more efficient?  The synchronization appears to 
> be on aspects of these classes that the next() operation is not concerned 
> with.

Are you sorting your terms in index-sort order (UTF16, ie
String.compareTo)?  This can be an important gain especially if you
have many terms.

Also, if you are working with your top reader, you should see some
perf gain by instead working w/ the sub readers directly, ie:

  for(Inde

Re: Solr SynonymFilter in Lucene analyzer

2010-08-18 Thread Arun Rangarajan
I think the lucene WhitespaceAnalyzer I am using inside Solr's SynonymFilter
is the one that prevents multi-word synonyms like "New York" from getting
mapped to the generic synonym name like CONCEPTYcity. It appears to me that
an analyzer which recognizes that a white-space is inside a synonym like
"New York" will be required. Do I need to implement one like this or is
there already an analyzer I can use? Looks like I am missing something here,
since Solr's SynonymFilter is supposed to handle this. Can someone tell me
what is the correct way to integrate Solr's SynonymFilter within a custom
lucene analyzer? Thanks.


On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan
wrote:

> I am trying to have multi-word synonyms work in lucene using Solr's *
> SynonymFilter*.
>
> I need to match synonyms at index time, since many of the synonym lists are
> huge. Actually they are really not synonyms, but are words that belong to a
> concept. For example, I would like to map {"New York", "Los Angeles", "New
> Orleans", "Salt Lake City"...}, a bunch of city names, to the concept called
> "city". While searching, the user query for the concept "city" will be
> translated to a keyword like, say "CONCEPTcity", which is the synonym for
> any city name.
>
> Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131),
> all I could match for "CONCEPTcity" is single word city names like
> "Chicago", "Seattle", "Boston", etc., It would not match multi-word city
> names like "New York", "Los Angeles", etc.,
>
> I tried using Solr's SynonymFilter in tokenStream method in a custom
> Analyzer (that extends org.apache.lucene.analysis.
> Analyzer - lucene ver. 2.9.3) using:
>
> *public TokenStream tokenStream(String fieldName, Reader reader) {
> TokenStream result = new SynonymFilter(
> new WhitespaceTokenizer(reader),
> synonymMap);
> return result;
> }
> *
> where *synonymMap* is loaded with synonyms using
>
> *synonymMap.add(conceptTerms, listOfTokens, true, true);*
>
> where *conceptTerms* is of type *ArrayList* of all the terms in a
> concept and *listofTokens* is of type *List  *and contains only the
> generic synonym identifier like *CONCEPTcity*.
>
> When I print synonymMap using synonymMap.toString(), I get the output like
>
> <{New York=<{Chicago=<{Seattle=<{New
> Orleans=<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>}>
>
> so it looks like all the synonyms are loaded. But if I search for
> "CONCEPTcity" then it says no matches found. I am not sure whether I have
> loaded the synonyms correctly in the synonymMap.
>
> Any help will be deeply appreciated. Thanks!
>


Re: Solr SynonymFilter in Lucene analyzer

2010-08-18 Thread Lance Norskog
Yes, you need an analyzer that leaves successive words together as one
long term. This might be easier to do with the new CharFilter tool,
which processes text before it goes to the tokenizer.

What you are doing here is similar to Parts-Of-Speech analysis, where
text analysis software parses a sentence and labels words 'Noun',
'Verb', etc. One suite stores these labels as payloads on the terms.
This might be a better way to store your categories, rather than using
the synonym filter.

On Wed, Aug 18, 2010 at 9:55 PM, Arun Rangarajan
 wrote:
> I think the lucene WhitespaceAnalyzer I am using inside Solr's SynonymFilter
> is the one that prevents multi-word synonyms like "New York" from getting
> mapped to the generic synonym name like CONCEPTYcity. It appears to me that
> an analyzer which recognizes that a white-space is inside a synonym like
> "New York" will be required. Do I need to implement one like this or is
> there already an analyzer I can use? Looks like I am missing something here,
> since Solr's SynonymFilter is supposed to handle this. Can someone tell me
> what is the correct way to integrate Solr's SynonymFilter within a custom
> lucene analyzer? Thanks.
>
>
> On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan
> wrote:
>
>> I am trying to have multi-word synonyms work in lucene using Solr's *
>> SynonymFilter*.
>>
>> I need to match synonyms at index time, since many of the synonym lists are
>> huge. Actually they are really not synonyms, but are words that belong to a
>> concept. For example, I would like to map {"New York", "Los Angeles", "New
>> Orleans", "Salt Lake City"...}, a bunch of city names, to the concept called
>> "city". While searching, the user query for the concept "city" will be
>> translated to a keyword like, say "CONCEPTcity", which is the synonym for
>> any city name.
>>
>> Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131),
>> all I could match for "CONCEPTcity" is single word city names like
>> "Chicago", "Seattle", "Boston", etc., It would not match multi-word city
>> names like "New York", "Los Angeles", etc.,
>>
>> I tried using Solr's SynonymFilter in tokenStream method in a custom
>> Analyzer (that extends org.apache.lucene.analysis.
>> Analyzer - lucene ver. 2.9.3) using:
>>
>> *    public TokenStream tokenStream(String fieldName, Reader reader) {
>>         TokenStream result = new SynonymFilter(
>>                 new WhitespaceTokenizer(reader),
>>                 synonymMap);
>>         return result;
>>     }
>> *
>> where *synonymMap* is loaded with synonyms using
>>
>> *synonymMap.add(conceptTerms, listOfTokens, true, true);*
>>
>> where *conceptTerms* is of type *ArrayList* of all the terms in a
>> concept and *listofTokens* is of type *List  *and contains only the
>> generic synonym identifier like *CONCEPTcity*.
>>
>> When I print synonymMap using synonymMap.toString(), I get the output like
>>
>> <{New York=<{Chicago=<{Seattle=<{New
>> Orleans=<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>}>
>>
>> so it looks like all the synonyms are loaded. But if I search for
>> "CONCEPTcity" then it says no matches found. I am not sure whether I have
>> loaded the synonyms correctly in the synonymMap.
>>
>> Any help will be deeply appreciated. Thanks!
>>
>



-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Sorting a Lucene index

2010-08-18 Thread Shelly_Singh
Hi Anshum,

I require sorted results for all my queries and the field on which I need 
sorting is fixed; so this lead to me the idea of storing in sorted order to 
avoid sorting cost with every query.

Thanks and Regards,

Shelly Singh
Center For KNowledge Driven Information Systems, Infosys
Email: shelly_si...@infosys.com
Phone: (M) 91 992 369 7200, (VoIP)2022978622

-Original Message-
From: Anshum [mailto:ansh...@gmail.com] 
Sent: Wednesday, August 18, 2010 5:21 PM
To: java-user@lucene.apache.org
Subject: Re: Sorting a Lucene index

Hi Shelly,
The search results so returned are sorted either by relevance, index order,
stored field, or custom order.
As you are saying that you would not be able to maintain the index order,
 you would have to do the sort at run time.
Sorting on a stored field is not costly and you may use it comfortably. btw,
are you facing any issues in sort time or is it a presumption?

--
Anshum Gupta
http://ai-cafe.blogspot.com


On Wed, Aug 18, 2010 at 5:12 PM, Shelly_Singh wrote:

> Hi,
>
> I have a Lucene index that contains a numeric field along with certain
> other fields. The order of incoming documents is random and un-predictable.
> As a result, while creating an index, I end up adding docs in random order
> with respect to the numeric field value.
>
> For example, documents may be added in following order:
> 12,y,d
> 100,o,p
> 1,x,y
> 23,u,i
> 31,v,m
> 22,b,m
> 109,k,l
>
> My requirement is that at search time, I want the documents in order of the
> numeric field.
> One, option is to do a score/sort on the numeric field.
> But, this may be a costly operation.
>
> Hence, I am trying to find if there is some way, such that, my stored index
> is sorted by itself.
>
> Please help.
>
> Thanks and Regards,
>
> Shelly Singh
> Center For KNowledge Driven Information Systems, Infosys
> Email: shelly_si...@infosys.com
> Phone: (M) 91 992 369 7200, (VoIP)2022978622
>
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: cluster documents based on fields' values

2010-08-18 Thread Stanislaw Osinski
>
> A colleague of mine also discovered solr's clustering component -
> http://wiki.apache.org/solr/ClusteringComponent. It's still labeled as
> experimental - does anybody have experience with it?
>

The clustering component is based on the Carrot2 project (
project.carrot2.org). Carrot2 has been around for many years and is a mature
piece of software. However, the scope of the clustering component is
currently limited to post-retrieval clustering, ie. clustering of search
results, not the whole index. If you're looking for large-scale clustering,
Mahout would be the way to go.

Cheers,

Stanislaw


asking about incremental update

2010-08-18 Thread Yakob
hello all,
you may remember me as the one who ask about how to understand lucene
in the previous email,but I have now been able to create a sample
application of lucene. I read the book and able to test it. which to
me is very great, as I am a new learner.

here is my proof.

http://jacobian.web.id/2010/08/09/how-to-use-lucene-part-1/

but now I am taking lucene to a higher level, I was tasked to create
an index that can update itself. it was called incremental update.
basically lucene will index the text file periodically and will store
the index first in memory then after a while it will be store on the
harddisk.

so anyone can give me any idea of how these things can be done? maybe
there is a sample application out there that I might have miss but can
be of great importance for me to learn about this incremental update.

any help would be greatly appreciated.

-- 
http://jacobian.web.id

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org