Re: Sort Performance Problems across large dataset

2005-01-24 Thread Stefan Groschupf
Hi,
do you optimize the index?
Do you tried to implement a own hit collector?
Stefan
Am 25.01.2005 um 01:01 schrieb Peter Hollas:
I am working on a public accessible Struts based species database 
project where the number of species names is currently at 2.3 million, 
and in the near future will be somewhere nearer 4 million (probably 
the largest there is). The species names are typically 1 to 7 words in 
length, and the broad requirement is to be able to do a fulltext 
search across them. It is also necessary to sort the results into 
alphabetical order by species name.

Currently we can issue a simple search query and expect a response 
back in about 0.2 seconds (~3,000 results) with the Lucene index that 
we have built. Lucene gives a much more predictable and faster average 
query time than using standard fulltext indexing with mySQL. This 
however returns result in score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst 
querying. This solution works fine, but is unacceptable since a query 
that returns thousands of results can take upwards of 30 seconds to 
sort them.

My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort 
will be to perform a monthly index rebuild, and return results by 
index order (about a day to re-index!). But ideally there might be a 
way to modify the Lucene API to incorporate a scoring system in a way 
that scores by lexical order.

Any ideas are appreciated!
Many thanks, Peter.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---
company:http://www.media-style.com
forum:  http://www.text-mining.org
blog:   http://www.find23.net
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-24 Thread Xiaohong Yang \(Sharon\)
Hi Peter,
I just got on the list a few hours ago.  I am still reading the source code.  I 
am not going to send this to the list.
 
I would like to know the ".2 sec" query time for 2 million fields, should it 
display only the first page (100 or so), not the whole 3000 found?  It is very 
fast I agree.  
 
If the alphabetic index display only a link, not the content, then it should 
not be very slow since you only need to sort part of what a user need.  May be 
display only the first "A" page, as it did with the regular scored results.  
Just my thought, might not work for you.
 
Do you store the Lucene index in the database or in a text file?
 
Best,
Sharon
LangPower Computing, Inc.
http://www.indexingonline.com

Peter Hollas <[EMAIL PROTECTED]> wrote:
I am working on a public accessible Struts based species database project 
where the number of species names is currently at 2.3 million, and in the 
near future will be somewhere nearer 4 million (probably the largest there 
is). The species names are typically 1 to 7 words in length, and the broad 
requirement is to be able to do a fulltext search across them. It is also 
necessary to sort the results into alphabetical order by species name.

Currently we can issue a simple search query and expect a response back in 
about 0.2 seconds (~3,000 results) with the Lucene index that we have built. 
Lucene gives a much more predictable and faster average query time than 
using standard fulltext indexing with mySQL. This however returns result in 
score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species names as 
a seperate keyword field, and sorted using it whilst querying. This solution 
works fine, but is unacceptable since a query that returns thousands of 
results can take upwards of 30 seconds to sort them.

My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort will be 
to perform a monthly index rebuild, and return results by index order (about 
a day to re-index!). But ideally there might be a way to modify the Lucene 
API to incorporate a scoring system in a way that scores by lexical order.

Any ideas are appreciated!

Many thanks, Peter.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sort Performance Problems across large dataset

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 7:01 PM, Peter Hollas wrote:
I am working on a public accessible Struts based
Well there's the problem right there :))
(just kidding)
To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst 
querying. This solution works fine, but is unacceptable since a query 
that returns thousands of results can take upwards of 30 seconds to 
sort them.
30 seconds... wow.
My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort 
will be to perform a monthly index rebuild, and return results by 
index order (about a day to re-index!). But ideally there might be a 
way to modify the Lucene API to incorporate a scoring system in a way 
that scores by lexical order.
What about assigning a numeric value field for each document with the 
number indicating the alphabetical ordering?  Off the top of my head, 
I'm not sure how this could be done, but perhaps some clever hashing 
algorithm could do this?  Or consider each character position one digit 
in a base 27 (or 27 to include a space) and construct a number for 
that?  (though that would be an enormous number and probably too large) 
- sorry my off-the-cuff estimating skills are not what they should be.

Certainly sorting by a numeric value is far less resource intensive 
than by String - so perhaps that is worth a try?  At the very least, 
give each document a random number and try sorting by that field (the 
value of the field can be Integer.toString()) to see how it compares 
performance-wise.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-24 Thread Matt Quail
Peter,
Currently we can issue a simple search query and expect a response back 
in about 0.2 seconds (~3,000 results) 
You may want to try something like the following (I do this in FishEye, 
seems to be performant for moderately large field-spaces).

Use a custom HitCollector, and store all the matching doc-ids in a 
java.util.BitSet. This will still give you your 0.2second performance.

Then, use a TermDocs iterator to visit each term in your "species name" 
field, "printing out" (or whatever) each species name if it contains a 
docid in your bitset. Something like this pseudocode:

BitSet docs = doSearch(query); // 0.2seconds
TermEnum te = reader.terms(new Term("species-name", ""));
TermDocs td = reader.termDocs();
Term t = te.term();
while (t!=null && t.field().equals("species-name")) {
  td.seek(te);
  while (td.next()) {
int docid = td.doc();
if (docs.get(docid)) {
  print "match:" + docid;
  break; // try next term
}
  }
  if (!te.next()) {
break;
  }
  t = te.term();
}
te.close();
td.close();
Now, with 2.3 million (or 4 million!) species names, I'm not sure how 
fast it will be to iterate through all the "species-name" termdocs. But 
I would be interested to find out; if you give this a code a try, could 
you report back your results?

=Matt
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Sort Performance Problems across large dataset

2005-01-25 Thread Peter Hollas
Sharon,

The system that we are working on does recordset paging, so only the first
100 are returned on the first page. But then you are able to page through
the rest of the of the results.

The Lucene index is stored as a standard FSDirectory on a RAID filesystem,
and it used only for the initial name lookup, to discover the SpeciesID.
Further requests are handled by the mySQL backend database.

The way that Lucene returns results by default is based on a document
scoring algorithm, and so results most likely are out of alphabetical order.
I think that we will likely have to use a custom hitcollector object to do
the sorting.

Many thanks, Peter.
-Original Message-
From: Xiaohong Yang (Sharon) [mailto:[EMAIL PROTECTED] 
Sent: 25 January 2005 00:37
To: Lucene Users List
Subject: Re: Sort Performance Problems across large dataset 

Hi Peter,
I just got on the list a few hours ago.  I am still reading the source code.
I am not going to send this to the list.
 
I would like to know the ".2 sec" query time for 2 million fields, should it
display only the first page (100 or so), not the whole 3000 found?  It is
very fast I agree.  
 
If the alphabetic index display only a link, not the content, then it should
not be very slow since you only need to sort part of what a user need.  May
be display only the first "A" page, as it did with the regular scored
results.  Just my thought, might not work for you.
 
Do you store the Lucene index in the database or in a text file?
 
Best,
Sharon
LangPower Computing, Inc.
http://www.indexingonline.com

Peter Hollas <[EMAIL PROTECTED]> wrote:
I am working on a public accessible Struts based species database project 
where the number of species names is currently at 2.3 million, and in the 
near future will be somewhere nearer 4 million (probably the largest there 
is). The species names are typically 1 to 7 words in length, and the broad 
requirement is to be able to do a fulltext search across them. It is also 
necessary to sort the results into alphabetical order by species name.

Currently we can issue a simple search query and expect a response back in 
about 0.2 seconds (~3,000 results) with the Lucene index that we have built.

Lucene gives a much more predictable and faster average query time than 
using standard fulltext indexing with mySQL. This however returns result in 
score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species names as

a seperate keyword field, and sorted using it whilst querying. This solution

works fine, but is unacceptable since a query that returns thousands of 
results can take upwards of 30 seconds to sort them.

My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort will be 
to perform a monthly index rebuild, and return results by index order (about

a day to re-index!). But ideally there might be a way to modify the Lucene 
API to incorporate a scoring system in a way that scores by lexical order.

Any ideas are appreciated!

Many thanks, Peter.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sort Performance Problems across large dataset

2005-01-25 Thread Xiaohong Yang \(Sharon\)
Peter, 
 

Let me continue to post to the list since I am here already.

 

Both dynamic sort on-the-fly (hit counter) or presorted and stored index would 
give you the alphabetically sorted species name.  Pre-stored alphabetical index 
would be faster.  Even for 4 million names, the index should not take up too 
much room if you use integer to represent the sorted strings.  ( The integer 
order represent the species name in alphabetical order.)

 

Using the pre-stored index, and an array of score-sorted document hit list, it 
would be reasonably fast to compile the alphabetic list for display.

 

This is just what I would design it, might not work for you.

 
What type of database is this?  Any example of use available on the web?
 
Best,
Sharon
 


Peter Hollas <[EMAIL PROTECTED]> wrote:
Sharon,

The system that we are working on does recordset paging, so only the first
100 are returned on the first page. But then you are able to page through
the rest of the of the results.

The Lucene index is stored as a standard FSDirectory on a RAID filesystem,
and it used only for the initial name lookup, to discover the SpeciesID.
Further requests are handled by the mySQL backend database.

The way that Lucene returns results by default is based on a document
scoring algorithm, and so results most likely are out of alphabetical order.
I think that we will likely have to use a custom hitcollector object to do
the sorting.

Many thanks, Peter.
-Original Message-
From: Xiaohong Yang (Sharon) [mailto:[EMAIL PROTECTED] 
Sent: 25 January 2005 00:37
To: Lucene Users List
Subject: Re: Sort Performance Problems across large dataset 

Hi Peter,
I just got on the list a few hours ago. I am still reading the source code.
I am not going to send this to the list.

I would like to know the ".2 sec" query time for 2 million fields, should it
display only the first page (100 or so), not the whole 3000 found? It is
very fast I agree. 

If the alphabetic index display only a link, not the content, then it should
not be very slow since you only need to sort part of what a user need. May
be display only the first "A" page, as it did with the regular scored
results. Just my thought, might not work for you.

Do you store the Lucene index in the database or in a text file?

Best,
Sharon
LangPower Computing, Inc.
http://www.indexingonline.com

Peter Hollas 
wrote:
I am working on a public accessible Struts based species database project 
where the number of species names is currently at 2.3 million, and in the 
near future will be somewhere nearer 4 million (probably the largest there 
is). The species names are typically 1 to 7 words in length, and the broad 
requirement is to be able to do a fulltext search across them. It is also 
necessary to sort the results into alphabetical order by species name.

Currently we can issue a simple search query and expect a response back in 
about 0.2 seconds (~3,000 results) with the Lucene index that we have built.

Lucene gives a much more predictable and faster average query time than 
using standard fulltext indexing with mySQL. This however returns result in 
score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species names as

a seperate keyword field, and sorted using it whilst querying. This solution

works fine, but is unacceptable since a query that returns thousands of 
results can take upwards of 30 seconds to sort them.

My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort will be 
to perform a monthly index rebuild, and return results by index order (about

a day to re-index!). But ideally there might be a way to modify the Lucene 
API to incorporate a scoring system in a way that scores by lexical order.

Any ideas are appreciated!

Many thanks, Peter.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sort Performance Problems across large dataset

2005-01-27 Thread Doug Cutting
Peter Hollas wrote:
Currently we can issue a simple search query and expect a response back 
in about 0.2 seconds (~3,000 results) with the Lucene index that we have 
built. Lucene gives a much more predictable and faster average query 
time than using standard fulltext indexing with mySQL. This however 
returns result in score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst querying. 
This solution works fine, but is unacceptable since a query that returns 
thousands of results can take upwards of 30 seconds to sort them.
Are you using a Lucene Sort?  If you reuse the same IndexReader (or 
IndexSearcher) then perhaps the first query specifying a Sort will take 
30 seconds (although that's much slower than I'd expect), but subsequent 
searches that sort on the same field should be nearly as fast as results 
sorted by score.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]