Maybe a bug of lucene 1.9

2006-05-29 Thread hu andy

I indexed a collection of Chinese documents. I use a special segmentation
api to do the analysis, because the segmentation of Chinese is different
from English.

A strange thing happened.   With lucene 1.4 or lucene 2.0, it will be all
right to retrieve the corresponding documents given the terms that exist in
the index  *.tis file(I wrote a program to pick the terms from the .tis file
and search them).  But with 1.9, for some terms that existed in the index, I
couldn't retrieve the corresponding document.

Can anybody give me some advice about this? Thank you in advance.


fastest way to get raw hit count

2006-05-29 Thread zzzzz shalev
hi all,
   
  is there a faster way to retrieve ONLY the count of results for a query?
   
  lucene ranks (scores) the first batch of docs and sorts them by rank, this is 
functionality i dont need in certain queries and i assume, not doing this can 
return the count faster then the hits.length()
   
  any ideas???
   
  thanks,,,


-
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.

Re: fastest way to get raw hit count

2006-05-29 Thread Paul Elschot
On Monday 29 May 2006 15:54, z shalev wrote:
> hi all,
>
>   is there a faster way to retrieve ONLY the count of results for a query?
>
>   lucene ranks (scores) the first batch of docs and sorts them by rank, this 
is functionality i dont need in certain queries and i assume, not doing this 
can return the count faster then the hits.length()

Untested:

Scorer scorer =
 query.weight(indexSearcher).scorer(indexSearcher.getIndexReader());

int docCount = 0;
while (scorer.next()) docCount++;


Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: fastest way to get raw hit count

2006-05-29 Thread Chris Hostetter

: Scorer scorer =
:  query.weight(indexSearcher).scorer(indexSearcher.getIndexReader());

You'd need to rewrite the query first to be safe.

A slightly higher level API approach would be a HitCollector that just
counts the hits...

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html

   Searcher searcher = new IndexSearcher(indexReader);
   final int[] count = new int[0]; // use array container since need final
   searcher.search(query, new HitCollector() {
  public void collect(int doc, float score) {
 count[0]++;
  }
});
System.out.println("count of matches: " + count[0]):



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search performance degrades by order of magnitude when using SortField.

2006-05-29 Thread Chris Hostetter
: default Sort.RELEVANCE, query response time is ~6ms.  However, when I
: specify a sort, e.g. Searcher.search( query, new Sort( "mydatefield" )
:  ), the query response time gets multiplied by a factor of 10 or 20.
...
: do a top-K ranking over the same number of raw hits.   The performance
: gets disproportionately worse as I increase the number of parallel
: threads that query the same Searcher object.

How many sequential queries are you running against the same Searcher
instance? ... the performance drop you are seeing may be a result of each
of those threads trying to build the same FieldCache on your sort field in
parrallel.

being 10x or 20x slower sounds like a lot .. but 10x 6ms is still only
60ms :) .. have you timed how long it takes just to build a FieldCache on
that field?

: Also, in my previous experience with sorting by a field in Lucene, I
: seem to remember there being a preload time when you first search with
: a sort by field, sometimes taking 30 seconds or so to load all of the
: field's values into the in-memory cache associated with the Searcher
: object.  This initial preload time doesn't seem to be happening in my
: case -- does that mean that for some reason Lucene is not caching the
: field values?

that's the FieldCache initialization i was refering to -- it's based on
reusing the same instenad of IndexReader (or IndexSearcher), as long as
you are using the same instance over and over you'll reuse the
FieldCache and only pay that cost once (or maybe N times if you have N
parrallel query threads and they all try to hit the FieldCache
immediately).

30 seconds sounds extremely long though ... you may be remembering
incorrectly how significant the penalty was.

: I have an index of 1 million documents, taking up about 1.7G of
: diskspace.  I specify -Xmx2000m when running my java search
: application.

the big issue when sorting on a field is what type of data is in that
field: is it a int? a long? a String? .. if it is a String how often does
the same String value appear for multiple documents? .. these all affect
how much RAM the FieldCache takes up.  you mentioned sorting by date, did
you store the date as a String? in what format? with what precision?




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-05-29 Thread zzzzz shalev
i know im a little late replying to this thread, but, in my humble opinion the 
best way to aggregate values (not necessarily terms, but whole values in 
fields) is as follows:
   
  startup stage:
   
  for each field you would like to aggregate create a hashmap
   
  open an index reader and run through all the docs
   
  get the values to be aggregated from the fields of each doc
   
  create a hashcode for each value from each field collected, the hashcode 
should have some sort of prefix indicating which field its from (for exampe: 1 
= author, 2 = ) and hence which hash it is stored in (at retrieval time, 
this prefix can be used to easily retrieve the value from the correct hash)
   
  place the hashcode/value in the appropriate hash
   
  create an arraylist
   
  at index X in the arraylist place an int array of all the hashcodes 
associated with doc id X
   
  so for example: if i have doc id 0 which contains the values: william 
shakespeare and the value 1797 the array list at index 0 will have an int array 
containing 2 values (the 2 hashcodes of shaklespeare and 1797)
   
  run time:
   
  at run time receive the hits and iterate through the doc ids , aggregate the 
values with direct access into the arraylist (for doc id 10 go to index 10 in 
the arraylist to retrieve the array of hashcodes) and lookups into the hashmaps
   
  i tested this today on a small index approx 400,000 docs (1GB of data) but i 
ran queries returning over 100,000 results
   
  my response time was about 550 milliseconds on large (over 100,000) result 
sets
   
  another point, this method should be scalable for much larger indexes as 
well, as it is linear to the result set size and not the index size (which is a 
HUGE bonus)
   
  if anyone wants the code let me know,
   
   
  

Marvin Humphrey <[EMAIL PROTECTED]> wrote:
  
Thanks, all.

The field cache and the bitsets both seem like good options until the 
collection grows too large, provided that the index does not need to 
be updated very frequently. Then for large collections, there's 
statistical sampling. Any of those options seems preferable to 
retrieving all docs all the time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone.  Get Yahoo! 
Messenger with Voice