Maybe a bug of lucene 1.9
I indexed a collection of Chinese documents. I use a special segmentation api to do the analysis, because the segmentation of Chinese is different from English. A strange thing happened. With lucene 1.4 or lucene 2.0, it will be all right to retrieve the corresponding documents given the terms that exist in the index *.tis file(I wrote a program to pick the terms from the .tis file and search them). But with 1.9, for some terms that existed in the index, I couldn't retrieve the corresponding document. Can anybody give me some advice about this? Thank you in advance.
fastest way to get raw hit count
hi all, is there a faster way to retrieve ONLY the count of results for a query? lucene ranks (scores) the first batch of docs and sorts them by rank, this is functionality i dont need in certain queries and i assume, not doing this can return the count faster then the hits.length() any ideas??? thanks,,, - New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
Re: fastest way to get raw hit count
On Monday 29 May 2006 15:54, z shalev wrote: > hi all, > > is there a faster way to retrieve ONLY the count of results for a query? > > lucene ranks (scores) the first batch of docs and sorts them by rank, this is functionality i dont need in certain queries and i assume, not doing this can return the count faster then the hits.length() Untested: Scorer scorer = query.weight(indexSearcher).scorer(indexSearcher.getIndexReader()); int docCount = 0; while (scorer.next()) docCount++; Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: fastest way to get raw hit count
: Scorer scorer = : query.weight(indexSearcher).scorer(indexSearcher.getIndexReader()); You'd need to rewrite the query first to be safe. A slightly higher level API approach would be a HitCollector that just counts the hits... http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html Searcher searcher = new IndexSearcher(indexReader); final int[] count = new int[0]; // use array container since need final searcher.search(query, new HitCollector() { public void collect(int doc, float score) { count[0]++; } }); System.out.println("count of matches: " + count[0]): -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search performance degrades by order of magnitude when using SortField.
: default Sort.RELEVANCE, query response time is ~6ms. However, when I : specify a sort, e.g. Searcher.search( query, new Sort( "mydatefield" ) : ), the query response time gets multiplied by a factor of 10 or 20. ... : do a top-K ranking over the same number of raw hits. The performance : gets disproportionately worse as I increase the number of parallel : threads that query the same Searcher object. How many sequential queries are you running against the same Searcher instance? ... the performance drop you are seeing may be a result of each of those threads trying to build the same FieldCache on your sort field in parrallel. being 10x or 20x slower sounds like a lot .. but 10x 6ms is still only 60ms :) .. have you timed how long it takes just to build a FieldCache on that field? : Also, in my previous experience with sorting by a field in Lucene, I : seem to remember there being a preload time when you first search with : a sort by field, sometimes taking 30 seconds or so to load all of the : field's values into the in-memory cache associated with the Searcher : object. This initial preload time doesn't seem to be happening in my : case -- does that mean that for some reason Lucene is not caching the : field values? that's the FieldCache initialization i was refering to -- it's based on reusing the same instenad of IndexReader (or IndexSearcher), as long as you are using the same instance over and over you'll reuse the FieldCache and only pay that cost once (or maybe N times if you have N parrallel query threads and they all try to hit the FieldCache immediately). 30 seconds sounds extremely long though ... you may be remembering incorrectly how significant the penalty was. : I have an index of 1 million documents, taking up about 1.7G of : diskspace. I specify -Xmx2000m when running my java search : application. the big issue when sorting on a field is what type of data is in that field: is it a int? a long? a String? .. if it is a String how often does the same String value appear for multiple documents? .. these all affect how much RAM the FieldCache takes up. you mentioned sorting by date, did you store the date as a String? in what format? with what precision? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
i know im a little late replying to this thread, but, in my humble opinion the best way to aggregate values (not necessarily terms, but whole values in fields) is as follows: startup stage: for each field you would like to aggregate create a hashmap open an index reader and run through all the docs get the values to be aggregated from the fields of each doc create a hashcode for each value from each field collected, the hashcode should have some sort of prefix indicating which field its from (for exampe: 1 = author, 2 = ) and hence which hash it is stored in (at retrieval time, this prefix can be used to easily retrieve the value from the correct hash) place the hashcode/value in the appropriate hash create an arraylist at index X in the arraylist place an int array of all the hashcodes associated with doc id X so for example: if i have doc id 0 which contains the values: william shakespeare and the value 1797 the array list at index 0 will have an int array containing 2 values (the 2 hashcodes of shaklespeare and 1797) run time: at run time receive the hits and iterate through the doc ids , aggregate the values with direct access into the arraylist (for doc id 10 go to index 10 in the arraylist to retrieve the array of hashcodes) and lookups into the hashmaps i tested this today on a small index approx 400,000 docs (1GB of data) but i ran queries returning over 100,000 results my response time was about 550 milliseconds on large (over 100,000) result sets another point, this method should be scalable for much larger indexes as well, as it is linear to the result set size and not the index size (which is a HUGE bonus) if anyone wants the code let me know, Marvin Humphrey <[EMAIL PROTECTED]> wrote: Thanks, all. The field cache and the bitsets both seem like good options until the collection grows too large, provided that the index does not need to be updated very frequently. Then for large collections, there's statistical sampling. Any of those options seems preferable to retrieving all docs all the time. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone. Get Yahoo! Messenger with Voice