lucene docs in bulk read?
Hey folks.. thanks in advance to any who respond... I do a good deal of post-search processing and the file io to read the fields I need becomes horribly costly and is definitely a problem. Is there any way to either retrieve 1. the entire doc (all fields that can be retrieved) and/or 2. a group of docs.. specified by say an array of doc ids? I've optimized to retrieve the entire list of fields instead of 1 by 1.. and also retrieve only the minimal number of fields that I can.. but still my profilers show me that the lucene io to read the doc fields is where I spend 95% of my time. Of course this is obvious given the nature of how it all works.. but can anyone think of a better way to go about retrieving docs in bulk? Are the different types of fields quicker/slower than others when retrieving them from the index? -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
Hi Chris, are your fields string or reader? How large do your fields get? Kelvin On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote: > Hey folks.. thanks in advance to any who respond... > > I do a good deal of post-search processing and the file io to read > the fields I need becomes horribly costly and is definitely a > problem. Is there any way to either retrieve 1. the entire doc (all > fields that can be retrieved) and/or 2. a group of docs.. specified > by say an array of doc ids? > > I've optimized to retrieve the entire list of fields instead of 1 > by 1.. and also retrieve only the minimal number of fields that I > can.. but still my profilers show me that the lucene io to read the > doc fields is where I spend 95% of my time. Of course this is > obvious given the nature of how it all works.. but can anyone think > of a better way to go about retrieving docs in bulk? Are the > different types of fields quicker/slower than others when > retrieving them from the index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
Well all my fields are strings when I index them. They're all very short strings, dates, hashes, etc. The largest field has a cap of 256 chars and there is only one of them, the rest are all fairly small. Can you explain what you meant by 'string or reader' ? Thanks, Chris On Tue, 1 Feb 2005 15:11:18 +0100, Kelvin Tan <[EMAIL PROTECTED]> wrote: > Hi Chris, are your fields string or reader? How large do your fields get? > > Kelvin > > On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote: > > Hey folks.. thanks in advance to any who respond... > > > > I do a good deal of post-search processing and the file io to read > > the fields I need becomes horribly costly and is definitely a > > problem. Is there any way to either retrieve 1. the entire doc (all > > fields that can be retrieved) and/or 2. a group of docs.. specified > > by say an array of doc ids? > > > > I've optimized to retrieve the entire list of fields instead of 1 > > by 1.. and also retrieve only the minimal number of fields that I > > can.. but still my profilers show me that the lucene io to read the > > doc fields is where I spend 95% of my time. Of course this is > > obvious given the nature of how it all works.. but can anyone think > > of a better way to go about retrieving docs in bulk? Are the > > different types of fields quicker/slower than others when > > retrieving them from the index? > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
Please see inline. On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote: > Well all my fields are strings when I index them. They're all very > short strings, dates, hashes, etc. The largest field has a cap of > 256 chars and there is only one of them, the rest are all fairly > small. > > Can you explain what you meant by 'string or reader' ? Sorry, I meant to ask if you're using String fields (field.stringValue()) or reader fields (field.readerValue()). Can you elaborate on the post-processing you need to do? Have you thought about concatenating the fields you require into a single non-indexed field (Field.UnIndexed) for simple retrieval? It'll increase the size of your index, but should be faster to retrieve them all at one go. Kelvin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
Definitely a good idea on the one line idea... that could possibly save a good amount of time. I'm using .stringValue ... in reality, I hadn't ever even considered readerValue ... is there a strong performance difference between the two? or is it simply on the functionality side? The basic post processing is a grouping of results... because of the time and space issues of my indexing process I am unable efficiently go back and reindex a document if I have found a duplicate (my search engine deals with multiple documents over time) .. so my post processing groups results in the top 5000 hits which are the same, except over different dates... But I need to grab the minimal data in order to do this... the URL of the original page, the date of the doc, etc... so that I can use only 1 doc, but if I find a duplicate, I can simple add the new date to already existing doc. I am only reading a few fields, but on a large scale of many documents, it hurts my timing quite a bit. -Chris On Tue, 1 Feb 2005 21:33:13 +0100, Kelvin Tan <[EMAIL PROTECTED]> wrote: > Please see inline. > > On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote: > > Well all my fields are strings when I index them. They're all very > > short strings, dates, hashes, etc. The largest field has a cap of > > 256 chars and there is only one of them, the rest are all fairly > > small. > > > > Can you explain what you meant by 'string or reader' ? > > Sorry, I meant to ask if you're using String fields (field.stringValue()) or > reader fields (field.readerValue()). > > Can you elaborate on the post-processing you need to do? Have you thought > about concatenating the fields you require into a single non-indexed field > (Field.UnIndexed) for simple retrieval? It'll increase the size of your > index, but should be faster to retrieve them all at one go. > > Kelvin > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
On Tue, 1 Feb 2005 14:12:54 -0800, Chris Fraschetti wrote: > Definitely a good idea on the one line idea... that could possibly > save a good amount of time. I'm using .stringValue ... in reality, > I hadn't ever even considered readerValue ... is there a strong > performance difference between the two? or is it simply on the > functionality side? Not that I'm aware of (performance). Reader fields are useful when reading in bulky data which doesn't make sense to be loaded into mem as a String. K - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]