lucene docs in bulk read?

2005-02-01 Thread Chris Fraschetti
Hey folks.. thanks in advance to any who respond...

I do a good deal of post-search processing and the file io to read the
fields I need becomes horribly costly and is definitely a problem. Is
there any way to either retrieve 1. the entire doc (all fields that
can be retrieved) and/or 2. a group of docs.. specified by say an
array of doc ids?

I've optimized to retrieve the entire list of fields instead of 1 by
1.. and also retrieve only the minimal number of fields that I can..
but still my profilers show me that the lucene io to read the doc
fields is where I spend 95% of my time. Of course this is obvious
given the nature of how it all works.. but can anyone think of a
better way to go about retrieving docs in bulk? Are the different
types of fields quicker/slower than others when retrieving them from
the index?

-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan
Hi Chris, are your fields string or reader? How large do your fields get?

Kelvin

On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote:
> Hey folks.. thanks in advance to any who respond...
>
> I do a good deal of post-search processing and the file io to read
> the fields I need becomes horribly costly and is definitely a
> problem. Is there any way to either retrieve 1. the entire doc (all
> fields that can be retrieved) and/or 2. a group of docs.. specified
> by say an array of doc ids?
>
> I've optimized to retrieve the entire list of fields instead of 1
> by 1.. and also retrieve only the minimal number of fields that I
> can.. but still my profilers show me that the lucene io to read the
> doc fields is where I spend 95% of my time. Of course this is
> obvious given the nature of how it all works.. but can anyone think
> of a better way to go about retrieving docs in bulk? Are the
> different types of fields quicker/slower than others when
> retrieving them from the index?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Chris Fraschetti
Well all my fields are strings when I index them. They're all very
short strings, dates, hashes, etc. The largest field has a cap of 256
chars and there is only one of them, the rest are all fairly small.

Can you explain what you meant by 'string or reader' ?

Thanks,
Chris


On Tue, 1 Feb 2005 15:11:18 +0100, Kelvin Tan <[EMAIL PROTECTED]> wrote:
> Hi Chris, are your fields string or reader? How large do your fields get?
> 
> Kelvin
> 
> On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote:
> > Hey folks.. thanks in advance to any who respond...
> >
> > I do a good deal of post-search processing and the file io to read
> > the fields I need becomes horribly costly and is definitely a
> > problem. Is there any way to either retrieve 1. the entire doc (all
> > fields that can be retrieved) and/or 2. a group of docs.. specified
> > by say an array of doc ids?
> >
> > I've optimized to retrieve the entire list of fields instead of 1
> > by 1.. and also retrieve only the minimal number of fields that I
> > can.. but still my profilers show me that the lucene io to read the
> > doc fields is where I spend 95% of my time. Of course this is
> > obvious given the nature of how it all works.. but can anyone think
> > of a better way to go about retrieving docs in bulk? Are the
> > different types of fields quicker/slower than others when
> > retrieving them from the index?
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan
Please see inline.

On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote:
> Well all my fields are strings when I index them. They're all very
> short strings, dates, hashes, etc. The largest field has a cap of
> 256 chars and there is only one of them, the rest are all fairly
> small.
>
> Can you explain what you meant by 'string or reader' ?

Sorry, I meant to ask if you're using String fields (field.stringValue()) or 
reader fields (field.readerValue()).

Can you elaborate on the post-processing you need to do? Have you thought about 
concatenating the fields you require into a single non-indexed field 
(Field.UnIndexed) for simple retrieval? It'll increase the size of your index, 
but should be faster to retrieve them all at one go.

Kelvin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Chris Fraschetti
Definitely a good idea on the one line idea... that could possibly
save a good amount of time. I'm using .stringValue ... in reality, I
hadn't ever even considered readerValue ... is there a strong
performance difference between the two? or is it simply on the
functionality side?

The basic post processing is a grouping of results... because of the
time and space issues of my indexing process I am unable efficiently
go back and reindex a document if I have found a duplicate (my search
engine deals with multiple documents over time) .. so my post
processing groups results in the top 5000 hits which are the same,
except over different dates... But I need to grab the minimal data in
order to do this... the URL of the original page, the date of the doc,
etc... so that I can use only 1 doc, but if I find a duplicate, I can
simple add the new date to already existing doc. I am only reading a
few fields, but on a large scale of many documents, it hurts my timing
quite a bit.

-Chris


On Tue, 1 Feb 2005 21:33:13 +0100, Kelvin Tan <[EMAIL PROTECTED]> wrote:
> Please see inline.
> 
> On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote:
> > Well all my fields are strings when I index them. They're all very
> > short strings, dates, hashes, etc. The largest field has a cap of
> > 256 chars and there is only one of them, the rest are all fairly
> > small.
> >
> > Can you explain what you meant by 'string or reader' ?
> 
> Sorry, I meant to ask if you're using String fields (field.stringValue()) or 
> reader fields (field.readerValue()).
> 
> Can you elaborate on the post-processing you need to do? Have you thought 
> about concatenating the fields you require into a single non-indexed field 
> (Field.UnIndexed) for simple retrieval? It'll increase the size of your 
> index, but should be faster to retrieve them all at one go.
> 
> Kelvin
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan


On Tue, 1 Feb 2005 14:12:54 -0800, Chris Fraschetti wrote:
> Definitely a good idea on the one line idea... that could possibly
> save a good amount of time. I'm using .stringValue ... in reality,
> I hadn't ever even considered readerValue ... is there a strong
> performance difference between the two? or is it simply on the
> functionality side?

Not that I'm aware of (performance). Reader fields are useful when reading in 
bulky data which doesn't make sense to be loaded into mem as a String.

K



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]