On 10/8/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:

Thanks for tackling this Mike ... I've been dreading the whole issue of
Lazy Loading but your patch gives me hope.  I haven't had a chance to try
it out, but reading through it, it seems a lot more straight forward then
I'd feared.

A couple of concerns jump out at me though, starting with the biggee...

: Now, there is a concern with the doc cache of SolrIndexSearcher, which
: assumes it has the whole document in the cache.  To maintain this
: invariant, it is still the case that all the fields in a document are
: loaded in a searcher.doc(i) call.  However, if a field set is given to
: teh method, only the given fields are loaded directly, while the rest
: are loaded lazily.

it doesn't look like this would actually be an invarient with your patch.
Consider the case of two request handlers: X which expects/uses
all fields, and Y which only uses a few fields consistently so the
rest are left to lazy loading.

Good point--I hadn't considered this case.

If Y hits doc N first it puts N in
cache with only partial fields, isn't X forced to use lazy loading every
time after that?

Yes, but as you guess below...

Or does lazy loading not work that way? ... would the lazy fields still
only be loaded once (the first time X asked for them) but then stored in
the Document (which is still in the cache) ?

... lazy fields read their value once the first time it is requested,
and operate as a Field thereafter.

... even if that is hte case
would hte performance hit of X loading each field Lazily one at a time be
more expensive then refetching completely?

I wouldn't expect there to be much of a difference.  Lazy fields hold
on to a stream and an offset, and operate by seek()'ing to the right
position and loading the fields as normal.  Now, if the lazy fields
were loaded in exactly the right order, the seeks would  be no-ops.
In practice, this will not happen, but we will expect that all seeks
(after the first one) will be in the same disk block, which will be
buffered, and so the seeks will be but pointer arithmetic.

But that isn't quite the right comparason, as lazy fields have to
"skip" the field in the first place, which is cheap but not free (it
is cheaper for binary and compressed fields than string fields).

The performance gain/loss will depend heavily on which request handler
gets called more often and in what order.  I don't think the cost
would be too great, but I'm loathe to guess.

Depending on what would happen in the situation i described, perhaps
the Solr document cache should only be used by the single arg version of
SolrIndexSearcher.doc? (if we do this, two arg version of readDocs needs
changed to use it directly)

perhaps, but that has other problems (in the situation you describe
above, every hit on the "few fields" handler would reload every
document).

An alternate idea: the single arg version could check if an item found in
the cache contains lazy fields and if so re-fetch and recache the full
Document?

That could work though I wonder if the O(num fields) cost per document
access is worth it.  Perhaps the document could be stored with a
"lazy" flag in the cache, to make this check O(1).

FWIW: This isn't an obscure situation, The main index I deal with has a
lot of metadata documents (one per category) which are *HUGE*, and there
are two custom request handlers which use those documents -- one uses
every field in them, and the other uses only a single field.  That second
handler would be an ideal candidate to start using your new
SolrIndexSearcher.doc(int,Set<String>) method, but it would suck if doing
thta ment that the other handler's performance suffered because it started
taking longer to fetch all of the fields)

Perhaps instead of a "lazy" flag, the number of real fields could be
stored in the cache.  On the next document request, if there are more
than 1-2 more fields requeste than "real" in the cache, the full
document is returned.

How much of a problem this is depends also on how often documets are
hit once in the cache.  If it if more than a few times, the load-once
behaviour of lazy fields should amortize out the extra cost.

Incidently, I think one of the major benefits to lazy field loaded
isn't necessary the retrieval cost, but the memory savings.  In your
example, assuming that many documents are returned by the 1-field
handler which the many-field handler never touches.  These document
will occupy much less memory, and consequently the size of the
document cache can be increased.

Other smaller issues...

1) it's not clear to me ... will optimizePreFetchDocs trigger an NPE if
there are highlight fields but no uniqueKey for the schema?
(same potential in HighlightUtils which has the same line)

Good point.  I don't think the test suite covers this case--and I've
been bitten by it before.

2) why doesn't optimizePreFetchDocs use SolrIndexSearcher.readDocs (looks
like cut/paste of method body)

Avoids the allocation of the Document[] array, and is three lines (vs.
two lines to allocate array and call readDocs).

3) should we be concerned about letting people specify prefixes/suffixes
of the fields they want to forcably load for dynamicFields instead of just
a Set<String> of names? .. or should we cross that bridge when we come to
it?  (I ask because we have no cache aware method that takes in a
FieldSelector, just the one that takes in the Set<String>)

It would be very easy to add a parallel method which takes a
FieldSelector.  My only concern with that is that it might make it
hard to do cache flushing heuristics like you suggested above.

And a few minor nits...

* lists in javadoc comments should us HTML <li> tags so they can be read
  cleanly in generated docs (see SolrPluginUtils.optimizePreFetchDocs)
* there are some double imports of Fieldable in several classes (it
  looks like maybe they were already importing it once for no reason)

Thanks for the comments!
-Mike

Reply via email to