Yonik Seeley wrote:
On 6/16/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
Ryan: independent of the javadoc comment on loadStoredFields about it
possibly being refactored somwhere else, the build method doesn't really
match the semantics of the DocumentBuilder class.
i think i commented in SOLR-193 that it didn't belove in DOcumentBuilder,
The original DocumentBuilder was very efficient at least... it
directly built the Lucene document with no intermediate state (but
then we added constraint checking, like multiValued, etc, had we had
to build a hash anyway).
Looking at it now, (with Yonik pointing out the Map) I see what you
mean. I put it there so that the Document field validation/conversion
is consistent: it calls the same addField() using the same errors
checking etc.
This is kinda silly because SolrDocument already knows if there are
multiple values for a field or not. It could easily do the SolrDocument
-> Document in a single loop without building any more intermediate
states and calling the functions.
Perhaps:
class DocumentUtils {
static Document toDocument( SolrInputDocument, Schema );
static SolrDocument loadFields( SolrDocument, Document, Schema,
boolean skipCopyFields)
}
thoughts?
Some other comments... sorry if these have already been discussed, but
I'm finding less time to keep up with JIRA patches (until after they
are committed sometimes).
No worries - i'll keep you on your toes ;)
SolrDocument :
- should it be Iterable at least?
that sounds fine.
I was only against the Map interface because it makes JSTL work strangely.
- I'm not crazy about an ArrayList per field, considering that most
fields aren't multiValued, but I guess it's not too much of an issue.
The alternative would be to have getField() return Object instead of
Collection<Object>
If Object[] is substantially better that could work. I don't have a
real sense of the performance hit for making an ArrayList. It could be
initialized as new ArrayList(1)?
- I have to wonder why we have a SolrDocument at all though, compared
to a Map<String,Collection<Object>>
setField('name',value);
addField('name',value);
getFieldValue('name');
getFieldValues('name');
boosts...
Mostly, I am doing a lot of work with transforming documents and it is
really nice to have an intermediate state for a document. From the
client, it needed to put the parsed document results somewhere.
SolrInputDocument :
- it seems like it should convey extra state (currently field and
document boosts), but not ehavior. keepDuplicates logic seems like
extra overhead that will rarely be useful,
Maybe I'm ahead of myself, but this functionality is essential for:
SOLR-139 and SOLR-103. For modifiable docs, it is MUCH easier for
clients to add tags if the server can worry about duplicates. For some
database layouts, it is only possible to get all fields for a document
represented as colums/rows if you repeat many fields.
and if it is useful, the
logic should probably be executed when building the Lucene Document
(when the schema is available for more info).
Perhaps. but if the base container is a Set, then it never needs to
worry about it again - otherwise you build a List, then to make sure its
distinct build a parallel Set at index time?
From the client side (SOLR-20) this is also a nice feature - and you
don't have (or need) access to the schema.
- keeping boosts in an extra map : uggg... going from a simple float
to boxing + hashing doesn't seem great performance-wise,
FWIW, it does not make the the map unless you are using boosts... any
suggestions for a better intermediate state?
and it also doesn't match the current Lucene Document interface which allows a
boost per field value (although lucene indexing currently lacks the
ability to boost them separately).
this was totally by design. The fact you *can* set boosts on field
values in a Document that aren't used is really strange. What happens
if I set: addField( 'f1', 'v1', 10 ), addField( 'f1', 'v2', 1 )?
any answer involves something like: "Yea yea, we know its bad, but..."
When the lucene index supports boosting it makes sense to add that to
the SolrDocument API. At that point the semantics of setBoost(
'fieldname', 10 ) still make perfect sense -- apply the boost to every
field.
- - - - -
Right now, /trunk is in a good state to test performance differences
http://localhost:8983/solr/update
uses the well tested XPP XmlUpdateHandler - writing documents directly
to document builder - never touching SolrDocument
http://localhost:8983/solr/update/stax
uses a new StaxUpdateHandler that loads documents into an intermediate
state, passes this to a "RequestProessor" that then converts that to a
lucene Document and then calls add.
The flexibility of the second approach is great - writing a JSON updater
is trivial; adding custom authentication/transformation is easy; using
the same authentication/transformation for JSON and XML is built in. If
it is a small performance hit, I think it is worth it. If it a serious
hit, then we should probably make one RequestHandler that writes to the
index as fast as possible (no SolrDocument) and another that handles an
intermediate state. Maybe the 'fast' update handler could (optiionally)
assume the input is good - skipping the validation that requires it to
build an extra Map?
ryan