Re: svn commit: r547493 - in /lucene/solr/trunk: ./ src/java/org/apache/solr/common/ src/java/org/apache/solr/schema/ src/java/org/apache/solr/update/ src/test/org/apache/solr/common/

Ryan McKinley Sat, 16 Jun 2007 12:04:56 -0700

Yonik Seeley wrote:

On 6/16/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

Ryan: independent of the javadoc comment on loadStoredFields about it
possibly being refactored somwhere else, the build method doesn't really
match the semantics of the DocumentBuilder class.


i think i commented in SOLR-193 that it didn't belove in DOcumentBuilder,


The original DocumentBuilder was very efficient at least... it
directly built the Lucene document with no intermediate state (but
then we added constraint checking, like multiValued, etc, had we had
to build a hash anyway).

Looking at it now, (with Yonik pointing out the Map) I see what youmean. I put it there so that the Document field validation/conversionis consistent: it calls the same addField() using the same errorschecking etc.

This is kinda silly because SolrDocument already knows if there aremultiple values for a field or not. It could easily do the SolrDocument-> Document in a single loop without building any more intermediatestates and calling the functions.


Perhaps:
class DocumentUtils {
  static Document toDocument( SolrInputDocument, Schema );

static SolrDocument loadFields( SolrDocument, Document, Schema,boolean skipCopyFields)

}

thoughts?


Some other comments... sorry if these have already been discussed, but
I'm finding less time to keep up with JIRA patches (until after they
are committed sometimes).


No worries - i'll keep you on your toes ;)

SolrDocument :
- should it be Iterable at least?


that sounds fine.

I was only against the Map interface because it makes JSTL work strangely.

- I'm not crazy about an ArrayList per field, considering that most
fields aren't multiValued, but I guess it's not too much of an issue.
The alternative would be to have getField() return Object instead of
Collection<Object>

If Object[] is substantially better that could work. I don't have areal sense of the performance hit for making an ArrayList. It could beinitialized as new ArrayList(1)?

- I have to wonder why we have a SolrDocument at all though, compared
to a Map<String,Collection<Object>>


setField('name',value);
addField('name',value);
getFieldValue('name');
getFieldValues('name');

boosts...

Mostly, I am doing a lot of work with transforming documents and it isreally nice to have an intermediate state for a document. From theclient, it needed to put the parsed document results somewhere.


SolrInputDocument :
- it seems like it should convey extra state (currently field and
document boosts), but not ehavior.  keepDuplicates logic seems like

extra overhead that will rarely be useful,

Maybe I'm ahead of myself, but this functionality is essential for:SOLR-139 and SOLR-103. For modifiable docs, it is MUCH easier forclients to add tags if the server can worry about duplicates. For somedatabase layouts, it is only possible to get all fields for a documentrepresented as colums/rows if you repeat many fields.

and if it is useful, the
logic should probably be executed when building the Lucene Document
(when the schema is available for more info).

Perhaps. but if the base container is a Set, then it never needs toworry about it again - otherwise you build a List, then to make sure itsdistinct build a parallel Set at index time?

From the client side (SOLR-20) this is also a nice feature - and youdon't have (or need) access to the schema.

- keeping boosts in an extra map : uggg... going from a simple float
to boxing + hashing doesn't seem great performance-wise,

FWIW, it does not make the the map unless you are using boosts... anysuggestions for a better intermediate state?

and it also doesn't match the current Lucene Document interface which allows a
boost per field value (although lucene indexing currently lacks the
ability to boost them separately).

this was totally by design. The fact you *can* set boosts on fieldvalues in a Document that aren't used is really strange. What happensif I set: addField( 'f1', 'v1', 10 ), addField( 'f1', 'v2', 1 )?


any answer involves something like: "Yea yea, we know its bad, but..."

When the lucene index supports boosting it makes sense to add that tothe SolrDocument API. At that point the semantics of setBoost('fieldname', 10 ) still make perfect sense -- apply the boost to everyfield.


- - - - -

Right now, /trunk is in a good state to test performance differences

http://localhost:8983/solr/update

uses the well tested XPP XmlUpdateHandler - writing documents directlyto document builder - never touching SolrDocument


http://localhost:8983/solr/update/stax

uses a new StaxUpdateHandler that loads documents into an intermediatestate, passes this to a "RequestProessor" that then converts that to alucene Document and then calls add.

The flexibility of the second approach is great - writing a JSON updateris trivial; adding custom authentication/transformation is easy; usingthe same authentication/transformation for JSON and XML is built in. Ifit is a small performance hit, I think it is worth it. If it a serioushit, then we should probably make one RequestHandler that writes to theindex as fast as possible (no SolrDocument) and another that handles anintermediate state. Maybe the 'fast' update handler could (optiionally)assume the input is good - skipping the validation that requires it tobuild an extra Map?


ryan

Re: svn commit: r547493 - in /lucene/solr/trunk: ./ src/java/org/apache/solr/common/ src/java/org/apache/solr/schema/ src/java/org/apache/solr/update/ src/test/org/apache/solr/common/

Reply via email to