Perhaps:
class DocumentUtils {
   static Document toDocument( SolrInputDocument, Schema );
   static SolrDocument loadFields( SolrDocument, Document, Schema,
boolean skipCopyFields)
}

thoughts?

Would request handlers (that are update handlers) call these directly
before passing a Lucene Document to the UpdateHander, or should
UpdateHandler take a SolrDocument?


In SOLR-139 (updateable/modifiable) documents, I add a new command:

public class IndexDocumentCommand
{
  public enum MODE {
OVERWRITE, // overwrite existing values with the new one. (the default behavior)
    APPEND,    // add the new value to existing value
    DISTINCT,  // same as APPEND, but make sure each value is distinct
    INCREMENT  // increment existing value.  Must be a number!
  };

  public boolean overwrite = true;
  public SolrInputDocument doc;
  public Map<String,MODE> mode; // What to do for each field.
public int commitMaxTime = -1; // make sure the document is committed within this much time
}

The MODE enum should be in common (so solrj can link to it). For the default behavior (overwrite the whole document) the mode Map is not created.

Perhaps it makes sense to add SOLR-139 so you all can easily see why I made some of the DocumentBuilder choices. Since XmlUpdateHandler and StaxUpdateHandler are living in parallel now, it would not affect anyone who does not use the StaxUpdateHandler...



If Object[] is substantially better that could work.  I don't have a
real sense of the performance hit for making an ArrayList.  It could be
initialized as new ArrayList(1)?

ArrayList(1) would help with GC issues at least.


I'll do that.


> SolrInputDocument :
> - it seems like it should convey extra state (currently field and
> document boosts), but not ehavior.  keepDuplicates logic seems like
> extra overhead that will rarely be useful,

Maybe I'm ahead of myself, but this functionality is essential for:
SOLR-139 and SOLR-103.  For modifiable docs, it is MUCH easier for
clients to add tags if the server can worry about duplicates.  For some
database layouts, it is only possible to get all fields for a document
represented as colums/rows if you repeat many fields.

Should keepDuplicates logic be reversed to removeDuplicates?
This code seems to be a problem since if you set a value for keepDuplicates for
any field, it switches the default of all others, right?
   if( _keepDuplicates == null || Boolean.TRUE ==
_keepDuplicates.get( name )) {


removeDuplicates sounds better. I'll change it and make sure the logic is ok (with tests that touch both cases)



this was totally by design.  The fact you *can* set boosts on field
values in a Document that aren't used is really strange.  What happens
if I set: addField( 'f1', 'v1', 10 ), addField( 'f1', 'v2', 1 )?

The way the Lucene code is written, the boost for a given field is the
product of the document boost and *all* boosts on values of that
field.
See DocumentWriter:
         fieldBoosts[fieldNumber] *= field.getBoost();
Not that that makes the most sense though...

So what should we do when someone sends in XML with more than one
field value with a boost on it?


Sounds like we should keep the accumulated product of multi-valued field boosts and set that on the first value in the multi-value field. If I understand correctly, this would be consistent with existing behavior.

ryan

Reply via email to