On 6/16/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
This is kinda silly because SolrDocument already knows if there are
multiple values for a field or not.

Yep.

 It could easily do the SolrDocument
-> Document in a single loop without building any more intermediate
states and calling the functions.

Perhaps:
class DocumentUtils {
   static Document toDocument( SolrInputDocument, Schema );
   static SolrDocument loadFields( SolrDocument, Document, Schema,
boolean skipCopyFields)
}

thoughts?

Would request handlers (that are update handlers) call these directly
before passing a Lucene Document to the UpdateHander, or should
UpdateHandler take a SolrDocument?

> SolrDocument :
> - should it be Iterable at least?

that sounds fine.

I was only against the Map interface because it makes JSTL work strangely.

> - I'm not crazy about an ArrayList per field, considering that most
> fields aren't multiValued, but I guess it's not too much of an issue.
> The alternative would be to have getField() return Object instead of
> Collection<Object>

If Object[] is substantially better that could work.  I don't have a
real sense of the performance hit for making an ArrayList.  It could be
initialized as new ArrayList(1)?

ArrayList(1) would help with GC issues at least.

> SolrInputDocument :
> - it seems like it should convey extra state (currently field and
> document boosts), but not ehavior.  keepDuplicates logic seems like
> extra overhead that will rarely be useful,

Maybe I'm ahead of myself, but this functionality is essential for:
SOLR-139 and SOLR-103.  For modifiable docs, it is MUCH easier for
clients to add tags if the server can worry about duplicates.  For some
database layouts, it is only possible to get all fields for a document
represented as colums/rows if you repeat many fields.

Should keepDuplicates logic be reversed to removeDuplicates?
This code seems to be a problem since if you set a value for keepDuplicates for
any field, it switches the default of all others, right?
   if( _keepDuplicates == null || Boolean.TRUE ==
_keepDuplicates.get( name )) {

> and if it is useful, the
> logic should probably be executed when building the Lucene Document
> (when the schema is available for more info).

Perhaps.  but if the base container is a Set, then it never needs to
worry about it again - otherwise you build a List, then to make sure its
distinct build a parallel Set at index time?

 From the client side (SOLR-20) this is also a nice feature - and you
don't have (or need) access to the schema.


> - keeping boosts in an extra map : uggg... going from a simple float
> to boxing + hashing doesn't seem great performance-wise,

FWIW, it does not make the the map unless you are using boosts...  any
suggestions for a better intermediate state?

Not really... and the performance impact will be negligible.  I just
bring it up because .1% here and .1% there start to add up after a
while.  It's more of a psychological thing I guess ;-)

> and it also doesn't match the current Lucene Document interface which allows a
> boost per field value (although lucene indexing currently lacks the
> ability to boost them separately).
>

this was totally by design.  The fact you *can* set boosts on field
values in a Document that aren't used is really strange.  What happens
if I set: addField( 'f1', 'v1', 10 ), addField( 'f1', 'v2', 1 )?

The way the Lucene code is written, the boost for a given field is the
product of the document boost and *all* boosts on values of that
field.
See DocumentWriter:
         fieldBoosts[fieldNumber] *= field.getBoost();
Not that that makes the most sense though...

So what should we do when someone sends in XML with more than one
field value with a boost on it?

any answer involves something like: "Yea yea, we know its bad, but..."

When the lucene index supports boosting it makes sense to add that to
the SolrDocument API.  At that point the semantics of setBoost(
'fieldname', 10 ) still make perfect sense -- apply the boost to every
field.

What you probably want (lucene scoring-wise) is to just boost a single
value of the multi-valued field.

-Yonik

Reply via email to