On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel <[email protected]> wrote:

> Yeah thought of that one too but it still requires each be ordered by Key,
> in which case simultaneous iteration works in one pass I think.
>

Multipass does not require ordering by key.  Solr documents can be updated
in any order.


> If the DRMs are always sorted by Key you can iterate through each at the
> same time, writing only when you have both fields or know there is a field
> missing from one DRM. If you get the same key you write a combined doc, if
> you have different ones, write out one sided until it catches up to the
> other.
>

Yes.  Merge will work when files are ordered and split consistently.  I
don't think we should be making that assumption.


> Every DRM I've examined seems to be ordered by key and I assume that is
> not an artifact of seqdumper. I'm using SequenceFileDirIterator so the part
> file splits aren't a problem.
>

But with the co- and cross- occurrence stuff, file splits could be a
problem.


> A m/r join is pretty simple too but I'll go with non-m/r unless there is a
> problem above.
>

The simplest join is to use Solr updates.  This would require a minimal
amount of programming, but less than writing a merge program.


> BTW the schema for the Solr csv is:
> id,b_b_links,b_a_links
> item1,"itemX itemY","itemZ"
>
> am I missing some "normal metadata"?
>

An item description is nice.

Reply via email to