Yeah thought of that one too but it still requires each be ordered by Key, in which case simultaneous iteration works in one pass I think.
If the DRMs are always sorted by Key you can iterate through each at the same time, writing only when you have both fields or know there is a field missing from one DRM. If you get the same key you write a combined doc, if you have different ones, write out one sided until it catches up to the other. Every DRM I've examined seems to be ordered by key and I assume that is not an artifact of seqdumper. I'm using SequenceFileDirIterator so the part file splits aren't a problem. A m/r join is pretty simple too but I'll go with non-m/r unless there is a problem above. BTW the schema for the Solr csv is: id,b_b_links,b_a_links item1,"itemX itemY","itemZ" am I missing some "normal metadata"? > On Aug 5, 2013, at 11:05 AM, Ted Dunning <[email protected]> wrote: > > What about just updating the document with the fields? Have three passes. > Pass 1 puts the normal meta-data for the item in place. Pass2 updates > with data from B'B. Pass 3 udpates with data from B'A. > > This will cause the entire index to be rewritten more than necessary, but > it should be fast enough to be a non-issue. >
