On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel <[email protected]> wrote:
> Yeah thought of that one too but it still requires each be ordered by Key, > in which case simultaneous iteration works in one pass I think. > Multipass does not require ordering by key. Solr documents can be updated in any order. > If the DRMs are always sorted by Key you can iterate through each at the > same time, writing only when you have both fields or know there is a field > missing from one DRM. If you get the same key you write a combined doc, if > you have different ones, write out one sided until it catches up to the > other. > Yes. Merge will work when files are ordered and split consistently. I don't think we should be making that assumption. > Every DRM I've examined seems to be ordered by key and I assume that is > not an artifact of seqdumper. I'm using SequenceFileDirIterator so the part > file splits aren't a problem. > But with the co- and cross- occurrence stuff, file splits could be a problem. > A m/r join is pretty simple too but I'll go with non-m/r unless there is a > problem above. > The simplest join is to use Solr updates. This would require a minimal amount of programming, but less than writing a merge program. > BTW the schema for the Solr csv is: > id,b_b_links,b_a_links > item1,"itemX itemY","itemZ" > > am I missing some "normal metadata"? > An item description is nice.
