Joining Nutch files

Hans Brende Fri, 23 Mar 2018 07:36:00 -0700

Question: I need the outer join of "crawl_fetch" and "content" as input to
a map-reduce job I'm writing, in order to access the *fetch time* and
*fetch status* alongside the fetched content. I'd like to use the
`org.apache.hadoop.mapreduce.lib.join.CompositeInputFormat` for this task,
but it states in the documentation that it is "capable of performing joins
over a set of data sources sorted and partitioned the same way". I know
that "crawl_fetch" and "content" use the same key (the url), but do I have
any sort of guarantee that they are "sorted and partitioned the same way"?
Using the latest 1.15 Nutch version from Github.

Joining Nutch files

Reply via email to