As reference to other readers: https://issues.apache.org/jira/browse/NUTCH-939
On Friday 26 November 2010 11:59:26 Claudio Martella wrote: > Hello list, > > I'm porting recrawl script to use hadoop (on an already existing hadoop > cluster). I attach my version. > > What i found out is that Indexer and SolrIndexer want a list of > segments. It's difficult to obtain the content of a directory through > hdfs (/craw/segments/* will be expanded by bash and hadoop dfs -ls will > return the content with details such as permissions, owners and dates), > so I wrote these little patches to add the -dir option like > SegmentMerger and LinkDB. They are attached too. > > They might be of interest for somebody else.

