Hi all, I'd like to MapReduce over (latest) cralwed data.
Should input path be crawldb/current/ ? InputFromatClass = SequenceFileInputFormat.class ? KV pair = <Text, CrawlDatum> ? where Text represents the URL ? Thanks.
Hi all, I'd like to MapReduce over (latest) cralwed data.
Should input path be crawldb/current/ ? InputFromatClass = SequenceFileInputFormat.class ? KV pair = <Text, CrawlDatum> ? where Text represents the URL ? Thanks.