On Thu, Jul 24, 2014 at 5:00 PM, Sebastian Nagel <[email protected] > wrote:
> Hi Chris, > > > I started off the crawler, using the runbot.sh script > Which Nutch version and what script is used? > > Nutch 1.6 Sorry, its the newer "crawl" script (I just have a runbot.sh that calls it and writes the output to a file). > > I'm up to 147813 URLs in Solr. > Because there are also redirects, robots=noindex, and > other URLs fetched but not indexed, the crawled content > is somewhat larger. But it should be possible to get > this amount crawled on a single node. > > OK. > > The LinkDB mapreduce calls were taking over 3 hours to run. > That's probably because filtering and normalization is on: > in this case all existing Links are normalized and filtered. > If Outlinks are normalized and filtered during parse > this can be avoided and inverting links should get faster. > Hadoop (but not in local mode) will also speed-up the > job: normalization and filtering is done in the mapper > and such is ideal for parallelization. > > How do I normalize & filter outlinks? Do I just need to add some arguments to the crawl script, or are they configuration settings? Thanks! -- Chris

