Hi Chris, > How do I normalize & filter outlinks? Do I just need to add some arguments > to the crawl script, or are they configuration settings?
Well, for historical reasons there are 3 possible ways to enable/disable URL normalizers and filters: * configuration properties * command-line arguments for tools (updatedb, index, etc.) * inverse command-line arguments (-noFilter/-noNormalize) for other tools (parse, invertlinks) Command-line arguments always overwrite configuration properties. Sebastian On 07/25/2014 01:44 PM, Christopher Gross wrote: > On Thu, Jul 24, 2014 at 5:00 PM, Sebastian Nagel <[email protected] >> wrote: > >> Hi Chris, >> >>> I started off the crawler, using the runbot.sh script >> Which Nutch version and what script is used? >> >> > Nutch 1.6 > Sorry, its the newer "crawl" script (I just have a runbot.sh that calls it > and writes the output to a file). > > >>> I'm up to 147813 URLs in Solr. >> Because there are also redirects, robots=noindex, and >> other URLs fetched but not indexed, the crawled content >> is somewhat larger. But it should be possible to get >> this amount crawled on a single node. >> >> > OK. > > >>> The LinkDB mapreduce calls were taking over 3 hours to run. >> That's probably because filtering and normalization is on: >> in this case all existing Links are normalized and filtered. >> If Outlinks are normalized and filtered during parse >> this can be avoided and inverting links should get faster. >> Hadoop (but not in local mode) will also speed-up the >> job: normalization and filtering is done in the mapper >> and such is ideal for parallelization. >> >> > How do I normalize & filter outlinks? Do I just need to add some arguments > to the crawl script, or are they configuration settings? > > Thanks! > > -- Chris >

