Sebastian, Can you provide more information?
I'd be fine with using command line arguments, unless you have a case against it. Are there any of the filter files that I'd need to work with? Is there a page in the wiki that covers this a bit more? Thanks. -- Chris On Mon, Jul 28, 2014 at 5:29 PM, Sebastian Nagel <[email protected] > wrote: > Hi Chris, > > > How do I normalize & filter outlinks? Do I just need to add some > arguments > > to the crawl script, or are they configuration settings? > > Well, for historical reasons there are 3 possible ways to enable/disable > URL normalizers and filters: > * configuration properties > * command-line arguments for tools (updatedb, index, etc.) > * inverse command-line arguments (-noFilter/-noNormalize) > for other tools (parse, invertlinks) > Command-line arguments always overwrite configuration properties. > > > Sebastian > > > On 07/25/2014 01:44 PM, Christopher Gross wrote: > > On Thu, Jul 24, 2014 at 5:00 PM, Sebastian Nagel < > [email protected] > >> wrote: > > > >> Hi Chris, > >> > >>> I started off the crawler, using the runbot.sh script > >> Which Nutch version and what script is used? > >> > >> > > Nutch 1.6 > > Sorry, its the newer "crawl" script (I just have a runbot.sh that calls > it > > and writes the output to a file). > > > > > >>> I'm up to 147813 URLs in Solr. > >> Because there are also redirects, robots=noindex, and > >> other URLs fetched but not indexed, the crawled content > >> is somewhat larger. But it should be possible to get > >> this amount crawled on a single node. > >> > >> > > OK. > > > > > >>> The LinkDB mapreduce calls were taking over 3 hours to run. > >> That's probably because filtering and normalization is on: > >> in this case all existing Links are normalized and filtered. > >> If Outlinks are normalized and filtered during parse > >> this can be avoided and inverting links should get faster. > >> Hadoop (but not in local mode) will also speed-up the > >> job: normalization and filtering is done in the mapper > >> and such is ideal for parallelization. > >> > >> > > How do I normalize & filter outlinks? Do I just need to add some > arguments > > to the crawl script, or are they configuration settings? > > > > Thanks! > > > > -- Chris > > > >

