Example crawl script Nutch 2.1

2013-04-30 Thread James Ford
Hello! I am wondering if there is some example crawl script for Nutch 2.1? This includes the Inject/Generate/Fetch/Parse/Update/Index phases. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Example-crawl-script-Nutch-2-1-tp4059960.html Sent from the Nutch - User

Re: Example crawl script Nutch 2.1

2013-04-30 Thread James Ford
Thanks for your answer! I think I will create my own modified crawlscript then. But I am pretty confused of how to get a generated batchId? Should I just parse the id from the output: GeneratorJob: generated batch id: 1367327604-149897259 Or should I get the newly generated batchId from the

Parsing of document types

2012-12-12 Thread James Ford
Hello, Which document types can nutch parse? I know that it works with PDF but can it also parse ms office documents and such? Thanks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Parsing-of-document-types-tp4026372.html Sent from the Nutch - User mailing

Re: Make Nutch to crawl internal urls only

2012-05-10 Thread James Ford
long? It can't take that much time selecting X urls from a database of about 10 million URLs? Thanks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html Sent from the Nutch - User mailing list archive

Make Nutch to crawl internal urls only

2012-05-09 Thread James Ford
will have topN at about 20k, and I want the db_unfetched to be around 20k for each iteration? What should I set db.max.outlinks.per.page to? I was wondering about setting it to 4, to get 4*5k=20k for the first iteration? Can anyone help me? Thanks, James Ford -- View this message in context: http

Bottleneck of my crawls: NativeCodeLoader

2012-03-26 Thread James Ford
platform... using builtin-java classes where applicable This step takes about 15 minutes, compared to all other steps which takes about 25 minutes in total. How can I make this step faster? Thanks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Bottleneck-of-my

Re: Generator taking time

2012-03-22 Thread James Ford
Thanks for answer Markus, But I don't think I follow you. I am new to nutch. How could I make nutch use the normalizer only when I have to? I tried removing the order of the normalizers in the config, but nothing happened. -- View this message in context:

Re: Only fetching initial seedlist

2012-03-02 Thread James Ford
Eh, Can't you guys be a little more specific? I have searched the archives, and found nothing of value? -- View this message in context: http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3793253.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Only fetching initial seedlist

2012-03-02 Thread James Ford
I am 100% sure that the regex-urlfilters is not the problem. I know regex patterns from before. But it seems that the solution to my problem is to set db.max.outlinks.per.page to 0? -- View this message in context:

Only fetching initial seedlist

2012-03-01 Thread James Ford
manually. Thanks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html Sent from the Nutch - User mailing list archive at Nabble.com.