Hello!
I am wondering if there is some example crawl script for Nutch 2.1? This
includes the Inject/Generate/Fetch/Parse/Update/Index phases.
Thanks
--
View this message in context:
http://lucene.472066.n3.nabble.com/Example-crawl-script-Nutch-2-1-tp4059960.html
Sent from the Nutch - User
Thanks for your answer!
I think I will create my own modified crawlscript then. But I am pretty
confused of how to get a generated batchId? Should I just parse the id from
the output:
GeneratorJob: generated batch id: 1367327604-149897259
Or should I get the newly generated batchId from the
Hello,
Which document types can nutch parse? I know that it works with PDF but can
it also parse ms office documents and such?
Thanks,
James Ford
--
View this message in context:
http://lucene.472066.n3.nabble.com/Parsing-of-document-types-tp4026372.html
Sent from the Nutch - User mailing
long? It can't take that
much time selecting X urls from a database of about 10 million URLs?
Thanks,
James Ford
--
View this message in context:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html
Sent from the Nutch - User mailing list archive
will have topN at about 20k, and I want the db_unfetched to be around 20k
for each iteration?
What should I set db.max.outlinks.per.page to? I was wondering about
setting it to 4, to get 4*5k=20k for the first iteration?
Can anyone help me?
Thanks,
James Ford
--
View this message in context:
http
platform... using builtin-java classes where
applicable
This step takes about 15 minutes, compared to all other steps which takes
about 25 minutes in total. How can I make this step faster?
Thanks,
James Ford
--
View this message in context:
http://lucene.472066.n3.nabble.com/Bottleneck-of-my
Thanks for answer Markus,
But I don't think I follow you. I am new to nutch. How could I make nutch
use the normalizer only when I have to? I tried removing the order of the
normalizers in the config, but nothing happened.
--
View this message in context:
Eh,
Can't you guys be a little more specific? I have searched the archives, and
found nothing of value?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3793253.html
Sent from the Nutch - User mailing list archive at Nabble.com.
I am 100% sure that the regex-urlfilters is not the problem. I know regex
patterns from before. But it seems that the solution to my problem is to set
db.max.outlinks.per.page to 0?
--
View this message in context:
manually.
Thanks,
James Ford
--
View this message in context:
http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html
Sent from the Nutch - User mailing list archive at Nabble.com.
10 matches
Mail list logo