Hi,
I am using apache-nutch 1.4 and it is crawling perfectly. But i have got
some issues in crawling some sites.
For testing my crawling, I took http://www.jabong.com http://www.jabong.com
I found out it is able to crawl categories but could not crawl pages.
For example look at this:-
Can somebody please help
Why do some sites are not being crawled..
eg.
Nutch failed to crawl
http://www.myntra.com
http://www.jabong.com
http://www.youtube.com
Successfully crawling some other sites.
--
View this message in context:
Hello,
I am trying to optimize my crawls as much as possible. The current
bottleneck is the step after adding segments to the linkdb, where Nutch is
trying to load the natiive-hadoop library:
2012-03-26 13:20:59,089 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your
I guess I STILL don't understand the topN setting. Here is what I thought it
would do:
Seed: file:myfileserver.com/share1
share1 Dir listing:
file1.pdf ... file300.pdf, dir1 ... dir20
running the following in a never ending shell script:
{generate crawl/crawldb crawl/segments -topN 1000
I believe it is complaining about this:
*public void addIndexBackendOptions(Configuration conf) {
LuceneWriter.addFieldOptions(MP3_TRACK_TITLE,
LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf);
LuceneWriter.addFieldOptions(MP3_ALBUM, LuceneWriter.STORE.YES,
Hey there,
currently i try to debug the dedup results from nutch. There is a page
with is exactly the same (compared the HTML with a diff tool) as on a
differed Domain but dedup does not delete this entry.
Is this caused by the differed Domain? If so, is there a possibility to
configure that?
Thank you very much, i got it working now!
No dia 26 de Março de 2012 15:26, webdev1977 webdev1...@gmail.comescreveu:
I believe it is complaining about this:
*public void addIndexBackendOptions(Configuration conf) {
LuceneWriter.addFieldOptions(MP3_TRACK_TITLE,
I think I may have figured it out.. but I don't know how to fix it :-(
I have many pdfs and html files that have relative links in them. They are
not from the originally hosted site, but are re-hosted. Nutch/Tika is
trying to prepend the relative urls in incounters with the url that
contained
Hello, i have some questions, sorry if i'm so noob
Is there a way to divide fetch process between two or
more computers using distinct internet conection? may
be divide load from crawldb into segments and after doing
a merge process whit them? is hadoop only for storage sharing?
i hope you
9 matches
Mail list logo