Nutch not crawling jabong

2012-03-26 Thread blunderboy
Hi, I am using apache-nutch 1.4 and it is crawling perfectly. But i have got some issues in crawling some sites. For testing my crawling, I took http://www.jabong.com http://www.jabong.com I found out it is able to crawl categories but could not crawl pages. For example look at this:-

Re: Nutch not crawling jabong

2012-03-26 Thread blunderboy
Can somebody please help Why do some sites are not being crawled.. eg. Nutch failed to crawl http://www.myntra.com http://www.jabong.com http://www.youtube.com Successfully crawling some other sites. -- View this message in context:

Bottleneck of my crawls: NativeCodeLoader

2012-03-26 Thread James Ford
Hello, I am trying to optimize my crawls as much as possible. The current bottleneck is the step after adding segments to the linkdb, where Nutch is trying to load the natiive-hadoop library: 2012-03-26 13:20:59,089 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your

Re: db_unfetched large number, but crawling not fetching any longer

2012-03-26 Thread webdev1977
I guess I STILL don't understand the topN setting. Here is what I thought it would do: Seed: file:myfileserver.com/share1 share1 Dir listing: file1.pdf ... file300.pdf, dir1 ... dir20 running the following in a never ending shell script: {generate crawl/crawldb crawl/segments -topN 1000

Re: Older plugin in Nutch 1.4

2012-03-26 Thread webdev1977
I believe it is complaining about this: *public void addIndexBackendOptions(Configuration conf) { LuceneWriter.addFieldOptions(MP3_TRACK_TITLE, LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(MP3_ALBUM, LuceneWriter.STORE.YES,

Pages that does not dedup

2012-03-26 Thread Jan Riewe
Hey there, currently i try to debug the dedup results from nutch. There is a page with is exactly the same (compared the HTML with a diff tool) as on a differed Domain but dedup does not delete this entry. Is this caused by the differed Domain? If so, is there a possibility to configure that?

Re: Older plugin in Nutch 1.4

2012-03-26 Thread Vicente Canhoto
Thank you very much, i got it working now! No dia 26 de Março de 2012 15:26, webdev1977 webdev1...@gmail.comescreveu: I believe it is complaining about this: *public void addIndexBackendOptions(Configuration conf) { LuceneWriter.addFieldOptions(MP3_TRACK_TITLE,

Re: db_unfetched large number, but crawling not fetching any longer

2012-03-26 Thread webdev1977
I think I may have figured it out.. but I don't know how to fix it :-( I have many pdfs and html files that have relative links in them. They are not from the originally hosted site, but are re-hosted. Nutch/Tika is trying to prepend the relative urls in incounters with the url that contained

divide fetch process ?

2012-03-26 Thread pepe3059
Hello, i have some questions, sorry if i'm so noob Is there a way to divide fetch process between two or more computers using distinct internet conection? may be divide load from crawldb into segments and after doing a merge process whit them? is hadoop only for storage sharing? i hope you