Re: 回复: using solr indexing exception

2014-05-16 Thread feng lu
It seems that you don't set batch id correctly. I see Crawler class is not used in launch script so you can try bin/nutch or bin/crawl command to run the Nutch again. On Thu, May 15, 2014 at 9:10 AM, 基勇 252637...@qq.com wrote: Which friend can help solve this problem? Thank's

Nutch can't crawl particular website

2014-05-16 Thread irfan romadona
Hi, I'm new to Nutch. I have crawling several sites using Nutch and it works, with several website as exception. I've looked up on hadoop.log buat can't find any suspected errors for the failed crawling site. No document added on console as any other successful crawling like this: 2014-05-15

Combining Document Parse Data

2014-05-16 Thread Iain Lopata
I have a situation in which, ideally, I would like to combine data parsed from two separate web pages into a single document, which would then be indexed into Solr. I have looked at the options for passing two separate documents to Solr and combining the data at query time, but none of the

Crawl Email Server with IMAPS or POP3

2014-05-16 Thread Lewis John Mcgibbney
Hi Folks, Has anyone done this before? Is email archiving something which we can do or not? I've been playing around with Geronimo's Javamail library and wondered if we could use it as Protocol extensions for above protocol's. Any thoughts? Lewis -- *Lewis*

Re: Solr 4.7 Schema?

2014-05-16 Thread BlackIce
Title filed needs to be set to multivalued - Tika issue, Tioka may return multiple values for Title on PDF's On Thu, May 8, 2014 at 1:37 AM, BlackIce blackice...@gmail.com wrote: Thnx On Wed, May 7, 2014 at 4:07 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi BlackIce, On

Re: nutch dedup on 1.8

2014-05-16 Thread Julien Nioche
Hi The dedup is now independent from any specific backend as you can see by typing './nutch dedup' *Usage: DeduplicationJob crawldb* what is does is that it marks the duplicates within the crawldb and this is then used by the indexer to delete the corresponding entries. I have updated the

Re: Nutch fetching on only one node

2014-05-16 Thread Julien Nioche
Hi, Usage: Generator crawldb segments_dir [-force] [-topN N] *[-numFetchers numFetchers]* [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num] set -numFetchers 10 to use all your slaves. Of course if all your URLs belong to the same host they'll end up being processed by a single

RE: Fetcher-Parser Nutch 2.2.1

2014-05-16 Thread Vangelis karv
I think patch-1651 https://issues.apache.org/jira/browse/NUTCH-1651 solved my problem. From: karvouni...@hotmail.com To: user@nutch.apache.org Subject: RE: Fetcher-Parser Nutch 2.2.1 Date: Mon, 12 May 2014 12:20:52 +0300 Thank you Talat in advance for helping me so much! How can I get rid