Is this a bug? On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <[email protected]> wrote: > Putting -filter between crawldb and segments, I sitll got the same thing: > > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch > Input path does not exist: > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse > Input path does not exist: > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data > Input path does not exist: > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma > <[email protected]>wrote: > >> These are roughly the available parameters: >> >> Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] [-hostdb >> <hostdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize] >> >> Having -filter at the end should work fine, if it, for some reason, >> doesn't work put it before the segment and after the crawldb and file an >> issue in jira, it works here if i have -filter at the end. >> >> Cheers >> >> -----Original message----- >> > From:Joe Zhang <[email protected]> >> > Sent: Thu 22-Nov-2012 23:05 >> > To: Markus Jelsma <[email protected]>; user < >> [email protected]> >> > Subject: Re: Indexing-time URL filtering again >> > >> > Yes, I forgot to do that. But still, what exactly should the command >> look like? >> > >> > bin/nutch solrindex -Durlfilter.regex.file=....UrlFiltering.txt >> http://localhost:8983/solr/ <http://localhost:8983/solr/> .../crawldb/ >> ..../segments/* -filter >> > this command would cause nutch to interpret "-filter" as a path. >> > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma < >> [email protected] <mailto:[email protected]> > wrote: >> > Hi, >> > >> > I just tested a small index job that usually writes 1200 records to >> Solr. It works fine if i specify -. in a filter (index nothing) and point >> to it with -Durlfilter.regex.file=path like you do. I assume you mean by >> `it doesn't work` that it filters nothing and indexes all records from the >> segment. Did you forget the -filter parameter? >> > >> > Cheers >> > >> > -----Original message----- >> > > From:Joe Zhang <[email protected] <mailto:[email protected]> > >> > > Sent: Thu 22-Nov-2012 07:29 >> > > To: user <[email protected] <mailto:[email protected]> > >> > > Subject: Indexing-time URL filtering again >> > > >> > > Dear List: >> > > >> > > I asked a similar question before, but I haven't solved the problem. >> > > Therefore I try to re-ask the question more clearly and seek advice. >> > > >> > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the >> > > rudimentary level. >> > > >> > > The basic problem I face in crawling/indexing is that I need to control >> > > which pages the crawlers should VISIT (so far through >> > > nutch/conf/regex-urlfilter.txt) >> > > and which pages are INDEXED by Solr. The latter are only a SUBSET of >> the >> > > former, and they are giving me headache. >> > > >> > > A real-life example would be: when we crawl CNN.com, we only want to >> index >> > > "real content" pages such as >> > > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1< >> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1> . >> > > When we start the crawling from the root, we can't specify tight >> > > patterns (e.g., +^http://([a-z0-9]*\.)* >> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* < >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in >> nutch/conf/regex-urlfilter.txt, >> > > because the pages on the path between root and content pages do not >> satisfy >> > > such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt >> > > would severely jeopardize the coverage of the crawl. >> > > >> > > The closest solution I've got so far (courtesy of Markus) was this: >> > > >> > > nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ < >> http://solrurl/> ... >> > > >> > > but unfortunately I haven't been able to make it work for me. The >> content >> > > of the urlfilter.regex.file is what I thought "correct" --- something >> like >> > > the following: >> > > >> > > +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* < >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> >> > > -. >> > > >> > > Everything seems quite straightforward. Am I doing anything wrong >> here? Can >> > > anyone advise? I'd greatly appreciate. >> > > >> > > Joe >> > > >> > >> > >>
-- Lewis

