I actually checked out the most recent build from SVN, Release 1.6 - 23/11/2012.
The following command bin/nutch solrindex -Durlfilter.regex.file=.....UrlFiltering.txt http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* -filter produced the following output: SolrIndexer: starting at 2012-11-25 16:19:29 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: true SolrIndexer: URL normalizing: false java.io.IOException: Job failed! Can anybody help? On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang <[email protected]> wrote: > How exactly do I get to trunk? > > I did download download NUTCH-1300-1.5-1.patch, and run the patch command > correctly, and re-build nutch. But the problem still persists... > > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma <[email protected] > > wrote: > >> No, this is no bug. As i said, you need either to patch your Nutch or get >> the sources from trunk. The -filter parameter is not in your version. Check >> the patch manual if you don't know how it works. >> >> $ cd trunk ; patch -p0 < file.patch >> >> -----Original message----- >> > From:Joe Zhang <[email protected]> >> > Sent: Sun 25-Nov-2012 08:42 >> > To: Markus Jelsma <[email protected]>; user < >> [email protected]> >> > Subject: Re: Indexing-time URL filtering again >> > >> > This does seem a bug. Can anybody help? >> > >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang <[email protected]> >> wrote: >> > >> > > Markus, could you advise? Thanks a lot! >> > > >> > > >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang <[email protected]> >> wrote: >> > > >> > >> I followed your instruction and applied the patch, Markus, but the >> > >> problem still persists --- "-filter" is interpreted as a path by >> solrindex. >> > >> >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma < >> > >> [email protected]> wrote: >> > >> >> > >>> Ah, i get it now. Please use trunk or patch your version with: >> > >>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable >> filtering. >> > >>> >> > >>> -----Original message----- >> > >>> > From:Joe Zhang <[email protected]> >> > >>> > Sent: Fri 23-Nov-2012 03:08 >> > >>> > To: [email protected] >> > >>> > Subject: Re: Indexing-time URL filtering again >> > >>> > >> > >>> > But Markus said it worked for him. I was really he could send his >> > >>> command >> > >>> > line. >> > >>> > >> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney < >> > >>> > [email protected]> wrote: >> > >>> > >> > >>> > > Is this a bug? >> > >>> > > >> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang < >> [email protected]> >> > >>> wrote: >> > >>> > > > Putting -filter between crawldb and segments, I sitll got the >> same >> > >>> thing: >> > >>> > > > >> > >>> > > > org.apache.hadoop.mapred.InvalidInputException: Input path >> does not >> > >>> > > exist: >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch >> > >>> > > > Input path does not exist: >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse >> > >>> > > > Input path does not exist: >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data >> > >>> > > > Input path does not exist: >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text >> > >>> > > > >> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma >> > >>> > > > <[email protected]>wrote: >> > >>> > > > >> > >>> > > >> These are roughly the available parameters: >> > >>> > > >> >> > >>> > > >> Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] >> > >>> [-hostdb >> > >>> > > >> <hostdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir >> > >>> <segments>) >> > >>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] >> > >>> > > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize] >> > >>> > > >> >> > >>> > > >> Having -filter at the end should work fine, if it, for some >> > >>> reason, >> > >>> > > >> doesn't work put it before the segment and after the crawldb >> and >> > >>> file an >> > >>> > > >> issue in jira, it works here if i have -filter at the end. >> > >>> > > >> >> > >>> > > >> Cheers >> > >>> > > >> >> > >>> > > >> -----Original message----- >> > >>> > > >> > From:Joe Zhang <[email protected]> >> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05 >> > >>> > > >> > To: Markus Jelsma <[email protected]>; user < >> > >>> > > >> [email protected]> >> > >>> > > >> > Subject: Re: Indexing-time URL filtering again >> > >>> > > >> > >> > >>> > > >> > Yes, I forgot to do that. But still, what exactly should >> the >> > >>> command >> > >>> > > >> look like? >> > >>> > > >> > >> > >>> > > >> > bin/nutch solrindex >> -Durlfilter.regex.file=....UrlFiltering.txt >> > >>> > > >> http://localhost:8983/solr/ <http://localhost:8983/solr/> >> > >>> .../crawldb/ >> > >>> > > >> ..../segments/* -filter >> > >>> > > >> > this command would cause nutch to interpret "-filter" as a >> path. >> > >>> > > >> > >> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma < >> > >>> > > >> [email protected] <mailto: >> [email protected]> > >> > >>> wrote: >> > >>> > > >> > Hi, >> > >>> > > >> > >> > >>> > > >> > I just tested a small index job that usually writes 1200 >> > >>> records to >> > >>> > > >> Solr. It works fine if i specify -. in a filter (index >> nothing) >> > >>> and >> > >>> > > point >> > >>> > > >> to it with -Durlfilter.regex.file=path like you do. I >> assume you >> > >>> mean >> > >>> > > by >> > >>> > > >> `it doesn't work` that it filters nothing and indexes all >> records >> > >>> from >> > >>> > > the >> > >>> > > >> segment. Did you forget the -filter parameter? >> > >>> > > >> > >> > >>> > > >> > Cheers >> > >>> > > >> > >> > >>> > > >> > -----Original message----- >> > >>> > > >> > > From:Joe Zhang <[email protected] <mailto: >> > >>> [email protected]> >> > >>> > > > >> > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29 >> > >>> > > >> > > To: user <[email protected] <mailto: >> [email protected]> >> > >>> > >> > >>> > > >> > > Subject: Indexing-time URL filtering again >> > >>> > > >> > > >> > >>> > > >> > > Dear List: >> > >>> > > >> > > >> > >>> > > >> > > I asked a similar question before, but I haven't solved >> the >> > >>> problem. >> > >>> > > >> > > Therefore I try to re-ask the question more clearly and >> seek >> > >>> advice. >> > >>> > > >> > > >> > >>> > > >> > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things >> work >> > >>> fine at >> > >>> > > the >> > >>> > > >> > > rudimentary level. >> > >>> > > >> > > >> > >>> > > >> > > The basic problem I face in crawling/indexing is that I >> need >> > >>> to >> > >>> > > control >> > >>> > > >> > > which pages the crawlers should VISIT (so far through >> > >>> > > >> > > nutch/conf/regex-urlfilter.txt) >> > >>> > > >> > > and which pages are INDEXED by Solr. The latter are only >> a >> > >>> SUBSET of >> > >>> > > >> the >> > >>> > > >> > > former, and they are giving me headache. >> > >>> > > >> > > >> > >>> > > >> > > A real-life example would be: when we crawl CNN.com, we >> only >> > >>> want to >> > >>> > > >> index >> > >>> > > >> > > "real content" pages such as >> > >>> > > >> > > >> > >>> > > >> > >>> >> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1< >> > >>> > > >> >> > >>> >> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1> >> > >>> > > . >> > >>> > > >> > > When we start the crawling from the root, we can't >> specify >> > >>> tight >> > >>> > > >> > > patterns (e.g., +^http://([a-z0-9]*\.)* >> > >>> > > >> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*> >> <http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>< >> > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in >> > >>> > > >> nutch/conf/regex-urlfilter.txt, >> > >>> > > >> > > because the pages on the path between root and content >> pages >> > >>> do not >> > >>> > > >> satisfy >> > >>> > > >> > > such patterns. Putting such patterns in >> > >>> > > nutch/conf/regex-urlfilter.txt >> > >>> > > >> > > would severely jeopardize the coverage of the crawl. >> > >>> > > >> > > >> > >>> > > >> > > The closest solution I've got so far (courtesy of >> Markus) was >> > >>> this: >> > >>> > > >> > > >> > >>> > > >> > > nutch solrindex -Durlfilter.regex.file=/path >> http://solrurl/< >> > >>> > > >> http://solrurl/> ... >> > >>> > > >> > > >> > >>> > > >> > > but unfortunately I haven't been able to make it work >> for >> > >>> me. The >> > >>> > > >> content >> > >>> > > >> > > of the urlfilter.regex.file is what I thought "correct" >> --- >> > >>> > > something >> > >>> > > >> like >> > >>> > > >> > > the following: >> > >>> > > >> > > >> > >>> > > >> > > +^http://([a-z0-9]*\.)* >> cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*> >> <http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>< >> > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> >> > >>> > > >> > > -. >> > >>> > > >> > > >> > >>> > > >> > > Everything seems quite straightforward. Am I doing >> anything >> > >>> wrong >> > >>> > > >> here? Can >> > >>> > > >> > > anyone advise? I'd greatly appreciate. >> > >>> > > >> > > >> > >>> > > >> > > Joe >> > >>> > > >> > > >> > >>> > > >> > >> > >>> > > >> > >> > >>> > > >> >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > -- >> > >>> > > Lewis >> > >>> > > >> > >>> > >> > >>> >> > >> >> > >> >> > > >> > >> > >

