RE: Indexing-time URL filtering again

Markus Jelsma Thu, 22 Nov 2012 23:34:31 -0800
Ah, i get it now. Please use trunk or patch your version with: 
https://issues.apache.org/jira/browse/NUTCH-1300 to enable filtering.
 
-----Original message-----
> From:Joe Zhang <[email protected]>
> Sent: Fri 23-Nov-2012 03:08
> To: [email protected]
> Subject: Re: Indexing-time URL filtering again
> 
> But Markus said it worked for him. I was really he could send his command
> line.
> 
> On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
> 
> > Is this a bug?
> >
> > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <[email protected]> wrote:
> > > Putting -filter between crawldb and segments, I sitll got the same thing:
> > >
> > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > exist:
> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
> > > Input path does not exist:
> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
> > > Input path does not exist:
> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
> > > Input path does not exist:
> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
> > >
> > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
> > > <[email protected]>wrote:
> > >
> > >> These are roughly the available parameters:
> > >>
> > >> Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] [-hostdb
> > >> <hostdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>)
> > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
> > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
> > >>
> > >> Having -filter at the end should work fine, if it, for some reason,
> > >> doesn't work put it before the segment and after the crawldb and file an
> > >> issue in jira, it works here if i have -filter at the end.
> > >>
> > >> Cheers
> > >>
> > >> -----Original message-----
> > >> > From:Joe Zhang <[email protected]>
> > >> > Sent: Thu 22-Nov-2012 23:05
> > >> > To: Markus Jelsma <[email protected]>; user <
> > >> [email protected]>
> > >> > Subject: Re: Indexing-time URL filtering again
> > >> >
> > >> > Yes, I forgot to do that. But still, what exactly should the command
> > >> look like?
> > >> >
> > >> > bin/nutch solrindex  -Durlfilter.regex.file=....UrlFiltering.txt
> > >> http://localhost:8983/solr/ <http://localhost:8983/solr/> .../crawldb/
> > >> ..../segments/*  -filter
> > >> > this command would cause nutch to interpret "-filter" as a path.
> > >> >
> > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <
> > >> [email protected] <mailto:[email protected]> > wrote:
> > >> > Hi,
> > >> >
> > >> > I just tested a small index job that usually writes 1200 records to
> > >> Solr. It works fine if i specify -. in a filter (index nothing) and
> > point
> > >> to it with -Durlfilter.regex.file=path like you do.  I assume you mean
> > by
> > >> `it doesn't work` that it filters nothing and indexes all records from
> > the
> > >> segment. Did you forget the -filter parameter?
> > >> >
> > >> > Cheers
> > >> >
> > >> > -----Original message-----
> > >> > > From:Joe Zhang <[email protected] <mailto:[email protected]>
> > >
> > >> > > Sent: Thu 22-Nov-2012 07:29
> > >> > > To: user <[email protected] <mailto:[email protected]> >
> > >> > > Subject: Indexing-time URL filtering again
> > >> > >
> > >> > > Dear List:
> > >> > >
> > >> > > I asked a similar question before, but I haven't solved the problem.
> > >> > > Therefore I try to re-ask the question more clearly and seek advice.
> > >> > >
> > >> > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at
> > the
> > >> > > rudimentary level.
> > >> > >
> > >> > > The basic problem I face in crawling/indexing is that I need to
> > control
> > >> > > which pages the crawlers should VISIT (so far through
> > >> > > nutch/conf/regex-urlfilter.txt)
> > >> > > and which pages are INDEXED by Solr. The latter are only a SUBSET of
> > >> the
> > >> > > former, and they are giving me headache.
> > >> > >
> > >> > > A real-life example would be: when we crawl CNN.com, we only want to
> > >> index
> > >> > > "real content" pages such as
> > >> > >
> > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1<
> > >> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1>
> > .
> > >> > > When we start the crawling from the root, we can't specify tight
> > >> > > patterns (e.g., +^http://([a-z0-9]*\.)*
> > >> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* <
> > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in
> > >> nutch/conf/regex-urlfilter.txt,
> > >> > > because the pages on the path between root and content pages do not
> > >> satisfy
> > >> > > such patterns. Putting such patterns in
> > nutch/conf/regex-urlfilter.txt
> > >> > > would severely jeopardize the coverage of the crawl.
> > >> > >
> > >> > > The closest solution I've got so far (courtesy of Markus) was this:
> > >> > >
> > >> > > nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ <
> > >> http://solrurl/> ...
> > >> > >
> > >> > >  but unfortunately I haven't been able to make it work for me. The
> > >> content
> > >> > > of the urlfilter.regex.file is what I thought "correct" ---
> > something
> > >> like
> > >> > > the following:
> > >> > >
> > >> > > +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* <
> > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*>
> > >> > > -.
> > >> > >
> > >> > > Everything seems quite straightforward. Am I doing anything wrong
> > >> here? Can
> > >> > > anyone advise? I'd greatly appreciate.
> > >> > >
> > >> > > Joe
> > >> > >
> > >> >
> > >> >
> > >>
> >
> >
> >
> > --
> > Lewis
> >
>
RE: Indexing-time URL filtering again

Reply via email to