Re: Indexing-time URL filtering again

Lewis John Mcgibbney Thu, 22 Nov 2012 17:29:26 -0800

Is this a bug?

On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <[email protected]> wrote:
> Putting -filter between crawldb and segments, I sitll got the same thing:
>
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
> Input path does not exist:
> file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
> Input path does not exist:
> file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
> Input path does not exist:
> file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
>
> On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
> <[email protected]>wrote:
>
>> These are roughly the available parameters:
>>
>> Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] [-hostdb
>> <hostdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>)
>> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
>> [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
>>
>> Having -filter at the end should work fine, if it, for some reason,
>> doesn't work put it before the segment and after the crawldb and file an
>> issue in jira, it works here if i have -filter at the end.
>>
>> Cheers
>>
>> -----Original message-----
>> > From:Joe Zhang <[email protected]>
>> > Sent: Thu 22-Nov-2012 23:05
>> > To: Markus Jelsma <[email protected]>; user <
>> [email protected]>
>> > Subject: Re: Indexing-time URL filtering again
>> >
>> > Yes, I forgot to do that. But still, what exactly should the command
>> look like?
>> >
>> > bin/nutch solrindex  -Durlfilter.regex.file=....UrlFiltering.txt
>> http://localhost:8983/solr/ <http://localhost:8983/solr/> .../crawldb/
>> ..../segments/*  -filter
>> > this command would cause nutch to interpret "-filter" as a path.
>> >
>> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <
>> [email protected] <mailto:[email protected]> > wrote:
>> > Hi,
>> >
>> > I just tested a small index job that usually writes 1200 records to
>> Solr. It works fine if i specify -. in a filter (index nothing) and point
>> to it with -Durlfilter.regex.file=path like you do.  I assume you mean by
>> `it doesn't work` that it filters nothing and indexes all records from the
>> segment. Did you forget the -filter parameter?
>> >
>> > Cheers
>> >
>> > -----Original message-----
>> > > From:Joe Zhang <[email protected] <mailto:[email protected]> >
>> > > Sent: Thu 22-Nov-2012 07:29
>> > > To: user <[email protected] <mailto:[email protected]> >
>> > > Subject: Indexing-time URL filtering again
>> > >
>> > > Dear List:
>> > >
>> > > I asked a similar question before, but I haven't solved the problem.
>> > > Therefore I try to re-ask the question more clearly and seek advice.
>> > >
>> > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
>> > > rudimentary level.
>> > >
>> > > The basic problem I face in crawling/indexing is that I need to control
>> > > which pages the crawlers should VISIT (so far through
>> > > nutch/conf/regex-urlfilter.txt)
>> > > and which pages are INDEXED by Solr. The latter are only a SUBSET of
>> the
>> > > former, and they are giving me headache.
>> > >
>> > > A real-life example would be: when we crawl CNN.com, we only want to
>> index
>> > > "real content" pages such as
>> > > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1<
>> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1> .
>> > > When we start the crawling from the root, we can't specify tight
>> > > patterns (e.g., +^http://([a-z0-9]*\.)*
>> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* <
>> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in
>> nutch/conf/regex-urlfilter.txt,
>> > > because the pages on the path between root and content pages do not
>> satisfy
>> > > such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt
>> > > would severely jeopardize the coverage of the crawl.
>> > >
>> > > The closest solution I've got so far (courtesy of Markus) was this:
>> > >
>> > > nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ <
>> http://solrurl/> ...
>> > >
>> > >  but unfortunately I haven't been able to make it work for me. The
>> content
>> > > of the urlfilter.regex.file is what I thought "correct" --- something
>> like
>> > > the following:
>> > >
>> > > +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* <
>> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*>
>> > > -.
>> > >
>> > > Everything seems quite straightforward. Am I doing anything wrong
>> here? Can
>> > > anyone advise? I'd greatly appreciate.
>> > >
>> > > Joe
>> > >
>> >
>> >
>>




-- 
Lewis

Re: Indexing-time URL filtering again

Reply via email to