Re: Indexing-time URL filtering again

Joe Zhang Sat, 24 Nov 2012 23:36:57 -0800

This does seem a bug. Can anybody help?

On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang <[email protected]> wrote:


> Markus, could you advise? Thanks a lot!
>
>
> On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang <[email protected]> wrote:
>
>> I followed your instruction and applied the patch, Markus, but the
>> problem still persists --- "-filter" is interpreted as a path by solrindex.
>>
>> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma <
>> [email protected]> wrote:
>>
>>> Ah, i get it now. Please use trunk or patch your version with:
>>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable filtering.
>>>
>>> -----Original message-----
>>> > From:Joe Zhang <[email protected]>
>>> > Sent: Fri 23-Nov-2012 03:08
>>> > To: [email protected]
>>> > Subject: Re: Indexing-time URL filtering again
>>> >
>>> > But Markus said it worked for him. I was really he could send his
>>> command
>>> > line.
>>> >
>>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney <
>>> > [email protected]> wrote:
>>> >
>>> > > Is this a bug?
>>> > >
>>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <[email protected]>
>>> wrote:
>>> > > > Putting -filter between crawldb and segments, I sitll got the same
>>> thing:
>>> > > >
>>> > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>> > > exist:
>>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
>>> > > > Input path does not exist:
>>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
>>> > > > Input path does not exist:
>>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
>>> > > > Input path does not exist:
>>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
>>> > > >
>>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
>>> > > > <[email protected]>wrote:
>>> > > >
>>> > > >> These are roughly the available parameters:
>>> > > >>
>>> > > >> Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>]
>>> [-hostdb
>>> > > >> <hostdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir
>>> <segments>)
>>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
>>> > > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
>>> > > >>
>>> > > >> Having -filter at the end should work fine, if it, for some
>>> reason,
>>> > > >> doesn't work put it before the segment and after the crawldb and
>>> file an
>>> > > >> issue in jira, it works here if i have -filter at the end.
>>> > > >>
>>> > > >> Cheers
>>> > > >>
>>> > > >> -----Original message-----
>>> > > >> > From:Joe Zhang <[email protected]>
>>> > > >> > Sent: Thu 22-Nov-2012 23:05
>>> > > >> > To: Markus Jelsma <[email protected]>; user <
>>> > > >> [email protected]>
>>> > > >> > Subject: Re: Indexing-time URL filtering again
>>> > > >> >
>>> > > >> > Yes, I forgot to do that. But still, what exactly should the
>>> command
>>> > > >> look like?
>>> > > >> >
>>> > > >> > bin/nutch solrindex  -Durlfilter.regex.file=....UrlFiltering.txt
>>> > > >> http://localhost:8983/solr/ <http://localhost:8983/solr/>
>>> .../crawldb/
>>> > > >> ..../segments/*  -filter
>>> > > >> > this command would cause nutch to interpret "-filter" as a path.
>>> > > >> >
>>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <
>>> > > >> [email protected] <mailto:[email protected]> >
>>> wrote:
>>> > > >> > Hi,
>>> > > >> >
>>> > > >> > I just tested a small index job that usually writes 1200
>>> records to
>>> > > >> Solr. It works fine if i specify -. in a filter (index nothing)
>>> and
>>> > > point
>>> > > >> to it with -Durlfilter.regex.file=path like you do.  I assume you
>>> mean
>>> > > by
>>> > > >> `it doesn't work` that it filters nothing and indexes all records
>>> from
>>> > > the
>>> > > >> segment. Did you forget the -filter parameter?
>>> > > >> >
>>> > > >> > Cheers
>>> > > >> >
>>> > > >> > -----Original message-----
>>> > > >> > > From:Joe Zhang <[email protected] <mailto:
>>> [email protected]>
>>> > > >
>>> > > >> > > Sent: Thu 22-Nov-2012 07:29
>>> > > >> > > To: user <[email protected] <mailto:[email protected]>
>>> >
>>> > > >> > > Subject: Indexing-time URL filtering again
>>> > > >> > >
>>> > > >> > > Dear List:
>>> > > >> > >
>>> > > >> > > I asked a similar question before, but I haven't solved the
>>> problem.
>>> > > >> > > Therefore I try to re-ask the question more clearly and seek
>>> advice.
>>> > > >> > >
>>> > > >> > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work
>>> fine at
>>> > > the
>>> > > >> > > rudimentary level.
>>> > > >> > >
>>> > > >> > > The basic problem I face in crawling/indexing is that I need
>>> to
>>> > > control
>>> > > >> > > which pages the crawlers should VISIT (so far through
>>> > > >> > > nutch/conf/regex-urlfilter.txt)
>>> > > >> > > and which pages are INDEXED by Solr. The latter are only a
>>> SUBSET of
>>> > > >> the
>>> > > >> > > former, and they are giving me headache.
>>> > > >> > >
>>> > > >> > > A real-life example would be: when we crawl CNN.com, we only
>>> want to
>>> > > >> index
>>> > > >> > > "real content" pages such as
>>> > > >> > >
>>> > >
>>> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1<
>>> > > >>
>>> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1>
>>> > > .
>>> > > >> > > When we start the crawling from the root, we can't specify
>>> tight
>>> > > >> > > patterns (e.g., +^http://([a-z0-9]*\.)*
>>> > > >> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*><
>>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in
>>> > > >> nutch/conf/regex-urlfilter.txt,
>>> > > >> > > because the pages on the path between root and content pages
>>> do not
>>> > > >> satisfy
>>> > > >> > > such patterns. Putting such patterns in
>>> > > nutch/conf/regex-urlfilter.txt
>>> > > >> > > would severely jeopardize the coverage of the crawl.
>>> > > >> > >
>>> > > >> > > The closest solution I've got so far (courtesy of Markus) was
>>> this:
>>> > > >> > >
>>> > > >> > > nutch solrindex -Durlfilter.regex.file=/path http://solrurl/<
>>> > > >> http://solrurl/> ...
>>> > > >> > >
>>> > > >> > >  but unfortunately I haven't been able to make it work for
>>> me. The
>>> > > >> content
>>> > > >> > > of the urlfilter.regex.file is what I thought "correct" ---
>>> > > something
>>> > > >> like
>>> > > >> > > the following:
>>> > > >> > >
>>> > > >> > > +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*><
>>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*>
>>> > > >> > > -.
>>> > > >> > >
>>> > > >> > > Everything seems quite straightforward. Am I doing anything
>>> wrong
>>> > > >> here? Can
>>> > > >> > > anyone advise? I'd greatly appreciate.
>>> > > >> > >
>>> > > >> > > Joe
>>> > > >> > >
>>> > > >> >
>>> > > >> >
>>> > > >>
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Lewis
>>> > >
>>> >
>>>
>>
>>
>

Re: Indexing-time URL filtering again

Reply via email to