RE: URL filtering: crawling time vs. indexing time

Markus Jelsma Fri, 02 Nov 2012 11:32:59 -0700

Ah, i understand now.

The indexer tool can filter as well in 1.5.1 and if you enable the regex filter 
and set a different regex configuration file when indexing vs. crawling you 
should be good to go.


You can override the default configuration file by setting urlfilter.regex.file 
and point it to the regex file you want to use for indexing. You can set it via 
nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ...

Cheers
 
-----Original message-----
> From:Joe Zhang <[email protected]>
> Sent: Fri 02-Nov-2012 17:55
> To: [email protected]
> Subject: Re: URL filtering: crawling time vs. indexing time
> 
> I'm not sure I get it. Again, my problem is a very generic one:
> 
> - The patterns in regex-urlfitler.txt, howevery exotic they are, they
> control ***which URLs to visit***.
> - Generally speaking, the set of ULRs to be indexed into solr is only a
> ***subset*** of the above.
> 
> We need a way to specify crawling filter (which is regex-urlfitler.txt) vs.
> indexing filter, I think.
> 
> On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux <[email protected]> wrote:
> 
> > You have still several possibilities here :
> > 1) find a way to seed the crawl with the URLs containing the links to the
> > leaf pages (sometimes it is possible with a simple loop)
> > 2) create regex for each step of the scenario going to the leaf page, in
> > order to limit the crawl to necessary pages only. Use the $ sign at the end
> > of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
> > mysite.com.
> >
> >
> > Le 2 nov. 2012 à 17:22, Joe Zhang <[email protected]> a écrit :
> >
> > > The problem is that,
> > >
> > > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll
> > end
> > > up indexing all the pages on the way, not just the leaf pages.
> > > - if you write specific regex for
> > > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and
> > you
> > > start crawling at mysite.com, you'll get zero results, as there is no
> > match.
> > >
> > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <
> > [email protected]>wrote:
> > >
> > >> -----Original message-----
> > >>> From:Joe Zhang <[email protected]>
> > >>> Sent: Fri 02-Nov-2012 10:04
> > >>> To: [email protected]
> > >>> Subject: URL filtering: crawling time vs. indexing time
> > >>>
> > >>> I feel like this is a trivial question, but I just can't get my ahead
> > >>> around it.
> > >>>
> > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> > >>> rudimentary level.
> > >>>
> > >>> If my understanding is correct, the regex-es in
> > >>> nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
> > which
> > >>> URLs to visit or not in the crawling process.
> > >>
> > >> Yes.
> > >>
> > >>>
> > >>> On the other hand, it doesn't seem artificial for us to only want
> > certain
> > >>> pages to be indexed. I was hoping to write some regular expressions as
> > >> well
> > >>> in some config file, but I just can't find the right place. My hunch
> > >> tells
> > >>> me that such things should not require into-the-box coding. Can anybody
> > >>> help?
> > >>
> > >> What exactly do you want? Add your custom regular expressions? The
> > >> regex-urlfilter.txt is the place to write them to.
> > >>
> > >>>
> > >>> Again, the scenario is really rather generic. Let's say we want to
> > crawl
> > >>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip
> > loops
> > >> and
> > >>> unncessary file types etc., but only expect to index pages with URLs
> > >> like:
> > >>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.
> > >>
> > >> To do this you must simply make sure your regular expressions can do
> > this.
> > >>
> > >>>
> > >>> Am I too naive to expect zero Java coding in this case?
> > >>
> > >> No, you can achieve almost all kinds of exotic filtering with just the
> > URL
> > >> filters and the regular expressions.
> > >>
> > >> Cheers
> > >>>
> > >>
> >
> >
>

RE: URL filtering: crawling time vs. indexing time

Reply via email to