Re: URL filtering: crawling time vs. indexing time

Joe Zhang Sat, 03 Nov 2012 22:02:50 -0700

Markus, I don't see "-D" as a valid command parameter for solrindex.


On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma
<[email protected]>wrote:

> Ah, i understand now.
>
> The indexer tool can filter as well in 1.5.1 and if you enable the regex
> filter and set a different regex configuration file when indexing vs.
> crawling you should be good to go.
>
> You can override the default configuration file by setting
> urlfilter.regex.file and point it to the regex file you want to use for
> indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path
> http://solrurl/ ...
>
> Cheers
>
> -----Original message-----
> > From:Joe Zhang <[email protected]>
> > Sent: Fri 02-Nov-2012 17:55
> > To: [email protected]
> > Subject: Re: URL filtering: crawling time vs. indexing time
> >
> > I'm not sure I get it. Again, my problem is a very generic one:
> >
> > - The patterns in regex-urlfitler.txt, howevery exotic they are, they
> > control ***which URLs to visit***.
> > - Generally speaking, the set of ULRs to be indexed into solr is only a
> > ***subset*** of the above.
> >
> > We need a way to specify crawling filter (which is regex-urlfitler.txt)
> vs.
> > indexing filter, I think.
> >
> > On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux <[email protected]> wrote:
> >
> > > You have still several possibilities here :
> > > 1) find a way to seed the crawl with the URLs containing the links to
> the
> > > leaf pages (sometimes it is possible with a simple loop)
> > > 2) create regex for each step of the scenario going to the leaf page,
> in
> > > order to limit the crawl to necessary pages only. Use the $ sign at
> the end
> > > of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
> > > mysite.com.
> > >
> > >
> > > Le 2 nov. 2012 à 17:22, Joe Zhang <[email protected]> a écrit :
> > >
> > > > The problem is that,
> > > >
> > > > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com,
> you'll
> > > end
> > > > up indexing all the pages on the way, not just the leaf pages.
> > > > - if you write specific regex for
> > > > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html,
> and
> > > you
> > > > start crawling at mysite.com, you'll get zero results, as there is
> no
> > > match.
> > > >
> > > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <
> > > [email protected]>wrote:
> > > >
> > > >> -----Original message-----
> > > >>> From:Joe Zhang <[email protected]>
> > > >>> Sent: Fri 02-Nov-2012 10:04
> > > >>> To: [email protected]
> > > >>> Subject: URL filtering: crawling time vs. indexing time
> > > >>>
> > > >>> I feel like this is a trivial question, but I just can't get my
> ahead
> > > >>> around it.
> > > >>>
> > > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at
> the
> > > >>> rudimentary level.
> > > >>>
> > > >>> If my understanding is correct, the regex-es in
> > > >>> nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
> > > which
> > > >>> URLs to visit or not in the crawling process.
> > > >>
> > > >> Yes.
> > > >>
> > > >>>
> > > >>> On the other hand, it doesn't seem artificial for us to only want
> > > certain
> > > >>> pages to be indexed. I was hoping to write some regular
> expressions as
> > > >> well
> > > >>> in some config file, but I just can't find the right place. My
> hunch
> > > >> tells
> > > >>> me that such things should not require into-the-box coding. Can
> anybody
> > > >>> help?
> > > >>
> > > >> What exactly do you want? Add your custom regular expressions? The
> > > >> regex-urlfilter.txt is the place to write them to.
> > > >>
> > > >>>
> > > >>> Again, the scenario is really rather generic. Let's say we want to
> > > crawl
> > > >>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip
> > > loops
> > > >> and
> > > >>> unncessary file types etc., but only expect to index pages with
> URLs
> > > >> like:
> > > >>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html
> .
> > > >>
> > > >> To do this you must simply make sure your regular expressions can do
> > > this.
> > > >>
> > > >>>
> > > >>> Am I too naive to expect zero Java coding in this case?
> > > >>
> > > >> No, you can achieve almost all kinds of exotic filtering with just
> the
> > > URL
> > > >> filters and the regular expressions.
> > > >>
> > > >> Cheers
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: URL filtering: crawling time vs. indexing time

Reply via email to