Re: URL filtering: crawling time vs. indexing time

Joe Zhang Fri, 02 Nov 2012 09:51:03 -0700

I'm not sure I get it. Again, my problem is a very generic one:

- The patterns in regex-urlfitler.txt, howevery exotic they are, they
control ***which URLs to visit***.
- Generally speaking, the set of ULRs to be indexed into solr is only a
***subset*** of the above.


We need a way to specify crawling filter (which is regex-urlfitler.txt) vs.
indexing filter, I think.

On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux <[email protected]> wrote:

> You have still several possibilities here :
> 1) find a way to seed the crawl with the URLs containing the links to the
> leaf pages (sometimes it is possible with a simple loop)
> 2) create regex for each step of the scenario going to the leaf page, in
> order to limit the crawl to necessary pages only. Use the $ sign at the end
> of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
> mysite.com.
>
>
> Le 2 nov. 2012 à 17:22, Joe Zhang <[email protected]> a écrit :
>
> > The problem is that,
> >
> > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll
> end
> > up indexing all the pages on the way, not just the leaf pages.
> > - if you write specific regex for
> > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and
> you
> > start crawling at mysite.com, you'll get zero results, as there is no
> match.
> >
> > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <
> [email protected]>wrote:
> >
> >> -----Original message-----
> >>> From:Joe Zhang <[email protected]>
> >>> Sent: Fri 02-Nov-2012 10:04
> >>> To: [email protected]
> >>> Subject: URL filtering: crawling time vs. indexing time
> >>>
> >>> I feel like this is a trivial question, but I just can't get my ahead
> >>> around it.
> >>>
> >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> >>> rudimentary level.
> >>>
> >>> If my understanding is correct, the regex-es in
> >>> nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
> which
> >>> URLs to visit or not in the crawling process.
> >>
> >> Yes.
> >>
> >>>
> >>> On the other hand, it doesn't seem artificial for us to only want
> certain
> >>> pages to be indexed. I was hoping to write some regular expressions as
> >> well
> >>> in some config file, but I just can't find the right place. My hunch
> >> tells
> >>> me that such things should not require into-the-box coding. Can anybody
> >>> help?
> >>
> >> What exactly do you want? Add your custom regular expressions? The
> >> regex-urlfilter.txt is the place to write them to.
> >>
> >>>
> >>> Again, the scenario is really rather generic. Let's say we want to
> crawl
> >>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip
> loops
> >> and
> >>> unncessary file types etc., but only expect to index pages with URLs
> >> like:
> >>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.
> >>
> >> To do this you must simply make sure your regular expressions can do
> this.
> >>
> >>>
> >>> Am I too naive to expect zero Java coding in this case?
> >>
> >> No, you can achieve almost all kinds of exotic filtering with just the
> URL
> >> filters and the regular expressions.
> >>
> >> Cheers
> >>>
> >>
>
>

Re: URL filtering: crawling time vs. indexing time

Reply via email to