RE: URL filtering: crawling time vs. indexing time

Markus Jelsma Fri, 02 Nov 2012 06:17:26 -0700

-----Original message-----
> From:Joe Zhang <[email protected]>
> Sent: Fri 02-Nov-2012 10:04
> To: [email protected]
> Subject: URL filtering: crawling time vs. indexing time
> 
> I feel like this is a trivial question, but I just can't get my ahead
> around it.
> 
> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> rudimentary level.
> 
> If my understanding is correct, the regex-es in
> nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie., which
> URLs to visit or not in the crawling process.


Yes.

> 
> On the other hand, it doesn't seem artificial for us to only want certain
> pages to be indexed. I was hoping to write some regular expressions as well
> in some config file, but I just can't find the right place. My hunch tells
> me that such things should not require into-the-box coding. Can anybody
> help?

What exactly do you want? Add your custom regular expressions? The 
regex-urlfilter.txt is the place to write them to.

> 
> Again, the scenario is really rather generic. Let's say we want to crawl
> http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and
> unncessary file types etc., but only expect to index pages with URLs like:
> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.

To do this you must simply make sure your regular expressions can do this.

> 
> Am I too naive to expect zero Java coding in this case?

No, you can achieve almost all kinds of exotic filtering with just the URL 
filters and the regular expressions.

Cheers
>

RE: URL filtering: crawling time vs. indexing time

Reply via email to