-----Original message----- > From:Joe Zhang <[email protected]> > Sent: Fri 02-Nov-2012 10:04 > To: [email protected] > Subject: URL filtering: crawling time vs. indexing time > > I feel like this is a trivial question, but I just can't get my ahead > around it. > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the > rudimentary level. > > If my understanding is correct, the regex-es in > nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which > URLs to visit or not in the crawling process.
Yes. > > On the other hand, it doesn't seem artificial for us to only want certain > pages to be indexed. I was hoping to write some regular expressions as well > in some config file, but I just can't find the right place. My hunch tells > me that such things should not require into-the-box coding. Can anybody > help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. > > Again, the scenario is really rather generic. Let's say we want to crawl > http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and > unncessary file types etc., but only expect to index pages with URLs like: > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. To do this you must simply make sure your regular expressions can do this. > > Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers >

