I'm not sure I get it. Again, my problem is a very generic one: - The patterns in regex-urlfitler.txt, howevery exotic they are, they control ***which URLs to visit***. - Generally speaking, the set of ULRs to be indexed into solr is only a ***subset*** of the above.
We need a way to specify crawling filter (which is regex-urlfitler.txt) vs. indexing filter, I think. On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux <[email protected]> wrote: > You have still several possibilities here : > 1) find a way to seed the crawl with the URLs containing the links to the > leaf pages (sometimes it is possible with a simple loop) > 2) create regex for each step of the scenario going to the leaf page, in > order to limit the crawl to necessary pages only. Use the $ sign at the end > of your regexp to limit the match of regexp like http://([a-z0-9]*\.)* > mysite.com. > > > Le 2 nov. 2012 à 17:22, Joe Zhang <[email protected]> a écrit : > > > The problem is that, > > > > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll > end > > up indexing all the pages on the way, not just the leaf pages. > > - if you write specific regex for > > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and > you > > start crawling at mysite.com, you'll get zero results, as there is no > match. > > > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma < > [email protected]>wrote: > > > >> -----Original message----- > >>> From:Joe Zhang <[email protected]> > >>> Sent: Fri 02-Nov-2012 10:04 > >>> To: [email protected] > >>> Subject: URL filtering: crawling time vs. indexing time > >>> > >>> I feel like this is a trivial question, but I just can't get my ahead > >>> around it. > >>> > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the > >>> rudimentary level. > >>> > >>> If my understanding is correct, the regex-es in > >>> nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., > which > >>> URLs to visit or not in the crawling process. > >> > >> Yes. > >> > >>> > >>> On the other hand, it doesn't seem artificial for us to only want > certain > >>> pages to be indexed. I was hoping to write some regular expressions as > >> well > >>> in some config file, but I just can't find the right place. My hunch > >> tells > >>> me that such things should not require into-the-box coding. Can anybody > >>> help? > >> > >> What exactly do you want? Add your custom regular expressions? The > >> regex-urlfilter.txt is the place to write them to. > >> > >>> > >>> Again, the scenario is really rather generic. Let's say we want to > crawl > >>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip > loops > >> and > >>> unncessary file types etc., but only expect to index pages with URLs > >> like: > >>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. > >> > >> To do this you must simply make sure your regular expressions can do > this. > >> > >>> > >>> Am I too naive to expect zero Java coding in this case? > >> > >> No, you can achieve almost all kinds of exotic filtering with just the > URL > >> filters and the regular expressions. > >> > >> Cheers > >>> > >> > >

