I really need to fetch news from a set of domains. But most of my domains have news links like this:
www.mydomain.com/article/ <http://www.mydomain.com/news/> werwer-wefewf-wfefef-fregd/ and the page www.mydomain.com/article/ <http://www.mydomain.com/news/> does not exit. so, i m forced to give site URLs on the seed, but this is crawling un-necessary pages which are not really articles. Hop u got it. Please help. On Wed, Jun 6, 2012 at 4:36 PM, Markus Jelsma <[email protected]>wrote: > What's the problem with having the seed page? Can you not only inject the > /news pages? Anyway, you can always filter it away later after the first > fetch cycle. > > > > -----Original message----- > > From:Shameema Umer <[email protected]> > > Sent: Wed 06-Jun-2012 13:02 > > To: [email protected] > > Subject: How to write complex rules on regex-urlfilter > > > > How can I write complex rules on regex-urlfilter: > > I need to http://www.bullionstreet.com > > > > so i entered http://www.mydomain.com on the seed.text > > but i need to fetch only http://www.mydomain.com/news/ matching urls > from > > the seed page and go on. > > > > My problem is +^http://([a-z0-9]*\.)*www.mydomain.com/news/ could not be > > entered on regex-urlfilter as it will not even fetch the seed url itself. > > > > Please help. Thanks > > >

