I really need to fetch news from a set of domains.
But most of my domains have news links like this:

www.mydomain.com/article/ <http://www.mydomain.com/news/>
werwer-wefewf-wfefef-fregd/

and the page www.mydomain.com/article/ <http://www.mydomain.com/news/> does
not exit. so, i m forced to give site URLs on the seed, but this is
crawling un-necessary pages which are not really articles. Hop u got it.

Please help.


On Wed, Jun 6, 2012 at 4:36 PM, Markus Jelsma <[email protected]>wrote:

> What's the problem with having the seed page? Can you not only inject the
> /news pages? Anyway, you can always filter it away later after the first
> fetch cycle.
>
>
>
> -----Original message-----
> > From:Shameema Umer <[email protected]>
> > Sent: Wed 06-Jun-2012 13:02
> > To: [email protected]
> > Subject: How to write complex rules on regex-urlfilter
> >
> > How can I write complex rules on regex-urlfilter:
> > I need to http://www.bullionstreet.com
> >
> > so i entered http://www.mydomain.com on the seed.text
> > but i need to fetch only http://www.mydomain.com/news/ matching urls
> from
> > the seed page and go on.
> >
> > My problem is +^http://([a-z0-9]*\.)*www.mydomain.com/news/ could not be
> > entered on regex-urlfilter as it will not even fetch the seed url itself.
> >
> > Please help. Thanks
> >
>

Reply via email to