Jamshaid ,

I think your site urls contain query params and your regex-urlfilter.txt is
filtering them.
Go to your regex-urlfilter.txt and replace '-[?*!@=]' with '-[*!@]' , I
hope this would resolve your problem

Tony.


On Mon, Jul 1, 2013 at 1:24 PM, Jamshaid Ashraf <[email protected]>wrote:

> Hi,
>
> I'm still facing same issue please help me out in this regard.
>
> Regards,
> Jamshaid
>
>
> On Fri, Jun 28, 2013 at 4:32 PM, Jamshaid Ashraf <[email protected]
> >wrote:
>
> >
> > Hi,
> >
> > I have followed the given link and updated 'db.max.outlinks.per.page' to
> > -1 in 'nutch-default' file.
> >
> > but facing same issue while crawling '
> > http://www.halliburton.com/en-US/default.page & cnn.com', below is the
> > last line of fetcher job which shows 0 page found on 3rd or 4th
> iteration.
> >
> > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
> URLs
> > in 0 queues
> > -activeThreads=0
> > FetcherJob: done
> >
> > Please note that when I crawl amazon & others sites it works fine. Do you
> > think is it because of some restriction of halliborton (robot.txt) or
> some
> > misconfiguration at my end?
> >
> > Regards,
> > Jamshaid
> >
> >
> > On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> Hi,
> >> Can you please try this
> >> http://s.apache.org/wIC
> >> Thanks
> >> Lewis
> >>
> >>
> >> On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf <[email protected]
> >> >wrote:
> >>
> >> > Hi,
> >> >
> >> > I'm using nutch 2.x with HBase and tried to crawl "
> >> > http://www.halliburton.com/en-US/default.page"; site for depth level
> 5.
> >> >
> >> > Following is the command:
> >> >
> >> > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5
> >> >
> >> >
> >> > It worked well till 3rd iteration but for remaining 4th and 5th
> nothing
> >> > fetched (same case happened with cnn.com). but if i tried to crawl
> >> other
> >> > sites like amazon with depth level 5 it works.
> >> >
> >> > Could you please guide what could be the reasons for failing of 4th
> and
> >> 5th
> >> > iteration.
> >> >
> >> >
> >> > Regards,
> >> > Jamshaid
> >> >
> >>
> >>
> >>
> >> --
> >> *Lewis*
> >>
> >
> >
>

Reply via email to