Jamshaid The regex-normalize.xml can also filter some urls into only one url. it will also clean some query params.
On Mon, Jul 1, 2013 at 8:40 PM, Jamshaid Ashraf <[email protected]>wrote: > Thanks tony! > > Issue with Halliburton site is resolved by changing 'regex-urlfilter' file. > But still facing same issue with 'cnn.com'. > > Regards, > Jamshaid > > > On Mon, Jul 1, 2013 at 3:20 PM, Tony Mullins <[email protected] > >wrote: > > > Jamshaid , > > > > I think your site urls contain query params and your regex-urlfilter.txt > is > > filtering them. > > Go to your regex-urlfilter.txt and replace '-[?*!@=]' with '-[*!@]' , I > > hope this would resolve your problem > > > > Tony. > > > > > > On Mon, Jul 1, 2013 at 1:24 PM, Jamshaid Ashraf <[email protected] > > >wrote: > > > > > Hi, > > > > > > I'm still facing same issue please help me out in this regard. > > > > > > Regards, > > > Jamshaid > > > > > > > > > On Fri, Jun 28, 2013 at 4:32 PM, Jamshaid Ashraf < > [email protected] > > > >wrote: > > > > > > > > > > > Hi, > > > > > > > > I have followed the given link and updated 'db.max.outlinks.per.page' > > to > > > > -1 in 'nutch-default' file. > > > > > > > > but facing same issue while crawling ' > > > > http://www.halliburton.com/en-US/default.page & cnn.com', below is > the > > > > last line of fetcher job which shows 0 page found on 3rd or 4th > > > iteration. > > > > > > > > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 > > > URLs > > > > in 0 queues > > > > -activeThreads=0 > > > > FetcherJob: done > > > > > > > > Please note that when I crawl amazon & others sites it works fine. Do > > you > > > > think is it because of some restriction of halliborton (robot.txt) or > > > some > > > > misconfiguration at my end? > > > > > > > > Regards, > > > > Jamshaid > > > > > > > > > > > > On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney < > > > > [email protected]> wrote: > > > > > > > >> Hi, > > > >> Can you please try this > > > >> http://s.apache.org/wIC > > > >> Thanks > > > >> Lewis > > > >> > > > >> > > > >> On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf < > > [email protected] > > > >> >wrote: > > > >> > > > >> > Hi, > > > >> > > > > >> > I'm using nutch 2.x with HBase and tried to crawl " > > > >> > http://www.halliburton.com/en-US/default.page" site for depth > level > > > 5. > > > >> > > > > >> > Following is the command: > > > >> > > > > >> > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5 > > > >> > > > > >> > > > > >> > It worked well till 3rd iteration but for remaining 4th and 5th > > > nothing > > > >> > fetched (same case happened with cnn.com). but if i tried to > crawl > > > >> other > > > >> > sites like amazon with depth level 5 it works. > > > >> > > > > >> > Could you please guide what could be the reasons for failing of > 4th > > > and > > > >> 5th > > > >> > iteration. > > > >> > > > > >> > > > > >> > Regards, > > > >> > Jamshaid > > > >> > > > > >> > > > >> > > > >> > > > >> -- > > > >> *Lewis* > > > >> > > > > > > > > > > > > > > -- Don't Grow Old, Grow Up... :-)

