Hi, my nutch-site.xml is rather minimalistic, see below.
You could also check whether fetching of "raw" HTML content succeeds: % bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http \ -verbose "http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" Sebastian >>> nutch-site.xml <configuration> <property> <name>http.agent.name</name> <value>sn-test-crawler</value> </property> <property> <name>http.robots.agents</name> <value>sn-test-crawler,*</value> </property> <property> <name>http.agent.description</name> <value></value> </property> <property> <name>http.agent.email</name> <value>wastldotnagelatgooglemaildotcom</value> </property> <property> <name>plugin.includes</name> <value>protocol-(http|file)|urlfilter-(regex|suffix)|parse-(html|tika|zip)|index-(basic|anchor|more)|scoring-opic</value> </property> <property> <name>file.content.limit</name> <value>-1</value> </property> <property> <name>http.content.limit</name> <value>-1</value> </property> </configuration> <<< nutch-site.xml On 10/18/2013 05:31 AM, S.L wrote: > Now I ran the clean trunk checkout as well ,unfortunately on a clean trunk > checkout (with name and plugin.folder value added to nutch-default.xml) > to I see the same behavior as the clean 1.7 tag checkout , obviously I am > doing something wrong and that has to do with the config files because I am > not modifying the source in any way . > > Will it be possible for to you share the config files you have used in the > clean trunk checkout with me please? > > The following is the output from the trunk checkout execution. > > fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > contentType: text/html > signature: 7026e09a97ff6df53f85d668bd86bcba > --------- > Url > --------------- > > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > --------- > ParseData > --------- > > Version: 5 > Status: success(1,0) > Title: All Categories > Outlinks: 8 > outlink: toUrl: > http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor: > outlink: toUrl: > http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,GHCOLL,UX&h=24864anchor: > outlink: toUrl: > http://ir.ebaystatic.com/z/ic/1hsgocfebuyd3pnukphb3cmqz.css anchor: > outlink: toUrl: > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor: > Skip to main content > outlink: toUrl: http://www.ebay.com anchor: eBay > outlink: toUrl: > http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay > outlink: toUrl: > http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor: > Shop by category > outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor: > Advanced > Content Metadata: Content-Language=en-US > RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141c972d0a9-0xeb Date=Fri, 18 Oct > 2013 02:44:06 GMT Content-Encoding=gzip Set-Cookie=lucky9=5427514;Domain=. > ebay.com;Expires=Wed, 17-Oct-2018 02:44:06 GMT;Path=/ Connection=close > Content-Type=text/html;charset=utf-8 Server=eBay Server > Cache-Control=private Pragma=no-cache > Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > --------- > ParseText > --------- > > All Categories Skip to main content eBay Shop by category Enter your search > keyword All Categories Advanced > > > On Wed, Oct 16, 2013 at 5:00 PM, Sebastian Nagel <[email protected] >> wrote: > >> Hi, >> >> when run parsechecker with current trunk of 1.x there are 653 outlinks >> (including many "internal" ones): >> >> % nutch parsechecker " >> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" >> ... >> Title: All Categories >> Outlinks: 653 >> ... >> >> >> Which Nutch version is used? >> Can you try to reproduce the problem with a "clean" Nutch (either 1.7 or >> 2.2.1) >> without any custom extensions (parse filters, etc.)? >> >> Thanks, >> Sebastian >> >> >> >> >> On 10/16/2013 01:32 AM, S.L wrote: >>> Sebastian, >>> >>> Thank you for the lead, after I use the ParseChecker , I get the >> following >>> output , I can see that only two URLs are being parsed out of the page , >> *I >>> see a pattern that* in this page almost all the URLs are enclosed in * >>> <li></li>* tags and those are *not* getting picked up , the two URLs that >>> are being picked by the parser are *not* enclosed in a <li> tag. >>> >>> I have also attached the regex-urlfilter.txt along with the >> nutch-site.xml >>> for your review. >>> >>> Please see the ParseChecker output below. >>> >>> fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 >>> parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 >>> contentType: text/html >>> signature: cb07f28617927cc0accb150b22f84649 >>> --------- >>> Url >>> --------------- >>> >>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 >>> --------- >>> ParseData >>> --------- >>> >>> Version: 5 >>> Status: success(1,0) >>> Title: All Categories >>> Outlinks: 12 >>> outlink: toUrl: >>> http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor: >>> outlink: toUrl: >>> >> http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor >> : >>> outlink: toUrl: >>> http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor: >>> outlink: toUrl: >>> >> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor >> : >>> Skip to main content >>> outlink: toUrl: http://www.ebay.com anchor: eBay >>> outlink: toUrl: >>> http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay >>> outlink: toUrl: >>> >> http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor >> : >>> Shop by category >>> outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your >> search >>> keyword All Categories Advanced >>> outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor: >>> Advanced >>> outlink: toUrl: >>> http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.jsanchor: >>> outlink: toUrl: >>> >> http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor >> : >>> outlink: toUrl: >>> http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor: >>> Content Metadata: Content-Language=en-US >>> RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct >>> 2013 23:12:51 GMT Content-Encoding=gzip >> Set-Cookie=lucky9=1113957;Domain=. >>> ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close >>> Content-Type=text/html;charset=utf-8 Server=eBay Server >>> Cache-Control=private Pragma=no-cache >>> Parse Metadata: CharEncodingForConversion=utf-8 >> OriginalCharEncoding=utf-8 >>> --------- >>> ParseText >>> --------- >>> >>> All Categories Skip to main content eBay Shop by category Enter your >> search >>> keyword All Categories Advanced >>> >>> >>> >>> >>> On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel < >> [email protected] >>>> wrote: >>> >>>> Hi, >>>> >>>>> I am only interested in the internal links. >>>> Then >>>> db.ignore.external.links = false >>>> is correct. >>>> >>>> It is impossible to decide what's going wrong. >>>> At a first glance, all seems ok except one: >>>> plugin.includes contains "scoring-optic". >>>> Should be "scoring-opic". I don't know but >>>> that hardly the reason. >>>> >>>> For a finer analysis, more details are required: >>>> - URL filter and normalizers: >>>> are the desired URLs accepted >>>> - CustomFetchSchedule.java: >>>> shouldFetch() may play a role >>>> >>>> You can try to find the reason by: >>>> >>>> % bin/nutch parsechecker " >>>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" >>>> Are all desired outlinks extracted by parser? >>>> >>>> (after fetch of start url) >>>> % bin/nutch readdb .../crawldb -dump crawldb_dump >>>> % less crawldb_dump/part-* >>>> Are they in CrawlDb? >>>> >>>> Cheers, >>>> Sebastian >>>> >>>> On 10/13/2013 04:18 AM, S.L wrote: >>>>> Hello All, >>>>> >>>>> I am facing this problem with the URL >>>>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this >> URL >>>> has >>>>> many internal links present in the page and also has many external >> links >>>>> to other domains , I am only interested in the internal links. >>>>> >>>>> However when this page is crawled the internal links in it are not >> added >>>>> for fetching in the next round of fetching ( I have given a depth of >>>> 100). >>>>> I have alread set the db.ignore.internal.links as false ,but for some >>>>> reason the internal links are not getting added to the next round of >>>> fetch >>>>> list. >>>>> >>>>> >>>>> On the other hand if I set the db.ignore.external.links as false, it >>>> correctly >>>>> picks up all the external links from the page. >>>>> >>>>> This problem is not present in any other domains , can some tell me >> what >>>> is >>>>> it with this particular page ? >>>>> >>>>> I have also attached the nucth-site.xml that I am using for your >> review, >>>>> please advise. >>>>> >>>> >>>> >>> >> >> >

