Hi, when run parsechecker with current trunk of 1.x there are 653 outlinks (including many "internal" ones):
% nutch parsechecker "http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" ... Title: All Categories Outlinks: 653 ... Which Nutch version is used? Can you try to reproduce the problem with a "clean" Nutch (either 1.7 or 2.2.1) without any custom extensions (parse filters, etc.)? Thanks, Sebastian On 10/16/2013 01:32 AM, S.L wrote: > Sebastian, > > Thank you for the lead, after I use the ParseChecker , I get the following > output , I can see that only two URLs are being parsed out of the page , *I > see a pattern that* in this page almost all the URLs are enclosed in * > <li></li>* tags and those are *not* getting picked up , the two URLs that > are being picked by the parser are *not* enclosed in a <li> tag. > > I have also attached the regex-urlfilter.txt along with the nutch-site.xml > for your review. > > Please see the ParseChecker output below. > > fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > contentType: text/html > signature: cb07f28617927cc0accb150b22f84649 > --------- > Url > --------------- > > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > --------- > ParseData > --------- > > Version: 5 > Status: success(1,0) > Title: All Categories > Outlinks: 12 > outlink: toUrl: > http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor: > outlink: toUrl: > http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor: > outlink: toUrl: > http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor: > outlink: toUrl: > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor: > Skip to main content > outlink: toUrl: http://www.ebay.com anchor: eBay > outlink: toUrl: > http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay > outlink: toUrl: > http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor: > Shop by category > outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your search > keyword All Categories Advanced > outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor: > Advanced > outlink: toUrl: > http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.jsanchor: > outlink: toUrl: > http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor: > outlink: toUrl: > http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor: > Content Metadata: Content-Language=en-US > RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct > 2013 23:12:51 GMT Content-Encoding=gzip Set-Cookie=lucky9=1113957;Domain=. > ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close > Content-Type=text/html;charset=utf-8 Server=eBay Server > Cache-Control=private Pragma=no-cache > Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > --------- > ParseText > --------- > > All Categories Skip to main content eBay Shop by category Enter your search > keyword All Categories Advanced > > > > > On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel <[email protected] >> wrote: > >> Hi, >> >>> I am only interested in the internal links. >> Then >> db.ignore.external.links = false >> is correct. >> >> It is impossible to decide what's going wrong. >> At a first glance, all seems ok except one: >> plugin.includes contains "scoring-optic". >> Should be "scoring-opic". I don't know but >> that hardly the reason. >> >> For a finer analysis, more details are required: >> - URL filter and normalizers: >> are the desired URLs accepted >> - CustomFetchSchedule.java: >> shouldFetch() may play a role >> >> You can try to find the reason by: >> >> % bin/nutch parsechecker " >> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" >> Are all desired outlinks extracted by parser? >> >> (after fetch of start url) >> % bin/nutch readdb .../crawldb -dump crawldb_dump >> % less crawldb_dump/part-* >> Are they in CrawlDb? >> >> Cheers, >> Sebastian >> >> On 10/13/2013 04:18 AM, S.L wrote: >>> Hello All, >>> >>> I am facing this problem with the URL >>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this URL >> has >>> many internal links present in the page and also has many external links >>> to other domains , I am only interested in the internal links. >>> >>> However when this page is crawled the internal links in it are not added >>> for fetching in the next round of fetching ( I have given a depth of >> 100). >>> I have alread set the db.ignore.internal.links as false ,but for some >>> reason the internal links are not getting added to the next round of >> fetch >>> list. >>> >>> >>> On the other hand if I set the db.ignore.external.links as false, it >> correctly >>> picks up all the external links from the page. >>> >>> This problem is not present in any other domains , can some tell me what >> is >>> it with this particular page ? >>> >>> I have also attached the nucth-site.xml that I am using for your review, >>> please advise. >>> >> >> >

