Re: Internal links not getting added to fetch list.

Sebastian Nagel Wed, 16 Oct 2013 14:02:14 -0700

Hi,

when run parsechecker with current trunk of 1.x there are 653 outlinks
(including many "internal" ones):


% nutch parsechecker 
"http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1";
...
Title: All Categories
Outlinks: 653
...


Which Nutch version is used?
Can you try to reproduce the problem with a "clean" Nutch (either 1.7 or 2.2.1)
without any custom extensions (parse filters, etc.)?

Thanks,
Sebastian




On 10/16/2013 01:32 AM, S.L wrote:
> Sebastian,
> 
> Thank you for the lead, after I use the ParseChecker , I get the following
> output , I can see that only two URLs are being parsed out of the page , *I
> see a pattern that* in this page almost all the URLs are enclosed in  *
> <li></li>* tags and those are *not* getting picked up , the two URLs that
> are being picked by the parser are *not* enclosed in a <li> tag.
> 
> I have also attached the regex-urlfilter.txt along with the nutch-site.xml
> for your review.
> 
> Please see the ParseChecker output below.
> 
> fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> contentType: text/html
> signature: cb07f28617927cc0accb150b22f84649
> ---------
> Url
> ---------------
> 
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> ---------
> ParseData
> ---------
> 
> Version: 5
> Status: success(1,0)
> Title: All Categories
> Outlinks: 12
>   outlink: toUrl:
> http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor:
>   outlink: toUrl:
> http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor:
>   outlink: toUrl:
> http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor:
>   outlink: toUrl:
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor:
> Skip to main content
>   outlink: toUrl: http://www.ebay.com anchor: eBay
>   outlink: toUrl:
> http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay
>   outlink: toUrl:
> http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor:
> Shop by category
>   outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your search
> keyword All Categories Advanced
>   outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor:
> Advanced
>   outlink: toUrl:
> http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.jsanchor:
>   outlink: toUrl:
> http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor:
>   outlink: toUrl:
> http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor:
> Content Metadata: Content-Language=en-US
> RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct
> 2013 23:12:51 GMT Content-Encoding=gzip Set-Cookie=lucky9=1113957;Domain=.
> ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close
> Content-Type=text/html;charset=utf-8 Server=eBay Server
> Cache-Control=private Pragma=no-cache
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> ---------
> ParseText
> ---------
> 
> All Categories Skip to main content eBay Shop by category Enter your search
> keyword All Categories Advanced
> 
> 
> 
> 
> On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> Hi,
>>
>>> I am only interested in the internal links.
>> Then
>>   db.ignore.external.links = false
>> is correct.
>>
>> It is impossible to decide what's going wrong.
>> At a first glance, all seems ok except one:
>> plugin.includes contains "scoring-optic".
>> Should be "scoring-opic". I don't know but
>> that hardly the reason.
>>
>> For a finer analysis, more details are required:
>> - URL filter and normalizers:
>>   are the desired URLs accepted
>> - CustomFetchSchedule.java:
>>   shouldFetch() may play a role
>>
>> You can try to find the reason by:
>>
>> % bin/nutch parsechecker "
>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1";
>> Are all desired outlinks extracted by parser?
>>
>> (after fetch of start url)
>> % bin/nutch readdb .../crawldb -dump crawldb_dump
>> % less crawldb_dump/part-*
>> Are they in CrawlDb?
>>
>> Cheers,
>> Sebastian
>>
>> On 10/13/2013 04:18 AM, S.L wrote:
>>> Hello All,
>>>
>>> I am facing this problem with the URL
>>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this URL
>> has
>>> many internal links present in  the page and also has many external links
>>> to other domains , I am only interested in the internal links.
>>>
>>> However when this page is crawled the internal links in it are not added
>>> for fetching in the next round of fetching ( I have given a depth of
>> 100).
>>> I have alread  set the db.ignore.internal.links as false ,but for some
>>> reason the internal links are not getting added to the next round of
>> fetch
>>> list.
>>>
>>>
>>> On the other hand if I set the db.ignore.external.links as false, it
>> correctly
>>> picks up all the external links from the page.
>>>
>>> This problem is not present in any other domains , can some tell me what
>> is
>>> it with this particular page ?
>>>
>>> I have also attached the nucth-site.xml that I am using for your review,
>>> please advise.
>>>
>>
>>
>

Re: Internal links not getting added to fetch list.

Reply via email to