Re: Internal links not getting added to fetch list.

Sebastian Nagel Fri, 18 Oct 2013 12:49:36 -0700

Hi,

my nutch-site.xml is rather minimalistic, see below.


You could also check whether fetching of "raw" HTML content succeeds:

% bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http \
  -verbose "http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1";

Sebastian

>>> nutch-site.xml
<configuration>

<property>
  <name>http.agent.name</name>
  <value>sn-test-crawler</value>
</property>

<property>
  <name>http.robots.agents</name>
  <value>sn-test-crawler,*</value>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
</property>

<property>
  <name>http.agent.email</name>
  <value>wastldotnagelatgooglemaildotcom</value>
</property>

<property>
  <name>plugin.includes</name>

<value>protocol-(http|file)|urlfilter-(regex|suffix)|parse-(html|tika|zip)|index-(basic|anchor|more)|scoring-opic</value>
</property>

<property>
  <name>file.content.limit</name>
  <value>-1</value>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
</property>

</configuration>
<<< nutch-site.xml

On 10/18/2013 05:31 AM, S.L wrote:
> Now I ran the clean trunk checkout as well ,unfortunately on a clean trunk
> checkout (with name and plugin.folder value added to nutch-default.xml)
> to  I see the same behavior as the clean 1.7 tag checkout , obviously I am
> doing something wrong and that has to do with the config files because I am
> not modifying the source in any way .
> 
> Will it be possible for to you share the config files you have used in the
> clean trunk checkout with me please?
> 
> The following is the output from the trunk checkout execution.
> 
> fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> contentType: text/html
> signature: 7026e09a97ff6df53f85d668bd86bcba
> ---------
> Url
> ---------------
> 
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> ---------
> ParseData
> ---------
> 
> Version: 5
> Status: success(1,0)
> Title: All Categories
> Outlinks: 8
>   outlink: toUrl:
> http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor:
>   outlink: toUrl:
> http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,GHCOLL,UX&h=24864anchor:
>   outlink: toUrl:
> http://ir.ebaystatic.com/z/ic/1hsgocfebuyd3pnukphb3cmqz.css anchor:
>   outlink: toUrl:
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor:
> Skip to main content
>   outlink: toUrl: http://www.ebay.com anchor: eBay
>   outlink: toUrl:
> http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay
>   outlink: toUrl:
> http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor:
> Shop by category
>   outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor:
> Advanced
> Content Metadata: Content-Language=en-US
> RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141c972d0a9-0xeb Date=Fri, 18 Oct
> 2013 02:44:06 GMT Content-Encoding=gzip Set-Cookie=lucky9=5427514;Domain=.
> ebay.com;Expires=Wed, 17-Oct-2018 02:44:06 GMT;Path=/ Connection=close
> Content-Type=text/html;charset=utf-8 Server=eBay Server
> Cache-Control=private Pragma=no-cache
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> ---------
> ParseText
> ---------
> 
> All Categories Skip to main content eBay Shop by category Enter your search
> keyword All Categories Advanced
> 
> 
> On Wed, Oct 16, 2013 at 5:00 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> Hi,
>>
>> when run parsechecker with current trunk of 1.x there are 653 outlinks
>> (including many "internal" ones):
>>
>> % nutch parsechecker "
>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1";
>> ...
>> Title: All Categories
>> Outlinks: 653
>> ...
>>
>>
>> Which Nutch version is used?
>> Can you try to reproduce the problem with a "clean" Nutch (either 1.7 or
>> 2.2.1)
>> without any custom extensions (parse filters, etc.)?
>>
>> Thanks,
>> Sebastian
>>
>>
>>
>>
>> On 10/16/2013 01:32 AM, S.L wrote:
>>> Sebastian,
>>>
>>> Thank you for the lead, after I use the ParseChecker , I get the
>> following
>>> output , I can see that only two URLs are being parsed out of the page ,
>> *I
>>> see a pattern that* in this page almost all the URLs are enclosed in  *
>>> <li></li>* tags and those are *not* getting picked up , the two URLs that
>>> are being picked by the parser are *not* enclosed in a <li> tag.
>>>
>>> I have also attached the regex-urlfilter.txt along with the
>> nutch-site.xml
>>> for your review.
>>>
>>> Please see the ParseChecker output below.
>>>
>>> fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
>>> parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
>>> contentType: text/html
>>> signature: cb07f28617927cc0accb150b22f84649
>>> ---------
>>> Url
>>> ---------------
>>>
>>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
>>> ---------
>>> ParseData
>>> ---------
>>>
>>> Version: 5
>>> Status: success(1,0)
>>> Title: All Categories
>>> Outlinks: 12
>>>   outlink: toUrl:
>>> http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor:
>>>   outlink: toUrl:
>>>
>> http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor
>> :
>>>   outlink: toUrl:
>>> http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor:
>>>   outlink: toUrl:
>>>
>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor
>> :
>>> Skip to main content
>>>   outlink: toUrl: http://www.ebay.com anchor: eBay
>>>   outlink: toUrl:
>>> http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay
>>>   outlink: toUrl:
>>>
>> http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor
>> :
>>> Shop by category
>>>   outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your
>> search
>>> keyword All Categories Advanced
>>>   outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor:
>>> Advanced
>>>   outlink: toUrl:
>>> http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.jsanchor:
>>>   outlink: toUrl:
>>>
>> http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor
>> :
>>>   outlink: toUrl:
>>> http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor:
>>> Content Metadata: Content-Language=en-US
>>> RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct
>>> 2013 23:12:51 GMT Content-Encoding=gzip
>> Set-Cookie=lucky9=1113957;Domain=.
>>> ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close
>>> Content-Type=text/html;charset=utf-8 Server=eBay Server
>>> Cache-Control=private Pragma=no-cache
>>> Parse Metadata: CharEncodingForConversion=utf-8
>> OriginalCharEncoding=utf-8
>>> ---------
>>> ParseText
>>> ---------
>>>
>>> All Categories Skip to main content eBay Shop by category Enter your
>> search
>>> keyword All Categories Advanced
>>>
>>>
>>>
>>>
>>> On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel <
>> [email protected]
>>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>> I am only interested in the internal links.
>>>> Then
>>>>   db.ignore.external.links = false
>>>> is correct.
>>>>
>>>> It is impossible to decide what's going wrong.
>>>> At a first glance, all seems ok except one:
>>>> plugin.includes contains "scoring-optic".
>>>> Should be "scoring-opic". I don't know but
>>>> that hardly the reason.
>>>>
>>>> For a finer analysis, more details are required:
>>>> - URL filter and normalizers:
>>>>   are the desired URLs accepted
>>>> - CustomFetchSchedule.java:
>>>>   shouldFetch() may play a role
>>>>
>>>> You can try to find the reason by:
>>>>
>>>> % bin/nutch parsechecker "
>>>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1";
>>>> Are all desired outlinks extracted by parser?
>>>>
>>>> (after fetch of start url)
>>>> % bin/nutch readdb .../crawldb -dump crawldb_dump
>>>> % less crawldb_dump/part-*
>>>> Are they in CrawlDb?
>>>>
>>>> Cheers,
>>>> Sebastian
>>>>
>>>> On 10/13/2013 04:18 AM, S.L wrote:
>>>>> Hello All,
>>>>>
>>>>> I am facing this problem with the URL
>>>>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this
>> URL
>>>> has
>>>>> many internal links present in  the page and also has many external
>> links
>>>>> to other domains , I am only interested in the internal links.
>>>>>
>>>>> However when this page is crawled the internal links in it are not
>> added
>>>>> for fetching in the next round of fetching ( I have given a depth of
>>>> 100).
>>>>> I have alread  set the db.ignore.internal.links as false ,but for some
>>>>> reason the internal links are not getting added to the next round of
>>>> fetch
>>>>> list.
>>>>>
>>>>>
>>>>> On the other hand if I set the db.ignore.external.links as false, it
>>>> correctly
>>>>> picks up all the external links from the page.
>>>>>
>>>>> This problem is not present in any other domains , can some tell me
>> what
>>>> is
>>>>> it with this particular page ?
>>>>>
>>>>> I have also attached the nucth-site.xml that I am using for your
>> review,
>>>>> please advise.
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Internal links not getting added to fetch list.

Reply via email to