Hi Yulio,
in this case Nutch behaves just correct ("polite"):
When I run parsechecker I get:
Parse Metadata: robots=noindex,nofollow ...
because of the meta tags:
<meta name="robots" content="noindex,nofollow" />
Because of this robots directive Nutch empties content, title
and outlinks of this page.
Best,
Sebastian
On 04/23/2015 07:40 PM, Ing. Yulio Aleman Jimenez wrote:
> Hi. I am new using Nutch 1.9(local mode) and Solr 4.10 and I have a problem
> when the spider try to identify the title of resources. I mean that in many
> cases Nutch don't identify the title of a web page, however this page have a
> title. I did a parsechecker to this web page and nutch don't detected any
> title.
>
> This URL is an example. This page has a title and nutch don't detect it:
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>
>
> For this URL, this is my output of the parsechecker:
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> [root@cidicubanutch2 generales]# bin/nutch parsechecker
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>
> fetching:
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>
> parsing:
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>
> contentType: text/html
> signature: 32541e28e020f7c290735bfe2cc4c7b3
> ---------
> Url
> ---------------
>
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>
> ---------
> ParseData
> ---------
>
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 0
> Content Metadata: Content-Language=es Content-Length=9504 Expires=Sat, 23 May
> 2015 17:35:15 GMT Connection=close X-Cache-Lookup=MISS from www.ecured.cu:80
> Server=Apache X-Cache=MISS from www.ecured.cu X-Content-Type-Options=nosniff
> Cache-Control=s-maxage=10, must-revalidate, max-age=0, max-age=2592000
> X-Frame-Options=DENY Date=Thu, 23 Apr 2015 17:35:15 GMT
> Vary=Accept-Encoding,Cookie,User-Agent nutch.crawl.score=0.0
> Content-Encoding=gzip Via=1.0 www.ecured.cu (squid/3.1.10)
> Content-Type=text/html; charset=UTF-8
> Parse Metadata: Custom-Tag=h1-
> Cambios relacionados con «EcuRed:Enciclopedia cubana»
>
> Custom-Tag=strong-(+1940) Custom-Tag=strong-(+4392) Custom-Tag=strong-50
> Custom-Tag=strong-7 CharEncodingForConversion=utf-8
> OriginalCharEncoding=utf-8 language=lt
> [root@cidicubanutch2 generales]#
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> I have the field "title" as a required="true" on the schema.xml of the Nutch
> and Solr to prevent indexing the resources without title.
>
> I hope anybody can help me.
>
>
>