Re: [MASSMAIL]Re: Help about parsing the title of resources with Nutch 1.9

Ing. Yulio Aleman Jimenez Thu, 23 Apr 2015 13:39:18 -0700

Thanks, I will to review the configuration about it. 

----- Mensaje original -----


De: "Sebastian Nagel" <[email protected]> 
Para: [email protected] 
Enviados: Jueves, 23 de Abril 2015 12:48:13 
Asunto: [MASSMAIL]Re: Help about parsing the title of resources with Nutch 1.9 

Hi Yulio, 

in this case Nutch behaves just correct ("polite"): 
When I run parsechecker I get: 
Parse Metadata: robots=noindex,nofollow ... 
because of the meta tags: 
<meta name="robots" content="noindex,nofollow" /> 

Because of this robots directive Nutch empties content, title 
and outlinks of this page. 

Best, 
Sebastian 

On 04/23/2015 07:40 PM, Ing. Yulio Aleman Jimenez wrote: 
> Hi. I am new using Nutch 1.9(local mode) and Solr 4.10 and I have a problem 
> when the spider try to identify the title of resources. I mean that in many 
> cases Nutch don't identify the title of a web page, however this page have a 
> title. I did a parsechecker to this web page and nutch don't detected any 
> title. 
> 
> This URL is an example. This page has a title and nutch don't detect it: 
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>  
> 
> For this URL, this is my output of the parsechecker: 
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  
> [root@cidicubanutch2 generales]# bin/nutch parsechecker 
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>  
> fetching: 
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>  
> parsing: 
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>  
> contentType: text/html 
> signature: 32541e28e020f7c290735bfe2cc4c7b3 
> --------- 
> Url 
> --------------- 
> 
> http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
>  
> --------- 
> ParseData 
> --------- 
> 
> Version: 5 
> Status: success(1,0) 
> Title: 
> Outlinks: 0 
> Content Metadata: Content-Language=es Content-Length=9504 Expires=Sat, 23 May 
> 2015 17:35:15 GMT Connection=close X-Cache-Lookup=MISS from www.ecured.cu:80 
> Server=Apache X-Cache=MISS from www.ecured.cu X-Content-Type-Options=nosniff 
> Cache-Control=s-maxage=10, must-revalidate, max-age=0, max-age=2592000 
> X-Frame-Options=DENY Date=Thu, 23 Apr 2015 17:35:15 GMT 
> Vary=Accept-Encoding,Cookie,User-Agent nutch.crawl.score=0.0 
> Content-Encoding=gzip Via=1.0 www.ecured.cu (squid/3.1.10) 
> Content-Type=text/html; charset=UTF-8 
> Parse Metadata: Custom-Tag=h1- 
> Cambios relacionados con «EcuRed:Enciclopedia cubana» 
> 
> Custom-Tag=strong-(+1940) Custom-Tag=strong-(+4392) Custom-Tag=strong-50 
> Custom-Tag=strong-7 CharEncodingForConversion=utf-8 
> OriginalCharEncoding=utf-8 language=lt 
> [root@cidicubanutch2 generales]# 
> 
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  
> 
> I have the field "title" as a required="true" on the schema.xml of the Nutch 
> and Solr to prevent indexing the resources without title. 
> 
> I hope anybody can help me. 
> 
> 
> 




-- 
Ing. Yulio Aleman Jimenez 
Dpto. Soluciones Informáticas para Internet. Centro de Ideoinformática (CIDI) 
Universidad de las Ciencias Informáticas (UCI) 
-----------------------------------------------------------------------------------------------------------------------------------
 
"Podrán morir los hombres, PERO JAMÁS SUS IDEAS"

Re: [MASSMAIL]Re: Help about parsing the title of resources with Nutch 1.9

Reply via email to