Thanks, I will to review the configuration about it. ----- Mensaje original -----
De: "Sebastian Nagel" <[email protected]> Para: [email protected] Enviados: Jueves, 23 de Abril 2015 12:48:13 Asunto: [MASSMAIL]Re: Help about parsing the title of resources with Nutch 1.9 Hi Yulio, in this case Nutch behaves just correct ("polite"): When I run parsechecker I get: Parse Metadata: robots=noindex,nofollow ... because of the meta tags: <meta name="robots" content="noindex,nofollow" /> Because of this robots directive Nutch empties content, title and outlinks of this page. Best, Sebastian On 04/23/2015 07:40 PM, Ing. Yulio Aleman Jimenez wrote: > Hi. I am new using Nutch 1.9(local mode) and Solr 4.10 and I have a problem > when the spider try to identify the title of resources. I mean that in many > cases Nutch don't identify the title of a web page, however this page have a > title. I did a parsechecker to this web page and nutch don't detected any > title. > > This URL is an example. This page has a title and nutch don't detect it: > http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana > > > For this URL, this is my output of the parsechecker: > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > [root@cidicubanutch2 generales]# bin/nutch parsechecker > http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana > > fetching: > http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana > > parsing: > http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana > > contentType: text/html > signature: 32541e28e020f7c290735bfe2cc4c7b3 > --------- > Url > --------------- > > http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana > > --------- > ParseData > --------- > > Version: 5 > Status: success(1,0) > Title: > Outlinks: 0 > Content Metadata: Content-Language=es Content-Length=9504 Expires=Sat, 23 May > 2015 17:35:15 GMT Connection=close X-Cache-Lookup=MISS from www.ecured.cu:80 > Server=Apache X-Cache=MISS from www.ecured.cu X-Content-Type-Options=nosniff > Cache-Control=s-maxage=10, must-revalidate, max-age=0, max-age=2592000 > X-Frame-Options=DENY Date=Thu, 23 Apr 2015 17:35:15 GMT > Vary=Accept-Encoding,Cookie,User-Agent nutch.crawl.score=0.0 > Content-Encoding=gzip Via=1.0 www.ecured.cu (squid/3.1.10) > Content-Type=text/html; charset=UTF-8 > Parse Metadata: Custom-Tag=h1- > Cambios relacionados con «EcuRed:Enciclopedia cubana» > > Custom-Tag=strong-(+1940) Custom-Tag=strong-(+4392) Custom-Tag=strong-50 > Custom-Tag=strong-7 CharEncodingForConversion=utf-8 > OriginalCharEncoding=utf-8 language=lt > [root@cidicubanutch2 generales]# > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > > I have the field "title" as a required="true" on the schema.xml of the Nutch > and Solr to prevent indexing the resources without title. > > I hope anybody can help me. > > > -- Ing. Yulio Aleman Jimenez Dpto. Soluciones Informáticas para Internet. Centro de Ideoinformática (CIDI) Universidad de las Ciencias Informáticas (UCI) ----------------------------------------------------------------------------------------------------------------------------------- "Podrán morir los hombres, PERO JAMÁS SUS IDEAS"

