Hi. I am new using Nutch 1.9(local mode) and Solr 4.10 and I have a problem 
when the spider try to identify the title of resources. I mean that in many 
cases Nutch don't identify the title of a web page, however this page have a 
title. I did a parsechecker to this web page and nutch don't detected any 
title. 

This URL is an example. This page has a title and nutch don't detect it: 
http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
 

For this URL, this is my output of the parsechecker: 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
[root@cidicubanutch2 generales]# bin/nutch parsechecker 
http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
 
fetching: 
http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
 
parsing: 
http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
 
contentType: text/html 
signature: 32541e28e020f7c290735bfe2cc4c7b3 
--------- 
Url 
--------------- 

http://www.ecured.cu/index.php/Especial:CambiosEnEnlazadas/EcuRed:Enciclopedia_cubana
 
--------- 
ParseData 
--------- 

Version: 5 
Status: success(1,0) 
Title: 
Outlinks: 0 
Content Metadata: Content-Language=es Content-Length=9504 Expires=Sat, 23 May 
2015 17:35:15 GMT Connection=close X-Cache-Lookup=MISS from www.ecured.cu:80 
Server=Apache X-Cache=MISS from www.ecured.cu X-Content-Type-Options=nosniff 
Cache-Control=s-maxage=10, must-revalidate, max-age=0, max-age=2592000 
X-Frame-Options=DENY Date=Thu, 23 Apr 2015 17:35:15 GMT 
Vary=Accept-Encoding,Cookie,User-Agent nutch.crawl.score=0.0 
Content-Encoding=gzip Via=1.0 www.ecured.cu (squid/3.1.10) 
Content-Type=text/html; charset=UTF-8 
Parse Metadata: Custom-Tag=h1- 
Cambios relacionados con «EcuRed:Enciclopedia cubana» 

Custom-Tag=strong-(+1940) Custom-Tag=strong-(+4392) Custom-Tag=strong-50 
Custom-Tag=strong-7 CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 
language=lt 
[root@cidicubanutch2 generales]# 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 

I have the field "title" as a required="true" on the schema.xml of the Nutch 
and Solr to prevent indexing the resources without title. 

I hope anybody can help me. 


Reply via email to