how to insert outlinks from rss in crawldb ?

Eyeris Rodriguez Rueda Fri, 04 Nov 2016 09:17:29 -0700

Hi.
I am using nutch 1.12 and solr 4.10.3.

I know that rss feeds are parsed by tika by default but also can be added feed 
plugin to parse feed urls.
Rss is a significant way to discover new url and it is very important for me.
In my case i have activated only tika parser because using both (tika and feed) 
the field content and outlinks are empty in solr.
for any reason don´t extract outlinks correctly.but using only tika it is 
extracted very well.
I have a problem because the outlinks detected inside a feed are indexed 
correctly as a field outlink of the url, but not included in crawldb as urls. 
See below


http://www.cubadebate.cu/feed/

"url": "http://www.cubadebate.cu/feed/";,
        "content": "..........",
        "tstamp": "2016-11-04T14:33:21.561Z",
        "segment": "20161104100311",
        "domain": "cubadebate.cu",
        "digest": "86af35325f0dcb671d24587ccda4ab64",
        "host": "www.cubadebate.cu",
        "boost": 1,
        "contentLength": 4085,
        "outlinks": [
          
"http://www.cubadebate.cu/noticias/2016/11/04/fifa-cristiano-griezmann-y-messi-entre-los-23-nominados-al-trofeo-the-best/";,
          
"http://www.cubadebate.cu/noticias/2016/11/04/unicef-600-000-ninos-en-haiti-afectado-por-huracan-necesitan-ayuda/";,
          
"http://www.cubadebate.cu/noticias/2016/11/04/juan-pablo-escobar-el-dinero-de-la-droga-nunca-abandona-estados-unidos-video/";,
          
"http://www.cubadebate.cu/noticias/2016/11/04/que-trae-la-prensa-cubana-viernes-4-de-noviembre-de-2016/";,
          
"http://www.cubadebate.cu/noticias/2016/11/04/la-participacion-a-las-elecciones-de-eeuu-es-una-de-las-mas-bajas-del-mundo-desde-1980/";,
          
"http://www.cubadebate.cu/noticias/2016/11/04/acuerdo-de-paris-sobre-cambio-climatico-entra-en-vigor-este-viernes/";,
          
"http://www.cubadebate.cu/opinion/2016/11/04/brasil-detras-del-show-la-despolitizacion/";,
          
"http://www.cubadebate.cu/noticias/2016/11/03/inicia-jetblue-vuelos-regulares-a-camaguey/";,
          
"http://www.cubadebate.cu/noticias/2016/11/03/beisbol-tarde-de-lechadas-y-otra-noche-de-saavedra/";,
          
"http://www.cubadebate.cu/noticias/2016/11/03/que-traen-las-empresas-cubanas-a-fihav-2016/";
        ],
        "id": "http://www.cubadebate.cu/feed/";,


After finish the crawl process only 1 url is in crawldb.
bin/nutch readdb crawl/crawldb/ -stats

CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     1
retry 0:        1
min score:      1.0
avg score:      1.0
max score:      1.0
status 2 (db_fetched):  1
CrawlDb statistics: done

I have read crawldb,linkdb,linkdbMerger classes but i can´t find how to insert 
outlinks from feed to crawldb.
Please any body can help me or point me in the right direction for insert 
outlinks from feed to crawldb, and visit its in the next round.









 E don´t are inserted in crawldb and also don´t visited in next iterations of c

how to insert outlinks from rss in crawldb ?

Reply via email to