Hi,
I'm using the most recent Nutch 2.x to crawl a single site, storing
the results in HBase and then indexing for search with ElasticSearch.
My crawl and indexing complete as expected. Looking in HBase I see
metadata that I would expect for a record. Fields like:
f:typ
timestamp=1346408694547, value=text/html
h:Cache-Control
timestamp=1346408694547, value=private
h:Connection
timestamp=1346408694547, value=close
h:Content-Length
timestamp=1346408694547, value=47166
h:Content-Type
timestamp=1346408694547, value=text/html;
charset=utf-8
h:Date
timestamp=1346408694547, value=Fri, 31 Aug 2012
10:24:37 GMT
h:Server
timestamp=1346408694547, value=Microsoft-IIS/6.0
h:Set-Cookie
timestamp=1346408694547,
value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
h:X-AspNet-Version
timestamp=1346408694547, value=2.0.50727
h:X-Powered-By
timestamp=1346408694547, value=ASP.NET
h:p3p
timestamp=1346408694547, value=CP="IDC DSP COR
ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
timestamp=1346408808930, value=Printable Version
il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
timestamp=1346408662165, value=5.18.10 Board of
Health May Minutes
But after indexing with bin/nutch elasticindex and looking at the same
record in ElasticSearch I'm only seeing a subset of the fields that I
see in HBase.
{
id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
site: "www.ci.watertown.ma.us",
content: "Watertown, MA - ...",
title: "Watertown, MA - Official Website",
host: "www.ci.watertown.ma.us",
digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
boost: "0.0",
tstamp: "2013-06-27T10:24:37.846Z",
url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027",
anchor: [
"5.18.10 Board of Health May Minutes",
"Printable Version"
]
}
I will need to be able to search/query against fields like
Content-Type so I'm wondering if I'm missing a configuration setting
to store those fields in the search index or what else might be going
on that is preventing the fields that I'm seeing in HBase from showing
up in ElasticSearch.
I'm very new to the Nutch codebase but I've looked in
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
and didn't notice anything that would prevent all the fields from
getting into ElasticSearch.
Thanks,
Matt