Hi,

I'm using the most recent Nutch 2.x to crawl a single site, storing
the results in HBase and then indexing for search with ElasticSearch.
My crawl and indexing complete as expected. Looking in HBase I see
metadata that I would expect for a record. Fields like:

 f:typ
                     timestamp=1346408694547, value=text/html
 h:Cache-Control
                     timestamp=1346408694547, value=private
 h:Connection
                     timestamp=1346408694547, value=close
 h:Content-Length
                     timestamp=1346408694547, value=47166
 h:Content-Type
                     timestamp=1346408694547, value=text/html;
charset=utf-8
 h:Date
                     timestamp=1346408694547, value=Fri, 31 Aug 2012
10:24:37 GMT
 h:Server
                     timestamp=1346408694547, value=Microsoft-IIS/6.0
 h:Set-Cookie
                     timestamp=1346408694547,
value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
 h:X-AspNet-Version
                     timestamp=1346408694547, value=2.0.50727
 h:X-Powered-By
                     timestamp=1346408694547, value=ASP.NET
 h:p3p
                     timestamp=1346408694547, value=CP="IDC DSP COR
ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
 il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
                     timestamp=1346408808930, value=Printable Version
 il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
                     timestamp=1346408662165, value=5.18.10 Board of
Health May Minutes

But after indexing with bin/nutch elasticindex and looking at the same
record in ElasticSearch I'm only seeing a subset of the fields that I
see in HBase.

{
  id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
  site: "www.ci.watertown.ma.us",
  content: "Watertown, MA - ...",
  title: "Watertown, MA - Official Website",
  host: "www.ci.watertown.ma.us",
  digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
  boost: "0.0",
  tstamp: "2013-06-27T10:24:37.846Z",
  url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027";,
  anchor: [
    "5.18.10 Board of Health May Minutes",
    "Printable Version"
  ]
}

I will need to be able to search/query against fields like
Content-Type so I'm wondering if I'm missing a configuration setting
to store those fields in the search index or what else might be going
on that is preventing the fields that I'm seeing in HBase from showing
up in ElasticSearch.

I'm very new to the Nutch codebase but I've looked in
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
and didn't notice anything that would prevent all the fields from
getting into ElasticSearch.

Thanks,
Matt

Reply via email to