Hi, Do some of the fields that are missing in the index have any special characters, such as hyphen? I can imagine that those are not supported. (I have not tested this).
Ferdy. On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <[email protected]> wrote: > Hi, > > I'm using the most recent Nutch 2.x to crawl a single site, storing > the results in HBase and then indexing for search with ElasticSearch. > My crawl and indexing complete as expected. Looking in HBase I see > metadata that I would expect for a record. Fields like: > > f:typ > timestamp=1346408694547, value=text/html > h:Cache-Control > timestamp=1346408694547, value=private > h:Connection > timestamp=1346408694547, value=close > h:Content-Length > timestamp=1346408694547, value=47166 > h:Content-Type > timestamp=1346408694547, value=text/html; > charset=utf-8 > h:Date > timestamp=1346408694547, value=Fri, 31 Aug 2012 > 10:24:37 GMT > h:Server > timestamp=1346408694547, value=Microsoft-IIS/6.0 > h:Set-Cookie > timestamp=1346408694547, > value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly > h:X-AspNet-Version > timestamp=1346408694547, value=2.0.50727 > h:X-Powered-By > timestamp=1346408694547, value=ASP.NET > h:p3p > timestamp=1346408694547, value=CP="IDC DSP COR > ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT" > il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027 > timestamp=1346408808930, value=Printable Version > il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40 > timestamp=1346408662165, value=5.18.10 Board of > Health May Minutes > > But after indexing with bin/nutch elasticindex and looking at the same > record in ElasticSearch I'm only seeing a subset of the fields that I > see in HBase. > > { > id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027", > site: "www.ci.watertown.ma.us", > content: "Watertown, MA - ...", > title: "Watertown, MA - Official Website", > host: "www.ci.watertown.ma.us", > digest: "b30833d3cd1180ddd8beb4f7d3bbaeee", > boost: "0.0", > tstamp: "2013-06-27T10:24:37.846Z", > url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027", > anchor: [ > "5.18.10 Board of Health May Minutes", > "Printable Version" > ] > } > > I will need to be able to search/query against fields like > Content-Type so I'm wondering if I'm missing a configuration setting > to store those fields in the search index or what else might be going > on that is preventing the fields that I'm seeing in HBase from showing > up in ElasticSearch. > > I'm very new to the Nutch codebase but I've looked in > > https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java > and didn't notice anything that would prevent all the fields from > getting into ElasticSearch. > > Thanks, > Matt >

