Hi Ferdy, It's likely that I'm confused about what to expect in the ElasticSearch index. Reviewing both ElasticWriter.java and NutchDocument.java I see that there are two properties that store data about the document:
private Map<String, List<String>> fields; private Metadata documentMeta; Looking at Metadata.java it's likely that the fields that I was expecting to show up in ElasticSearch (HTTP Headers like Content-Type, Last-Modified, etc.) would be contained in the documentMeta property. Is there a reason that the write(NutchDocument) method in ElasticWriter shouldn't also store documentMeta in ElasticSearch? Thanks, Matt On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <[email protected]> wrote: > Hi, > > Do some of the fields that are missing in the index have any special > characters, such as hyphen? I can imagine that those are not supported. (I > have not tested this). > > Ferdy. > > On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <[email protected]> wrote: > >> Hi, >> >> I'm using the most recent Nutch 2.x to crawl a single site, storing >> the results in HBase and then indexing for search with ElasticSearch. >> My crawl and indexing complete as expected. Looking in HBase I see >> metadata that I would expect for a record. Fields like: >> >> f:typ >> timestamp=1346408694547, value=text/html >> h:Cache-Control >> timestamp=1346408694547, value=private >> h:Connection >> timestamp=1346408694547, value=close >> h:Content-Length >> timestamp=1346408694547, value=47166 >> h:Content-Type >> timestamp=1346408694547, value=text/html; >> charset=utf-8 >> h:Date >> timestamp=1346408694547, value=Fri, 31 Aug 2012 >> 10:24:37 GMT >> h:Server >> timestamp=1346408694547, value=Microsoft-IIS/6.0 >> h:Set-Cookie >> timestamp=1346408694547, >> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly >> h:X-AspNet-Version >> timestamp=1346408694547, value=2.0.50727 >> h:X-Powered-By >> timestamp=1346408694547, value=ASP.NET >> h:p3p >> timestamp=1346408694547, value=CP="IDC DSP COR >> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT" >> il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027 >> timestamp=1346408808930, value=Printable Version >> il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40 >> timestamp=1346408662165, value=5.18.10 Board of >> Health May Minutes >> >> But after indexing with bin/nutch elasticindex and looking at the same >> record in ElasticSearch I'm only seeing a subset of the fields that I >> see in HBase. >> >> { >> id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027", >> site: "www.ci.watertown.ma.us", >> content: "Watertown, MA - ...", >> title: "Watertown, MA - Official Website", >> host: "www.ci.watertown.ma.us", >> digest: "b30833d3cd1180ddd8beb4f7d3bbaeee", >> boost: "0.0", >> tstamp: "2013-06-27T10:24:37.846Z", >> url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027", >> anchor: [ >> "5.18.10 Board of Health May Minutes", >> "Printable Version" >> ] >> } >> >> I will need to be able to search/query against fields like >> Content-Type so I'm wondering if I'm missing a configuration setting >> to store those fields in the search index or what else might be going >> on that is preventing the fields that I'm seeing in HBase from showing >> up in ElasticSearch. >> >> I'm very new to the Nutch codebase but I've looked in >> >> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java >> and didn't notice anything that would prevent all the fields from >> getting into ElasticSearch. >> >> Thanks, >> Matt >>

