Hi Ferdy, Thanks for the additional information. I found out what was missing in my configuration. I updated the nutch-site.xml plugin.includes section to use index-(basic|anchor|more|urlmeta) and I'm now seeing the fields that I was anticipating in the ElasticSearch index. These were the relevant urls that I came across that supplied the information that I was looking for:
* http://wiki.apache.org/nutch/IndexStructure * https://issues.apache.org/jira/browse/NUTCH-940 Thanks again for such a prompt reply and the help. Thanks, Matt On Mon, Sep 3, 2012 at 9:11 AM, Ferdy Galema <[email protected]> wrote: > I'm not sure what the original purpose of the documentMeta is, but seeing > as there already is clearly defined 'fields' container for all fields that > should be indexed, I guess it is just a place for storing some extra data > about the fields or document that should be indexed. The Elasticwriter uses > it only for the type, the Solrwriter does not use it at all. It looks like > Nutch trunk does not use it either. > > In short, for now I would just use the 'fields' and ignore documentMeta. > > On Mon, Sep 3, 2012 at 2:38 PM, Matt MacDonald <[email protected]> wrote: > >> Hi Ferdy, >> >> It's likely that I'm confused about what to expect in the >> ElasticSearch index. Reviewing both ElasticWriter.java and >> NutchDocument.java I see that there are two properties that store data >> about the document: >> >> private Map<String, List<String>> fields; >> private Metadata documentMeta; >> >> Looking at Metadata.java it's likely that the fields that I was >> expecting to show up in ElasticSearch (HTTP Headers like Content-Type, >> Last-Modified, etc.) would be contained in the documentMeta property. >> Is there a reason that the write(NutchDocument) method in >> ElasticWriter shouldn't also store documentMeta in ElasticSearch? >> >> Thanks, >> Matt >> >> On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <[email protected]> >> wrote: >> > Hi, >> > >> > Do some of the fields that are missing in the index have any special >> > characters, such as hyphen? I can imagine that those are not supported. >> (I >> > have not tested this). >> > >> > Ferdy. >> > >> > On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <[email protected]> >> wrote: >> > >> >> Hi, >> >> >> >> I'm using the most recent Nutch 2.x to crawl a single site, storing >> >> the results in HBase and then indexing for search with ElasticSearch. >> >> My crawl and indexing complete as expected. Looking in HBase I see >> >> metadata that I would expect for a record. Fields like: >> >> >> >> f:typ >> >> timestamp=1346408694547, value=text/html >> >> h:Cache-Control >> >> timestamp=1346408694547, value=private >> >> h:Connection >> >> timestamp=1346408694547, value=close >> >> h:Content-Length >> >> timestamp=1346408694547, value=47166 >> >> h:Content-Type >> >> timestamp=1346408694547, value=text/html; >> >> charset=utf-8 >> >> h:Date >> >> timestamp=1346408694547, value=Fri, 31 Aug 2012 >> >> 10:24:37 GMT >> >> h:Server >> >> timestamp=1346408694547, value=Microsoft-IIS/6.0 >> >> h:Set-Cookie >> >> timestamp=1346408694547, >> >> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly >> >> h:X-AspNet-Version >> >> timestamp=1346408694547, value=2.0.50727 >> >> h:X-Powered-By >> >> timestamp=1346408694547, value=ASP.NET >> >> h:p3p >> >> timestamp=1346408694547, value=CP="IDC DSP COR >> >> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT" >> >> il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027 >> >> timestamp=1346408808930, value=Printable Version >> >> il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40 >> >> timestamp=1346408662165, value=5.18.10 Board of >> >> Health May Minutes >> >> >> >> But after indexing with bin/nutch elasticindex and looking at the same >> >> record in ElasticSearch I'm only seeing a subset of the fields that I >> >> see in HBase. >> >> >> >> { >> >> id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027", >> >> site: "www.ci.watertown.ma.us", >> >> content: "Watertown, MA - ...", >> >> title: "Watertown, MA - Official Website", >> >> host: "www.ci.watertown.ma.us", >> >> digest: "b30833d3cd1180ddd8beb4f7d3bbaeee", >> >> boost: "0.0", >> >> tstamp: "2013-06-27T10:24:37.846Z", >> >> url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027", >> >> anchor: [ >> >> "5.18.10 Board of Health May Minutes", >> >> "Printable Version" >> >> ] >> >> } >> >> >> >> I will need to be able to search/query against fields like >> >> Content-Type so I'm wondering if I'm missing a configuration setting >> >> to store those fields in the search index or what else might be going >> >> on that is preventing the fields that I'm seeing in HBase from showing >> >> up in ElasticSearch. >> >> >> >> I'm very new to the Nutch codebase but I've looked in >> >> >> >> >> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java >> >> and didn't notice anything that would prevent all the fields from >> >> getting into ElasticSearch. >> >> >> >> Thanks, >> >> Matt >> >> >>

