Thanks for updating the list. On Tue, Sep 4, 2012 at 2:52 PM, Matt MacDonald <[email protected]> wrote:
> Hi Ferdy, > > Thanks for the additional information. I found out what was missing in > my configuration. I updated the nutch-site.xml plugin.includes section > to use index-(basic|anchor|more|urlmeta) and I'm now seeing the fields > that I was anticipating in the ElasticSearch index. These were the > relevant urls that I came across that supplied the information that I > was looking for: > > * http://wiki.apache.org/nutch/IndexStructure > * https://issues.apache.org/jira/browse/NUTCH-940 > > Thanks again for such a prompt reply and the help. > > Thanks, > Matt > > On Mon, Sep 3, 2012 at 9:11 AM, Ferdy Galema <[email protected]> > wrote: > > I'm not sure what the original purpose of the documentMeta is, but seeing > > as there already is clearly defined 'fields' container for all fields > that > > should be indexed, I guess it is just a place for storing some extra data > > about the fields or document that should be indexed. The Elasticwriter > uses > > it only for the type, the Solrwriter does not use it at all. It looks > like > > Nutch trunk does not use it either. > > > > In short, for now I would just use the 'fields' and ignore documentMeta. > > > > On Mon, Sep 3, 2012 at 2:38 PM, Matt MacDonald <[email protected]> > wrote: > > > >> Hi Ferdy, > >> > >> It's likely that I'm confused about what to expect in the > >> ElasticSearch index. Reviewing both ElasticWriter.java and > >> NutchDocument.java I see that there are two properties that store data > >> about the document: > >> > >> private Map<String, List<String>> fields; > >> private Metadata documentMeta; > >> > >> Looking at Metadata.java it's likely that the fields that I was > >> expecting to show up in ElasticSearch (HTTP Headers like Content-Type, > >> Last-Modified, etc.) would be contained in the documentMeta property. > >> Is there a reason that the write(NutchDocument) method in > >> ElasticWriter shouldn't also store documentMeta in ElasticSearch? > >> > >> Thanks, > >> Matt > >> > >> On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <[email protected]> > >> wrote: > >> > Hi, > >> > > >> > Do some of the fields that are missing in the index have any special > >> > characters, such as hyphen? I can imagine that those are not > supported. > >> (I > >> > have not tested this). > >> > > >> > Ferdy. > >> > > >> > On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <[email protected]> > >> wrote: > >> > > >> >> Hi, > >> >> > >> >> I'm using the most recent Nutch 2.x to crawl a single site, storing > >> >> the results in HBase and then indexing for search with ElasticSearch. > >> >> My crawl and indexing complete as expected. Looking in HBase I see > >> >> metadata that I would expect for a record. Fields like: > >> >> > >> >> f:typ > >> >> timestamp=1346408694547, value=text/html > >> >> h:Cache-Control > >> >> timestamp=1346408694547, value=private > >> >> h:Connection > >> >> timestamp=1346408694547, value=close > >> >> h:Content-Length > >> >> timestamp=1346408694547, value=47166 > >> >> h:Content-Type > >> >> timestamp=1346408694547, value=text/html; > >> >> charset=utf-8 > >> >> h:Date > >> >> timestamp=1346408694547, value=Fri, 31 Aug 2012 > >> >> 10:24:37 GMT > >> >> h:Server > >> >> timestamp=1346408694547, value=Microsoft-IIS/6.0 > >> >> h:Set-Cookie > >> >> timestamp=1346408694547, > >> >> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly > >> >> h:X-AspNet-Version > >> >> timestamp=1346408694547, value=2.0.50727 > >> >> h:X-Powered-By > >> >> timestamp=1346408694547, value=ASP.NET > >> >> h:p3p > >> >> timestamp=1346408694547, value=CP="IDC DSP COR > >> >> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT" > >> >> il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027 > >> >> timestamp=1346408808930, value=Printable Version > >> >> il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40 > >> >> timestamp=1346408662165, value=5.18.10 Board of > >> >> Health May Minutes > >> >> > >> >> But after indexing with bin/nutch elasticindex and looking at the > same > >> >> record in ElasticSearch I'm only seeing a subset of the fields that I > >> >> see in HBase. > >> >> > >> >> { > >> >> id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027", > >> >> site: "www.ci.watertown.ma.us", > >> >> content: "Watertown, MA - ...", > >> >> title: "Watertown, MA - Official Website", > >> >> host: "www.ci.watertown.ma.us", > >> >> digest: "b30833d3cd1180ddd8beb4f7d3bbaeee", > >> >> boost: "0.0", > >> >> tstamp: "2013-06-27T10:24:37.846Z", > >> >> url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027", > >> >> anchor: [ > >> >> "5.18.10 Board of Health May Minutes", > >> >> "Printable Version" > >> >> ] > >> >> } > >> >> > >> >> I will need to be able to search/query against fields like > >> >> Content-Type so I'm wondering if I'm missing a configuration setting > >> >> to store those fields in the search index or what else might be going > >> >> on that is preventing the fields that I'm seeing in HBase from > showing > >> >> up in ElasticSearch. > >> >> > >> >> I'm very new to the Nutch codebase but I've looked in > >> >> > >> >> > >> > https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java > >> >> and didn't notice anything that would prevent all the fields from > >> >> getting into ElasticSearch. > >> >> > >> >> Thanks, > >> >> Matt > >> >> > >> >

