I'm not sure what the original purpose of the documentMeta is, but seeing
as there already is clearly defined 'fields' container for all fields that
should be indexed, I guess it is just a place for storing some extra data
about the fields or document that should be indexed. The Elasticwriter uses
it only for the type, the Solrwriter does not use it at all. It looks like
Nutch trunk does not use it either.

In short, for now I would just use the 'fields' and ignore documentMeta.

On Mon, Sep 3, 2012 at 2:38 PM, Matt MacDonald <[email protected]> wrote:

> Hi Ferdy,
>
> It's likely that I'm confused about what to expect in the
> ElasticSearch index. Reviewing both ElasticWriter.java and
> NutchDocument.java I see that there are two properties that store data
> about the document:
>
> private Map<String, List<String>> fields;
> private Metadata documentMeta;
>
> Looking at Metadata.java it's likely that the fields that I was
> expecting to show up in ElasticSearch (HTTP Headers like Content-Type,
> Last-Modified, etc.) would be contained in the documentMeta property.
> Is there a reason that the write(NutchDocument) method in
> ElasticWriter shouldn't also store documentMeta in ElasticSearch?
>
> Thanks,
> Matt
>
> On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <[email protected]>
> wrote:
> > Hi,
> >
> > Do some of the fields that are missing in the index have any special
> > characters, such as hyphen? I can imagine that those are not supported.
> (I
> > have not tested this).
> >
> > Ferdy.
> >
> > On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <[email protected]>
> wrote:
> >
> >> Hi,
> >>
> >> I'm using the most recent Nutch 2.x to crawl a single site, storing
> >> the results in HBase and then indexing for search with ElasticSearch.
> >> My crawl and indexing complete as expected. Looking in HBase I see
> >> metadata that I would expect for a record. Fields like:
> >>
> >>  f:typ
> >>                      timestamp=1346408694547, value=text/html
> >>  h:Cache-Control
> >>                      timestamp=1346408694547, value=private
> >>  h:Connection
> >>                      timestamp=1346408694547, value=close
> >>  h:Content-Length
> >>                      timestamp=1346408694547, value=47166
> >>  h:Content-Type
> >>                      timestamp=1346408694547, value=text/html;
> >> charset=utf-8
> >>  h:Date
> >>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
> >> 10:24:37 GMT
> >>  h:Server
> >>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
> >>  h:Set-Cookie
> >>                      timestamp=1346408694547,
> >> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
> >>  h:X-AspNet-Version
> >>                      timestamp=1346408694547, value=2.0.50727
> >>  h:X-Powered-By
> >>                      timestamp=1346408694547, value=ASP.NET
> >>  h:p3p
> >>                      timestamp=1346408694547, value=CP="IDC DSP COR
> >> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
> >>                      timestamp=1346408808930, value=Printable Version
> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
> >>                      timestamp=1346408662165, value=5.18.10 Board of
> >> Health May Minutes
> >>
> >> But after indexing with bin/nutch elasticindex and looking at the same
> >> record in ElasticSearch I'm only seeing a subset of the fields that I
> >> see in HBase.
> >>
> >> {
> >>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
> >>   site: "www.ci.watertown.ma.us",
> >>   content: "Watertown, MA - ...",
> >>   title: "Watertown, MA - Official Website",
> >>   host: "www.ci.watertown.ma.us",
> >>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
> >>   boost: "0.0",
> >>   tstamp: "2013-06-27T10:24:37.846Z",
> >>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027";,
> >>   anchor: [
> >>     "5.18.10 Board of Health May Minutes",
> >>     "Printable Version"
> >>   ]
> >> }
> >>
> >> I will need to be able to search/query against fields like
> >> Content-Type so I'm wondering if I'm missing a configuration setting
> >> to store those fields in the search index or what else might be going
> >> on that is preventing the fields that I'm seeing in HBase from showing
> >> up in ElasticSearch.
> >>
> >> I'm very new to the Nutch codebase but I've looked in
> >>
> >>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
> >> and didn't notice anything that would prevent all the fields from
> >> getting into ElasticSearch.
> >>
> >> Thanks,
> >> Matt
> >>
>

Reply via email to