Thanks for updating the list.

On Tue, Sep 4, 2012 at 2:52 PM, Matt MacDonald <[email protected]> wrote:

> Hi Ferdy,
>
> Thanks for the additional information. I found out what was missing in
> my configuration. I updated the nutch-site.xml plugin.includes section
> to use index-(basic|anchor|more|urlmeta) and I'm now seeing the fields
> that I was anticipating in the ElasticSearch index. These were the
> relevant urls that I came across that supplied the information that I
> was looking for:
>
> * http://wiki.apache.org/nutch/IndexStructure
> * https://issues.apache.org/jira/browse/NUTCH-940
>
> Thanks again for such a prompt reply and the help.
>
> Thanks,
> Matt
>
> On Mon, Sep 3, 2012 at 9:11 AM, Ferdy Galema <[email protected]>
> wrote:
> > I'm not sure what the original purpose of the documentMeta is, but seeing
> > as there already is clearly defined 'fields' container for all fields
> that
> > should be indexed, I guess it is just a place for storing some extra data
> > about the fields or document that should be indexed. The Elasticwriter
> uses
> > it only for the type, the Solrwriter does not use it at all. It looks
> like
> > Nutch trunk does not use it either.
> >
> > In short, for now I would just use the 'fields' and ignore documentMeta.
> >
> > On Mon, Sep 3, 2012 at 2:38 PM, Matt MacDonald <[email protected]>
> wrote:
> >
> >> Hi Ferdy,
> >>
> >> It's likely that I'm confused about what to expect in the
> >> ElasticSearch index. Reviewing both ElasticWriter.java and
> >> NutchDocument.java I see that there are two properties that store data
> >> about the document:
> >>
> >> private Map<String, List<String>> fields;
> >> private Metadata documentMeta;
> >>
> >> Looking at Metadata.java it's likely that the fields that I was
> >> expecting to show up in ElasticSearch (HTTP Headers like Content-Type,
> >> Last-Modified, etc.) would be contained in the documentMeta property.
> >> Is there a reason that the write(NutchDocument) method in
> >> ElasticWriter shouldn't also store documentMeta in ElasticSearch?
> >>
> >> Thanks,
> >> Matt
> >>
> >> On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <[email protected]>
> >> wrote:
> >> > Hi,
> >> >
> >> > Do some of the fields that are missing in the index have any special
> >> > characters, such as hyphen? I can imagine that those are not
> supported.
> >> (I
> >> > have not tested this).
> >> >
> >> > Ferdy.
> >> >
> >> > On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <[email protected]>
> >> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm using the most recent Nutch 2.x to crawl a single site, storing
> >> >> the results in HBase and then indexing for search with ElasticSearch.
> >> >> My crawl and indexing complete as expected. Looking in HBase I see
> >> >> metadata that I would expect for a record. Fields like:
> >> >>
> >> >>  f:typ
> >> >>                      timestamp=1346408694547, value=text/html
> >> >>  h:Cache-Control
> >> >>                      timestamp=1346408694547, value=private
> >> >>  h:Connection
> >> >>                      timestamp=1346408694547, value=close
> >> >>  h:Content-Length
> >> >>                      timestamp=1346408694547, value=47166
> >> >>  h:Content-Type
> >> >>                      timestamp=1346408694547, value=text/html;
> >> >> charset=utf-8
> >> >>  h:Date
> >> >>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
> >> >> 10:24:37 GMT
> >> >>  h:Server
> >> >>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
> >> >>  h:Set-Cookie
> >> >>                      timestamp=1346408694547,
> >> >> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
> >> >>  h:X-AspNet-Version
> >> >>                      timestamp=1346408694547, value=2.0.50727
> >> >>  h:X-Powered-By
> >> >>                      timestamp=1346408694547, value=ASP.NET
> >> >>  h:p3p
> >> >>                      timestamp=1346408694547, value=CP="IDC DSP COR
> >> >> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
> >> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
> >> >>                      timestamp=1346408808930, value=Printable Version
> >> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
> >> >>                      timestamp=1346408662165, value=5.18.10 Board of
> >> >> Health May Minutes
> >> >>
> >> >> But after indexing with bin/nutch elasticindex and looking at the
> same
> >> >> record in ElasticSearch I'm only seeing a subset of the fields that I
> >> >> see in HBase.
> >> >>
> >> >> {
> >> >>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
> >> >>   site: "www.ci.watertown.ma.us",
> >> >>   content: "Watertown, MA - ...",
> >> >>   title: "Watertown, MA - Official Website",
> >> >>   host: "www.ci.watertown.ma.us",
> >> >>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
> >> >>   boost: "0.0",
> >> >>   tstamp: "2013-06-27T10:24:37.846Z",
> >> >>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027";,
> >> >>   anchor: [
> >> >>     "5.18.10 Board of Health May Minutes",
> >> >>     "Printable Version"
> >> >>   ]
> >> >> }
> >> >>
> >> >> I will need to be able to search/query against fields like
> >> >> Content-Type so I'm wondering if I'm missing a configuration setting
> >> >> to store those fields in the search index or what else might be going
> >> >> on that is preventing the fields that I'm seeing in HBase from
> showing
> >> >> up in ElasticSearch.
> >> >>
> >> >> I'm very new to the Nutch codebase but I've looked in
> >> >>
> >> >>
> >>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
> >> >> and didn't notice anything that would prevent all the fields from
> >> >> getting into ElasticSearch.
> >> >>
> >> >> Thanks,
> >> >> Matt
> >> >>
> >>
>

Reply via email to