Hi Ferdy,

Thanks for the additional information. I found out what was missing in
my configuration. I updated the nutch-site.xml plugin.includes section
to use index-(basic|anchor|more|urlmeta) and I'm now seeing the fields
that I was anticipating in the ElasticSearch index. These were the
relevant urls that I came across that supplied the information that I
was looking for:

* http://wiki.apache.org/nutch/IndexStructure
* https://issues.apache.org/jira/browse/NUTCH-940

Thanks again for such a prompt reply and the help.

Thanks,
Matt

On Mon, Sep 3, 2012 at 9:11 AM, Ferdy Galema <[email protected]> wrote:
> I'm not sure what the original purpose of the documentMeta is, but seeing
> as there already is clearly defined 'fields' container for all fields that
> should be indexed, I guess it is just a place for storing some extra data
> about the fields or document that should be indexed. The Elasticwriter uses
> it only for the type, the Solrwriter does not use it at all. It looks like
> Nutch trunk does not use it either.
>
> In short, for now I would just use the 'fields' and ignore documentMeta.
>
> On Mon, Sep 3, 2012 at 2:38 PM, Matt MacDonald <[email protected]> wrote:
>
>> Hi Ferdy,
>>
>> It's likely that I'm confused about what to expect in the
>> ElasticSearch index. Reviewing both ElasticWriter.java and
>> NutchDocument.java I see that there are two properties that store data
>> about the document:
>>
>> private Map<String, List<String>> fields;
>> private Metadata documentMeta;
>>
>> Looking at Metadata.java it's likely that the fields that I was
>> expecting to show up in ElasticSearch (HTTP Headers like Content-Type,
>> Last-Modified, etc.) would be contained in the documentMeta property.
>> Is there a reason that the write(NutchDocument) method in
>> ElasticWriter shouldn't also store documentMeta in ElasticSearch?
>>
>> Thanks,
>> Matt
>>
>> On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <[email protected]>
>> wrote:
>> > Hi,
>> >
>> > Do some of the fields that are missing in the index have any special
>> > characters, such as hyphen? I can imagine that those are not supported.
>> (I
>> > have not tested this).
>> >
>> > Ferdy.
>> >
>> > On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <[email protected]>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> I'm using the most recent Nutch 2.x to crawl a single site, storing
>> >> the results in HBase and then indexing for search with ElasticSearch.
>> >> My crawl and indexing complete as expected. Looking in HBase I see
>> >> metadata that I would expect for a record. Fields like:
>> >>
>> >>  f:typ
>> >>                      timestamp=1346408694547, value=text/html
>> >>  h:Cache-Control
>> >>                      timestamp=1346408694547, value=private
>> >>  h:Connection
>> >>                      timestamp=1346408694547, value=close
>> >>  h:Content-Length
>> >>                      timestamp=1346408694547, value=47166
>> >>  h:Content-Type
>> >>                      timestamp=1346408694547, value=text/html;
>> >> charset=utf-8
>> >>  h:Date
>> >>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
>> >> 10:24:37 GMT
>> >>  h:Server
>> >>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
>> >>  h:Set-Cookie
>> >>                      timestamp=1346408694547,
>> >> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
>> >>  h:X-AspNet-Version
>> >>                      timestamp=1346408694547, value=2.0.50727
>> >>  h:X-Powered-By
>> >>                      timestamp=1346408694547, value=ASP.NET
>> >>  h:p3p
>> >>                      timestamp=1346408694547, value=CP="IDC DSP COR
>> >> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
>> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
>> >>                      timestamp=1346408808930, value=Printable Version
>> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
>> >>                      timestamp=1346408662165, value=5.18.10 Board of
>> >> Health May Minutes
>> >>
>> >> But after indexing with bin/nutch elasticindex and looking at the same
>> >> record in ElasticSearch I'm only seeing a subset of the fields that I
>> >> see in HBase.
>> >>
>> >> {
>> >>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
>> >>   site: "www.ci.watertown.ma.us",
>> >>   content: "Watertown, MA - ...",
>> >>   title: "Watertown, MA - Official Website",
>> >>   host: "www.ci.watertown.ma.us",
>> >>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
>> >>   boost: "0.0",
>> >>   tstamp: "2013-06-27T10:24:37.846Z",
>> >>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027";,
>> >>   anchor: [
>> >>     "5.18.10 Board of Health May Minutes",
>> >>     "Printable Version"
>> >>   ]
>> >> }
>> >>
>> >> I will need to be able to search/query against fields like
>> >> Content-Type so I'm wondering if I'm missing a configuration setting
>> >> to store those fields in the search index or what else might be going
>> >> on that is preventing the fields that I'm seeing in HBase from showing
>> >> up in ElasticSearch.
>> >>
>> >> I'm very new to the Nutch codebase but I've looked in
>> >>
>> >>
>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
>> >> and didn't notice anything that would prevent all the fields from
>> >> getting into ElasticSearch.
>> >>
>> >> Thanks,
>> >> Matt
>> >>
>>

Reply via email to