Hi Ferdy,

It's likely that I'm confused about what to expect in the
ElasticSearch index. Reviewing both ElasticWriter.java and
NutchDocument.java I see that there are two properties that store data
about the document:

private Map<String, List<String>> fields;
private Metadata documentMeta;

Looking at Metadata.java it's likely that the fields that I was
expecting to show up in ElasticSearch (HTTP Headers like Content-Type,
Last-Modified, etc.) would be contained in the documentMeta property.
Is there a reason that the write(NutchDocument) method in
ElasticWriter shouldn't also store documentMeta in ElasticSearch?

Thanks,
Matt

On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <[email protected]> wrote:
> Hi,
>
> Do some of the fields that are missing in the index have any special
> characters, such as hyphen? I can imagine that those are not supported. (I
> have not tested this).
>
> Ferdy.
>
> On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <[email protected]> wrote:
>
>> Hi,
>>
>> I'm using the most recent Nutch 2.x to crawl a single site, storing
>> the results in HBase and then indexing for search with ElasticSearch.
>> My crawl and indexing complete as expected. Looking in HBase I see
>> metadata that I would expect for a record. Fields like:
>>
>>  f:typ
>>                      timestamp=1346408694547, value=text/html
>>  h:Cache-Control
>>                      timestamp=1346408694547, value=private
>>  h:Connection
>>                      timestamp=1346408694547, value=close
>>  h:Content-Length
>>                      timestamp=1346408694547, value=47166
>>  h:Content-Type
>>                      timestamp=1346408694547, value=text/html;
>> charset=utf-8
>>  h:Date
>>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
>> 10:24:37 GMT
>>  h:Server
>>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
>>  h:Set-Cookie
>>                      timestamp=1346408694547,
>> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
>>  h:X-AspNet-Version
>>                      timestamp=1346408694547, value=2.0.50727
>>  h:X-Powered-By
>>                      timestamp=1346408694547, value=ASP.NET
>>  h:p3p
>>                      timestamp=1346408694547, value=CP="IDC DSP COR
>> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
>>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
>>                      timestamp=1346408808930, value=Printable Version
>>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
>>                      timestamp=1346408662165, value=5.18.10 Board of
>> Health May Minutes
>>
>> But after indexing with bin/nutch elasticindex and looking at the same
>> record in ElasticSearch I'm only seeing a subset of the fields that I
>> see in HBase.
>>
>> {
>>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
>>   site: "www.ci.watertown.ma.us",
>>   content: "Watertown, MA - ...",
>>   title: "Watertown, MA - Official Website",
>>   host: "www.ci.watertown.ma.us",
>>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
>>   boost: "0.0",
>>   tstamp: "2013-06-27T10:24:37.846Z",
>>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027";,
>>   anchor: [
>>     "5.18.10 Board of Health May Minutes",
>>     "Printable Version"
>>   ]
>> }
>>
>> I will need to be able to search/query against fields like
>> Content-Type so I'm wondering if I'm missing a configuration setting
>> to store those fields in the search index or what else might be going
>> on that is preventing the fields that I'm seeing in HBase from showing
>> up in ElasticSearch.
>>
>> I'm very new to the Nutch codebase but I've looked in
>>
>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
>> and didn't notice anything that would prevent all the fields from
>> getting into ElasticSearch.
>>
>> Thanks,
>> Matt
>>

Reply via email to