Hi,

Do some of the fields that are missing in the index have any special
characters, such as hyphen? I can imagine that those are not supported. (I
have not tested this).

Ferdy.

On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <[email protected]> wrote:

> Hi,
>
> I'm using the most recent Nutch 2.x to crawl a single site, storing
> the results in HBase and then indexing for search with ElasticSearch.
> My crawl and indexing complete as expected. Looking in HBase I see
> metadata that I would expect for a record. Fields like:
>
>  f:typ
>                      timestamp=1346408694547, value=text/html
>  h:Cache-Control
>                      timestamp=1346408694547, value=private
>  h:Connection
>                      timestamp=1346408694547, value=close
>  h:Content-Length
>                      timestamp=1346408694547, value=47166
>  h:Content-Type
>                      timestamp=1346408694547, value=text/html;
> charset=utf-8
>  h:Date
>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
> 10:24:37 GMT
>  h:Server
>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
>  h:Set-Cookie
>                      timestamp=1346408694547,
> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
>  h:X-AspNet-Version
>                      timestamp=1346408694547, value=2.0.50727
>  h:X-Powered-By
>                      timestamp=1346408694547, value=ASP.NET
>  h:p3p
>                      timestamp=1346408694547, value=CP="IDC DSP COR
> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
>                      timestamp=1346408808930, value=Printable Version
>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
>                      timestamp=1346408662165, value=5.18.10 Board of
> Health May Minutes
>
> But after indexing with bin/nutch elasticindex and looking at the same
> record in ElasticSearch I'm only seeing a subset of the fields that I
> see in HBase.
>
> {
>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
>   site: "www.ci.watertown.ma.us",
>   content: "Watertown, MA - ...",
>   title: "Watertown, MA - Official Website",
>   host: "www.ci.watertown.ma.us",
>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
>   boost: "0.0",
>   tstamp: "2013-06-27T10:24:37.846Z",
>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027";,
>   anchor: [
>     "5.18.10 Board of Health May Minutes",
>     "Printable Version"
>   ]
> }
>
> I will need to be able to search/query against fields like
> Content-Type so I'm wondering if I'm missing a configuration setting
> to store those fields in the search index or what else might be going
> on that is preventing the fields that I'm seeing in HBase from showing
> up in ElasticSearch.
>
> I'm very new to the Nutch codebase but I've looked in
>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
> and didn't notice anything that would prevent all the fields from
> getting into ElasticSearch.
>
> Thanks,
> Matt
>

Reply via email to