Hi David,
Answers inline

On Thu, Feb 8, 2018 at 9:19 AM, <[email protected]> wrote:

>
> From: David Ferrero <[email protected]>
> To: [email protected]
> Cc:
> Bcc:
> Date: Thu, 8 Feb 2018 10:19:52 -0700
> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
> Pull request #205 was recently merged into master branch for Nutch 1.x in
> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
>
> I am new to nutch and solr and have just started crawling and indexing a
> few select websites. Using the built in html parsing/indexing, I am getting
> searchable fields like url, content, host, sometimes a title, and a few
> other indexing related fields like digest, boost, segment, and tstamp. That
> said, I realized very quickly that I need better results. While exploring
> the source of the website, I noticed references to schema.org and get
> excited by what I see. That’s how I stumbled upon NUTCH-1129.
>
> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.
>

Excellent.


>
> Q: Now what?  How do I gain Any23 microdata parsing / indexing
> capabilities introduced by NUTCH-1129?
> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
> plugin.includes with something like parse-(html | tika |
> any23)|index-(basic | anchor | any23)
>

No, you just add 'any23' to the list of plugins within the plugin.includes
property of nutch-site.xml


> Q: How do I expose the discovered microdata structure / items to end-user
> such as Solr? For example, what are the microdata items and do I need to
> map them to Solr in solrindex-mapping.xml?
>

OK, so current configuration for the Any23 plugin, is to store extracted
structured data markup in the Nutch Metadata object with a key "
Any23-Triples". You can locate it using something like the ParserChekcer
tool provided via the 'nutch' script. Liekwise you can also locate it, as a
representation of what would be indexed, by using the IndexerChecker
tooling also provided within the 'nutch' script.

An example would be as follows, data is now indexed as follows (example
after crawling https://smartive.ch/jobs):


          "structured_data": [
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"IE-edge,chrome=1\"@de",
              "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
              "short_key": "X-UA-Compatible"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"Wir sind smartive \\u2014 eine dynamische,
innovative Schweizer Webentwicklungsagentur. Die Realisierung
zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
Kunden.\"@de",
              "key": "<http://vocab.sindice.net/any23#description>",
              "short_key": "description"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"width=device-width, initial-scale=1,
shrink-to-fit=no\"@de",
              "key": "<http://vocab.sindice.net/any23#viewport>",
              "short_key": "viewport"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"width=device-width,initial-scale=1\"@de",
              "key": "<http://vocab.sindice.net/any23#viewport>",
              "short_key": "viewport"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"ie=edge\"@de",
              "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
              "short_key": "x-ua-compatible"
            }
          ],


Note from above, that the 'predicate' key field is very useful for quickly
filtering through, for example, Hotel Ratings, or something similar.


>
> I’d also be interested to learn how to point at a specific URL and see how
> nutch sees the microdata (best case), then learn how to leverage this into
> nutch and finally into solr.
>
>
See the tooling for ParserChecker and IndexerChecker as explained above.
Any further question, please let me know.
Lewis

Reply via email to