Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

lewis john mcgibbney Mon, 12 Feb 2018 07:59:18 -0800

Hi David,
The java.lang.NoClassDefFoundError issues could be resolved simply by
including the correct Jar artifacts.
 We will have the issue resolved correctly very soon and I will let you
know when Any23 2.2 is released.
Lewis


On Sat, Feb 10, 2018 at 11:42 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: David Ferrero <david.ferr...@zion.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Sat, 10 Feb 2018 12:41:57 -0700
> Subject: Re: NUTCH-1129, Any23, microdata parsing, indexing, and
> extraction?
> Awesome on Any23 2.2 forthcoming release. I look forward to it and
> subsequent bump to Nutch.
>
> In the meantime, I was successful to build Any23 from master, then copy
> the any23 jars into Nutch (master) then reference them in the plugin…
>     <library name="apache-any23-api-2.3-SNAPSHOT.jar"/>
>     <library name="apache-any23-core-2.3-SNAPSHOT.jar"/>
>     <library name="apache-any23-csvutils-2.3-SNAPSHOT.jar"/>
>     <library name="apache-any23-encoding-2.3-SNAPSHOT.jar"/>
>     <library name="apache-any23-mime-2.3-SNAPSHOT.jar"/>
>
> Unfortunately when I reran the nutch parsechecker it failed to parse
> anymore. A quick look at the logs/hadoop.log reveal that updated any23
> depends on new classes in the other jar files:
> Caused by: java.lang.NoClassDefFoundError: org/apache/commons/rdf/api/IRI
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class
> org.semanticweb.owlapi.rio.OWLAPIRDFFormat
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:
> org.jsoup.select.NodeTraversor.traverse(Lorg/
> jsoup/select/NodeVisitor;Lorg/jsoup/nodes/Node;)V
>
> I guess I would need to rebuild nutch from master (rather than just copy a
> few jar files) and ensure that any23’s jar dependencies as also references..
>
> > On Feb 9, 2018, at 1:45 PM, Lewis John McGibbney <lewi...@apache.org>
> wrote:
> >
> > Hi David,
> > We are in the process of releasing Any23 2.2, this will include the fix.
> > We can then come back to Nutch and make the upgrade and you should be
> all set.
> > Hopefully this will be achieved within around 72hrs. In the meantime,
> you can clone, build and deploy Any23 master. This will do the trick.
> > Lewis
> >
> > On 2018/02/09 07:31:10, David Ferrero <david.ferr...@zion.com> wrote:
> >> Thank you for this information. Since this is very much related to
> Any23 and microdata parsing, Iâ€™m going to ask what I believe is a related
> question but keep this same thread so it will be organized in one place:
> >>
> >> I noticed a lot of job boards such as dice.com <http://dice.com/>,
> monster.com <http://monster.com/>, etc use http://schema.org/JobPosting <
> http://schema.org/JobPosting> information, however many seem to use
> <script type="application/ld+jsonâ€ >â€¦</script> rather than RDF.
> >> Summer 2017, Google announced structured data guidance for Jobs:
> >> https://developers.google.com/search/docs/data-types/job-posting <
> https://developers.google.com/search/docs/data-types/job-posting>
> >> and a testing tool to validate your HTML: https://search.google.com/
> structured-data/testing-tool
> >> I verified a few sample listings on the above mentioned job boards on
> googleâ€™s testing-tool and they validate OK.
> >>
> >> So after looking at http://any23.apache.org/getting-started.html <
> http://any23.apache.org/getting-started.html> for the supported
> extractors, I see Any23 mentions it supports JSON+LD input, so I added this
> to nutch-site.xml to override the same property in nutch-default.xml:
> >>
> >> <property>
> >>    <name>any23.extractors</name>
> >>    <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value>
> >>    <description>Comma-separated list of Any23 extractors (a list of
> extractors is available here: http://any23.apache.org/getting-started.html
> )</description>
> >> </property>
> >>
> >> I expected to see additional information from nutch parsechecker after
> adding the jsonld extractors, however I see NO changes to Any23-Triples
> microdata parsed.
> >>
> >> What might I be doing wrong?
> >>
> >>> On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <lewi...@apache.org>
> wrote:
> >>>
> >>> Hi David,
> >>> Answers inline
> >>>
> >>> On Thu, Feb 8, 2018 at 9:19 AM, <user-digest-h...@nutch.apache.org>
> wrote:
> >>>
> >>>>
> >>>> From: David Ferrero <david.ferr...@zion.com>
> >>>> To: user@nutch.apache.org
> >>>> Cc:
> >>>> Bcc:
> >>>> Date: Thu, 8 Feb 2018 10:19:52 -0700
> >>>> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and
> extraction?
> >>>> Pull request #205 was recently merged into master branch for Nutch
> 1.x in
> >>>> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
> >>>>
> >>>> I am new to nutch and solr and have just started crawling and
> indexing a
> >>>> few select websites. Using the built in html parsing/indexing, I am
> getting
> >>>> searchable fields like url, content, host, sometimes a title, and a
> few
> >>>> other indexing related fields like digest, boost, segment, and
> tstamp. That
> >>>> said, I realized very quickly that I need better results. While
> exploring
> >>>> the source of the website, I noticed references to schema.org and get
> >>>> excited by what I see. Thatâ€™s how I stumbled upon NUTCH-1129.
> >>>>
> >>>> Iâ€™ve built apache-nutch-1.15-SNAPSHOT which includes Any23
> parser/indexer.
> >>>>
> >>>
> >>> Excellent.
> >>>
> >>>
> >>>>
> >>>> Q: Now what?  How do I gain Any23 microdata parsing / indexing
> >>>> capabilities introduced by NUTCH-1129?
> >>>> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
> >>>> plugin.includes with something like parse-(html | tika |
> >>>> any23)|index-(basic | anchor | any23)
> >>>>
> >>>
> >>> No, you just add 'any23' to the list of plugins within the
> plugin.includes
> >>> property of nutch-site.xml
> >>>
> >>>
> >>>> Q: How do I expose the discovered microdata structure / items to
> end-user
> >>>> such as Solr? For example, what are the microdata items and do I need
> to
> >>>> map them to Solr in solrindex-mapping.xml?
> >>>>
> >>>
> >>> OK, so current configuration for the Any23 plugin, is to store
> extracted
> >>> structured data markup in the Nutch Metadata object with a key "
> >>> Any23-Triples". You can locate it using something like the
> ParserChekcer
> >>> tool provided via the 'nutch' script. Liekwise you can also locate it,
> as a
> >>> representation of what would be indexed, by using the IndexerChecker
> >>> tooling also provided within the 'nutch' script.
> >>>
> >>> An example would be as follows, data is now indexed as follows (example
> >>> after crawling https://smartive.ch/jobs):
> >>>
> >>>
> >>>         "structured_data": [
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"IE-edge,chrome=1\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
> >>>             "short_key": "X-UA-Compatible"
> >>>           },
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"Wir sind smartive \\u2014 eine dynamische,
> >>> innovative Schweizer Webentwicklungsagentur. Die Realisierung
> >>> zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
> >>> Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
> >>> Kunden.\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#description>",
> >>>             "short_key": "description"
> >>>           },
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"width=device-width, initial-scale=1,
> >>> shrink-to-fit=no\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#viewport>",
> >>>             "short_key": "viewport"
> >>>           },
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"width=device-width,initial-scale=1\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#viewport>",
> >>>             "short_key": "viewport"
> >>>           },
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"ie=edge\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
> >>>             "short_key": "x-ua-compatible"
> >>>           }
> >>>         ],
> >>>
> >>>
> >>> Note from above, that the 'predicate' key field is very useful for
> quickly
> >>> filtering through, for example, Hotel Ratings, or something similar.
> >>>
> >>>
> >>>>
> >>>> Iâ€™d also be interested to learn how to point at a specific URL and
> see how
> >>>> nutch sees the microdata (best case), then learn how to leverage this
> into
> >>>> nutch and finally into solr.
> >>>>
> >>>>
> >>> See the tooling for ParserChecker and IndexerChecker as explained
> above.
> >>> Any further question, please let me know.
> >>> Lewis
> >>
> >>
>
>
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

Reply via email to