Awesome on Any23 2.2 forthcoming release. I look forward to it and subsequent
bump to Nutch.
In the meantime, I was successful to build Any23 from master, then copy the
any23 jars into Nutch (master) then reference them in the plugin…
<library name="apache-any23-api-2.3-SNAPSHOT.jar"/>
<library name="apache-any23-core-2.3-SNAPSHOT.jar"/>
<library name="apache-any23-csvutils-2.3-SNAPSHOT.jar"/>
<library name="apache-any23-encoding-2.3-SNAPSHOT.jar"/>
<library name="apache-any23-mime-2.3-SNAPSHOT.jar"/>
Unfortunately when I reran the nutch parsechecker it failed to parse anymore. A
quick look at the logs/hadoop.log reveal that updated any23 depends on new
classes in the other jar files:
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/rdf/api/IRI
Caused by: java.lang.NoClassDefFoundError: Could not initialize class
org.semanticweb.owlapi.rio.OWLAPIRDFFormat
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:
org.jsoup.select.NodeTraversor.traverse(Lorg/jsoup/select/NodeVisitor;Lorg/jsoup/nodes/Node;)V
I guess I would need to rebuild nutch from master (rather than just copy a few
jar files) and ensure that any23’s jar dependencies as also references..
> On Feb 9, 2018, at 1:45 PM, Lewis John McGibbney <[email protected]> wrote:
>
> Hi David,
> We are in the process of releasing Any23 2.2, this will include the fix.
> We can then come back to Nutch and make the upgrade and you should be all set.
> Hopefully this will be achieved within around 72hrs. In the meantime, you can
> clone, build and deploy Any23 master. This will do the trick.
> Lewis
>
> On 2018/02/09 07:31:10, David Ferrero <[email protected]> wrote:
>> Thank you for this information. Since this is very much related to Any23 and
>> microdata parsing, Iâm going to ask what I believe is a related question
>> but keep this same thread so it will be organized in one place:
>>
>> I noticed a lot of job boards such as dice.com <http://dice.com/>,
>> monster.com <http://monster.com/>, etc use http://schema.org/JobPosting
>> <http://schema.org/JobPosting> information, however many seem to use <script
>> type="application/ld+jsonâ>â¦</script> rather than RDF.
>> Summer 2017, Google announced structured data guidance for Jobs:
>> https://developers.google.com/search/docs/data-types/job-posting
>> <https://developers.google.com/search/docs/data-types/job-posting>
>> and a testing tool to validate your HTML:
>> https://search.google.com/structured-data/testing-tool
>> I verified a few sample listings on the above mentioned job boards on
>> googleâs testing-tool and they validate OK.
>>
>> So after looking at http://any23.apache.org/getting-started.html
>> <http://any23.apache.org/getting-started.html> for the supported extractors,
>> I see Any23 mentions it supports JSON+LD input, so I added this to
>> nutch-site.xml to override the same property in nutch-default.xml:
>>
>> <property>
>> <name>any23.extractors</name>
>> <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value>
>> <description>Comma-separated list of Any23 extractors (a list of
>> extractors is available here:
>> http://any23.apache.org/getting-started.html)</description>
>> </property>
>>
>> I expected to see additional information from nutch parsechecker after
>> adding the jsonld extractors, however I see NO changes to Any23-Triples
>> microdata parsed.
>>
>> What might I be doing wrong?
>>
>>> On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <[email protected]>
>>> wrote:
>>>
>>> Hi David,
>>> Answers inline
>>>
>>> On Thu, Feb 8, 2018 at 9:19 AM, <[email protected]> wrote:
>>>
>>>>
>>>> From: David Ferrero <[email protected]>
>>>> To: [email protected]
>>>> Cc:
>>>> Bcc:
>>>> Date: Thu, 8 Feb 2018 10:19:52 -0700
>>>> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
>>>> Pull request #205 was recently merged into master branch for Nutch 1.x in
>>>> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
>>>>
>>>> I am new to nutch and solr and have just started crawling and indexing a
>>>> few select websites. Using the built in html parsing/indexing, I am getting
>>>> searchable fields like url, content, host, sometimes a title, and a few
>>>> other indexing related fields like digest, boost, segment, and tstamp. That
>>>> said, I realized very quickly that I need better results. While exploring
>>>> the source of the website, I noticed references to schema.org and get
>>>> excited by what I see. Thatâs how I stumbled upon NUTCH-1129.
>>>>
>>>> Iâve built apache-nutch-1.15-SNAPSHOT which includes Any23
>>>> parser/indexer.
>>>>
>>>
>>> Excellent.
>>>
>>>
>>>>
>>>> Q: Now what? How do I gain Any23 microdata parsing / indexing
>>>> capabilities introduced by NUTCH-1129?
>>>> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
>>>> plugin.includes with something like parse-(html | tika |
>>>> any23)|index-(basic | anchor | any23)
>>>>
>>>
>>> No, you just add 'any23' to the list of plugins within the plugin.includes
>>> property of nutch-site.xml
>>>
>>>
>>>> Q: How do I expose the discovered microdata structure / items to end-user
>>>> such as Solr? For example, what are the microdata items and do I need to
>>>> map them to Solr in solrindex-mapping.xml?
>>>>
>>>
>>> OK, so current configuration for the Any23 plugin, is to store extracted
>>> structured data markup in the Nutch Metadata object with a key "
>>> Any23-Triples". You can locate it using something like the ParserChekcer
>>> tool provided via the 'nutch' script. Liekwise you can also locate it, as a
>>> representation of what would be indexed, by using the IndexerChecker
>>> tooling also provided within the 'nutch' script.
>>>
>>> An example would be as follows, data is now indexed as follows (example
>>> after crawling https://smartive.ch/jobs):
>>>
>>>
>>> "structured_data": [
>>> {
>>> "node": "<https://smartive.ch/jobs>",
>>> "value": "\"IE-edge,chrome=1\"@de",
>>> "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
>>> "short_key": "X-UA-Compatible"
>>> },
>>> {
>>> "node": "<https://smartive.ch/jobs>",
>>> "value": "\"Wir sind smartive \\u2014 eine dynamische,
>>> innovative Schweizer Webentwicklungsagentur. Die Realisierung
>>> zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
>>> Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
>>> Kunden.\"@de",
>>> "key": "<http://vocab.sindice.net/any23#description>",
>>> "short_key": "description"
>>> },
>>> {
>>> "node": "<https://smartive.ch/jobs>",
>>> "value": "\"width=device-width, initial-scale=1,
>>> shrink-to-fit=no\"@de",
>>> "key": "<http://vocab.sindice.net/any23#viewport>",
>>> "short_key": "viewport"
>>> },
>>> {
>>> "node": "<https://smartive.ch/jobs>",
>>> "value": "\"width=device-width,initial-scale=1\"@de",
>>> "key": "<http://vocab.sindice.net/any23#viewport>",
>>> "short_key": "viewport"
>>> },
>>> {
>>> "node": "<https://smartive.ch/jobs>",
>>> "value": "\"ie=edge\"@de",
>>> "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
>>> "short_key": "x-ua-compatible"
>>> }
>>> ],
>>>
>>>
>>> Note from above, that the 'predicate' key field is very useful for quickly
>>> filtering through, for example, Hotel Ratings, or something similar.
>>>
>>>
>>>>
>>>> Iâd also be interested to learn how to point at a specific URL and see
>>>> how
>>>> nutch sees the microdata (best case), then learn how to leverage this into
>>>> nutch and finally into solr.
>>>>
>>>>
>>> See the tooling for ParserChecker and IndexerChecker as explained above.
>>> Any further question, please let me know.
>>> Lewis
>>
>>