Hi David, The java.lang.NoClassDefFoundError issues could be resolved simply by including the correct Jar artifacts. We will have the issue resolved correctly very soon and I will let you know when Any23 2.2 is released. Lewis
On Sat, Feb 10, 2018 at 11:42 AM, <user-digest-h...@nutch.apache.org> wrote: > From: David Ferrero <david.ferr...@zion.com> > To: user@nutch.apache.org > Cc: > Bcc: > Date: Sat, 10 Feb 2018 12:41:57 -0700 > Subject: Re: NUTCH-1129, Any23, microdata parsing, indexing, and > extraction? > Awesome on Any23 2.2 forthcoming release. I look forward to it and > subsequent bump to Nutch. > > In the meantime, I was successful to build Any23 from master, then copy > the any23 jars into Nutch (master) then reference them in the plugin… > <library name="apache-any23-api-2.3-SNAPSHOT.jar"/> > <library name="apache-any23-core-2.3-SNAPSHOT.jar"/> > <library name="apache-any23-csvutils-2.3-SNAPSHOT.jar"/> > <library name="apache-any23-encoding-2.3-SNAPSHOT.jar"/> > <library name="apache-any23-mime-2.3-SNAPSHOT.jar"/> > > Unfortunately when I reran the nutch parsechecker it failed to parse > anymore. A quick look at the logs/hadoop.log reveal that updated any23 > depends on new classes in the other jar files: > Caused by: java.lang.NoClassDefFoundError: org/apache/commons/rdf/api/IRI > Caused by: java.lang.NoClassDefFoundError: Could not initialize class > org.semanticweb.owlapi.rio.OWLAPIRDFFormat > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > org.jsoup.select.NodeTraversor.traverse(Lorg/ > jsoup/select/NodeVisitor;Lorg/jsoup/nodes/Node;)V > > I guess I would need to rebuild nutch from master (rather than just copy a > few jar files) and ensure that any23’s jar dependencies as also references.. > > > On Feb 9, 2018, at 1:45 PM, Lewis John McGibbney <lewi...@apache.org> > wrote: > > > > Hi David, > > We are in the process of releasing Any23 2.2, this will include the fix. > > We can then come back to Nutch and make the upgrade and you should be > all set. > > Hopefully this will be achieved within around 72hrs. In the meantime, > you can clone, build and deploy Any23 master. This will do the trick. > > Lewis > > > > On 2018/02/09 07:31:10, David Ferrero <david.ferr...@zion.com> wrote: > >> Thank you for this information. Since this is very much related to > Any23 and microdata parsing, I’m going to ask what I believe is a related > question but keep this same thread so it will be organized in one place: > >> > >> I noticed a lot of job boards such as dice.com <http://dice.com/>, > monster.com <http://monster.com/>, etc use http://schema.org/JobPosting < > http://schema.org/JobPosting> information, however many seem to use > <script type="application/ld+json†>…</script> rather than RDF. > >> Summer 2017, Google announced structured data guidance for Jobs: > >> https://developers.google.com/search/docs/data-types/job-posting < > https://developers.google.com/search/docs/data-types/job-posting> > >> and a testing tool to validate your HTML: https://search.google.com/ > structured-data/testing-tool > >> I verified a few sample listings on the above mentioned job boards on > google’s testing-tool and they validate OK. > >> > >> So after looking at http://any23.apache.org/getting-started.html < > http://any23.apache.org/getting-started.html> for the supported > extractors, I see Any23 mentions it supports JSON+LD input, so I added this > to nutch-site.xml to override the same property in nutch-default.xml: > >> > >> <property> > >> <name>any23.extractors</name> > >> <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value> > >> <description>Comma-separated list of Any23 extractors (a list of > extractors is available here: http://any23.apache.org/getting-started.html > )</description> > >> </property> > >> > >> I expected to see additional information from nutch parsechecker after > adding the jsonld extractors, however I see NO changes to Any23-Triples > microdata parsed. > >> > >> What might I be doing wrong? > >> > >>> On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <lewi...@apache.org> > wrote: > >>> > >>> Hi David, > >>> Answers inline > >>> > >>> On Thu, Feb 8, 2018 at 9:19 AM, <user-digest-h...@nutch.apache.org> > wrote: > >>> > >>>> > >>>> From: David Ferrero <david.ferr...@zion.com> > >>>> To: user@nutch.apache.org > >>>> Cc: > >>>> Bcc: > >>>> Date: Thu, 8 Feb 2018 10:19:52 -0700 > >>>> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and > extraction? > >>>> Pull request #205 was recently merged into master branch for Nutch > 1.x in > >>>> fulfillment of NUTCH-1129 "microdata for Nutch 1.x" > >>>> > >>>> I am new to nutch and solr and have just started crawling and > indexing a > >>>> few select websites. Using the built in html parsing/indexing, I am > getting > >>>> searchable fields like url, content, host, sometimes a title, and a > few > >>>> other indexing related fields like digest, boost, segment, and > tstamp. That > >>>> said, I realized very quickly that I need better results. While > exploring > >>>> the source of the website, I noticed references to schema.org and get > >>>> excited by what I see. That’s how I stumbled upon NUTCH-1129. > >>>> > >>>> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 > parser/indexer. > >>>> > >>> > >>> Excellent. > >>> > >>> > >>>> > >>>> Q: Now what? How do I gain Any23 microdata parsing / indexing > >>>> capabilities introduced by NUTCH-1129? > >>>> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in > >>>> plugin.includes with something like parse-(html | tika | > >>>> any23)|index-(basic | anchor | any23) > >>>> > >>> > >>> No, you just add 'any23' to the list of plugins within the > plugin.includes > >>> property of nutch-site.xml > >>> > >>> > >>>> Q: How do I expose the discovered microdata structure / items to > end-user > >>>> such as Solr? For example, what are the microdata items and do I need > to > >>>> map them to Solr in solrindex-mapping.xml? > >>>> > >>> > >>> OK, so current configuration for the Any23 plugin, is to store > extracted > >>> structured data markup in the Nutch Metadata object with a key " > >>> Any23-Triples". You can locate it using something like the > ParserChekcer > >>> tool provided via the 'nutch' script. Liekwise you can also locate it, > as a > >>> representation of what would be indexed, by using the IndexerChecker > >>> tooling also provided within the 'nutch' script. > >>> > >>> An example would be as follows, data is now indexed as follows (example > >>> after crawling https://smartive.ch/jobs): > >>> > >>> > >>> "structured_data": [ > >>> { > >>> "node": "<https://smartive.ch/jobs>", > >>> "value": "\"IE-edge,chrome=1\"@de", > >>> "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>", > >>> "short_key": "X-UA-Compatible" > >>> }, > >>> { > >>> "node": "<https://smartive.ch/jobs>", > >>> "value": "\"Wir sind smartive \\u2014 eine dynamische, > >>> innovative Schweizer Webentwicklungsagentur. Die Realisierung > >>> zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer > >>> Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und > >>> Kunden.\"@de", > >>> "key": "<http://vocab.sindice.net/any23#description>", > >>> "short_key": "description" > >>> }, > >>> { > >>> "node": "<https://smartive.ch/jobs>", > >>> "value": "\"width=device-width, initial-scale=1, > >>> shrink-to-fit=no\"@de", > >>> "key": "<http://vocab.sindice.net/any23#viewport>", > >>> "short_key": "viewport" > >>> }, > >>> { > >>> "node": "<https://smartive.ch/jobs>", > >>> "value": "\"width=device-width,initial-scale=1\"@de", > >>> "key": "<http://vocab.sindice.net/any23#viewport>", > >>> "short_key": "viewport" > >>> }, > >>> { > >>> "node": "<https://smartive.ch/jobs>", > >>> "value": "\"ie=edge\"@de", > >>> "key": "<http://vocab.sindice.net/any23#x-ua-compatible>", > >>> "short_key": "x-ua-compatible" > >>> } > >>> ], > >>> > >>> > >>> Note from above, that the 'predicate' key field is very useful for > quickly > >>> filtering through, for example, Hotel Ratings, or something similar. > >>> > >>> > >>>> > >>>> I’d also be interested to learn how to point at a specific URL and > see how > >>>> nutch sees the microdata (best case), then learn how to leverage this > into > >>>> nutch and finally into solr. > >>>> > >>>> > >>> See the tooling for ParserChecker and IndexerChecker as explained > above. > >>> Any further question, please let me know. > >>> Lewis > >> > >> > > > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc