Re: [Virtuoso-users] RDF Mapper Options in Conductor

Kingsley Idehen Thu, 15 Oct 2015 09:46:51 -0700

On 10/15/15 12:09 PM, Haag, Jason wrote:
> Hi All,
>
> Rather than wait for someone to tell me the right options based on all
> of the different options that are available for a crawl job targeting
> RDFa I will post which options I selected. Perhaps someone can tell me
> what I'm missing or did wrong. I'm running Version: 07.20.3214.
>
> Here is what I entered under "Web Application Server > Content Imports":
>
> Target Description: ADL Verbs (RDFa/HTML)
> Target URL: http://xapi.vocab.pub/datasets/adl/verbs/index.html


http://xapi.vocab.pub/datasets/adl/verbs/ -- if you want everything. 

 
> Copy to Local DAV collection: DAV/home/dba/rdf_sink
/DAV/home/dba/rdf_sink/

> Number of redirects to follow: 1
> Update interval: 10
> Checked the following:
> X Run Sponger
> X Store Metadata
>
> Cartridges Selected:
> X RDFa

Select HTML (and variants) -- but note that via the "Linked Data" menu's
"Sponger"  section you need to goto "Extractor Cartridges" section to
select and configure the HTML with the following:

add-html-meta=yes
get-feeds=no
preview-length=512
fallback-mode=no
rdfa=yes
reify_html5md=0
reify_rdfa=0
reify_jsonld=0
reify_all_grddl=0
reify_html=0
passthrough_mode=yes
loose=yes
reify_html_misc=no
reify_turtle=no

I know this seems awkward, but this is the best solution we could come
up with due to the problems posed by text/html content-type overloading
re. HTML+Microdata and RDFa etc..


>
> After I created the crawl job, I went to "import queues" and clicked "run"
>
> I received the following message:
>
> Results for xapi.vocab.pub
> errors while retrieving target. Select "reset" to return initial state
> Total URLs processed : 1
> Download finished
>
> I also checked "retrieved sites" and 0/1 were downloaded.
>
> Where do I find out the error that was encountered while retrieving
> target? Thanks!

Click on the "Edit" button aligned with your crawler job.

Another bit of quirky UI to be fixed.
>
> Also, I'm not sure if this is a bug, but I noticed when I specify the
> Local DAV collection of DAV/home/dba/rdf_sink/ and then go to edit the
> job, it changes to /DAV/DAV/home/dba/rdf_sink/ by adding a new
> directory 'DAV'

You initial value should have been: /DAV/home/dba/rdf_sink/.

We are going to look into some of these quirks, in due course.

>
> If I use /home/dba/rdf_sink/ as the the local path when creating the
> crawl job it won't let me without adding /DAV/ path in front of it. So
> it seems it is creating a sub directory /DAV/ under /DAV/ when it
> shouldn't be.

See my comment above.

Kingsley
>
> Regards,
>
> J Haag
> -------------------------------------------------------
> Advanced Distributed Learning Initiative
> +1.850.266.7100(office)
> +1.850.471.1300 (mobile)
> jhaag75 (skype)
> http://linkedin.com/in/jasonhaag
>
>
> On Thu, Oct 15, 2015 at 10:10 AM, Haag, Jason <jason.haag....@adlnet.gov> 
> wrote:
>> Just touching base on this post... I suspect this one was TL;DR.
>>
>> -------------------------------------------------------
>> Advanced Distributed Learning Initiative
>> +1.850.266.7100(office)
>> +1.850.471.1300 (mobile)
>> jhaag75 (skype)
>> http://linkedin.com/in/jasonhaag
>>
>> On Thu, Oct 8, 2015 at 10:39 AM, Haag, Jason <jason.haag....@adlnet.gov>
>> wrote:
>>> Hi All,
>>>
>>> I'm posting these questions to the users group to also help other Virtuoso
>>> users potentially interested in importing RDFa-based content into Virtuoso
>>> might also benefit from the responses. Please let me know if any of these
>>> questions should be submitted as an issue to github or as feature requests
>>> instead. The current documentation on importing RDFa document is a little
>>> dated and does not accurately match the conductor interface for content
>>> imports. The conductor interface also don't explain what the various fields
>>> and options mean. Some of them are not obvious to a new user like myself and
>>> might lead to bad assumptions or even cause conflicts in the system. I've
>>> been a little confused by what some of the various options mean, and I have
>>> also been running an older version (7.2.1). From what I have been told many
>>> of the HTML/RDFa cartridges have been improved since the older version of
>>> VOS. Therefore, I would like to ask a few questions and determine exactly
>>> what these fields will do (or not do) before I make any mistakes or
>>> assumptions. Thank you to Hugh, Tim, and Kingsley for all of the excellent
>>> advice so far. I truly appreciate your support and patience with all of my
>>> questions.
>>>
>>> I'm currently running a new build of Virtuoso Open Source, Version:
>>> 07.20.3214, develop/7 branch on gitihub, Build: Oct 7 2015 on Debian+Ubuntu.
>>>
>>> For my use case, we will have several (potentially 50 or more) HTML5 /
>>> RDFa 1.1 (core) pages available on one external server/domain and would like
>>> to regularly "sponge" or "crawl" these URIs (as these datasets expressed in
>>> HTML/RDFa may be updated or even grow from time to time). They will also
>>> become more decentralized and available on multiple external servers so
>>> Virtuoso seems like the perfect solution for being able to automatically
>>> crawl all of these external sources of RDFa controlled vocabulary datasets
>>> (as well as perfect for many other future objectives we have for RDF as
>>> well).
>>>
>>> Here are my questions (perhaps some of the answers can be used for
>>> FAQs,etc):
>>>
>>> 1) Does Virtuoso support crawling external domains and servers for the
>>> Target URL if the target is HTML5/RDFa or must they be imported into DAV
>>> first?
>>> 2) Am I always required to specify a local DAV collection for sponging and
>>> crawling RDFa even if I don't want to store the RDFa/HTML locally?
>>> 3) If yes to #2, when I use dav(or dba) as the owner and the rdf_sink
>>> folder to store the crawled RDFa/HTML, are there any special permissions or
>>> configurations required to be made on the rdf_sink folder? Here are the
>>> default configuration settings for rdf_sink:
>>>
>>> Main Tab:
>>> - Folder Name: (rdf_sink)
>>> - Folder Type: Linked Data Import
>>> - Owner: dav (or dba)
>>> - Permissions: rw-rw----
>>> - Full Text Search: Recursively
>>> - Default Permissions: Off
>>> - Metdata Retrieval: Recursively
>>> - Apply changes to all subfolders and resources: unchecked
>>> - Expiration Date: 0
>>> -WebDAV Properties: No properties
>>>
>>> Sharing Tab:
>>> - ODS users/groups: No Security
>>> - WebID users: No WebID Security
>>>
>>> Linked Data Import Tab:
>>> - Graph name: urn:dav:home:dav:rdf_sink
>>> - Base URI: http://host:8890/DAV/home/dba/rdf_sink/
>>> - Use special graph security (on/off): unchecked
>>> - Sponger (on/off): checked
>>>
>>> 4) When importing content using the crawler + sponger feature I navigate
>>> to "Conductor > Web Application Server > Content Imports" and click the "New
>>> Target" button.
>>>
>>> Which of the following fields should I use to specify an external
>>> HTML5/RDFa 1.1 URL for crawling and what do each of these fields mean? Note:
>>> For the fields that are obvious (or are adequately addressed in the VOS
>>> documentation) I have already entered those below. I would greatly
>>> appreciate more information those fields that have an *asterisk*  with a
>>> question in (parenthesis).
>>>
>>> - Target description: This is obvious. Name of content import / crawl job,
>>> etc.
>>> - Target URL: http://domain/path/html file (* does this URL prefer an xml
>>> sitemap for RDfa or can it explicitly point directly to an html file for
>>> RDFa? I also have content negotiation set up on the external server where
>>> the RDFa/HTML is hosted as it also serves JSON-LD, RDF/XML, and Turtle
>>> serializations, but I would prefer to only regularly crawl/update based on
>>> the HTML/RDFa data for now. I might have Virtuoso generate the alternate
>>> serializations in the future*)
>>> - Login name on target: (*if target URL is an external server, does this
>>> need to be blank?*)
>>> - Login password on target:  (*if target URL is an external server, does
>>> this need to be blank?*)
>>> - Copy to Local DAV collection: (*what does this mean? It seems to imply
>>> that it is required to specify a Local Dav collection to create a crawl job,
>>> but another option implies that you don't have to store the data. The two
>>> options are conflicting and confusing. From an user experience perspective,
>>> it seems I would either want to store it or not. If I don't then why do I
>>> have to specify a local DAV collection?*)
>>> - Single page download: (*what does this mean?*)
>>> - Local resources owner: dav
>>> - Download only newer than: 1900-01-01 00-00-00
>>> - Follow links matching (delimited with ;): (*what does this do? what
>>> types of "links" are examined?*)
>>> - Do not follow links matching (delimited with ;): (*what does this
>>> do?what types of "links" are examined?*)
>>> - Custom HTTP headers: (*is this required for RDFa? If so, what is the
>>> expected syntax and delimiters? "Accept: text/html"?*)
>>> - Number of HTTP redirects to follow: (*I currently have a 303 redirect in
>>> place for content negotiation, but what if this is unknown or changes in the
>>> future? Will it break the crawler job?*)
>>> - XPath expression for links extraction: (*is this applicable for
>>> importing RDFa?*)
>>> - Crawling depth limit: unlimited
>>> - Update Interval (minutes): 0 - Number of threads: (*is this applicable
>>> for importing RDFa?*)
>>> - Crawl delay (sec): 0.00
>>> - Store Function: (*is this applicable for importing RDFa?*)
>>> - Extract Function: (*is this applicable for importing RDFa?*)
>>> - Semantic Web Crawling: (*what does this do exactly?*)
>>> - If Graph IRI is unassigned use this Data Source URL: (*what is the
>>> purpose of this? The content can't be imported of a Target is not specified,
>>> right?*)
>>> - Follow URLs outside of the target host:  (*what does this do exactly?*)
>>> - Follow HTML meta link: (*is this only for HTML/RDFa that specifies an
>>> alternate serialization via the <link> element in the <head>?*)
>>> - Follow RDF properties (one IRI per row): (*what does this do?*)
>>> - Download images:
>>> - Use WebDAV methods: (*what does this mean?*)
>>> - Delete if remove on remote detected: (*what does this mean?*)
>>> - Store documents locally: (*does this only apply to storing the content
>>> in DAV?*)
>>> - Convert Links: (*is this related to another option/field*?)
>>> - Run Sponger: (*does this force to only use the sponger for reading RDFa
>>> and populate the DB with the triples?*)
>>> - Accept RDF: (*is this option only for slash-based URIs that return
>>> XML/RDF via content negotiation?*)
>>> - Store metadata *: (*what does this mean?*)
>>> - Cartridges: (* I recommend improving the usability on this. At first I
>>> thought perhaps my cartridges were not installed because the content area
>>> below the "Cartridges" tab was empty. I realized the cartridges only appear
>>> when you click/toggle the "Cartridges" tab. I suggest they should all be
>>> listed by default. Turning their visibility off by default may prevent users
>>> from realizing they are there, especially based on the old documentation*)
>>>
>>> 5) What do the following cartridge options do? I only listed the ones that
>>> seem most applicable to running a crawler/sponger import job for an
>>> externally hosted HTML5/RDFa URL.
>>>
>>> - RDF cartridge (*what types of RDF? what does this one do?*)
>>> - RDFa cartridge (*which versions of RDFa are supported? RDFa 1.1 core?
>>> RDF 1.0? RDF 1.1 Lite?*)
>>> - WebDAV Metadata
>>> - xHTML
>>>
>>>
>>>
> ------------------------------------------------------------------------------
> _______________________________________________
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>


-- 
Regards,

Kingsley Idehen       
Founder & CEO 
OpenLink Software     
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

smime.p7s
Description: S/MIME Cryptographic Signature

------------------------------------------------------------------------------

_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Re: [Virtuoso-users] RDF Mapper Options in Conductor

Reply via email to