On 10/15/15 12:09 PM, Haag, Jason wrote: > Hi All, > > Rather than wait for someone to tell me the right options based on all > of the different options that are available for a crawl job targeting > RDFa I will post which options I selected. Perhaps someone can tell me > what I'm missing or did wrong. I'm running Version: 07.20.3214. > > Here is what I entered under "Web Application Server > Content Imports": > > Target Description: ADL Verbs (RDFa/HTML) > Target URL: http://xapi.vocab.pub/datasets/adl/verbs/index.html
http://xapi.vocab.pub/datasets/adl/verbs/ -- if you want everything. > Copy to Local DAV collection: DAV/home/dba/rdf_sink /DAV/home/dba/rdf_sink/ > Number of redirects to follow: 1 > Update interval: 10 > Checked the following: > X Run Sponger > X Store Metadata > > Cartridges Selected: > X RDFa Select HTML (and variants) -- but note that via the "Linked Data" menu's "Sponger" section you need to goto "Extractor Cartridges" section to select and configure the HTML with the following: add-html-meta=yes get-feeds=no preview-length=512 fallback-mode=no rdfa=yes reify_html5md=0 reify_rdfa=0 reify_jsonld=0 reify_all_grddl=0 reify_html=0 passthrough_mode=yes loose=yes reify_html_misc=no reify_turtle=no I know this seems awkward, but this is the best solution we could come up with due to the problems posed by text/html content-type overloading re. HTML+Microdata and RDFa etc.. > > After I created the crawl job, I went to "import queues" and clicked "run" > > I received the following message: > > Results for xapi.vocab.pub > errors while retrieving target. Select "reset" to return initial state > Total URLs processed : 1 > Download finished > > I also checked "retrieved sites" and 0/1 were downloaded. > > Where do I find out the error that was encountered while retrieving > target? Thanks! Click on the "Edit" button aligned with your crawler job. Another bit of quirky UI to be fixed. > > Also, I'm not sure if this is a bug, but I noticed when I specify the > Local DAV collection of DAV/home/dba/rdf_sink/ and then go to edit the > job, it changes to /DAV/DAV/home/dba/rdf_sink/ by adding a new > directory 'DAV' You initial value should have been: /DAV/home/dba/rdf_sink/. We are going to look into some of these quirks, in due course. > > If I use /home/dba/rdf_sink/ as the the local path when creating the > crawl job it won't let me without adding /DAV/ path in front of it. So > it seems it is creating a sub directory /DAV/ under /DAV/ when it > shouldn't be. See my comment above. Kingsley > > Regards, > > J Haag > ------------------------------------------------------- > Advanced Distributed Learning Initiative > +1.850.266.7100(office) > +1.850.471.1300 (mobile) > jhaag75 (skype) > http://linkedin.com/in/jasonhaag > > > On Thu, Oct 15, 2015 at 10:10 AM, Haag, Jason <jason.haag....@adlnet.gov> > wrote: >> Just touching base on this post... I suspect this one was TL;DR. >> >> ------------------------------------------------------- >> Advanced Distributed Learning Initiative >> +1.850.266.7100(office) >> +1.850.471.1300 (mobile) >> jhaag75 (skype) >> http://linkedin.com/in/jasonhaag >> >> On Thu, Oct 8, 2015 at 10:39 AM, Haag, Jason <jason.haag....@adlnet.gov> >> wrote: >>> Hi All, >>> >>> I'm posting these questions to the users group to also help other Virtuoso >>> users potentially interested in importing RDFa-based content into Virtuoso >>> might also benefit from the responses. Please let me know if any of these >>> questions should be submitted as an issue to github or as feature requests >>> instead. The current documentation on importing RDFa document is a little >>> dated and does not accurately match the conductor interface for content >>> imports. The conductor interface also don't explain what the various fields >>> and options mean. Some of them are not obvious to a new user like myself and >>> might lead to bad assumptions or even cause conflicts in the system. I've >>> been a little confused by what some of the various options mean, and I have >>> also been running an older version (7.2.1). From what I have been told many >>> of the HTML/RDFa cartridges have been improved since the older version of >>> VOS. Therefore, I would like to ask a few questions and determine exactly >>> what these fields will do (or not do) before I make any mistakes or >>> assumptions. Thank you to Hugh, Tim, and Kingsley for all of the excellent >>> advice so far. I truly appreciate your support and patience with all of my >>> questions. >>> >>> I'm currently running a new build of Virtuoso Open Source, Version: >>> 07.20.3214, develop/7 branch on gitihub, Build: Oct 7 2015 on Debian+Ubuntu. >>> >>> For my use case, we will have several (potentially 50 or more) HTML5 / >>> RDFa 1.1 (core) pages available on one external server/domain and would like >>> to regularly "sponge" or "crawl" these URIs (as these datasets expressed in >>> HTML/RDFa may be updated or even grow from time to time). They will also >>> become more decentralized and available on multiple external servers so >>> Virtuoso seems like the perfect solution for being able to automatically >>> crawl all of these external sources of RDFa controlled vocabulary datasets >>> (as well as perfect for many other future objectives we have for RDF as >>> well). >>> >>> Here are my questions (perhaps some of the answers can be used for >>> FAQs,etc): >>> >>> 1) Does Virtuoso support crawling external domains and servers for the >>> Target URL if the target is HTML5/RDFa or must they be imported into DAV >>> first? >>> 2) Am I always required to specify a local DAV collection for sponging and >>> crawling RDFa even if I don't want to store the RDFa/HTML locally? >>> 3) If yes to #2, when I use dav(or dba) as the owner and the rdf_sink >>> folder to store the crawled RDFa/HTML, are there any special permissions or >>> configurations required to be made on the rdf_sink folder? Here are the >>> default configuration settings for rdf_sink: >>> >>> Main Tab: >>> - Folder Name: (rdf_sink) >>> - Folder Type: Linked Data Import >>> - Owner: dav (or dba) >>> - Permissions: rw-rw---- >>> - Full Text Search: Recursively >>> - Default Permissions: Off >>> - Metdata Retrieval: Recursively >>> - Apply changes to all subfolders and resources: unchecked >>> - Expiration Date: 0 >>> -WebDAV Properties: No properties >>> >>> Sharing Tab: >>> - ODS users/groups: No Security >>> - WebID users: No WebID Security >>> >>> Linked Data Import Tab: >>> - Graph name: urn:dav:home:dav:rdf_sink >>> - Base URI: http://host:8890/DAV/home/dba/rdf_sink/ >>> - Use special graph security (on/off): unchecked >>> - Sponger (on/off): checked >>> >>> 4) When importing content using the crawler + sponger feature I navigate >>> to "Conductor > Web Application Server > Content Imports" and click the "New >>> Target" button. >>> >>> Which of the following fields should I use to specify an external >>> HTML5/RDFa 1.1 URL for crawling and what do each of these fields mean? Note: >>> For the fields that are obvious (or are adequately addressed in the VOS >>> documentation) I have already entered those below. I would greatly >>> appreciate more information those fields that have an *asterisk* with a >>> question in (parenthesis). >>> >>> - Target description: This is obvious. Name of content import / crawl job, >>> etc. >>> - Target URL: http://domain/path/html file (* does this URL prefer an xml >>> sitemap for RDfa or can it explicitly point directly to an html file for >>> RDFa? I also have content negotiation set up on the external server where >>> the RDFa/HTML is hosted as it also serves JSON-LD, RDF/XML, and Turtle >>> serializations, but I would prefer to only regularly crawl/update based on >>> the HTML/RDFa data for now. I might have Virtuoso generate the alternate >>> serializations in the future*) >>> - Login name on target: (*if target URL is an external server, does this >>> need to be blank?*) >>> - Login password on target: (*if target URL is an external server, does >>> this need to be blank?*) >>> - Copy to Local DAV collection: (*what does this mean? It seems to imply >>> that it is required to specify a Local Dav collection to create a crawl job, >>> but another option implies that you don't have to store the data. The two >>> options are conflicting and confusing. From an user experience perspective, >>> it seems I would either want to store it or not. If I don't then why do I >>> have to specify a local DAV collection?*) >>> - Single page download: (*what does this mean?*) >>> - Local resources owner: dav >>> - Download only newer than: 1900-01-01 00-00-00 >>> - Follow links matching (delimited with ;): (*what does this do? what >>> types of "links" are examined?*) >>> - Do not follow links matching (delimited with ;): (*what does this >>> do?what types of "links" are examined?*) >>> - Custom HTTP headers: (*is this required for RDFa? If so, what is the >>> expected syntax and delimiters? "Accept: text/html"?*) >>> - Number of HTTP redirects to follow: (*I currently have a 303 redirect in >>> place for content negotiation, but what if this is unknown or changes in the >>> future? Will it break the crawler job?*) >>> - XPath expression for links extraction: (*is this applicable for >>> importing RDFa?*) >>> - Crawling depth limit: unlimited >>> - Update Interval (minutes): 0 - Number of threads: (*is this applicable >>> for importing RDFa?*) >>> - Crawl delay (sec): 0.00 >>> - Store Function: (*is this applicable for importing RDFa?*) >>> - Extract Function: (*is this applicable for importing RDFa?*) >>> - Semantic Web Crawling: (*what does this do exactly?*) >>> - If Graph IRI is unassigned use this Data Source URL: (*what is the >>> purpose of this? The content can't be imported of a Target is not specified, >>> right?*) >>> - Follow URLs outside of the target host: (*what does this do exactly?*) >>> - Follow HTML meta link: (*is this only for HTML/RDFa that specifies an >>> alternate serialization via the <link> element in the <head>?*) >>> - Follow RDF properties (one IRI per row): (*what does this do?*) >>> - Download images: >>> - Use WebDAV methods: (*what does this mean?*) >>> - Delete if remove on remote detected: (*what does this mean?*) >>> - Store documents locally: (*does this only apply to storing the content >>> in DAV?*) >>> - Convert Links: (*is this related to another option/field*?) >>> - Run Sponger: (*does this force to only use the sponger for reading RDFa >>> and populate the DB with the triples?*) >>> - Accept RDF: (*is this option only for slash-based URIs that return >>> XML/RDF via content negotiation?*) >>> - Store metadata *: (*what does this mean?*) >>> - Cartridges: (* I recommend improving the usability on this. At first I >>> thought perhaps my cartridges were not installed because the content area >>> below the "Cartridges" tab was empty. I realized the cartridges only appear >>> when you click/toggle the "Cartridges" tab. I suggest they should all be >>> listed by default. Turning their visibility off by default may prevent users >>> from realizing they are there, especially based on the old documentation*) >>> >>> 5) What do the following cartridge options do? I only listed the ones that >>> seem most applicable to running a crawler/sponger import job for an >>> externally hosted HTML5/RDFa URL. >>> >>> - RDF cartridge (*what types of RDF? what does this one do?*) >>> - RDFa cartridge (*which versions of RDFa are supported? RDFa 1.1 core? >>> RDF 1.0? RDF 1.1 Lite?*) >>> - WebDAV Metadata >>> - xHTML >>> >>> >>> > ------------------------------------------------------------------------------ > _______________________________________________ > Virtuoso-users mailing list > Virtuoso-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/virtuoso-users > -- Regards, Kingsley Idehen Founder & CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog 1: http://kidehen.blogspot.com Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen Twitter Profile: https://twitter.com/kidehen Google+ Profile: https://plus.google.com/+KingsleyIdehen/about LinkedIn Profile: http://www.linkedin.com/in/kidehen Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this
smime.p7s
Description: S/MIME Cryptographic Signature
------------------------------------------------------------------------------
_______________________________________________ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users