Hi All, Rather than wait for someone to tell me the right options based on all of the different options that are available for a crawl job targeting RDFa I will post which options I selected. Perhaps someone can tell me what I'm missing or did wrong. I'm running Version: 07.20.3214.
Here is what I entered under "Web Application Server > Content Imports": Target Description: ADL Verbs (RDFa/HTML) Target URL: http://xapi.vocab.pub/datasets/adl/verbs/index.html Copy to Local DAV collection: DAV/home/dba/rdf_sink Number of redirects to follow: 1 Update interval: 10 Checked the following: X Run Sponger X Store Metadata Cartridges Selected: X RDFa After I created the crawl job, I went to "import queues" and clicked "run" I received the following message: Results for xapi.vocab.pub errors while retrieving target. Select "reset" to return initial state Total URLs processed : 1 Download finished I also checked "retrieved sites" and 0/1 were downloaded. Where do I find out the error that was encountered while retrieving target? Thanks! Also, I'm not sure if this is a bug, but I noticed when I specify the Local DAV collection of DAV/home/dba/rdf_sink/ and then go to edit the job, it changes to /DAV/DAV/home/dba/rdf_sink/ by adding a new directory 'DAV' If I use /home/dba/rdf_sink/ as the the local path when creating the crawl job it won't let me without adding /DAV/ path in front of it. So it seems it is creating a sub directory /DAV/ under /DAV/ when it shouldn't be. Regards, J Haag ------------------------------------------------------- Advanced Distributed Learning Initiative +1.850.266.7100(office) +1.850.471.1300 (mobile) jhaag75 (skype) http://linkedin.com/in/jasonhaag On Thu, Oct 15, 2015 at 10:10 AM, Haag, Jason <jason.haag....@adlnet.gov> wrote: > > Just touching base on this post... I suspect this one was TL;DR. > > ------------------------------------------------------- > Advanced Distributed Learning Initiative > +1.850.266.7100(office) > +1.850.471.1300 (mobile) > jhaag75 (skype) > http://linkedin.com/in/jasonhaag > > On Thu, Oct 8, 2015 at 10:39 AM, Haag, Jason <jason.haag....@adlnet.gov> > wrote: >> >> Hi All, >> >> I'm posting these questions to the users group to also help other Virtuoso >> users potentially interested in importing RDFa-based content into Virtuoso >> might also benefit from the responses. Please let me know if any of these >> questions should be submitted as an issue to github or as feature requests >> instead. The current documentation on importing RDFa document is a little >> dated and does not accurately match the conductor interface for content >> imports. The conductor interface also don't explain what the various fields >> and options mean. Some of them are not obvious to a new user like myself and >> might lead to bad assumptions or even cause conflicts in the system. I've >> been a little confused by what some of the various options mean, and I have >> also been running an older version (7.2.1). From what I have been told many >> of the HTML/RDFa cartridges have been improved since the older version of >> VOS. Therefore, I would like to ask a few questions and determine exactly >> what these fields will do (or not do) before I make any mistakes or >> assumptions. Thank you to Hugh, Tim, and Kingsley for all of the excellent >> advice so far. I truly appreciate your support and patience with all of my >> questions. >> >> I'm currently running a new build of Virtuoso Open Source, Version: >> 07.20.3214, develop/7 branch on gitihub, Build: Oct 7 2015 on Debian+Ubuntu. >> >> For my use case, we will have several (potentially 50 or more) HTML5 / >> RDFa 1.1 (core) pages available on one external server/domain and would like >> to regularly "sponge" or "crawl" these URIs (as these datasets expressed in >> HTML/RDFa may be updated or even grow from time to time). They will also >> become more decentralized and available on multiple external servers so >> Virtuoso seems like the perfect solution for being able to automatically >> crawl all of these external sources of RDFa controlled vocabulary datasets >> (as well as perfect for many other future objectives we have for RDF as >> well). >> >> Here are my questions (perhaps some of the answers can be used for >> FAQs,etc): >> >> 1) Does Virtuoso support crawling external domains and servers for the >> Target URL if the target is HTML5/RDFa or must they be imported into DAV >> first? >> 2) Am I always required to specify a local DAV collection for sponging and >> crawling RDFa even if I don't want to store the RDFa/HTML locally? >> 3) If yes to #2, when I use dav(or dba) as the owner and the rdf_sink >> folder to store the crawled RDFa/HTML, are there any special permissions or >> configurations required to be made on the rdf_sink folder? Here are the >> default configuration settings for rdf_sink: >> >> Main Tab: >> - Folder Name: (rdf_sink) >> - Folder Type: Linked Data Import >> - Owner: dav (or dba) >> - Permissions: rw-rw---- >> - Full Text Search: Recursively >> - Default Permissions: Off >> - Metdata Retrieval: Recursively >> - Apply changes to all subfolders and resources: unchecked >> - Expiration Date: 0 >> -WebDAV Properties: No properties >> >> Sharing Tab: >> - ODS users/groups: No Security >> - WebID users: No WebID Security >> >> Linked Data Import Tab: >> - Graph name: urn:dav:home:dav:rdf_sink >> - Base URI: http://host:8890/DAV/home/dba/rdf_sink/ >> - Use special graph security (on/off): unchecked >> - Sponger (on/off): checked >> >> 4) When importing content using the crawler + sponger feature I navigate >> to "Conductor > Web Application Server > Content Imports" and click the "New >> Target" button. >> >> Which of the following fields should I use to specify an external >> HTML5/RDFa 1.1 URL for crawling and what do each of these fields mean? Note: >> For the fields that are obvious (or are adequately addressed in the VOS >> documentation) I have already entered those below. I would greatly >> appreciate more information those fields that have an *asterisk* with a >> question in (parenthesis). >> >> - Target description: This is obvious. Name of content import / crawl job, >> etc. >> - Target URL: http://domain/path/html file (* does this URL prefer an xml >> sitemap for RDfa or can it explicitly point directly to an html file for >> RDFa? I also have content negotiation set up on the external server where >> the RDFa/HTML is hosted as it also serves JSON-LD, RDF/XML, and Turtle >> serializations, but I would prefer to only regularly crawl/update based on >> the HTML/RDFa data for now. I might have Virtuoso generate the alternate >> serializations in the future*) >> - Login name on target: (*if target URL is an external server, does this >> need to be blank?*) >> - Login password on target: (*if target URL is an external server, does >> this need to be blank?*) >> - Copy to Local DAV collection: (*what does this mean? It seems to imply >> that it is required to specify a Local Dav collection to create a crawl job, >> but another option implies that you don't have to store the data. The two >> options are conflicting and confusing. From an user experience perspective, >> it seems I would either want to store it or not. If I don't then why do I >> have to specify a local DAV collection?*) >> - Single page download: (*what does this mean?*) >> - Local resources owner: dav >> - Download only newer than: 1900-01-01 00-00-00 >> - Follow links matching (delimited with ;): (*what does this do? what >> types of "links" are examined?*) >> - Do not follow links matching (delimited with ;): (*what does this >> do?what types of "links" are examined?*) >> - Custom HTTP headers: (*is this required for RDFa? If so, what is the >> expected syntax and delimiters? "Accept: text/html"?*) >> - Number of HTTP redirects to follow: (*I currently have a 303 redirect in >> place for content negotiation, but what if this is unknown or changes in the >> future? Will it break the crawler job?*) >> - XPath expression for links extraction: (*is this applicable for >> importing RDFa?*) >> - Crawling depth limit: unlimited >> - Update Interval (minutes): 0 - Number of threads: (*is this applicable >> for importing RDFa?*) >> - Crawl delay (sec): 0.00 >> - Store Function: (*is this applicable for importing RDFa?*) >> - Extract Function: (*is this applicable for importing RDFa?*) >> - Semantic Web Crawling: (*what does this do exactly?*) >> - If Graph IRI is unassigned use this Data Source URL: (*what is the >> purpose of this? The content can't be imported of a Target is not specified, >> right?*) >> - Follow URLs outside of the target host: (*what does this do exactly?*) >> - Follow HTML meta link: (*is this only for HTML/RDFa that specifies an >> alternate serialization via the <link> element in the <head>?*) >> - Follow RDF properties (one IRI per row): (*what does this do?*) >> - Download images: >> - Use WebDAV methods: (*what does this mean?*) >> - Delete if remove on remote detected: (*what does this mean?*) >> - Store documents locally: (*does this only apply to storing the content >> in DAV?*) >> - Convert Links: (*is this related to another option/field*?) >> - Run Sponger: (*does this force to only use the sponger for reading RDFa >> and populate the DB with the triples?*) >> - Accept RDF: (*is this option only for slash-based URIs that return >> XML/RDF via content negotiation?*) >> - Store metadata *: (*what does this mean?*) >> - Cartridges: (* I recommend improving the usability on this. At first I >> thought perhaps my cartridges were not installed because the content area >> below the "Cartridges" tab was empty. I realized the cartridges only appear >> when you click/toggle the "Cartridges" tab. I suggest they should all be >> listed by default. Turning their visibility off by default may prevent users >> from realizing they are there, especially based on the old documentation*) >> >> 5) What do the following cartridge options do? I only listed the ones that >> seem most applicable to running a crawler/sponger import job for an >> externally hosted HTML5/RDFa URL. >> >> - RDF cartridge (*what types of RDF? what does this one do?*) >> - RDFa cartridge (*which versions of RDFa are supported? RDFa 1.1 core? >> RDF 1.0? RDF 1.1 Lite?*) >> - WebDAV Metadata >> - xHTML >> >> >> > ------------------------------------------------------------------------------ _______________________________________________ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users