Hi All,

Rather than wait for someone to tell me the right options based on all
of the different options that are available for a crawl job targeting
RDFa I will post which options I selected. Perhaps someone can tell me
what I'm missing or did wrong. I'm running Version: 07.20.3214.

Here is what I entered under "Web Application Server > Content Imports":

Target Description: ADL Verbs (RDFa/HTML)
Target URL: http://xapi.vocab.pub/datasets/adl/verbs/index.html
Copy to Local DAV collection: DAV/home/dba/rdf_sink
Number of redirects to follow: 1
Update interval: 10
Checked the following:
X Run Sponger
X Store Metadata

Cartridges Selected:
X RDFa

After I created the crawl job, I went to "import queues" and clicked "run"

I received the following message:

Results for xapi.vocab.pub
errors while retrieving target. Select "reset" to return initial state
Total URLs processed : 1
Download finished

I also checked "retrieved sites" and 0/1 were downloaded.

Where do I find out the error that was encountered while retrieving
target? Thanks!

Also, I'm not sure if this is a bug, but I noticed when I specify the
Local DAV collection of DAV/home/dba/rdf_sink/ and then go to edit the
job, it changes to /DAV/DAV/home/dba/rdf_sink/ by adding a new
directory 'DAV'

If I use /home/dba/rdf_sink/ as the the local path when creating the
crawl job it won't let me without adding /DAV/ path in front of it. So
it seems it is creating a sub directory /DAV/ under /DAV/ when it
shouldn't be.

Regards,

J Haag
-------------------------------------------------------
Advanced Distributed Learning Initiative
+1.850.266.7100(office)
+1.850.471.1300 (mobile)
jhaag75 (skype)
http://linkedin.com/in/jasonhaag


On Thu, Oct 15, 2015 at 10:10 AM, Haag, Jason <jason.haag....@adlnet.gov> wrote:
>
> Just touching base on this post... I suspect this one was TL;DR.
>
> -------------------------------------------------------
> Advanced Distributed Learning Initiative
> +1.850.266.7100(office)
> +1.850.471.1300 (mobile)
> jhaag75 (skype)
> http://linkedin.com/in/jasonhaag
>
> On Thu, Oct 8, 2015 at 10:39 AM, Haag, Jason <jason.haag....@adlnet.gov>
> wrote:
>>
>> Hi All,
>>
>> I'm posting these questions to the users group to also help other Virtuoso
>> users potentially interested in importing RDFa-based content into Virtuoso
>> might also benefit from the responses. Please let me know if any of these
>> questions should be submitted as an issue to github or as feature requests
>> instead. The current documentation on importing RDFa document is a little
>> dated and does not accurately match the conductor interface for content
>> imports. The conductor interface also don't explain what the various fields
>> and options mean. Some of them are not obvious to a new user like myself and
>> might lead to bad assumptions or even cause conflicts in the system. I've
>> been a little confused by what some of the various options mean, and I have
>> also been running an older version (7.2.1). From what I have been told many
>> of the HTML/RDFa cartridges have been improved since the older version of
>> VOS. Therefore, I would like to ask a few questions and determine exactly
>> what these fields will do (or not do) before I make any mistakes or
>> assumptions. Thank you to Hugh, Tim, and Kingsley for all of the excellent
>> advice so far. I truly appreciate your support and patience with all of my
>> questions.
>>
>> I'm currently running a new build of Virtuoso Open Source, Version:
>> 07.20.3214, develop/7 branch on gitihub, Build: Oct 7 2015 on Debian+Ubuntu.
>>
>> For my use case, we will have several (potentially 50 or more) HTML5 /
>> RDFa 1.1 (core) pages available on one external server/domain and would like
>> to regularly "sponge" or "crawl" these URIs (as these datasets expressed in
>> HTML/RDFa may be updated or even grow from time to time). They will also
>> become more decentralized and available on multiple external servers so
>> Virtuoso seems like the perfect solution for being able to automatically
>> crawl all of these external sources of RDFa controlled vocabulary datasets
>> (as well as perfect for many other future objectives we have for RDF as
>> well).
>>
>> Here are my questions (perhaps some of the answers can be used for
>> FAQs,etc):
>>
>> 1) Does Virtuoso support crawling external domains and servers for the
>> Target URL if the target is HTML5/RDFa or must they be imported into DAV
>> first?
>> 2) Am I always required to specify a local DAV collection for sponging and
>> crawling RDFa even if I don't want to store the RDFa/HTML locally?
>> 3) If yes to #2, when I use dav(or dba) as the owner and the rdf_sink
>> folder to store the crawled RDFa/HTML, are there any special permissions or
>> configurations required to be made on the rdf_sink folder? Here are the
>> default configuration settings for rdf_sink:
>>
>> Main Tab:
>> - Folder Name: (rdf_sink)
>> - Folder Type: Linked Data Import
>> - Owner: dav (or dba)
>> - Permissions: rw-rw----
>> - Full Text Search: Recursively
>> - Default Permissions: Off
>> - Metdata Retrieval: Recursively
>> - Apply changes to all subfolders and resources: unchecked
>> - Expiration Date: 0
>> -WebDAV Properties: No properties
>>
>> Sharing Tab:
>> - ODS users/groups: No Security
>> - WebID users: No WebID Security
>>
>> Linked Data Import Tab:
>> - Graph name: urn:dav:home:dav:rdf_sink
>> - Base URI: http://host:8890/DAV/home/dba/rdf_sink/
>> - Use special graph security (on/off): unchecked
>> - Sponger (on/off): checked
>>
>> 4) When importing content using the crawler + sponger feature I navigate
>> to "Conductor > Web Application Server > Content Imports" and click the "New
>> Target" button.
>>
>> Which of the following fields should I use to specify an external
>> HTML5/RDFa 1.1 URL for crawling and what do each of these fields mean? Note:
>> For the fields that are obvious (or are adequately addressed in the VOS
>> documentation) I have already entered those below. I would greatly
>> appreciate more information those fields that have an *asterisk*  with a
>> question in (parenthesis).
>>
>> - Target description: This is obvious. Name of content import / crawl job,
>> etc.
>> - Target URL: http://domain/path/html file (* does this URL prefer an xml
>> sitemap for RDfa or can it explicitly point directly to an html file for
>> RDFa? I also have content negotiation set up on the external server where
>> the RDFa/HTML is hosted as it also serves JSON-LD, RDF/XML, and Turtle
>> serializations, but I would prefer to only regularly crawl/update based on
>> the HTML/RDFa data for now. I might have Virtuoso generate the alternate
>> serializations in the future*)
>> - Login name on target: (*if target URL is an external server, does this
>> need to be blank?*)
>> - Login password on target:  (*if target URL is an external server, does
>> this need to be blank?*)
>> - Copy to Local DAV collection: (*what does this mean? It seems to imply
>> that it is required to specify a Local Dav collection to create a crawl job,
>> but another option implies that you don't have to store the data. The two
>> options are conflicting and confusing. From an user experience perspective,
>> it seems I would either want to store it or not. If I don't then why do I
>> have to specify a local DAV collection?*)
>> - Single page download: (*what does this mean?*)
>> - Local resources owner: dav
>> - Download only newer than: 1900-01-01 00-00-00
>> - Follow links matching (delimited with ;): (*what does this do? what
>> types of "links" are examined?*)
>> - Do not follow links matching (delimited with ;): (*what does this
>> do?what types of "links" are examined?*)
>> - Custom HTTP headers: (*is this required for RDFa? If so, what is the
>> expected syntax and delimiters? "Accept: text/html"?*)
>> - Number of HTTP redirects to follow: (*I currently have a 303 redirect in
>> place for content negotiation, but what if this is unknown or changes in the
>> future? Will it break the crawler job?*)
>> - XPath expression for links extraction: (*is this applicable for
>> importing RDFa?*)
>> - Crawling depth limit: unlimited
>> - Update Interval (minutes): 0 - Number of threads: (*is this applicable
>> for importing RDFa?*)
>> - Crawl delay (sec): 0.00
>> - Store Function: (*is this applicable for importing RDFa?*)
>> - Extract Function: (*is this applicable for importing RDFa?*)
>> - Semantic Web Crawling: (*what does this do exactly?*)
>> - If Graph IRI is unassigned use this Data Source URL: (*what is the
>> purpose of this? The content can't be imported of a Target is not specified,
>> right?*)
>> - Follow URLs outside of the target host:  (*what does this do exactly?*)
>> - Follow HTML meta link: (*is this only for HTML/RDFa that specifies an
>> alternate serialization via the <link> element in the <head>?*)
>> - Follow RDF properties (one IRI per row): (*what does this do?*)
>> - Download images:
>> - Use WebDAV methods: (*what does this mean?*)
>> - Delete if remove on remote detected: (*what does this mean?*)
>> - Store documents locally: (*does this only apply to storing the content
>> in DAV?*)
>> - Convert Links: (*is this related to another option/field*?)
>> - Run Sponger: (*does this force to only use the sponger for reading RDFa
>> and populate the DB with the triples?*)
>> - Accept RDF: (*is this option only for slash-based URIs that return
>> XML/RDF via content negotiation?*)
>> - Store metadata *: (*what does this mean?*)
>> - Cartridges: (* I recommend improving the usability on this. At first I
>> thought perhaps my cartridges were not installed because the content area
>> below the "Cartridges" tab was empty. I realized the cartridges only appear
>> when you click/toggle the "Cartridges" tab. I suggest they should all be
>> listed by default. Turning their visibility off by default may prevent users
>> from realizing they are there, especially based on the old documentation*)
>>
>> 5) What do the following cartridge options do? I only listed the ones that
>> seem most applicable to running a crawler/sponger import job for an
>> externally hosted HTML5/RDFa URL.
>>
>> - RDF cartridge (*what types of RDF? what does this one do?*)
>> - RDFa cartridge (*which versions of RDFa are supported? RDFa 1.1 core?
>> RDF 1.0? RDF 1.1 Lite?*)
>> - WebDAV Metadata
>> - xHTML
>>
>>
>>
>

------------------------------------------------------------------------------
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to