Re: [Virtuoso-users] Reification alternative
Hello Aldo, I'd recommend to keep RDF_QUAD unchanged and use RDF Views to keep n-ary things in separate tables. The reason is that the access to RDF_QUAD is heavily optimized, we've never polished any other table to such a degree (and I hope we will not :), and any changes may result in severe penalties in scalability. Triggers should be possible as well, but we haven't tried them, because it is relatively cheap to redirect data manipulations to other tables. Both the loader of files and SPARUL internals are flexible enough so it may be more convenient to change different tables depending on parameters: the loader can call arbitrary callback functions for each parsed triple and SPARUL manipulations are configurable via define output:route pragma at the beginning of the query. In this case there will be no need in writing special SQL to triplify data from that wide tables because RDF Views will do that automatically. Moreover, it's possible to automatically create triggers by RDF Views that will materialize changes in wide tables in RDF_QUAD (say, if you need inference). So instead of editing RDF_QUAD and let triggers on RDF_QUAD reproduce the changes in wide tables, you may edit wide tables and let triggers reproduce the changes in RDF_QUAD. The second approach is much more flexible and it promise better performance due to much smaller activity in triggers. For cluster, I'd say that the second variant is the only possible thing, because fast manipulations with RDF_QUAD are _really_ complicated there. Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com On Wed, 2010-10-13 at 12:57 -0300, Aldo Bucchi wrote: Hi Mirko, Here's a tip that is a bit software bound but it may prove useful to keep it in mind. Virtuoso's Quad Store is implemented atop an RDF_QUAD table with 4 columns (g, s, p o). This is very straightforward. It may even seem naive at first glance. ( a table!!? ). Now, the great part is that the architecture is very open. You can actually modify the table via SQL statements directly: insert, delete, update, etc. You can even add columns and triggers to it. Some ideas: * Keep track of n-ary relations in the same table by using accessory columns ( time, author, etc ). * Add a trigger and log each add/delete to a separate table where you also store more data * When consuming this data, you can use SQL or you can run a SPARQL construct based on a SQL query, so as to triplity the n-tuple as you wish. The bottom suggestion here is: Take a look at what's possible when you escape SPARQL only and start working in a hybrid environment ( SQL + SPARQL ). Also note that the self-contained nature of RDF assertions ( facts, statements ) makes it possible to do all sorts of tricks by taking them into 3+ tuple structures. My coolest experiment so far is a time machine. I log adds and deletes and can recreate the state of the system ( Quad Store ) up to any point in time. Imagine a Queue management system where you can replay the state of the system, for example. Regards, A
Re: [Virtuoso-users] Reification alternative
Aldo, On Wed, 2010-10-13 at 16:02 -0300, Aldo Bucchi wrote: From the docs: output:route: works only for SPARUL operators and tells the SPARQL compiler to generate procedure names that differ from default. As a result, the effect of operator will depend on application. That is for tricks. E.g., consider an application that extracts metadata from DAV resources stored in the Virtuoso and put them to RDF storage to make visible from outside. When a web application has permissions and credentials to execute a SPARUL query, the changed metadata can be written to the DAV resource (and after that the trigger will update them in the RDF storage), transparently for all other parts of application. Where can I find more docs on this feature? ( I don't actually need this, just asking ) Oops, looks like functions are not yet in the User's Guide. Will appear there soon. To make a custom repository for RDF data usable from SPARUL, one should create two functions, one to deal with inserts or deletes of individually defined triples and one to manipulate at graph level, such as SPARUL CLEAR GRAPH statement. If the repository is named NOTARY, then the first function should be named DB.DBA.SPARQL_ROUTE_DICT_CONTENT_NOTARY (due to types of arguments they get --- triples to insert or delete are passed in DICTionary objects), and the second should be DB.DBA.SPARQL_ROUTE_MDW_NOTARY (and MDW stands for mass destruction weapon and warns about the effect that the function under development may produce while not fully debugged) Arguments for both functions are in the same order: DB.DBA.SPARQL_ROUTE_DICT_CONTENT_NOTARY ( in graph_to_edit varchar, in operation_name varchar, --- the value passed will be 'INSERT', 'DELETE' or 'MODIFY' in storage_name varchar or null, --- value of define input:storage in output_storage_name varchar or null, --- reserved, now NULL in output_format_name varchar or null,--- value of define output:format in dict_of_triples_to_delete, --- (NULL is passed for INSERT) in dict_of_triples_to_insert, --- (NULL is passed for DELETE) NULL,--- reserved in uid_and_gs_cbk any, --- authentication data (numeric UID or vector of UID and name of application-specific graph security callback function) in log_mode integer, in report_flag --- 1 if function creates a small result set with human-friendly status report DB.DBA.SPARQL_ROUTE_MDW_NOTARY ( in graph_to_edit varchar, in operation_name varchar, --- the value passed will be 'CREATE', 'DROP', or 'CLEAR' in storage_name varchar or null, --- value of define input:storage in output_storage_name varchar or null, --- reserved, now NULL in output_format_name varchar or null,--- value of define output:format in aux any, --- flags like 'QUIET' NULL, --- reserved NULL,--- reserved in uid_and_gs_cbk any, --- authentication data (numeric UID or vector of UID and name of application-specific graph security callback function) in log_mode integer, in report_flag --- 1 if function creates a small result set with human-friendly status report ) Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com P.S. As shown by Google, WMD is more popular variant of abbreviation than MDW and, ironically, WMD also stands for World Movement for Democracy.
Re: Subjects as Literals
After 7 days of discussion, are there any volunteers to implement this proposal? Or you specify the wish and I should implement it (and Kingsley should pay) for an unclear purpose? Sorry, no. I should remind one more time: without two scheduled implementations right now and two complete implementations at the CR time, the discussion is just for fun. Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com
Re: Subjects as Literals
Antoine, all, On Tue, 2010-07-06 at 20:54 +0100, Antoine Zimmermann wrote: Not only there are volunteers to implement tools which allow literals as subjects, but there are already implementations out there. As an example, take Ivan Herman's OWL 2 RL reasoner [1]. You can put triples with literals as subject, and it will reason with them. Here in DERI, we also have prototypes processing generalised triples. It is absolutely not a problem to add a support in, e.g., Virtuoso as well. 1 day for non-clustered version + 1 more day for cluster. But it will naturally kill the scalability. Literals in subject position means either outlining literals at all or switch from bitmap indexes to plain, and it the same time it blocks important query rewriting. We have seen triple store benchmark reports where a winner is up to 120 times faster than a loser and nevertheless all participants are in widespread use. With these reports in mind, I can make two forecasts. 1. RDF is so young that even an epic fail like this feature would not immediately throw an implementation away from the market. 2. It will throw it away later. Other reasoners are dealing with literals as subjects. RIF implementations are also able to parse triples with literals as subjects, as it is required by the spec. ... Some people mentioned scalability issues when we allow literals as subject. It might be detrimental to the scalability of query engines over big triple stores, but allowing literals as subjects is perfectly scalable when it comes to inference materialisation (see recent work on computing the inference closure of 100 billion triples [2]). Reasoners should get data from some place and put them to same or other place. There are three sorts of inputs: triple stores with real data, dumps of real data and synthetic benchmarks like LUBM. There are two sorts of outputs: triple stores for real data and papers with nice numbers. Without adequate triple store infrastructure at both ends (or inside), any reasoner is simply unusable. [2] compares a reasoner that can not answer queries after preparing the result with a store that works longer but is capable of doing something for its multiple clients immediately after completion of its work. If this is the best achieved and the most complete result then volunteers are still required. Considering this amount of usage and use cases, which is certainly meant to grow in the future, I believe that it is time to standardised generalised RDF. http://en.wikipedia.org/wiki/Second-system_effect There were generalised RDFs before a simple RDF comes to scene. Minsky --- frames and slots. Winston --- knowledge graphs that are only a bit more complicated than RDF. The fate of these approaches is known: great impact on science, little use in industry. A possible compromise would be to define RDF 2 as /generalised RDF + named graphs + deprecate stuff/, and have a sublanguage (or profile) RDF# which forbids literals in subject and predicate positions, as well as bnodes in predicate position. Breaking a small market in two incompatible parts is as bad as asking my mom what she would like to use on her netbook, ALSA or OSS. She don't know (me either) and she don't want to chose which half of sound applications will crash. Honestly, it's just about putting a W3C stamp on things that some people are already using and doing. If people are living in love and happiness without a stamp on a paper, it does not mean living in sin ;) Similarly, people may use literals as subjects without asking others and without any stamp. Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com [2] Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank van Harmelen, and Henri Bal. OWL reasoning with WebPIE: calculating the closure of 100 billion triples in the proceedings of ESWC 2010.
Re: RDF and its discontents
in the metadata of the view. The oldest etiquette rule is Mammoth should be eaten in parts, not as a whole. Can't it be a slogan for the semweb activity? Step by step, no promises of silver bullets, with attention to benchmarks and legacy interfaces/protocols and existing data sources etc. Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com
Re: Subjects as Literals, [was Re: The Ordered List Ontology]
On Fri, 2010-07-02 at 08:50 +0100, Graham Klyne wrote: [cc's trimmed] I'm with Jeremy here, the problem's economic not technical. If we could introduce subjects-as-literals in a way that: (a) doesn't invalidate any existing RDF, and (b) doesn't permit the generation of RDF/XML that existing applications cannot parse, then I think there's a possible way forward. Yes, there's such a way, more over, it will be just a small subset of a richer language that is widely used already, so it will require very little coding. When there's a need in SPARQL-like extensions, like subjects-as-literals, there's a straightforward way of serializing data as SPARQL fragments. Note that in this way not only subjects-as-literals are available, but even SPARQL 1.1 features like path expressions or for each X expressed as ?X in an appropriate place of an appropriate BGP. Best Regards, Ivan Mikhailov, OpenLink Software http://virtuoso.openlinksw.com
Re: Subjects as Literals, [was Re: The Ordered List Ontology]
On Fri, 2010-07-02 at 12:42 +0200, Richard Cyganiak wrote: Hi Yves, On 2 Jul 2010, at 11:15, Yves Raimond wrote: I am not arguing for each vendor to implement that. I am arguing for removing this arbitrary limitation from the RDF spec. Also marked as an issue since 2000: http://www.w3.org/2000/03/rdf-tracking/#rdfms-literalsubjects The demand that W3C modify the specs to allow literals as subjects should be rejected on a simple principle: Those who demand that change, including yourself, have failed to put their money where their mouth is. Where is the alternative specification that documents the syntactic and semantic extension? Where are the proposed RDF/XML++ and RDFa++ that support literals as subjects? Where are the patches to Jena, Sesame, Redland and ARC2 that support these changes? +1, with a small correction. I'd expect a patch for Virtuoso as well ;) Actually, the approval of a new spec will require two adequate implementations. I can't imagine that existing vendors will decide to waste their time to make their products worse in terms of speed and disk footprint and scalability. The most efficient critics is sabotage, you know. Some new vendor may of course try to become a strikebreaker but his benchmark runs will look quite poorly, because others will continue to optimize any SPARQL BGP like ?s ?p ?o . ?o ?p2 ?s2 . into more selective ?s ?p ?o . FILTER (isREFERENCE (?o)) . ?o ?p2 ?s2 . and this sort of rewriting will easily bring them two orders of magnitude of speed on a simple query with less than 10 triple patterns. Keeping in mind that Bio2RDF people tend to write queries with 20-30 triple patterns mostly connected into long chains, the speed difference on real life queries will be a blocking issue. - The discussion is quite long; I'm sorry I can't continue to track it accurately, I'm on a critical path of a new Virtuoso release. If somebody is interested in a whole list of reasons why I will not put this feature into the DB core or a whole list of workarounds for it at DB application level or a detailed history of round Earth and Columbus then ping me and I'll write a page at ESW wiki. Best Regards, Ivan Mikhailov OpenLink Virtuoso http://virtuoso.openlinksw.com
Re: DBpedia hosting burden
Last time I checked (which was quite a while ago though), loading DBpedia in a normal triple store such as Jena TDB didn't work very well due to many issues with the DBpedia RDF (e.g., problems with the URIs of external links scraped from Wikipedia). Agree. Common errors in LOD are: -- single quoted and double quoted strings with newlines; -- bnode predicates (but SPARQL processor may ignore them!); -- variables, but triples with variables are ignored; -- literal subjects, but triples with them are ignored; -- '/', '#', '%' and '+' in local part of QName (Qname with path); -- invalid symbols between '' and '', i.e. in relative IRIs. That's why my own TURTLE parser is configurable to selectively report or ignore these errors. In addition I can relax TURTLE syntax to include popular violations like redundant delimiters and/or try to recover from lexical errors as much as it is possible, even if I should lose some ill triples together with some limited number of proper triples around them (GIGO mode, for Garbage In Garbage Out). Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com
Re: [semanticweb] ANN: DBpedia 3.5 released
On Tue, 2010-04-13 at 21:58 +0300, ba...@goldmail.de wrote: A fact of my experience since many years: The homepage of my grandma is better accessible than the flagship(!) of 'linked data' dbpedia.org... Someone who has used the endpoint dbpedia.org/sparql intensively knows what i mean: After one or two hours or so, it hangs, i try dbpedia.org with FFox, Opera, IE, it hangs also, after 5 minutes i try dbpedia.org, i see the page, for dbpedia.org/sparql i put my simple query again, it is ok. Since years it is the same story in the same rhythm. They say, the everlasting problem for professional cosmetics is growing quality of optics and media used for movies, celebrities should continue to look perfect. But I can bet you've never paid attention to that fact while looking at the final result. Similarly, growing database size and growing hit rate and growing complexity of queries are not obviously visible from outside, but turn the hosting into a race. We're improving the underlaying RDBMS as fast as we only can just to prevent the service from total halt. One might wish to provide a better service on their own RDBMS and thus to make a good advertisement, but nobody else want to do that _and_ can do that, so we're alone under this load. If you wish, you may help us with hosting and/or equipment, or simply set up a mirror site and we would be glad to redirect some part of load to your cluster. Even an inexpensive $2 mirror would help to some degree. Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com
Re: [semanticweb] ANN: DBpedia 3.5 released
Hello Leigh, Out of interest, do you actually share any metrics on usage levels, common sparql queries, etc? Sorry I can't help you much. Admins kept some logs for operational purposes but I'm not authorized to provide them. If you would like to see them, please make an official request to Kingsley, stating reason for this request etc. Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com
Re: DBpedia hosting burden
Dan, Are there any scenarios around eg. BitTorrent that could be explored? What if each of the static files in http://dbpedia.org/sitemap.xml were available as torrents (or magnet: URIs)? I realise that would only address part of the problem/cost, but it's a widely used technology for distributing large files; can we bend it to our needs? If I were The Emperor of LOD I'd ask all grand dukes of datasources to put fresh dumps at some torrent with control of UL/DL ratio :) For reason I can't understand this idea is proposed few times per year but never tried. Other approach is to implement scalable and safe patch/diff on RDF graphs plus subscription on them. That's what I'm writing ATM. Using this toolkit, it would be quite cheap to place a local copy of LOD on any appropriate box in any workgroup. A local copy will not require any hi-end equipment for two reasons: the database can be much smaller than the public one (one may install only a subset of LOD) and it will usually less sensitive to RAM/disk ratio (small number of clients will result in better locality because any given individual tend to browse interrelated data whereas a crowd produces chaotic sequence of requests). Crawlers and mobile apps will not migrate to local copies, but some complicated queries will go away from the bottleneck server and that would be good enough. Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Hello Eyal, this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). One can question whether such flexibility is relevant in practice, and if so, one may try to extract such structured patterns from data on-the-fly. That will be our next big extension -- updateable RDV Views, as proposed in http://esw.w3.org/topic/UpdatingRelationalDataViaSPARUL . So we will be able to load BSBM data as RDF and query them via SPARQL web service endpoint; thus we will masquerade the relational storage entirely. Best Regards, Ivan Mikhailov, OpenLink Software http://virtuoso.openlinksw.com
RE: BSBM With Triples and Mapped Relational Data in Virtuoso
Hello Andy, SPARQL already has parameterized queries! Because it has explicit (named) variables, these can be used to set variables scoped just outside the query string. I'd like to know in advance that some variables should be bound in the environment so I'd like to have some way to distinguish between local and external variables. Indeed, it is technically possible to compile the statement once for each distinct set of variables in environment; I simply do not like that :) Tastes differ, of course, but I'd like to have the environment double-checked and be able to compare expected and actual lists of passed parameters. Yes, that sort of check will add extra syntax (or extra pragmas) to the language, making queries less compact, but I prefer to type few more chars if this will result in accurate diagnostics of a protocol error in a distributed system. Best Regards, Ivan Mikhailov, OpenLink Software.