Re: [Virtuoso-users] Reification alternative

2010-10-13 Thread Ivan Mikhailov
Hello Aldo,

I'd recommend to keep RDF_QUAD unchanged and use RDF Views to keep n-ary
things in separate tables. The reason is that the access to RDF_QUAD is
heavily optimized, we've never polished any other table to such a
degree (and I hope we will not :), and any changes may result in severe
penalties in scalability. Triggers should be possible as well, but we
haven't tried them, because it is relatively cheap to redirect data
manipulations to other tables. Both the loader of files and SPARUL
internals are flexible enough so it may be more convenient to change
different tables depending on parameters: the loader can call arbitrary
callback functions for each parsed triple and SPARUL manipulations are
configurable via define output:route pragma at the beginning of the
query.

In this case there will be no need in writing special SQL to triplify
data from that wide tables because RDF Views will do that
automatically. Moreover, it's possible to automatically create triggers
by  RDF Views that will materialize changes in wide tables in RDF_QUAD
(say, if you need inference). So instead of editing RDF_QUAD and let
triggers on RDF_QUAD reproduce the changes in wide tables, you may edit
wide tables and let triggers reproduce the changes in RDF_QUAD. The
second approach is much more flexible and it promise better performance
due to much smaller activity in triggers. For cluster, I'd say that the
second variant is the only possible thing, because fast manipulations
with RDF_QUAD are _really_ complicated there.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com


On Wed, 2010-10-13 at 12:57 -0300, Aldo Bucchi wrote:
 Hi Mirko,
 
 Here's a tip that is a bit software bound but it may prove useful to
 keep it in mind.
 
 Virtuoso's Quad Store is implemented atop an RDF_QUAD table with 4
 columns (g, s, p o). This is very straightforward. It may even seem
 naive at first glance. ( a table!!? ).
 
 Now, the great part is that the architecture is very open. You can
 actually modify the table via SQL statements directly: insert, delete,
 update, etc. You can even add columns and triggers to it.
 
 Some ideas:
 * Keep track of n-ary relations in the same table by using accessory
 columns ( time, author, etc ).
 * Add a trigger and log each add/delete to a separate table where you
 also store more data
 * When consuming this data, you can use SQL or you can run a SPARQL
 construct based on a SQL query, so as to triplity the n-tuple as you
 wish.
 
 The bottom suggestion here is: Take a look at what's possible when you
 escape SPARQL only and start working in a hybrid environment ( SQL +
 SPARQL ).
 Also note that the self-contained nature of RDF assertions ( facts,
 statements ) makes it possible to do all sorts of tricks by taking
 them into 3+ tuple structures.
 
 My coolest experiment so far is a time machine. I log adds and deletes
 and can recreate the state of the system ( Quad Store ) up to any
 point in time.
 
 Imagine a Queue management system where you can replay the state of
 the system, for example.
 
 Regards,
 A





Re: [Virtuoso-users] Reification alternative

2010-10-13 Thread Ivan Mikhailov
Aldo,

On Wed, 2010-10-13 at 16:02 -0300, Aldo Bucchi wrote:
 From the docs:
 
 output:route: works only for SPARUL operators and tells the SPARQL
 compiler to generate procedure names that differ from default. As a
 result, the effect of operator will depend on application. That is for
 tricks. E.g., consider an application that extracts metadata from DAV
 resources stored in the Virtuoso and put them to RDF storage to make
 visible from outside. When a web application has permissions and
 credentials to execute a SPARUL query, the changed metadata can be
 written to the DAV resource (and after that the trigger will update
 them in the RDF storage), transparently for all other parts of
 application.
 
 Where can I find more docs on this feature?
 ( I don't actually need this, just asking )

Oops, looks like functions are not yet in the User's Guide. Will appear
there soon.

To make a custom repository for RDF data usable from SPARUL, one should
create two functions, one to deal with inserts or deletes of
individually defined triples and one to manipulate at graph level, such
as SPARUL CLEAR GRAPH statement. If the repository is named NOTARY, then
the first function should be named
DB.DBA.SPARQL_ROUTE_DICT_CONTENT_NOTARY (due to types of arguments they
get --- triples to insert or delete are passed in DICTionary objects),
and the second should be DB.DBA.SPARQL_ROUTE_MDW_NOTARY (and MDW stands
for mass destruction weapon and warns about the effect that the
function under development may produce while not fully debugged)

Arguments for both functions are in the same order:
DB.DBA.SPARQL_ROUTE_DICT_CONTENT_NOTARY (
  in graph_to_edit varchar,
  in operation_name varchar, --- the value passed will be 'INSERT',
'DELETE' or 'MODIFY'
  in storage_name varchar or null, --- value of define input:storage
  in output_storage_name varchar or null, --- reserved, now NULL
  in output_format_name varchar or null,--- value of define
output:format
  in dict_of_triples_to_delete, --- (NULL is passed for INSERT)
  in dict_of_triples_to_insert, --- (NULL is passed for DELETE)
  NULL,--- reserved
  in uid_and_gs_cbk any, --- authentication data (numeric UID or vector
of UID and name of application-specific graph security callback
function)
  in log_mode integer,
  in report_flag --- 1 if function creates a small result set with
human-friendly status report

DB.DBA.SPARQL_ROUTE_MDW_NOTARY (
  in graph_to_edit varchar,
  in operation_name varchar, --- the value passed will be 'CREATE',
'DROP', or 'CLEAR'
  in storage_name varchar or null, --- value of define input:storage
  in output_storage_name varchar or null, --- reserved, now NULL
  in output_format_name varchar or null,--- value of define
output:format
  in aux any, --- flags like 'QUIET'
  NULL, --- reserved
  NULL,--- reserved
  in uid_and_gs_cbk any, --- authentication data (numeric UID or vector
of UID and name of application-specific graph security callback
function)
  in log_mode integer,
  in report_flag --- 1 if function creates a small result set with
human-friendly status report
)


Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com

P.S. As shown by Google, WMD is more popular variant of abbreviation
than MDW and, ironically, WMD also stands for World Movement for
Democracy.






Re: Subjects as Literals

2010-07-06 Thread Ivan Mikhailov
After 7 days of discussion, are there any volunteers to implement this
proposal? Or you specify the wish and I should implement it (and
Kingsley should pay) for an unclear purpose? Sorry, no.

I should remind one more time: without two scheduled implementations
right now and two complete implementations at the CR time, the
discussion is just for fun.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com





Re: Subjects as Literals

2010-07-06 Thread Ivan Mikhailov
Antoine, all,

On Tue, 2010-07-06 at 20:54 +0100, Antoine Zimmermann wrote:

 Not only there are volunteers to implement tools which allow literals as 
 subjects, but there are already implementations out there.
 As an example, take Ivan Herman's OWL 2 RL reasoner [1]. You can put 
 triples with literals as subject, and it will reason with them.
 Here in DERI, we also have prototypes processing generalised triples.

It is absolutely not a problem to add a support in, e.g., Virtuoso as
well. 1 day for non-clustered version + 1 more day for cluster. But it
will naturally kill the scalability. Literals in subject position means
either outlining literals at all or switch from bitmap indexes to plain,
and it the same time it blocks important query rewriting.

We have seen triple store benchmark reports where a winner is up to 120
times faster than a loser and nevertheless all participants are in
widespread use. With these reports in mind, I can make two forecasts.

1. RDF is so young that even an epic fail like this feature would not
immediately throw an implementation away from the market.

2. It will throw it away later.

 Other reasoners are dealing with literals as subjects. RIF 
 implementations are also able to parse triples with literals as 
 subjects, as it is required by the spec.
...
 Some people mentioned scalability issues when we allow literals as 
 subject. It might be detrimental to the scalability of query engines 
 over big triple stores, but allowing literals as subjects is perfectly 
 scalable when it comes to inference materialisation (see recent work on 
 computing the inference closure of 100 billion triples [2]).
 

Reasoners should get data from some place and put them to same or other
place. There are three sorts of inputs: triple stores with real data,
dumps of real data and synthetic benchmarks like LUBM. There are two
sorts of outputs: triple stores for real data and papers with nice
numbers. Without adequate triple store infrastructure at both ends (or
inside), any reasoner is simply unusable. [2] compares a reasoner that
can not answer queries after preparing the result with a store that
works longer but is capable of doing something for its multiple clients
immediately after completion of its work. If this is the best achieved
and the most complete result then volunteers are still required.

 Considering this amount of usage and use cases, which is certainly meant 
 to grow in the future, I believe that it is time to standardised 
 generalised RDF.

http://en.wikipedia.org/wiki/Second-system_effect

There were generalised RDFs before a simple RDF comes to scene. Minsky
--- frames and slots. Winston --- knowledge graphs that are only a bit
more complicated than RDF. The fate of these approaches is known: great
impact on science, little use in industry.

 A possible compromise would be to define RDF 2 as /generalised RDF + 
 named graphs + deprecate stuff/, and have a sublanguage (or profile) 
 RDF# which forbids literals in subject and predicate positions, as well 
 as bnodes in predicate position.

Breaking a small market in two incompatible parts is as bad as asking my
mom what she would like to use on her netbook, ALSA or OSS. She don't
know (me either) and she don't want to chose which half of sound
applications will crash.

 Honestly, it's just about putting a W3C stamp on things that some people 
 are already using and doing.

If people are living in love and happiness without a stamp on a paper,
it does not mean living in sin ;) Similarly, people may use literals as
subjects without asking others and without any stamp.

Best Regards,
Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com

 [2] Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank van Harmelen, 
 and Henri Bal. OWL reasoning with WebPIE: calculating the closure of 
 100 billion triples in the proceedings of ESWC 2010.





Re: RDF and its discontents

2010-07-05 Thread Ivan Mikhailov
 in the metadata of the view.



The oldest etiquette rule is Mammoth should be eaten in parts, not as a
whole. Can't it be a slogan for the semweb activity? Step by step, no
promises of silver bullets, with attention to benchmarks and legacy
interfaces/protocols and existing data sources etc.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com






Re: Subjects as Literals, [was Re: The Ordered List Ontology]

2010-07-02 Thread Ivan Mikhailov
On Fri, 2010-07-02 at 08:50 +0100, Graham Klyne wrote:
 [cc's trimmed]
 
 I'm with Jeremy here, the problem's economic not technical.
 
 If we could introduce subjects-as-literals in a way that:
 (a) doesn't invalidate any existing RDF, and
 (b) doesn't permit the generation of RDF/XML that existing applications 
 cannot 
 parse,
 
 then I think there's a possible way forward.

Yes, there's such a way, more over, it will be just a small subset of a
richer language that is widely used already, so it will require very
little coding.

When there's a need in SPARQL-like extensions, like
subjects-as-literals, there's a straightforward way of serializing data
as SPARQL fragments. Note that in this way not only subjects-as-literals
are available, but even SPARQL 1.1 features like path expressions or
for each X expressed as ?X in an appropriate place of an appropriate
BGP.

Best Regards,

Ivan Mikhailov,
OpenLink Software
http://virtuoso.openlinksw.com





Re: Subjects as Literals, [was Re: The Ordered List Ontology]

2010-07-02 Thread Ivan Mikhailov
On Fri, 2010-07-02 at 12:42 +0200, Richard Cyganiak wrote:
 Hi Yves,

 On 2 Jul 2010, at 11:15, Yves Raimond wrote:
  I am not arguing for each vendor to implement that. I am arguing for
  removing this arbitrary limitation from the RDF spec. Also marked as
  an issue since 2000:
  http://www.w3.org/2000/03/rdf-tracking/#rdfms-literalsubjects
 
 The demand that W3C modify the specs to allow literals as subjects  
 should be rejected on a simple principle: Those who demand that  
 change, including yourself, have failed to put their money where their  
 mouth is. Where is the alternative specification that documents the  
 syntactic and semantic extension? Where are the proposed RDF/XML++  
 and RDFa++ that support literals as subjects? Where are the patches  
 to Jena, Sesame, Redland and ARC2 that support these changes?

+1, with a small correction. I'd expect a patch for Virtuoso as well ;)

Actually, the approval of a new spec will require two adequate
implementations. I can't imagine that existing vendors will decide to
waste their time to make their products worse in terms of speed and disk
footprint and scalability. The most efficient critics is sabotage, you
know.

Some new vendor may of course try to become a strikebreaker but his
benchmark runs will look quite poorly, because others will continue to
optimize any SPARQL BGP like

?s ?p ?o .
?o ?p2 ?s2 .

into more selective

?s ?p ?o . FILTER (isREFERENCE (?o)) .
?o ?p2 ?s2 .

and this sort of rewriting will easily bring them two orders of
magnitude of speed on a simple query with less than 10 triple patterns.
Keeping in mind that Bio2RDF people tend to write queries with 20-30
triple patterns mostly connected into long chains, the speed difference
on real life queries will be a blocking issue.

-

The discussion is quite long; I'm sorry I can't continue to track it
accurately, I'm on a critical path of a new Virtuoso release.

If somebody is interested in
a whole list of reasons why I will not put this feature into the DB core
or
a whole list of workarounds for it at DB application level
or
a detailed history of round Earth and Columbus
then
ping me and I'll write a page at ESW wiki.

Best Regards,

Ivan Mikhailov
OpenLink Virtuoso
http://virtuoso.openlinksw.com





Re: DBpedia hosting burden

2010-04-15 Thread Ivan Mikhailov
 Last time I checked (which was quite a while ago though), loading 
 DBpedia in a normal triple store such as Jena TDB didn't work very well 
 due to many issues with the DBpedia RDF (e.g., problems with the URIs of 
 external links scraped from Wikipedia).

Agree. Common errors in LOD are:

-- single quoted and double quoted strings with newlines;
-- bnode predicates (but SPARQL processor may ignore them!);
-- variables, but triples with variables are ignored;
-- literal subjects, but triples with them are ignored;
-- '/', '#', '%' and '+' in local part of QName (Qname with path);
-- invalid symbols between '' and '', i.e. in relative IRIs.

That's why my own TURTLE parser is configurable to selectively report or
ignore these errors. In addition I can relax TURTLE syntax to include
popular violations like redundant delimiters and/or try to recover from
lexical errors as much as it is possible, even if I should lose some ill
triples together with some limited number of proper triples around them
(GIGO mode, for Garbage In Garbage Out).

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com





Re: [semanticweb] ANN: DBpedia 3.5 released

2010-04-14 Thread Ivan Mikhailov
On Tue, 2010-04-13 at 21:58 +0300, ba...@goldmail.de wrote:
  A fact of my experience since many years:
 
  The homepage of my grandma is better accessible than the flagship(!) of
  'linked data' dbpedia.org...

 Someone who has used the endpoint dbpedia.org/sparql intensively
 knows what i mean:
 
 After one or two hours or so, it hangs, i try dbpedia.org with FFox,
 Opera, IE,
 it hangs also, after 5 minutes i try dbpedia.org, i see the page,
 for dbpedia.org/sparql i put my simple query again, it is ok.
 
 Since years it is the same story in the same rhythm.

They say, the everlasting problem for professional cosmetics is growing
quality of optics and media used for movies, celebrities should continue
to look perfect. But I can bet you've never paid attention to that fact
while looking at the final result.

Similarly, growing database size and growing hit rate and growing
complexity of queries are not obviously visible from outside, but turn
the hosting into a race. We're improving the underlaying RDBMS as fast
as we only can just to prevent the service from total halt. One might
wish to provide a better service on their own RDBMS and thus to make a
good advertisement, but nobody else want to do that _and_ can do that,
so we're alone under this load.

If you wish, you may help us with hosting and/or equipment, or simply
set up a mirror site and we would be glad to redirect some part of load
to your cluster. Even an inexpensive $2 mirror would help to some
degree.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com





Re: [semanticweb] ANN: DBpedia 3.5 released

2010-04-14 Thread Ivan Mikhailov
Hello Leigh,

 Out of interest, do you actually share any metrics on usage levels,
 common sparql queries, etc?

Sorry I can't help you much. Admins kept some logs for operational
purposes but I'm not authorized to provide them. If you would like to
see them, please make an official request to Kingsley, stating reason
for this request etc.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com





Re: DBpedia hosting burden

2010-04-14 Thread Ivan Mikhailov
Dan,

 Are there any scenarios around eg. BitTorrent that could be explored?
 What if each of the static files in http://dbpedia.org/sitemap.xml
 were available as torrents (or magnet: URIs)? I realise that would
 only address part of the problem/cost, but it's a widely used
 technology for distributing large files; can we bend it to our needs?

If I were The Emperor of LOD I'd ask all grand dukes of datasources to
put fresh dumps at some torrent with control of UL/DL ratio :) For
reason I can't understand this idea is proposed few times per year but
never tried.

Other approach is to implement scalable and safe patch/diff on RDF
graphs plus subscription on them. That's what I'm writing ATM. Using
this toolkit, it would be quite cheap to place a local copy of LOD on
any appropriate box in any workgroup. A local copy will not require any
hi-end equipment for two reasons: the database can be much smaller than
the public one (one may install only a subset of LOD) and it will
usually less sensitive to RAM/disk ratio (small number of clients will
result in better locality because any given individual tend to browse
interrelated data whereas a crowd produces chaotic sequence of
requests). Crawlers and mobile apps will not migrate to local copies,
but some complicated queries will go away from the bottleneck server and
that would be good enough.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com




Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-23 Thread Ivan Mikhailov

Hello Eyal,

 this benchmark is somewhat unfair as the relational stores have one advantage 
 compared to the 
 native triple stores: the relational data structure is fixed (Products, 
 Producers, Reviews, etc with given columns), while the triple 
 representation is generic (arbitrary s,p,o).
 
 One can question whether such flexibility is relevant in practice, and if 
 so, one may try to extract such structured patterns from data on-the-fly.

That will be our next big extension -- updateable RDV Views, as proposed
in http://esw.w3.org/topic/UpdatingRelationalDataViaSPARUL . So we will
be able to load BSBM data as RDF and query them via SPARQL web service
endpoint; thus we will masquerade  the relational storage entirely.

Best Regards,

Ivan Mikhailov,
OpenLink Software
http://virtuoso.openlinksw.com





RE: BSBM With Triples and Mapped Relational Data in Virtuoso

2008-08-10 Thread Ivan Mikhailov

Hello Andy,

 SPARQL already has parameterized queries!  Because it has explicit
(named) variables, these can be used to set variables scoped just
outside the query string.

I'd like to know in advance that some variables should be bound in the
environment so I'd like to have some way to distinguish between local
and external variables. Indeed, it is technically possible to compile
the statement once for each distinct set of variables in environment; I
simply do not like that :) Tastes differ, of course, but I'd like to
have the environment double-checked and be able to compare expected
and actual lists of passed parameters. Yes, that sort of check will
add extra syntax (or extra pragmas) to the language, making queries less
compact, but I prefer to type few more chars if this will result in
accurate diagnostics of a protocol error in a distributed system.

Best Regards,

Ivan Mikhailov,
OpenLink Software.