RE: Identifiers of graphs within spaces

Stephen Bayliss Fri, 10 Feb 2012 04:33:59 -0800

Hi Alessandro

> Good! I'll also be glad if you have any figures to share re 
> the size of 
> the graphs in triples, size of the VM and loading times.


Some indicative timings, these are on a 2GB JVM for Stanbol - this is on a
64-bit OpenSUSE 11.4 VM (4GB) running alongide some other VMs (and other
server processes) so these timings are an indication only.

Graph size 72M loading in 72,366ms
Graph size 185M loading in 341,178ms

These timings are of the same order of magnitude as loading the graphs into
a triplestore (Mulgara) - so this performance is looking acceptable to us.

I was hoping to supply triple counts, but executing any SPARQL queries (from
the Stanbol SPARQL endpoint) on the above is triggering an out of memory
exception - do you have any suggestions on this? (I seem to recall that you
anticipated that there was likely to be an issue here.)

When I get chance I'll load them back into Mulgara and get some counts from
there.

> I wouldn't be surprised to see multiple graphs if you were to 
> load the 
> same ontology from multiple input streams, but since you are 
> passing the 
> TcManager to the input source this should not happen.
> 
> I will look into this hopefully later today. Please refer to the same 
> issue you opened

Confirmed this is occurring with the above graphs.  These are RDF graphs in
SKOS format, so you should easily be able to find some data to test with (eg
from http://www.w3.org/2004/02/skos/).

> On another note, there is indeed an issue about defining a policy 
> concerning what should happen if you are submitting an input 
> source with 
> an ontology with the same ontology ID (not the Clerezza graph 
> ID but the 
> IRI of the one owl:Ontology individual in the content) as a 
> stored one. 
> Whether it implies
> 
> (a) the creation of a new graph, as it is now, or
> (b) a graph replacement,
> (c) a brutal, monotonic addition of all triples to the existing graph,
> (d) no action / an exception, or
> (e) some sort of sophisticated (DL-consistent?) merge.

Some good questions, I will have to think about this.  Overall, one would
want to be able to cope with "good" situations where people properly version
their ontologies using different ontology identifiers, and the "bad"
situations where they do not (and therefore there could be a need to manage
two "versions" with the same identifier?).  I'll come back to you on this...
(but it strikes me initially that (a) would cover this scenario, what do you
think?)



Thanks again.
Steve



> -----Original Message-----
> From: Alessandro Adamou [mailto:[email protected]] 
> Sent: 09 February 2012 15:18
> To: [email protected]
> Subject: Re: Identifiers of graphs within spaces
> 
> 
> Hi Steve, good to know your progress is putting the OntoNet 
> changes to 
> good use!
> 
> > I'd like to report back that using the 
> GraphContentInputSource to load 
> > our large ontologies is working well now; I'll let you if anything 
> > crops up in further testing.
> 
> Good! I'll also be glad if you have any figures to share re 
> the size of 
> the graphs in triples, size of the VM and loading times.
> 
> > On the identifiers -
> >
> > This seems to work fine, we can capture these and use them 
> to manage 
> > updates and deletes so that the graph can be deleted and added back.
> >
> > But, multiple graphs seem to be created.
> 
> Hmmmmmm
> 
> I wouldn't be surprised to see multiple graphs if you were to 
> load the 
> same ontology from multiple input streams, but since you are 
> passing the 
> TcManager to the input source this should not happen.
> 
> I will look into this hopefully later today. Please refer to the same 
> issue you opened
> 
> https://issues.apache.org/jira/browse/STANBOL-426
> 
> On another note, there is indeed an issue about defining a policy 
> concerning what should happen if you are submitting an input 
> source with 
> an ontology with the same ontology ID (not the Clerezza graph 
> ID but the 
> IRI of the one owl:Ontology individual in the content) as a 
> stored one. 
> Whether it implies
> 
> (a) the creation of a new graph, as it is now, or
> (b) a graph replacement,
> (c) a brutal, monotonic addition of all triples to the existing graph,
> (d) no action / an exception, or
> (e) some sort of sophisticated (DL-consistent?) merge.
> 
> This is still to be thought about. As Acuity, which of these policies 
> would you be happiest about?
> 
> Best regards
> Alessandro
> 
> 
> > Example;  adding an ontology using
> >
> > OntologyInputSource<?, TCProvider>  src = 
> GraphContentInputSource(is,
> > (String) null, tcManager);
> > String ID = space.addOntology(src)
> >
> > (tcManager grabbed via SCM )
> >
> > This is getting an ID: 
> > ontonet::http://stanbol.apache.org/1328805977033
> >
> > However, querying using SPARQL I am seeing two graphs with the same 
> > content, the additional graph being (in this case):
> >
> > 
> org.apache.stanbol.ontologymanager.ontonet.api.io.GraphContentInputSou
> > rce-13
> > 28805976740
> >
> > (and the same if I directly do tcManager.listMGraphs() and
> > tcManager.listTripleCollections())
> >
> > Any idea what's going on here?
> >
> > Thanks
> > Steve
> >
> >
> >
> >> -----Original Message-----
> >> From: Stephen Bayliss [mailto:[email protected]]
> >> Sent: 13 January 2012 17:28
> >> To: [email protected]
> >> Subject: RE: Identifiers of graphs within spaces
> >>
> >>
> >> Hi Alessandro
> >>
> >> Thanks very much for this - we're working through the 
> changes.  One 
> >> quick
> >> question:
> >>
> >>> - you can supply the TcProvider to the 
> GraphContentInputSource. If 
> >>> it is the same as the TcManager singleton instance, we skip
> >> copying all the
> >>> triples to yet another Graph. Should take considerably less time
> >> Should we be grabbing the TcProvider with an OSGi SCR @Reference 
> >> annotation, or TcManager.getInstance() ?
> >>
> >> Steve
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: Alessandro Adamou [mailto:[email protected]]
> >>> Sent: 11 January 2012 11:59
> >>> To: [email protected]
> >>> Subject: Re: Identifiers of graphs within spaces
> >>>
> >>>
> >>> Dear Steve,
> >>>
> >>> thanks for your feedback and sorry for not coming back to
> >> you earlier
> >>> but I was on vacation until just the other day.
> >>>
> >>> I have committed an update to OntoNet that should address your 
> >>> inquiries:
> >>> - addOntology() on spaces and sessions now returns the 
> String that 
> >>> you can use as a key to identify the ontology in the
> >> OntologyProvider (or
> >>> the graph in the TcManager if you create a UriRef from it).
> >>> - you can export scopes, spaces and sessions as Clerezza 
> objects if 
> >>> needed - does not give you the OWL-oriented view on the
> >> graph but can
> >>> save much computing power. I will probably employ it on 
> the REST API
> >>> - you can supply the TcProvider to the 
> GraphContentInputSource. If 
> >>> it is the same as the TcManager singleton instance, we skip
> >> copying all the
> >>> triples to yet another Graph. Should take considerably 
> less time; on 
> >>> the other hand it prevents from using this method to *update*
> >>> graphs. Note
> >>> that there are protected binding methods in OntologyInputSource
> >>> implementations for triple providers, physical IRIs etc.
> >>> - other minor optimizations
> >>>
> >>> It would be great to share a benchmarking method to 
> assess network 
> >>> scalability. So far I have managed to load a 200MB RDF/XML graph 
> >>> using a 256MB VM without out-of-memory errors.
> >>>
> >>> Also thanks for the post on the IKS blog (I am telling you here 
> >>> because I don't know if you and Martin are following an 
> IKS mailing
> >>> list)! I am
> >>> working on an adopter-oriented one, and it would be great to
> >>> include an
> >>> overview on the Acuity experience with Stanbol-Fedora - what
> >>> it does and
> >>> what benefit it gets from Stanbol. Would you like to share?
> >>> Unfortunately, I have been able to tell only my side of the
> >> story so
> >>> far, as the link at [1] keeps timing out on me :(
> >>>
> >>> Thanks a lot, keep up the good work!
> >>>
> >>> Alessandro
> >>>
> >>> [1] fedora-stanbol.acuityunlimited.net:18080/orbeon/stanbol-fedora
> >>> /data-browser
> >>>
> >>>
> >>> On 12/30/11 6:08 PM, Stephen Bayliss wrote:
> >>>> Hi Alessandro
> >>>>
> >>>> Thanks very much for your responses.
> >>>>
> >>>>> Dear Steve,
> >>>>>
> >>>>> On 12/19/11 6:22 PM, Stephen Bayliss wrote:
> >>>>>> Our use-case is thus:
> >>>>>>
> >>>>>> 1) Create OntologyContentInputSource(stream)
> >>>>> Perhaps you're better off with a
> >>> GraphContentInputSource(InpuStream),
> >>>>> so it won't have to go through the burden of converting from
> >>>>> OWLOntology to Graph just in order to store it (everything is
> >>>>> stored as Clerezza graphs
> >>>>> anyhow). Note that OWLOntology exports of scopes, spaces and
> >>>>> ontologies
> >>>>> within is possible regardless of the input source
> >>> (although it is THE
> >>>>> bottleneck of the current implementation, see my comment to
> >>>>> STANBOL-433).
> >>>>>
> >>>>> I'm now adding the possibility to specify the TcProvider in the
> >>>>> GraphContentInputSource constructor. This should also save
> >>> the burden
> >>>>> of copying the triples from the in-memory SimpleGraph to
> >> the Graph
> >>>>> stored in the TcManager (IF you pass the TcManager singleton as
> >>> TcProvider).
> >>>> Great, we'll take a look at the GraphContentInputSource and the
> >>>> TcProvider constructor argument.
> >>>>
> >>>>>>       - as our content is behind authentication, the stream
> >>>>> is provided
> >>>>>> by an HTTP client
> >>>>>>       - the content has an identifier (URI) assigned by
> >>> the external
> >>>>>> system (independent of the contents of the stream/ontology)
> >>>>>> 2) Load OntologyInputSource into the space with
> >>>>>> CustomOntologySpace.addOntology(...)
> >>>>>> 3) When updated content comes along:
> >>>>>>       - remove the original (from the store as well as 
> the space)
> >>>>>>       - add the updated content
> >>>>>>
> >>>>>> As the OntologyInputSource was created from a stream, it
> >>>>> doesn't have
> >>>>>> a physical IRI (I think?),
> >>>>> correct
> >>>> Actually logically it does have a physical IRI - the one
> >>> that our HTTP
> >>>> client sourced the input stream from - so if there was an
> >> option to
> >>>> specify the physical IRI somehow, maybe this would in fact
> >>> do the job?
> >>>>>> so at (2) we don't have a "KReS identifier" for it
> >>>>>> - so if we want to replace the ontology in the future with
> >>>>> an updated
> >>>>>> version I can't see currently an easy way of determining which
> >>>>>> ontology to remove from the space and then delete it prior
> >>>>> to adding
> >>>>>> the updated content.
> >>>>> if the ontology is named (i.e. it does have  logical IRI
> >>> even if not
> >>>>> a physical one), you could simply call
> >>>>> OntologyProvider#getKey(logicalIRI), but if you would like
> >>> something
> >>>>> simpler... see my next comment below.
> >>>>>
> >>>>>> I can list the graph keys through the OntologyProvider;
> >>> but I think
> >>>>>> what I need is to know (or be able to set?) the key when
> >>> adding it?
> >>>>> Would it be enough if this key were the return value of
> >>>>> addOntology() ?
> >>>> If there's no logical way of passing in an identifier that
> >>> we wish to
> >>>> use for the graph, then I think this would do the job; we
> >>> can maintain
> >>>> our own map/index of the graph keys vs the content
> >>> provider's URIs for
> >>>> these graphs.
> >>>>
> >>>>
> >>>>>> Also I can see that if I get the TcProvider I can do a
> >>>>>> .deleteTripleCollection(UriRef ref) - how would this
> >>> UriRef link in
> >>>>>> with the above (when I look at the identifiers of the 
> ontologies
> >>>>>> retrieved using the the keys from listGraphs, these are
> >>>>>> "Anonymous-xyz" and don't have an IRI).
> >>>>> I'll have to look into this one, fortunately I've still
> >>> got some time
> >>>>> on it.
> >>>> Great, thanks!
> >>>>
> >>>>> All the best,
> >>>>>
> >>>>> Alessandro
> >>>>>
> >>>>> --
> >>>>> M.Sc. Alessandro Adamou
> >>>>>
> >>>>> Alma Mater Studiorum - Università di Bologna
> >>>>> Department of Computer Science
> >>>>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
> >>>>>
> >>>>> Semantic Technology Laboratory (STLab)
> >>>>> Institute for Cognitive Science and Technology (ISTC) National
> >>>>> Research Council (CNR) Via Nomentana 56, 00161 Rome - Italy
> >>>>>
> >>>>>
> >>>>> "As for the charges against me, I am unconcerned. I am
> >>> beyond their
> >>>>> timid, lying morality, and so I am beyond caring." (Col.
> >> Walter E.
> >>>>> Kurtz)
> >>>>>
> >>>>>
> >>>
> >>> --
> >>> M.Sc. Alessandro Adamou
> >>>
> >>> Alma Mater Studiorum - Università di Bologna
> >>> Department of Computer Science
> >>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
> >>>
> >>> Semantic Technology Laboratory (STLab)
> >>> Institute for Cognitive Science and Technology (ISTC) National
> >>> Research Council (CNR) Via Nomentana 56, 00161 Rome - Italy
> >>>
> >>>
> >>> "As for the charges against me, I am unconcerned. I am beyond
> >>> their timid, lying morality, and so I am beyond caring."
> >>> (Col. Walter E. Kurtz)
> >>>
> >>>
> >>
> >
> 
> 
> -- 
> M.Sc. Alessandro Adamou
> 
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
> 
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
> 
> 
> "As for the charges against me, I am unconcerned. I am beyond 
> their timid, lying morality, and so I am beyond caring."
> (Col. Walter E. Kurtz)
> 
> Not sent from my iSnobTechDevice
> 
>

RE: Identifiers of graphs within spaces

Reply via email to