Sorry for the delay
On 25/02/13 12:42, Dick Murray wrote:
Questions regarding the memory footprint and the
SPARQL_QueryGeneral.MaxTriples (100*1000) which is final static int.
To be clear - this is only used when loading data as part of FROM/FROM
NAMED. In that case, an in-memory graph/dataset is used. And reading
in data just to query it and throw it away is quite time-consuming.
The execute method on SPARQL_Query calls decideDataset(action, query,
queryStringLog) to return the Dataset against which to execute the query.
This in turn builds a Dataset from a DatsesetDescription which loops
through graphURLs and namesGraphs. Each iteration creates a default Model
(in memory) and loads in triples using a SinkTriplesToGraph via the
RiotReader. I'm assuming this uses the sink send method (I got lost in the
interface when tracing the hierachy)..?
Yes
I'm assuming that the graphs/triples aren't duplicated? But there is a
overhead as the triples are "sinked"?
No (not duplicated)
It should not be be an overhead - it's one extra method call on each triple.
As of Fuseki 0.2.6, this is now a StreamRDF, much the same thing as a
sink but it models the output of parsers better. All parsing now
outputs via a StreamRDF and there are a myriad of implementations from
ones that put the output in graph to ones that print directly (so you
can have streaming parse-to-print).
GraphLoadUtils.readUtil comes down to:
Lang lang = RDFLanguages.filenameToLang(uri, RDFLanguages.RDFXML) ;
StreamRDF sink = StreamRDFLib.graph(graph) ;
sink = new SinkRDFLimited(sink, limit) ;
InputStream input = Fuseki.webStreamManager.open(uri) ;
RDFDataMgr.parse(sink, input, uri, lang, null) ;
and "RDFDataMgr.parse(StreamRDF" is the core that drives allparsing
operations nowadays (Jena 2.10.0)
The SinkLimited class uses the MaxTriples value and throws a
RiotException("Limit "+limit+" exceeded") from the send(T thing) method.
How do I get around this?
Typically by loading the data via SPARQL update e.g. LOAD or the upload
operations, or offline, into a database.
100K for reading in a one time use dataset is not something to be done
lightly.
What's the use case here?
TDB, with FROM/FROM NAMED, works differently.
It uses the graphs named to construct an execution that only applies to
those graphs. The graphs come from the local database and are not copied.
Andy
Dick.