Re: Jena and datatypes

Andy Seaborne Wed, 14 Oct 2020 03:13:01 -0700



On 14/10/2020 01:58, Zalan Kemenczy wrote:

Hi Andy, thanks for the reply!

TDB1 and TDB2 do different things.  TDB2 preserves integer subtypes
datatypes whereas TDB1 does not.

It's the lexical form that is canonicalized - not the datatype.

The distinction of "070", "70", "+70" etc is lost and become "70".



Just to confirm that I understand what you mean by "preserves integer
subtypes": when I insert the following two triples into either a TDB2
instance, or the default memory model...

(triple _e <http://xmlns.com/foaf/0.1/age> "70"^^<
http://www.w3.org/2001/XMLSchema#int>)
(triple _e <http://xmlns.com/foaf/0.1/age> "70"^^<
http://www.w3.org/2001/XMLSchema#integer>)

...there are actually two distinct triples in the model. If I insert the
same two triples into a TDB1 instance, there is only a single triple in the
model. Do I have this correct?


Yes.

2 triples for TDB2 and in-memory
1 triple for TDB1


My follow-up question is then around the purpose of `sameValueAs`, which is
described in the typed literal howto (
https://jena.apache.org/documentation/notes/typed-literals.html):

"There is a well defined notion of when two typed literals should be equal,
based on the equality defined for the datatype in question. Jena2
implements this equality function by using the method sameValueAs. Thus two
literals (“13”, xsd:int) and (“13”, xsd:decimal) will test as sameValueAs
each other but neither will test sameValueAs (“13”, xsd:string)."

Where are these `sameValueAs` semantics relevant? I don't immediately see
that they play a role in triple insertion, deletion or unification in
queries. Given the two triples above, if I query either the TDB2 model or
the in memory model:

(bgp (triple ?e <http://xmlns.com/foaf/0.1/age> "70"^^<
http://www.w3.org/2001/XMLSchema#integer>))

I only get back a single binding, for the xsd:integer. Is `sameValueAs`
also specific to TDB1?

"sameValueAs" is a java API feature and only applies to plain(non-transactional) in-memory graphs.


There is a a lot of history here.

In TDB1, it would be xs:integer and that in java is BigInteger.

Fuseki only uses plain in-memory graphs for complex dataset sets withassemblers and ja:Model.


Normally, it's using transactional TIM, TDB1 or TDB2.

(TIM = "Transactions In Memory")

sameValueAs is quite an old feature. A good idea if you see RDF fromJava, mapping to Java types like int and long. That was original Jena.


It affects Graph.find(s,p,o) as well.
See below for an example program.

In plain in-memory, it works by having specialised indexing for accessbut it does not play well with XSD F&O nor with storage. (Storage couldbe done at the expense of more complicated indexes and larger indexes."A small matter of programming" but load performance is affected.

But with SPARQL and XSD evaluation and wanting to work with the actualRDF terms (lexical form and xs datatype), working directly in valueshiding the details does not work out.

An example of a difference: in F&O it says xs:int+xs:long -> xs:integerwhere as in java it is int+long -> long.


ARQ has its own XSD/F&O expression evaluator and does not use sameValueAs.

And Jena generally tries to preserve legacy - what started as anecessary difference for SPARQL has become a whole system of it's own.

Striping it all back to a modern, consistent core aligned to where RDFand SPARQL have gone since the core of Jena was designed is bothattractive and would have significant impact on existing users code.

When that spec says "the same type" in the context of the that spec at
that point, does it mean same xs:numeric basic types: xs:integer,
xs:decimal, xs:float and xs:double, or exact datatype? (The later 3.1
adds "primitive datatype" but that post-dates SPARQL.)

Generally, the interpretation seems to be the former - so long+long is
xs:integer+xs:integer (and can't overflow except for the implementation
limit on xs:integer) which is what Jena does.

I can't find any examples on the web that show long + long -> long.
All examples are "integer + integer" suggesting arguments are raised to
one of the four xs:numeric types.


The safest approach if you want xs:long is to cast:

xsd:long(?age + "1"^^xsd:long)

The text in the latest spec 3.1 is:
https://www.w3.org/TR/xpath-functions/#op.numeric

I'd be interested in knowing what other systems do here.


Thanks for the link Andy, that was very helpful.



Gory details:

    public static void main(String...a) {

        dwim("Plain", DatasetGraphFactory.create());
        dwim("TIM", DatasetGraphFactory.createTxnMem());
        dwim("TDB1", TDBFactory.createDatasetGraph());
        dwim("TDB2", DatabaseMgr.createDatasetGraph());
    }
    private static void dwim(String string, DatasetGraph dsg) {
        System.out.println("== "+string);

        Graph g = dsg.getDefaultGraph();
        Node s = SSE.parseNode(":s");
        Node p = SSE.parseNode(":p");
        Node x1 = SSE.parseNode("'70'^^xsd:int");
        Node x2 = SSE.parseNode("'70'^^xsd:long");
        Triple t1 = Triple.create(s, p, x1);
        Triple t2 = Triple.create(s, p, x2);
        Txn.executeWrite(dsg,  ()->{
            g.add(t1);
            g.add(t2);
            System.out.println("Size: "+g.size());
            Iter.print(g.find(s,p, x2).mapWith(Triple::getObject));
        });
    }
==>

== Plain
Size: 2
"70"^^http://www.w3.org/2001/XMLSchema#long
"70"^^http://www.w3.org/2001/XMLSchema#int
== TIM
Size: 2
"70"^^http://www.w3.org/2001/XMLSchema#long
== TDB1
Size: 1
"70"^^http://www.w3.org/2001/XMLSchema#integer
== TDB2
Size: 2
"70"^^http://www.w3.org/2001/XMLSchema#long

Re: Jena and datatypes

Reply via email to