Re: RDF Patch - experiences suggesting changes

2016-10-20 Thread Rob Vesse
I think we’re both coming at this from different angles hence the disagreement. 
Your assumption seems to be that patches are packets, or perhaps messages, 
exchanged between parts of the system. Whereas my assumption is that the patch 
is similar to a journal and represents a continuous stream of changes with 
potentially many packets in your parlance concatenated together. In your 
scenario I agree that there is probably no need for reversible at the 
transaction level, however in my scenario this could be extremely useful.

 Ultimately you probably have to make a call one way or another. You have 
actual practical implementation experience with this format so you are probably 
better placed to decide what works in the real world. As with the notion of 
repeat it is not like a decision made now is a permanent one, you can always 
revise the standard in the future provided you include a way to define what 
version of the standard a patch conforms to.

Rob

On 20/10/2016 13:50, "Andy Seaborne"  wrote:



On 19/10/16 10:51, Rob Vesse wrote:
> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>
> I don't understand what capabilities are enabled by transaction
> granularity if there are multiple transactions in a single patch.
> Concrete examples of where it helps?
>
> However, I've normally been working with one transaction per patch 
anyway.
>
> Allowing multiple transaction per patch is for making a collect of
> (semantically) related changes into a unit, by consolidating small
> patches "today's changes " (c.f. git squash).
>
> Leaving the transaction boundaries in gives internal checkpoints, not
> just one big transaction. It also makes the consolidate patch
> decomposable (unlike squash).
>
> Internal checkpoints are useful not just for keeping the transaction
> manageable but also to be able to restart a very large update in case 
it
> failed part way through for system reasons (server power cut, user
> reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up 
to date.
>
> I think the thought is that a producer of A patch can decide whether
> each transaction being recorded should be reversible or not. For
> example if you are a very large dataset to an already large database
> you probably don’t want to slow down the import process by having to
> check whether every triple/quad is already in the database as you
> import it. Therefore you might choose to output a non-reversible
> transaction for performance reasons.
>
> On the other hand if you’re accepting a small change to the data then
> that cost is probably acceptable and you would output a reversible
> transaction.
>
> I am not arguing that you shouldn’t have transaction boundaries, in
> fact I think they are essential, but simply that you may want to be
> to annotate the properties of a transaction Beyond just stating the
> boundaries.

Rob,

I agree the producer needs to have control.  What I am asking is why one 
patch unit (packet) would have multiple transactions with different 
characteristics in it.  The properties of patch packet include 
reversibility of contents. A patch overall isn't reversible unless each 
transaction within it is so there is now an opportunity for errors.

I think unit of patch packet is enough - it is supposed to be a sensible 
set of changes to move the dataset from one consistent state to another. 
  In developing that set of changes, there may have been several 
transactions (c.f. git squash).  It happens to give a checkpoint effect 
on large patches as well.

Analogy that may not help : a "TB/TC" is a database-transaction and a 
"patch" is more like a "business transaction".


(The use of "transaction" may not be the best - "action"? but with a 
need for "abort" as well as "commit", "transaction"

Andy







Re: RDF Patch - experiences suggesting changes

2016-10-20 Thread A. Soroka
For my cases, I would like intra-patch transactions because I have several 
different possible implementations of "patch"-- in other words, a patch might 
be an HTTP request, a section of a journal on a filesystem, a feed from a queue 
between time x and time y, an isolated file, etc. Having an independent notion 
of transaction would let me easily keep a common entity (transaction) in my 
systems even though the concrete manifestation of "patch" is varying.

---
A. Soroka
The University of Virginia Library

> On Oct 20, 2016, at 8:50 AM, Andy Seaborne  wrote:
> 
> 
> 
> On 19/10/16 10:51, Rob Vesse wrote:
>> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>> 
>>I don't understand what capabilities are enabled by transaction
>>granularity if there are multiple transactions in a single patch.
>>Concrete examples of where it helps?
>> 
>>However, I've normally been working with one transaction per patch anyway.
>> 
>>Allowing multiple transaction per patch is for making a collect of
>>(semantically) related changes into a unit, by consolidating small
>>patches "today's changes " (c.f. git squash).
>> 
>>Leaving the transaction boundaries in gives internal checkpoints, not
>>just one big transaction. It also makes the consolidate patch
>>decomposable (unlike squash).
>> 
>>Internal checkpoints are useful not just for keeping the transaction
>>manageable but also to be able to restart a very large update in case it
>>failed part way through for system reasons (server power cut, user
>>reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up to 
>> date.
>> 
>> I think the thought is that a producer of A patch can decide whether
>> each transaction being recorded should be reversible or not. For
>> example if you are a very large dataset to an already large database
>> you probably don’t want to slow down the import process by having to
>> check whether every triple/quad is already in the database as you
>> import it. Therefore you might choose to output a non-reversible
>> transaction for performance reasons.
>> 
>> On the other hand if you’re accepting a small change to the data then
>> that cost is probably acceptable and you would output a reversible
>> transaction.
>> 
>> I am not arguing that you shouldn’t have transaction boundaries, in
>> fact I think they are essential, but simply that you may want to be
>> to annotate the properties of a transaction Beyond just stating the
>> boundaries.
> 
> Rob,
> 
> I agree the producer needs to have control.  What I am asking is why one 
> patch unit (packet) would have multiple transactions with different 
> characteristics in it.  The properties of patch packet include reversibility 
> of contents. A patch overall isn't reversible unless each transaction within 
> it is so there is now an opportunity for errors.
> 
> I think unit of patch packet is enough - it is supposed to be a sensible set 
> of changes to move the dataset from one consistent state to another.  In 
> developing that set of changes, there may have been several transactions 
> (c.f. git squash).  It happens to give a checkpoint effect on large patches 
> as well.
> 
> Analogy that may not help : a "TB/TC" is a database-transaction and a "patch" 
> is more like a "business transaction".
> 
> 
> (The use of "transaction" may not be the best - "action"? but with a need for 
> "abort" as well as "commit", "transaction"
> 
>   Andy



Re: RDF Patch - experiences suggesting changes

2016-10-20 Thread Andy Seaborne



On 19/10/16 10:51, Rob Vesse wrote:

On 14/10/2016 17:09, "Andy Seaborne"  wrote:

I don't understand what capabilities are enabled by transaction
granularity if there are multiple transactions in a single patch.
Concrete examples of where it helps?

However, I've normally been working with one transaction per patch anyway.

Allowing multiple transaction per patch is for making a collect of
(semantically) related changes into a unit, by consolidating small
patches "today's changes " (c.f. git squash).

Leaving the transaction boundaries in gives internal checkpoints, not
just one big transaction. It also makes the consolidate patch
decomposable (unlike squash).

Internal checkpoints are useful not just for keeping the transaction
manageable but also to be able to restart a very large update in case it
failed part way through for system reasons (server power cut, user
reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up to date.

I think the thought is that a producer of A patch can decide whether
each transaction being recorded should be reversible or not. For
example if you are a very large dataset to an already large database
you probably don’t want to slow down the import process by having to
check whether every triple/quad is already in the database as you
import it. Therefore you might choose to output a non-reversible
transaction for performance reasons.

On the other hand if you’re accepting a small change to the data then
that cost is probably acceptable and you would output a reversible
transaction.

I am not arguing that you shouldn’t have transaction boundaries, in
fact I think they are essential, but simply that you may want to be
to annotate the properties of a transaction Beyond just stating the
boundaries.


Rob,

I agree the producer needs to have control.  What I am asking is why one 
patch unit (packet) would have multiple transactions with different 
characteristics in it.  The properties of patch packet include 
reversibility of contents. A patch overall isn't reversible unless each 
transaction within it is so there is now an opportunity for errors.


I think unit of patch packet is enough - it is supposed to be a sensible 
set of changes to move the dataset from one consistent state to another. 
 In developing that set of changes, there may have been several 
transactions (c.f. git squash).  It happens to give a checkpoint effect 
on large patches as well.


Analogy that may not help : a "TB/TC" is a database-transaction and a 
"patch" is more like a "business transaction".



(The use of "transaction" may not be the best - "action"? but with a 
need for "abort" as well as "commit", "transaction"


Andy


Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Andy Seaborne



On 19/10/16 11:34, Stian Soiland-Reyes wrote:

I had a quick go, and the penalty from gzip with using expanded forms
without "R" was negligible (~ 0.1%, a bit higher with no prefixes). It
also means you can't process the RDF Patch in a parallel way without
preprocessing.  (Same for prefixes).


Good point ... for certain restricted patches like all QA or all QD 
where reordering (necessary for parallel processing) is possible.


At this point, specifying RDF Patch v2 without R until the interactions 
with gzip etc compressing is better understood seems to me to be the way 
forward.


It's easier to add later than add now and remove.

FYI:

The RIOT parers do interning of Nodes using a 1000 slot LRU cache (so 
not large) - this leads to 30%, sometimes 50%, less memory being used 
due to shared terms.  In practice, it results interning all properties 
in a vocabulary (a 1000 well used properties being quite unusual) which 
R does not do.


Andy





Using "R" could also restrict possible compression pattern, for instance in :

A 

 .
A 

 .

a good compression algorithm might recognize patterns in here like:

 .\nA   .\nA R R" (which sometimes would work well).



Can RDF Patch items within a transaction be considered in any order
(first all the DELETEs, then all the ADDs), or do they have to be
played back linearly?


On 19 October 2016 at 10:57, Rob Vesse  wrote:

Yes but ANY is a form of lossy compression. You lost the actual details of what 
was removed. Also it can only be used for removals and yields no benefit for 
additions.

 On the other hand REPEAT is lossless compression.

 However if you apply a general-purpose compression like gzip on top of the 
patch you probably get just as good compression without needing any special 
tokens. In my experience repeat is more useful in compact binary formats where 
you can use fewer bytes to encode it then either the term itself or a reference 
to the term in some lookup table.

On 14/10/2016 17:09, "Andy Seaborne"  wrote:

These two together seem a bit contradictory.  The advantage of ANY, with
versions, is that it is form of compression.










Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Andy Seaborne

They must be in teh order presented.

QD then QA is a different outcome to QA then QD.

On 19/10/16 15:00, Stian Soiland-Reyes wrote:

Obviously that would be the case for the flat file (no transactions) and
order of any transactions.

So if that is the case also *inside* a transaction, then you are
effectively doing suboperations with a new transactional state per line in
the transaction.

How about restricting transactions to always have the order DDD ..?


Because it is a stream


That would help on reversibility as well as you can't then remove triples
added in the same transaction. (Reversibility is just to swap A/D blocks).


The reverse of QD depends on the data.

If the quad existed at the start, the reverse is a no-op, else its QA.


Perhaps DDD ordering could be a restriction only for Reversible
transactions as it could prevent a more "naive" log approach to be used
with transactions..?


The important point is higher-level semantics (that word!) can be 
imposed by systems on top of a basic patch format.


The headers indicate the additional properties of the patch.

Andy



On 19 Oct 2016 1:40 pm, "Rob Vesse"  wrote:


I am pretty sure that the intent is that a patch must be read in linear
order i.e. It is not designed for parallel processing

On 19/10/2016 11:34, "Stian Soiland-Reyes"  wrote:

I had a quick go, and the penalty from gzip with using expanded forms
without "R" was negligible (~ 0.1%, a bit higher with no prefixes). It
also means you can't process the RDF Patch in a parallel way without
preprocessing.  (Same for prefixes).

Using "R" could also restrict possible compression pattern, for
instance in :

A 

 .
A 

 .

a good compression algorithm might recognize patterns in here like:

 .\nA    .\nA R R" (which sometimes would work well).



Can RDF Patch items within a transaction be considered in any order
(first all the DELETEs, then all the ADDs), or do they have to be
played back linearly?


On 19 October 2016 at 10:57, Rob Vesse  wrote:
> Yes but ANY is a form of lossy compression. You lost the actual
details of what was removed. Also it can only be used for removals and
yields no benefit for additions.
>
>  On the other hand REPEAT is lossless compression.
>
>  However if you apply a general-purpose compression like gzip on top
of the patch you probably get just as good compression without needing any
special tokens. In my experience repeat is more useful in compact binary
formats where you can use fewer bytes to encode it then either the term
itself or a reference to the term in some lookup table.
>
> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>
> These two together seem a bit contradictory.  The advantage of
ANY, with
> versions, is that it is form of compression.
>
>
>
>



--
Stian Soiland-Reyes
http://orcid.org/-0001-9842-9718










Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Rob Vesse
I am pretty sure that the intent is that a patch must be read in linear order 
i.e. It is not designed for parallel processing

On 19/10/2016 11:34, "Stian Soiland-Reyes"  wrote:

I had a quick go, and the penalty from gzip with using expanded forms
without "R" was negligible (~ 0.1%, a bit higher with no prefixes). It
also means you can't process the RDF Patch in a parallel way without
preprocessing.  (Same for prefixes).

Using "R" could also restrict possible compression pattern, for instance in 
:

A 

 .
A 

 .

a good compression algorithm might recognize patterns in here like:

 .\nA    .\nA R R" (which sometimes would work well).



Can RDF Patch items within a transaction be considered in any order
(first all the DELETEs, then all the ADDs), or do they have to be
played back linearly?


On 19 October 2016 at 10:57, Rob Vesse  wrote:
> Yes but ANY is a form of lossy compression. You lost the actual details 
of what was removed. Also it can only be used for removals and yields no 
benefit for additions.
>
>  On the other hand REPEAT is lossless compression.
>
>  However if you apply a general-purpose compression like gzip on top of 
the patch you probably get just as good compression without needing any special 
tokens. In my experience repeat is more useful in compact binary formats where 
you can use fewer bytes to encode it then either the term itself or a reference 
to the term in some lookup table.
>
> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>
> These two together seem a bit contradictory.  The advantage of ANY, 
with
> versions, is that it is form of compression.
>
>
>
>



-- 
Stian Soiland-Reyes
http://orcid.org/-0001-9842-9718







Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Stian Soiland-Reyes
I had a quick go, and the penalty from gzip with using expanded forms
without "R" was negligible (~ 0.1%, a bit higher with no prefixes). It
also means you can't process the RDF Patch in a parallel way without
preprocessing.  (Same for prefixes).

Using "R" could also restrict possible compression pattern, for instance in :

A 

 .
A 

 .

a good compression algorithm might recognize patterns in here like:

 .\nA    .\nA R R" (which sometimes would work well).



Can RDF Patch items within a transaction be considered in any order
(first all the DELETEs, then all the ADDs), or do they have to be
played back linearly?


On 19 October 2016 at 10:57, Rob Vesse  wrote:
> Yes but ANY is a form of lossy compression. You lost the actual details of 
> what was removed. Also it can only be used for removals and yields no benefit 
> for additions.
>
>  On the other hand REPEAT is lossless compression.
>
>  However if you apply a general-purpose compression like gzip on top of the 
> patch you probably get just as good compression without needing any special 
> tokens. In my experience repeat is more useful in compact binary formats 
> where you can use fewer bytes to encode it then either the term itself or a 
> reference to the term in some lookup table.
>
> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>
> These two together seem a bit contradictory.  The advantage of ANY, with
> versions, is that it is form of compression.
>
>
>
>



-- 
Stian Soiland-Reyes
http://orcid.org/-0001-9842-9718


Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Rob Vesse
Yes but ANY is a form of lossy compression. You lost the actual details of what 
was removed. Also it can only be used for removals and yields no benefit for 
additions.

 On the other hand REPEAT is lossless compression.

 However if you apply a general-purpose compression like gzip on top of the 
patch you probably get just as good compression without needing any special 
tokens. In my experience repeat is more useful in compact binary formats where 
you can use fewer bytes to encode it then either the term itself or a reference 
to the term in some lookup table.

On 14/10/2016 17:09, "Andy Seaborne"  wrote:

These two together seem a bit contradictory.  The advantage of ANY, with 
versions, is that it is form of compression.






Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Rob Vesse
On 14/10/2016 17:09, "Andy Seaborne"  wrote:

I don't understand what capabilities are enabled by transaction 
granularity if there are multiple transactions in a single patch. 
Concrete examples of where it helps?

However, I've normally been working with one transaction per patch anyway.

Allowing multiple transaction per patch is for making a collect of 
(semantically) related changes into a unit, by consolidating small 
patches "today's changes " (c.f. git squash).

Leaving the transaction boundaries in gives internal checkpoints, not 
just one big transaction. It also makes the consolidate patch 
decomposable (unlike squash).

Internal checkpoints are useful not just for keeping the transaction 
manageable but also to be able to restart a very large update in case it 
failed part way through for system reasons (server power cut, user 
reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up to date.

 I think the thought is that a producer of A patch can decide whether each 
transaction being recorded should be reversible or not. For example if you are 
a very large dataset to an already large database you probably don’t want to 
slow down the import process by having to check whether every triple/quad is 
already in the database as you import it. Therefore you might choose to output 
a non-reversible transaction for performance reasons.

On the other hand if you’re accepting a small change to the data then that cost 
is probably acceptable and you would output a reversible transaction.

 I am not arguing that you shouldn’t have transaction boundaries, in fact I 
think they are essential, but simply that you may want to be to annotate the 
properties of a transaction Beyond just stating the boundaries.






Re: RDF Patch - experiences suggesting changes

2016-10-17 Thread Andy Seaborne



On 14/10/16 11:59, A. Soroka wrote:
...


6/ Packets of change.

To have 4 (label a patch with reversible) and 5 (the version details), there 
needs to be somewhere to put the information. Having it in the patch itself 
means that the whole unit can be stored in a file.  If it is in the protocol, 
like HTTP for E-tags then the information becomes separated.  That is not to 
say that it can't also be in the protocol but it needs support in the data 
format.


As long as the sort of information about which we are thinking makes sense on a 
per-transaction basis, that could be as I suggest above, as "metadata" on BEGIN.


So a patch packet for a single transaction:

PARENT 
VERSION 
REVERSIBLE   optional
TB
QA ...
QD ...
PA ...
PD ...
TC
H 

where QA and QD are "quad add" "quad delete", and "PA" "PD" are "add prefix" and 
"delete prefix"


I'm suggesting something more like:

TB PARENT  VERSION  REVERSIBLE
QA ...
QD ...
PA ...
PD ...
TC H 

Or even just positionally

TB   REVERSIBLE
QA ...
QD ...
PA ...
PD ...
TC 


An "HTTP header"-like form (key-value lines) would be better because it 
is an open design where new header fields can be put in as needed. 
(e.g. "Author", "Date", "Signed-off-by").


It could be done as a different style, maybe identical to HTTP, so the 
patch specific parsing begins at the first TB, or include a blank line 
or marker like "". Or the header could reuse the same parsing 
tokens, though values may end up in strings for grouping more often.


Andy





Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread Andy Seaborne



On 14/10/16 15:22, Rob Vesse wrote:

Thanks for sending this out

Another use case that springs to mind is for write ahead logging
particularly for reversible patches.


Yes. Whether it has to be reversible depends on how it related to the
journal.  If it is the journal, it only has to be a replayable log. If 
the commit journal is separate, it may need to be reversible.



On the subject prefixes I agree that being able to record prefix
definitions it Is useful and I am strongly in favour of not using
them to compact the data. As you say it actually makes reading and
writing the Data slower as well as requiring additional state to be
recorded during processing.

I like the use of transaction boundaries, I also like A.Soroka’s
suggestion on making the reversible flag be Applied to transaction
begin rather than to the patch as a whole though I don’t see any
problem with supporting both forms. I think reversible patches are an
essential feature.


I don't understand what capabilities are enabled by transaction 
granularity if there are multiple transactions in a single patch. 
Concrete examples of where it helps?


However, I've normally been working with one transaction per patch anyway.

Allowing multiple transaction per patch is for making a collect of 
(semantically) related changes into a unit, by consolidating small 
patches "today's changes " (c.f. git squash).


Leaving the transaction boundaries in gives internal checkpoints, not 
just one big transaction. It also makes the consolidate patch 
decomposable (unlike squash).


Internal checkpoints are useful not just for keeping the transaction 
manageable but also to be able to restart a very large update in case it 
failed part way through for system reasons (server power cut, user 
reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up to date.



For the version control aspect I would be tempted to not constrain it
to UUID and simply say that it is an identifier for the parent state
to which the patch is applied. This will then allow people the
freedom to use hash algorithms, simple counters etc or any other
Version identification scheme they desired. I might even be tempted
to suggest that it should be a URI so that people can use identifiers
in their own name spaces to reduce the chance collisions.


As long as the ref is globally unique (so not counters without uniquifier).

I mentioned UUIDs really to turn up the contrast. It is not naming a web 
resource if it is a version.  The web resource is mutable - it's the 
dataset.  If someone wants to use http: versions for a 
way-back-database, that's cool, but making that the way for systems that 
don't have temporal capabilities (the majority) gets into philosophical 
debates.


And to keep patches protocol independent.

I have separate work on a protocol for keeping two datasets synced (soft 
consistency).



I can see the value of supporting meta data about the patch both
within it and in any protocol used to communicate it. Checksums are
fine although if you include this then you probably need to define
exactly how each checksum should be calculated.


Yes.



As for some of the other suggestions you have received:

- I would be strongly against including an ANY term. As soon as you
get into wild cards you may as well just use SPARQL Update. Plus the
meaning of the wild card is dependent on the dataset to which it is
applied which completely defeats the purpose of being a canonical
description of changes

> - I am strongly for including the REPEAT term.

This has the potential to offer significant compression particularly
if the system producing the patch chooses to group changes by subject
and predicate À la turtle and most other syntaxes.


These two together seem a bit contradictory.  The advantage of ANY, with 
versions, is that it is form of compression.


With out a version, I agree that it is stepping towards a higher level 
language for changes.



The compression by subject/predicate leads me mixed - compression after 
hashing would treat them as more orthogonal.  compressing even with R


My rule of thumb is x8 to x10 compression of N-triple/N-quads.  That's 
not all coming from same-subject etc.  I assume it comes from 
effectively spotting the namespaces and making them compression tokens.


> - Having a term for the default graph could prove useful

Andy



Rob


On 13/10/2016 16:32, "Andy Seaborne"  wrote:

I've been using modified RDF Patch for the data exchanged to keep
multiple datasets synchronized.

My primary use case is having multiple copies of the datasets for a
high availability solution.  It has to be a general solution for any
data.

There are some changes to the format that this work has highlighted.

[RDF Patch - v1] https://afs.github.io/rdf-patch/


1/ Record changes to prefixes

Just handling quads/triples isn't enough - to keep two datasets
in-step, we also need to record changes to 

Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread Rob Vesse
Thanks for sending this out

Another use case that springs to mind is for write ahead logging particularly 
for reversible patches.

On the subject prefixes I agree that being able to record prefix definitions it 
Is useful and I am strongly in favour of not using them to compact the data. As 
you say it actually makes reading and writing the Data slower as well as 
requiring additional state to be recorded during processing.

 I like the use of transaction boundaries, I also like A.Soroka’s suggestion on 
making the reversible flag be Applied to transaction begin rather than to the 
patch as a whole though I don’t see any problem with supporting both forms. I 
think reversible patches are an essential feature.

 For the version control aspect I would be tempted to not constrain it to UUID 
and simply say that it is an identifier for the parent state to which the patch 
is applied. This will then allow people the freedom to use hash algorithms, 
simple counters etc or any other Version identification scheme they desired. I 
might even be tempted to suggest that it should be a URI so that people can use 
identifiers in their own name spaces to reduce the chance collisions.

 I can see the value of supporting meta data about the patch both within it and 
in any protocol used to communicate it. Checksums are fine although if you 
include this then you probably need to define exactly how each checksum should 
be calculated.

 As for some of the other suggestions you have received:

- I would be strongly against including an ANY term. As soon as you get into 
wild cards you may as well just use SPARQL Update. Plus the meaning of the wild 
card is dependent on the dataset to which it is applied which completely 
defeats the purpose of being a canonical description of changes
- I am strongly for including the REPEAT term. This has the potential to offer 
significant compression particularly if the system producing the patch chooses 
to group changes by subject and predicate À la turtle and most other syntaxes.
- Having a term for the default graph could prove useful

Rob


On 13/10/2016 16:32, "Andy Seaborne"  wrote:

I've been using modified RDF Patch for the data exchanged to keep 
multiple datasets synchronized.

My primary use case is having multiple copies of the datasets for a high 
availability solution.  It has to be a general solution for any data.

There are some changes to the format that this work has highlighted.

[RDF Patch - v1]
https://afs.github.io/rdf-patch/


1/ Record changes to prefixes

Just handling quads/triples isn't enough - to keep two datasets in-step, 
we also need to record changes to prefixes.  While they don't change the 
meaning of the data, application developers and users like prefixes.

2/ Remove the in-data prefixes feature.

RDF Patch has the feature to define prefixes in the data and use them 
for prefix names later in the data using @prefix.

This seems to have no real advantage, it can slow things down (c.f. 
N-Triples parsing is faster than Turtle parsing - prefixes is part of 
that), and it generally complicates the data form.

When including "add"/"delete" prefixes on the dataset (1) it also makes 
it quite confusing.

Whether the "R" for "repeat" entry from previous row should also be 
removed is an open question.

3/ Record transaction boundaries.

(A.3 in RDF Patch v1)
http://afs.github.io/rdf-patch/#transaction-boundaries

Having the transaction boundaries recorded means that they can be 
replayed when applying the patch.  While often a patch will be one 
transaction, patches can be consolidated by concatenation.

There 3 operations:

TB, TC, TA - Transaction Begin, Commit, Abort.

Abort is useful to include because to know whether a transaction in a 
patch is going to commit or abort means waiting until the end.  That 
could be buffering client-side, or buffering server-side (or not writing 
the patch to a file) and having a means to discard a patch stream.

Instead, allow a transaction to record an abort, and say that aborted 
transactions in patches can be discarded downstream.

4/ Reversibility is a patch feature.

The RDF Patch v1 document includes "canonical patch" (section 9)
http://afs.github.io/rdf-patch/#canonical-patches

Such a patch is reversible (it can undo changes) if the adds and deletes 
are recorded only if they lead to a real change.  "Add quad" must mean 
"there was no quad in the set before".  But this only makes sense if the 
whole patch has this property.

RDF Patches are in general entries in a "redo log" - you can apply the 
patch over and over again and it will end up in the same state (they are 
idempotent).

A 

Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread Andy Seaborne



On 14/10/16 12:09, A. Soroka wrote:

+1 to ANY, because it offers the potential to actually remove lines
from a patch. For example, removing a whole graph could shrink pretty
rapidly. But just to be clear, ANY would seem to be illegal inside a
reversible patch, right?


Yes.



--- A. Soroka The University of Virginia Library


Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread A. Soroka
+1 to ANY, because it offers the potential to actually remove lines from a 
patch. For example, removing a whole graph could shrink pretty rapidly. But 
just to be clear, ANY would seem to be illegal inside a reversible patch, right?

---
A. Soroka
The University of Virginia Library

> On Oct 14, 2016, at 5:34 AM, Andy Seaborne  wrote:
> 
> Hi Paul,
> 
> The general goal of RDF Patch is to be "assembler" for changes, or "N-Triples 
> for changes" - and there is no pattern matching capability.
> 
> In your example you'd have to know the old value:
> 
> TB
> QD 
>    "r3.xlarge" .
> QA 
>    "r3.2xlarge" .
> TC
> 
> 
> (aside: I wonder if instead of the 3/4 rule for triples/quads, a marker for 
> the default graph is better so the tuple is always QD and 4 terms.
> 
> QD _ 
>    "r3.xlarge" .
> 
> or have TD, TA
> )
> 
> On 13/10/16 17:02, Paul Houle wrote:
>> There is another use case for an "RDF Patch" which applies to
>> hand-written models.  For instance I have a model which describes a job
>> that is run in AWS that looks like
>> 
>> @prefix : 
>> @prefix parameter: 
>> 
>> :Server
>>   :subnetId "subnet-e0ab0197";
>>   :baseImage "ami-ea602afd";
>>   :instanceType "r3.xlarge";
>>   :keyName "o2key";
>>   :keyFile "~/AMZN Keys/o2key.ppk" ;
>>   :securityGroupIds "sg-bca0b2d9" ;
>>   :todo "dbpedia-load" ;
>>   parameter:RDF_SOURCE "s3://abnes/dbpedia/2015-10-gz/" ;
>>   parameter:GRAPH_NAME "http://dbpedia.org/; ;
>>   :imageCommand "/home/ubuntu/RDFeasy/bin/shred_evidence_and_halt";
>>   :iamProfile  ;
>>   :instanceName "Image Build Server";
>>   :qBase  .
>> 
>> one thing you might want to do is modify it so it uses a different
>> :baseImage or a different :instanceType and a natural way to do that is
>> to say
>> 
>> 'remove :Server :instanceType ?x and insert :Server :instanceType
>> "r3.2xlarge"'
> 
> SPARQL Update can provide the "pattern matching" (or some subset like 
> SparqlPatch [https://www.w3.org/2001/sw/wiki/SparqlPatch]):
> 
> 
> DELETE { :Server :instanceType ?x }
> INSERT { :Server :instanceType "r3.2xlarge" }
> WHERE  { :Server :instanceType ?x }
> 
> or
> 
> DELETE WHERE { :Server :instanceType ?x }
> ;
> INSERT DATA { :Server :instanceType "r3.2xlarge" }
> 
> 
> That said, the one useful additional to RDF Patch which is "pattern matching" 
> might be limited bulk delete.
> 
> QD   ANY ANY .
> 
> because listing all the triples to delete when they can be found from the 
> data anyway is a big space saving.
> 
>   Andy
> 
>> but better than that if you have a schema that says ":instanceType is a
>> single valued property" you can write another graph like
>> 
>> :Server
>>   :instanceType "r3.2xlarge" .
>> 
>> and merge it with the first graph to get the desired effect.
>> 
>> More generally this fits into the theme that "the structure of
>> commonsense knowledge is that there are rules,  then exceptions to the
>> rules,  then exceptions to the exceptions of the rules,  etc."
>> For instance I extracted a geospatial database out of Freebase that was
>> about 10 million facts and I found I had to add and remove about 10
>> facts on the route to a 99% success rate at a geospatial recognition
>> task.  A disciplined approach to "agreeing to disagree" goes a long way
>> to solve the problem that specific applications require us to split
>> hairs in different ways.
>> 
>> 
>> 



Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread A. Soroka
Thoughts in-line. (Incidentally, my immediate interest in RDF Patch is pretty 
similar; robustness via distribution, but there's also a smaller, more 
theoretical interest for me in automatically "shredding" or "sharding" datasets 
across networks for higher persistence and query throughput.)

---
A. Soroka
The University of Virginia Library

> On Oct 13, 2016, at 11:32 AM, Andy Seaborne  wrote:
> 
> ...
> 1/ Record changes to prefixes
> 
> Just handling quads/triples isn't enough - to keep two datasets in-step, we 
> also need to record changes to prefixes.  While they don't change the meaning 
> of the data, application developers and users like prefixes.

Boo, hiss, but I can see your point. The worry to me would be the inevitable 
semantic overloading that will come with. But I guess that cake has already 
been baked by all the other RDF formats except NTriples.

> 2/ Remove the in-data prefixes feature.
> 
> RDF Patch has the feature to define prefixes in the data and use them for 
> prefix names later in the data using @prefix.
> 
> This seems to have no real advantage, it can slow things down (c.f. N-Triples 
> parsing is faster than Turtle parsing - prefixes is part of that), and it 
> generally complicates the data form.
> 
> When including "add"/"delete" prefixes on the dataset (1) it also makes it 
> quite confusing.
> 
> Whether the "R" for "repeat" entry from previous row should also be removed 
> is an open question.

I would agree with removing R, and the reason is that it doesn't remove lines. 
In other words, the abbreviation it offers is pretty minimal. On the other 
hand, it is relatively cheap to implement (4 slots of state) so I wouldn't 
argue very much to remove it.

> 3/ Record transaction boundaries.
> 
> (A.3 in RDF Patch v1)
> http://afs.github.io/rdf-patch/#transaction-boundaries
> 
> Having the transaction boundaries recorded means that they can be replayed 
> when applying the patch.  While often a patch will be one transaction, 
> patches can be consolidated by concatenation.
> 
> There 3 operations:
> 
> TB, TC, TA - Transaction Begin, Commit, Abort.
> 
> Abort is useful to include because to know whether a transaction in a patch 
> is going to commit or abort means waiting until the end.  That could be 
> buffering client-side, or buffering server-side (or not writing the patch to 
> a file) and having a means to discard a patch stream.
> 
> Instead, allow a transaction to record an abort, and say that aborted 
> transactions in patches can be discarded downstream.

This is very good stuff. It would be nice to include a definition of 
"transaction-compact" in which no TA may appear. It would enable RDF Patch 
readers to make a very convenient assumption. 

> 4/ Reversibility is a patch feature.
> 
> The RDF Patch v1 document includes "canonical patch" (section 9)
> http://afs.github.io/rdf-patch/#canonical-patches
> 
> Such a patch is reversible (it can undo changes) if the adds and deletes are 
> recorded only if they lead to a real change.  "Add quad" must mean "there was 
> no quad in the set before".  But this only makes sense if the whole patch has 
> this property.
> ...
> What would be useful is to label the patch itself to say whether it is 
> reversible.

Just a thought-- you could change BEGIN to permit "flags". So you could have:

BEGIN REVERSIBLE
patch
patch
patch
END

and you get "canonicity" on a per-transaction level. A patch could optionally 
make explicit its wrapping BEGIN and END for this kind of use.

> 5/ "RDF Git"
> 
> A patch should be able to record where it can be applied.  If RDF Patch is 
> being used to keep two datasets in-step, then some checking to know that the 
> patch can be applied to a copy because it is a patch created from the 
> previous version
> 
> So give each version of the dataset a UUID for a version then record the old 
> ("parent") UUID and the new UUID in the patch.
> ...
> Or some system may want to apply any patch and so create a tree of changes.  
> For the use case of keeping two datasets in-step, that's not what is wanted 
> but other use cases may be better served by having the primary version chain 
> sorted out by higher level software; a patch may be a "proposed change".

Yes, the roaring success of Git (and other DVCS) may imply that letting patches 
be pure changes (not connected to particular versions of the dataset, just 
"isolated" deltas) is the right way to think about them. The word "patch", 
itself, is usefully suggestive. That doesn't mean avoiding any versioning info, 
just making clear that datasets have versions, and the UUIDs associated with a 
given patch refer to where it _came from_, but you can still apply it to 
whatever you want (like cherry-picking Git commits).

Or another way to think about it: any dataset is just the sum of a series of 
patches (a random dataset with no history has an implicit history of one 
"virtual" patch with nothing but adds). So those UUIDs are roughly 

Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread Andy Seaborne

Hi Paul,

The general goal of RDF Patch is to be "assembler" for changes, or 
"N-Triples for changes" - and there is no pattern matching capability.


In your example you'd have to know the old value:

TB
QD 
    "r3.xlarge" .
QA 
    "r3.2xlarge" .
TC


(aside: I wonder if instead of the 3/4 rule for triples/quads, a marker 
for the default graph is better so the tuple is always QD and 4 terms.


QD _ 
    "r3.xlarge" .

or have TD, TA
)

On 13/10/16 17:02, Paul Houle wrote:

There is another use case for an "RDF Patch" which applies to
hand-written models.  For instance I have a model which describes a job
that is run in AWS that looks like

@prefix : 
@prefix parameter: 

:Server
   :subnetId "subnet-e0ab0197";
   :baseImage "ami-ea602afd";
   :instanceType "r3.xlarge";
   :keyName "o2key";
   :keyFile "~/AMZN Keys/o2key.ppk" ;
   :securityGroupIds "sg-bca0b2d9" ;
   :todo "dbpedia-load" ;
   parameter:RDF_SOURCE "s3://abnes/dbpedia/2015-10-gz/" ;
   parameter:GRAPH_NAME "http://dbpedia.org/; ;
   :imageCommand "/home/ubuntu/RDFeasy/bin/shred_evidence_and_halt";
   :iamProfile  ;
   :instanceName "Image Build Server";
   :qBase  .

one thing you might want to do is modify it so it uses a different
:baseImage or a different :instanceType and a natural way to do that is
to say

'remove :Server :instanceType ?x and insert :Server :instanceType
"r3.2xlarge"'


SPARQL Update can provide the "pattern matching" (or some subset like 
SparqlPatch [https://www.w3.org/2001/sw/wiki/SparqlPatch]):



DELETE { :Server :instanceType ?x }
INSERT { :Server :instanceType "r3.2xlarge" }
WHERE  { :Server :instanceType ?x }

or

DELETE WHERE { :Server :instanceType ?x }
;
INSERT DATA { :Server :instanceType "r3.2xlarge" }


That said, the one useful additional to RDF Patch which is "pattern 
matching" might be limited bulk delete.


QD   ANY ANY .

because listing all the triples to delete when they can be found from 
the data anyway is a big space saving.


Andy


but better than that if you have a schema that says ":instanceType is a
single valued property" you can write another graph like

:Server
   :instanceType "r3.2xlarge" .

and merge it with the first graph to get the desired effect.

More generally this fits into the theme that "the structure of
commonsense knowledge is that there are rules,  then exceptions to the
rules,  then exceptions to the exceptions of the rules,  etc."
For instance I extracted a geospatial database out of Freebase that was
about 10 million facts and I found I had to add and remove about 10
facts on the route to a 99% success rate at a geospatial recognition
task.  A disciplined approach to "agreeing to disagree" goes a long way
to solve the problem that specific applications require us to split
hairs in different ways.





Re: RDF Patch - experiences suggesting changes

2016-10-13 Thread Paul Houle
There is another use case for an "RDF Patch" which applies to
hand-written models.  For instance I have a model which describes a job
that is run in AWS that looks like

@prefix : 
@prefix parameter: 

:Server
   :subnetId "subnet-e0ab0197";
   :baseImage "ami-ea602afd";
   :instanceType "r3.xlarge";
   :keyName "o2key";
   :keyFile "~/AMZN Keys/o2key.ppk" ;
   :securityGroupIds "sg-bca0b2d9" ;
   :todo "dbpedia-load" ;
   parameter:RDF_SOURCE "s3://abnes/dbpedia/2015-10-gz/" ;
   parameter:GRAPH_NAME "http://dbpedia.org/; ;
   :imageCommand "/home/ubuntu/RDFeasy/bin/shred_evidence_and_halt";
   :iamProfile  ;
   :instanceName "Image Build Server";
   :qBase  .

one thing you might want to do is modify it so it uses a different
:baseImage or a different :instanceType and a natural way to do that is
to say

'remove :Server :instanceType ?x and insert :Server :instanceType
"r3.2xlarge"'

but better than that if you have a schema that says ":instanceType is a
single valued property" you can write another graph like

:Server
   :instanceType "r3.2xlarge" .

and merge it with the first graph to get the desired effect.

More generally this fits into the theme that "the structure of
commonsense knowledge is that there are rules,  then exceptions to the
rules,  then exceptions to the exceptions of the rules,  etc."

For instance I extracted a geospatial database out of Freebase that was
about 10 million facts and I found I had to add and remove about 10
facts on the route to a 99% success rate at a geospatial recognition
task.  A disciplined approach to "agreeing to disagree" goes a long way
to solve the problem that specific applications require us to split
hairs in different ways.



-- 
  Paul Houle
  paul.ho...@ontology2.com

On Thu, Oct 13, 2016, at 11:32 AM, Andy Seaborne wrote:
> I've been using modified RDF Patch for the data exchanged to keep 
> multiple datasets synchronized.
> 
> My primary use case is having multiple copies of the datasets for a high 
> availability solution.  It has to be a general solution for any data.
> 
> There are some changes to the format that this work has highlighted.
> 
> [RDF Patch - v1]
> https://afs.github.io/rdf-patch/
> 
> 
> 1/ Record changes to prefixes
> 
> Just handling quads/triples isn't enough - to keep two datasets in-step, 
> we also need to record changes to prefixes.  While they don't change the 
> meaning of the data, application developers and users like prefixes.
> 
> 2/ Remove the in-data prefixes feature.
> 
> RDF Patch has the feature to define prefixes in the data and use them 
> for prefix names later in the data using @prefix.
> 
> This seems to have no real advantage, it can slow things down (c.f. 
> N-Triples parsing is faster than Turtle parsing - prefixes is part of 
> that), and it generally complicates the data form.
> 
> When including "add"/"delete" prefixes on the dataset (1) it also makes 
> it quite confusing.
> 
> Whether the "R" for "repeat" entry from previous row should also be 
> removed is an open question.
> 
> 3/ Record transaction boundaries.
> 
> (A.3 in RDF Patch v1)
> http://afs.github.io/rdf-patch/#transaction-boundaries
> 
> Having the transaction boundaries recorded means that they can be 
> replayed when applying the patch.  While often a patch will be one 
> transaction, patches can be consolidated by concatenation.
> 
> There 3 operations:
> 
> TB, TC, TA - Transaction Begin, Commit, Abort.
> 
> Abort is useful to include because to know whether a transaction in a 
> patch is going to commit or abort means waiting until the end.  That 
> could be buffering client-side, or buffering server-side (or not writing 
> the patch to a file) and having a means to discard a patch stream.
> 
> Instead, allow a transaction to record an abort, and say that aborted 
> transactions in patches can be discarded downstream.
> 
> 4/ Reversibility is a patch feature.
> 
> The RDF Patch v1 document includes "canonical patch" (section 9)
> http://afs.github.io/rdf-patch/#canonical-patches
> 
> Such a patch is reversible (it can undo changes) if the adds and deletes 
> are recorded only if they lead to a real change.  "Add quad" must mean 
> "there was no quad in the set before".  But this only makes sense if the 
> whole patch has this property.
> 
> RDF Patches are in general entries in a "redo log" - you can apply the 
> patch over and over again and it will end up in the same state (they are 
> idempotent).
> 
> A reversible patch is also an "undo log" entry and if you apply it in 
> reverse order, it acts to undo the patch played forwards.
> 
> Testing whether a triple or quad is already present while performing 
> updates is not cheap - and in some cases where the patch is being 
> computed without reference to an existing dataset may not be possible.
> 
> What would be useful is to label the patch 

RDF Patch - experiences suggesting changes

2016-10-13 Thread Andy Seaborne
I've been using modified RDF Patch for the data exchanged to keep 
multiple datasets synchronized.


My primary use case is having multiple copies of the datasets for a high 
availability solution.  It has to be a general solution for any data.


There are some changes to the format that this work has highlighted.

[RDF Patch - v1]
https://afs.github.io/rdf-patch/


1/ Record changes to prefixes

Just handling quads/triples isn't enough - to keep two datasets in-step, 
we also need to record changes to prefixes.  While they don't change the 
meaning of the data, application developers and users like prefixes.


2/ Remove the in-data prefixes feature.

RDF Patch has the feature to define prefixes in the data and use them 
for prefix names later in the data using @prefix.


This seems to have no real advantage, it can slow things down (c.f. 
N-Triples parsing is faster than Turtle parsing - prefixes is part of 
that), and it generally complicates the data form.


When including "add"/"delete" prefixes on the dataset (1) it also makes 
it quite confusing.


Whether the "R" for "repeat" entry from previous row should also be 
removed is an open question.


3/ Record transaction boundaries.

(A.3 in RDF Patch v1)
http://afs.github.io/rdf-patch/#transaction-boundaries

Having the transaction boundaries recorded means that they can be 
replayed when applying the patch.  While often a patch will be one 
transaction, patches can be consolidated by concatenation.


There 3 operations:

TB, TC, TA - Transaction Begin, Commit, Abort.

Abort is useful to include because to know whether a transaction in a 
patch is going to commit or abort means waiting until the end.  That 
could be buffering client-side, or buffering server-side (or not writing 
the patch to a file) and having a means to discard a patch stream.


Instead, allow a transaction to record an abort, and say that aborted 
transactions in patches can be discarded downstream.


4/ Reversibility is a patch feature.

The RDF Patch v1 document includes "canonical patch" (section 9)
http://afs.github.io/rdf-patch/#canonical-patches

Such a patch is reversible (it can undo changes) if the adds and deletes 
are recorded only if they lead to a real change.  "Add quad" must mean 
"there was no quad in the set before".  But this only makes sense if the 
whole patch has this property.


RDF Patches are in general entries in a "redo log" - you can apply the 
patch over and over again and it will end up in the same state (they are 
idempotent).


A reversible patch is also an "undo log" entry and if you apply it in 
reverse order, it acts to undo the patch played forwards.


Testing whether a triple or quad is already present while performing 
updates is not cheap - and in some cases where the patch is being 
computed without reference to an existing dataset may not be possible.


What would be useful is to label the patch itself to say whether it is 
reversible.


5/ "RDF Git"

A patch should be able to record where it can be applied.  If RDF Patch 
is being used to keep two datasets in-step, then some checking to know 
that the patch can be applied to a copy because it is a patch created 
from the previous version


So give each version of the dataset a UUID for a version then record the 
old ("parent") UUID and the new UUID in the patch.


If the version checked and enforced, we get a chain of versions and 
patches that lead from one state to another without risk of concurrent 
changes getting mixed in.


This is like git - a patch can be accepted if the versions align 
otherwise it is rejected (more a git repo not accepting a push than a 
merge conflict).


Or some system may want to apply any patch and so create a tree of 
changes.  For the use case of keeping two datasets in-step, that's not 
what is wanted but other use cases may be better served by having the 
primary version chain sorted out by higher level software; a patch may 
be a "proposed change".


6/ Packets of change.

To have 4 (label a patch with reversible) and 5 (the version details), 
there needs to be somewhere to put the information. Having it in the 
patch itself means that the whole unit can be stored in a file.  If it 
is in the protocol, like HTTP for E-tags then the information becomes 
separated.  That is not to say that it can't also be in the protocol but 
it needs support in the data format.


7/ Checksum

Another feature to add to the packet is a checksum. A hash (which one? 
git uses SHA1) from start of packet header, including the initial 
version (UUID), the version on applying the patch (UUID) and the changes 
(i.e. start of packet to after the DOT of the last line of change), 
makes the packet robust to editting after creating it.   Like git; git 
uses it as the "object id".


So a patch packet for a single transaction:

PARENT 
VERSION 
REVERSIBLE   optional
TB
QA ...
QD ...
PA ...
PD ...
TC
H 

where QA and QD are "quad add" "quad delete", and "PA" "PD" are "add 
prefix"