Re: [OSM-legal-talk] OSM for training ML machines

2019-04-10 Thread Tom Lee via legal-talk
I'll defer to others on the finer points of how downstream or intermediate
ML products fit into the licensing picture, but this did catch my eye:

> If you need an example:  Take a translator for geographic names trained
> using OSM data.  This translator in practical use will spit out names
> or name components identical to those from the OSM database (if it does
> not it'd be pretty useless).  These names - in sufficient volume -
> evidently form a derivative database IMO - even if they are not the
> result of a literal copy but result from 'knowledge' encoded in a
> neural network.

I have sometimes sene similar arguments about intellectual property brought
up in engineering-focused conversations, which propose elaborate technical
mechanisms by which data might be transformed, then recreated, and in the
process its intellectual property rights somehow purged. There are already
several community guidelines explaining why this isn't acceptable. Beyond
that, my own sense is that this is at odds with how the legal system
approaches these questions (and confusion about what "transformative use"
means). If a party has custody of proprietary data, feeds it through a
black box that a judge and jury don't really understand, and the original
data comes out the other side--well, you can see why it's a hard argument
to win.

I think the risk of recreating OSM data via ML trickery is pretty low: it's
an elaborate approach that offer dubious legal advantages. If the question
is whether OSM could claim rights over a fictional but plausible map-like
output from an OSM-trained ML model, I'd say the question is more open. But
I suspect this is a less worrisome scenario for most.

On Wed, Apr 10, 2019 at 10:12 AM Christoph Hormann 
wrote:

> On Wednesday 10 April 2019, althio wrote:
> >
> > You may have skipped parts of my message, so excuse me if I repeat a
> > few lines. You quoted only two sentences and I slightly wonder if you
> > genuinely read the whole.
>
> I am sorry if i left the impression that i was specifically criticizing
> your ideas - i was more referring to the general course of the
> discussion towards a rather mechanical exegesis of the ODbL based on a
> simplistic view of how algorithms work as a mechanical process
> converting well defined input data into well defined output data.
>
> > [...]
> > I don't think my original message can be read as "sweepingly declare
> > any output of algorithms as having no copyright connection".
>
> I did not mean to imply that - but since your line of reasoning only
> covers this case it is to be expected that people assume this is the
> only relevant case.
>
> > [...]
> >
> > My final two cents:
> > Take the Geocoding guideline, replace "Geocoding" by "Machine
> > Learning" and this is, in my humble opinion, an acceptable first
> > draft for discussion.
>
> But as far as i understand you, you up-front want to declare
> the "database" behind the Machine Learning, i.e. the adaptive part of
> the algorithms that gets modified through training, to be a produced
> work and therefore not subject to share-alike.
>
> If not i don't see the practical usefulness in applying the geocoding
> guideline to this in analogy because while for geocoding the individual
> result is a frequent practical use case Machine Learning and similar
> algorithms are mostly used to produce bulk results which are usually
> substantial in terms of database law.
>
> As far as the Horizontal Layers guideline and the concept of produced
> works in general is concerned - the only consistent view of these
> concepts is IMO to consider them to be limited exclusively to cases
> when you are talking about things produced for and used only for direct
> human consumption.
>
> --
> Christoph Hormann
> http://www.imagico.de/
>
> ___
> legal-talk mailing list
> legal-talk@openstreetmap.org
> https://lists.openstreetmap.org/listinfo/legal-talk
>
___
legal-talk mailing list
legal-talk@openstreetmap.org
https://lists.openstreetmap.org/listinfo/legal-talk


Re: [OSM-legal-talk] OSM for training ML machines

2019-04-10 Thread Christoph Hormann
On Wednesday 10 April 2019, althio wrote:
>
> You may have skipped parts of my message, so excuse me if I repeat a
> few lines. You quoted only two sentences and I slightly wonder if you
> genuinely read the whole.

I am sorry if i left the impression that i was specifically criticizing
your ideas - i was more referring to the general course of the
discussion towards a rather mechanical exegesis of the ODbL based on a
simplistic view of how algorithms work as a mechanical process
converting well defined input data into well defined output data.

> [...]
> I don't think my original message can be read as "sweepingly declare
> any output of algorithms as having no copyright connection".

I did not mean to imply that - but since your line of reasoning only
covers this case it is to be expected that people assume this is the
only relevant case.

> [...]
>
> My final two cents:
> Take the Geocoding guideline, replace "Geocoding" by "Machine
> Learning" and this is, in my humble opinion, an acceptable first
> draft for discussion.

But as far as i understand you, you up-front want to declare
the "database" behind the Machine Learning, i.e. the adaptive part of
the algorithms that gets modified through training, to be a produced
work and therefore not subject to share-alike.

If not i don't see the practical usefulness in applying the geocoding
guideline to this in analogy because while for geocoding the individual
result is a frequent practical use case Machine Learning and similar
algorithms are mostly used to produce bulk results which are usually
substantial in terms of database law.

As far as the Horizontal Layers guideline and the concept of produced
works in general is concerned - the only consistent view of these
concepts is IMO to consider them to be limited exclusively to cases
when you are talking about things produced for and used only for direct
human consumption.

--
Christoph Hormann
http://www.imagico.de/

___
legal-talk mailing list
legal-talk@openstreetmap.org
https://lists.openstreetmap.org/listinfo/legal-talk


Re: [OSM-legal-talk] OSM for training ML machines

2019-04-10 Thread althio
Christoph,

You may have skipped parts of my message, so excuse me if I repeat a few lines.
You quoted only two sentences and I slightly wonder if you genuinely
read the whole.
If I misread your critique, please help me and maybe quote the exact
and detailed part where you disagree, not the introductive summary
only.


> > A typical "learned" model, based on a ML algorithm and a substantial
> > extract of OSM data:
> > That seems like a Produced Work to me.
>
> Maybe i have not been clear enough with my comment - approaching this
> matter based on gut feeling and wishful thinking (seems like...)
> without considering the practical effects is a very bad idea.

I stand by my initial assessment:
In a **typical use case** for applying ML algorithms (not just
replicate the training data in bulk), I consider Produced Work as the
best fit.


> You can design 'learning' algorithms to essentially replicate the
> training data so to just sweepingly declare any output of algorithms as
> having no copyright connection to training data is a recipe for desaster [...]

If you replicate the original OSM data, in a substantial amount, this
does not qualify any more.
This is underlined in my first message under:
"licence for the results (outputs): **provided there are an
insubstantial extract or contain no OSM data** [...]"
and
"If the results (outputs) are used to create a new database that
contains the whole or a substantial part of the contents of the OSM
database, this new database would be considered a Derivative Database
and would trigger share-alike obligations under section 4.4.b of the
ODbL. [shameless plug of Geocoding guideline]"

I don't think my original message can be read as "sweepingly declare
any output of algorithms as having no copyright connection".
I don't think we can have a fruitful discussion if you selectively
read messages or redact some important parts.


> is a recipe for desaster (if you subscribe to the spirit of the OdbL) or a 
> recipe for
> success (if your goal is to abolish share-alike and attribution through
> the back door - which of course many corporate OSM data users would
> find highly desirable).

No other comment on this section.


> as also said concentrating exclusively on the produced work vs. derivative 
> database is not really helpful,
> [snip]
> that does not mean that the output of this algorithm, [...], is not a 
> derivative database.

I consider your interpretation very similar to mine.
I fail to see what you are criticizing.


> If you need an example:  Take a translator for geographic names [...].
> These names - in sufficient volume - evidently form a derivative database IMO

I agree with you.
Yet I don't see why you provide this example, or where you disagree with me.

For the record: I find the geocoding example more interesting since it
already has practical applications, it provides a parallel for the
data process of ML and it comes with a Community-LegalWG guideline.


> When considering this subject, maybe think of it less as a question of
> copying data, think of it more as a process of mimicry.

My final two cents:
Take the Geocoding guideline, replace "Geocoding" by "Machine
Learning" and this is, in my humble opinion, an acceptable first draft
for discussion.
Is this draft suitable, or is there any parts that do not hold against
reality or practical effects?
Is there a need to take into account the type of input and output
data, and whether the output data is suitable for inclusion in a
geographical database such as OSM?
See also bits of the Horizontal Layers guideline, such as "If you
improve data used in the OpenStreetMap layer, such as additions or
factual corrections, then you need to share those improvements." Would
they apply? How it could be extended for non-map products?

-- althio

___
legal-talk mailing list
legal-talk@openstreetmap.org
https://lists.openstreetmap.org/listinfo/legal-talk


Re: [OSM-legal-talk] OSM for training ML machines

2019-04-10 Thread Christoph Hormann
On Wednesday 10 April 2019, althio wrote:
>
> A typical "learned" model, based on a ML algorithm and a substantial
> extract of OSM data:
> That seems like a Produced Work to me.
>
> Hence...
> [...]

Maybe i have not been clear enough with my comment - approaching this
matter based on gut feeling and wishful thinking (seems like...)
without considering the practical effects is a very bad idea.

You can design 'learning' algorithms to essentially replicate the
training data so to just sweepingly declare any output of algorithms as
having no copyright connection to training data is a recipe for
desaster (if you subscribe to the spirit of the OdbL) or a recipe for
success (if your goal is to abolish share-alike and attribution through
the back door - which of course many corporate OSM data users would
find highly desirable).

And as also said concentrating exclusively on the produced work vs.
derivative database is not really helpful, in particular since we have
established a long time ago that using a produced work to reconstruct
semantic information of substantial volume will not set you free of the
requirements of the ODbL regarding derivative databases.  So even if
you have a basis for considering the algorithm trained with OSM data a
produced work, that does not mean that the output of this algorithm,
which might be data of exactly the same type as in the OSM database, is
not a derivative database.

If you need an example:  Take a translator for geographic names trained
using OSM data.  This translator in practical use will spit out names
or name components identical to those from the OSM database (if it does
not it'd be pretty useless).  These names - in sufficient volume -
evidently form a derivative database IMO - even if they are not the
result of a literal copy but result from 'knowledge' encoded in a
neural network.

When considering this subject, maybe think of it less as a question of
copying data, think of it more as a process of mimicry.

--
Christoph Hormann
http://www.imagico.de/

___
legal-talk mailing list
legal-talk@openstreetmap.org
https://lists.openstreetmap.org/listinfo/legal-talk


Re: [OSM-legal-talk] OSM for training ML machines

2019-04-10 Thread althio
I will add my 2 cents in the same pot as Kathleen.

A typical "learned" model, based on a ML algorithm and a substantial
extract of OSM data:
That seems like a Produced Work to me.

Hence...
- licence for the training inputs (underlying database, data
structures built before learning): release under ODbL (Derivative
Database; publish the entire database; or alterations; or algorithm)
- licence for the model (weights, internal data structures built
during learning): Produced Work, release under any license that you
like (Share Alike: no), required to credit OpenStreetMap (Attribution:
yes)
- licence for the results (outputs): provided there are an
insubstantial extract or contain no OSM data, release under any
license that you like (Share Alike: no), not required to credit
OpenStreetMap (Attribution: no)

If the results (outputs) are used to create a new database that
contains the whole or a substantial part of the contents of the OSM
database, this new database would be considered a Derivative Database
and would trigger share-alike obligations under section 4.4.b of the
ODbL. [shameless plug of Geocoding guideline]

In fact, I think the Geocoding guideline is a very good starting point
and could be extended to cover other applications (ML-based or not).
Geocoder underlying database ~equivalent~ training inputs
Geocoder application ~equivalent~ ML-based model
Geocoding results ~equivalent~ model outputs

This is my understanding or interpretation of the current materials:
https://opendatacommons.org/licenses/odbl/1.0/
https://wiki.osmfoundation.org/wiki/Licence/Community_Guidelines/Produced_Work_-_Guideline
https://wiki.openstreetmap.org/wiki/Open_Data_License/Produced_Work_-_Guideline
https://wiki.osmfoundation.org/wiki/Licence/Community_Guidelines/Geocoding_-_Guideline

-- althio

On Tue, 9 Apr 2019 at 15:35, Kathleen Lu via legal-talk
 wrote:
>
> My two cents:
> I'm not sure what you mean by internal data structures. If OSM data is used 
> to train a ML algorithm, then I would think that the training inputs could be 
> a substantial extract (possibly a trivial transformation of an extract). But 
> what is trained would be an algorithm/weights, which I generally do not think 
> of as a database at all? But since it uses an OSM database, a Produced Work 
> seems the right concept:
> "a work (such as an image, audiovisual material, text,
> or sounds) resulting from using the whole or a Substantial part of the
> Contents (via a search or other query) from this Database, a Derivative
> Database, or this Database as part of a Collective Database."
> -Kathleen
>
>
>
> On Tue, Apr 9, 2019 at 5:06 AM Frederik Ramm  wrote:
>>
>> Hi,
>>
>> is it a community consensus that, when someone uses OSM to train their
>> machine learning "black box", the internal data structures built during
>> learning constitute a derivative database? Or are there people who argue
>> that somehow the "black box" can ingest OSM data at will and still
>> remain 100% intellectual property of its operator?
>>
>> Further, assuming that we have a system that has ingested OSM by deep
>> learning and we say that this means its internal database is ODbL, what
>> would this mean for the output later produced by the same machine?
>>
>> Bye
>> Frederik
>>
>> --
>> Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09" E008°23'33"
>>
>> ___
>> legal-talk mailing list
>> legal-talk@openstreetmap.org
>> https://lists.openstreetmap.org/listinfo/legal-talk
>
> ___
> legal-talk mailing list
> legal-talk@openstreetmap.org
> https://lists.openstreetmap.org/listinfo/legal-talk

___
legal-talk mailing list
legal-talk@openstreetmap.org
https://lists.openstreetmap.org/listinfo/legal-talk


Re: [OSM-legal-talk] OSM for training ML machines

2019-04-09 Thread Christoph Hormann
On Tuesday 09 April 2019, Frederik Ramm wrote:
>
> is it a community consensus that, when someone uses OSM to train
> their machine learning "black box", the internal data structures
> built during learning constitute a derivative database? Or are there
> people who argue that somehow the "black box" can ingest OSM data at
> will and still remain 100% intellectual property of its operator?
>
> Further, assuming that we have a system that has ingested OSM by deep
> learning and we say that this means its internal database is ODbL,
> what would this mean for the output later produced by the same
> machine?

I see two underlying questions in this that are both not really specific 
to OSM and the ODbL:

* does training a neural network or some other kind of self 
learning/self adjusting algorithm create a derivative work of the 
training data.

* under what circumstances does running/applying an algorithm (which is 
not commonly understood to produce a derivative work of the algorithm 
itself) disseminate so much of itself in its output (the extreme case 
of this being a self replicating program) that its output needs to be 
considered a derivative work of the algorithm itself.

I find both of these to be fascinating and significant questions but as 
said before i suppose there is already significant literature on this 
so it might not make that much sense to contemplate how the OSM 
community would like the anwsers to these questions to be in isolation 
without looking how this is seen elsewhere.

What makes things more complicated in the OSM case is the distinction 
between produced work and derivative database.  That is indeed a 
question we need to discuss in the OSM community specifically.  But it 
does not really make sense to start this discussion before having some 
kind of consensus on the more fundamental questions mentioned before.

And i'd like to in that context quote myself with something i said here 
last June:

> And yes, we probably need a broader discussion on the topic of
> analytic use of OSM data, in particular in the context of 'big data',
> and how this relates to the ODbL.  It seems to me opinions on this
> are too much based on wishful thinking and too little aim to form a
> consistent framework that supports desirable and harmless use cases
> but does not create loopholes against the spirit of the license.

-- 
Christoph Hormann
http://www.imagico.de/

___
legal-talk mailing list
legal-talk@openstreetmap.org
https://lists.openstreetmap.org/listinfo/legal-talk