Re: [OSM-legal-talk] OSM for training ML machines
I'll defer to others on the finer points of how downstream or intermediate ML products fit into the licensing picture, but this did catch my eye: > If you need an example: Take a translator for geographic names trained > using OSM data. This translator in practical use will spit out names > or name components identical to those from the OSM database (if it does > not it'd be pretty useless). These names - in sufficient volume - > evidently form a derivative database IMO - even if they are not the > result of a literal copy but result from 'knowledge' encoded in a > neural network. I have sometimes sene similar arguments about intellectual property brought up in engineering-focused conversations, which propose elaborate technical mechanisms by which data might be transformed, then recreated, and in the process its intellectual property rights somehow purged. There are already several community guidelines explaining why this isn't acceptable. Beyond that, my own sense is that this is at odds with how the legal system approaches these questions (and confusion about what "transformative use" means). If a party has custody of proprietary data, feeds it through a black box that a judge and jury don't really understand, and the original data comes out the other side--well, you can see why it's a hard argument to win. I think the risk of recreating OSM data via ML trickery is pretty low: it's an elaborate approach that offer dubious legal advantages. If the question is whether OSM could claim rights over a fictional but plausible map-like output from an OSM-trained ML model, I'd say the question is more open. But I suspect this is a less worrisome scenario for most. On Wed, Apr 10, 2019 at 10:12 AM Christoph Hormann wrote: > On Wednesday 10 April 2019, althio wrote: > > > > You may have skipped parts of my message, so excuse me if I repeat a > > few lines. You quoted only two sentences and I slightly wonder if you > > genuinely read the whole. > > I am sorry if i left the impression that i was specifically criticizing > your ideas - i was more referring to the general course of the > discussion towards a rather mechanical exegesis of the ODbL based on a > simplistic view of how algorithms work as a mechanical process > converting well defined input data into well defined output data. > > > [...] > > I don't think my original message can be read as "sweepingly declare > > any output of algorithms as having no copyright connection". > > I did not mean to imply that - but since your line of reasoning only > covers this case it is to be expected that people assume this is the > only relevant case. > > > [...] > > > > My final two cents: > > Take the Geocoding guideline, replace "Geocoding" by "Machine > > Learning" and this is, in my humble opinion, an acceptable first > > draft for discussion. > > But as far as i understand you, you up-front want to declare > the "database" behind the Machine Learning, i.e. the adaptive part of > the algorithms that gets modified through training, to be a produced > work and therefore not subject to share-alike. > > If not i don't see the practical usefulness in applying the geocoding > guideline to this in analogy because while for geocoding the individual > result is a frequent practical use case Machine Learning and similar > algorithms are mostly used to produce bulk results which are usually > substantial in terms of database law. > > As far as the Horizontal Layers guideline and the concept of produced > works in general is concerned - the only consistent view of these > concepts is IMO to consider them to be limited exclusively to cases > when you are talking about things produced for and used only for direct > human consumption. > > -- > Christoph Hormann > http://www.imagico.de/ > > ___ > legal-talk mailing list > legal-talk@openstreetmap.org > https://lists.openstreetmap.org/listinfo/legal-talk > ___ legal-talk mailing list legal-talk@openstreetmap.org https://lists.openstreetmap.org/listinfo/legal-talk
Re: [OSM-legal-talk] OSM for training ML machines
On Wednesday 10 April 2019, althio wrote: > > You may have skipped parts of my message, so excuse me if I repeat a > few lines. You quoted only two sentences and I slightly wonder if you > genuinely read the whole. I am sorry if i left the impression that i was specifically criticizing your ideas - i was more referring to the general course of the discussion towards a rather mechanical exegesis of the ODbL based on a simplistic view of how algorithms work as a mechanical process converting well defined input data into well defined output data. > [...] > I don't think my original message can be read as "sweepingly declare > any output of algorithms as having no copyright connection". I did not mean to imply that - but since your line of reasoning only covers this case it is to be expected that people assume this is the only relevant case. > [...] > > My final two cents: > Take the Geocoding guideline, replace "Geocoding" by "Machine > Learning" and this is, in my humble opinion, an acceptable first > draft for discussion. But as far as i understand you, you up-front want to declare the "database" behind the Machine Learning, i.e. the adaptive part of the algorithms that gets modified through training, to be a produced work and therefore not subject to share-alike. If not i don't see the practical usefulness in applying the geocoding guideline to this in analogy because while for geocoding the individual result is a frequent practical use case Machine Learning and similar algorithms are mostly used to produce bulk results which are usually substantial in terms of database law. As far as the Horizontal Layers guideline and the concept of produced works in general is concerned - the only consistent view of these concepts is IMO to consider them to be limited exclusively to cases when you are talking about things produced for and used only for direct human consumption. -- Christoph Hormann http://www.imagico.de/ ___ legal-talk mailing list legal-talk@openstreetmap.org https://lists.openstreetmap.org/listinfo/legal-talk
Re: [OSM-legal-talk] OSM for training ML machines
Christoph, You may have skipped parts of my message, so excuse me if I repeat a few lines. You quoted only two sentences and I slightly wonder if you genuinely read the whole. If I misread your critique, please help me and maybe quote the exact and detailed part where you disagree, not the introductive summary only. > > A typical "learned" model, based on a ML algorithm and a substantial > > extract of OSM data: > > That seems like a Produced Work to me. > > Maybe i have not been clear enough with my comment - approaching this > matter based on gut feeling and wishful thinking (seems like...) > without considering the practical effects is a very bad idea. I stand by my initial assessment: In a **typical use case** for applying ML algorithms (not just replicate the training data in bulk), I consider Produced Work as the best fit. > You can design 'learning' algorithms to essentially replicate the > training data so to just sweepingly declare any output of algorithms as > having no copyright connection to training data is a recipe for desaster [...] If you replicate the original OSM data, in a substantial amount, this does not qualify any more. This is underlined in my first message under: "licence for the results (outputs): **provided there are an insubstantial extract or contain no OSM data** [...]" and "If the results (outputs) are used to create a new database that contains the whole or a substantial part of the contents of the OSM database, this new database would be considered a Derivative Database and would trigger share-alike obligations under section 4.4.b of the ODbL. [shameless plug of Geocoding guideline]" I don't think my original message can be read as "sweepingly declare any output of algorithms as having no copyright connection". I don't think we can have a fruitful discussion if you selectively read messages or redact some important parts. > is a recipe for desaster (if you subscribe to the spirit of the OdbL) or a > recipe for > success (if your goal is to abolish share-alike and attribution through > the back door - which of course many corporate OSM data users would > find highly desirable). No other comment on this section. > as also said concentrating exclusively on the produced work vs. derivative > database is not really helpful, > [snip] > that does not mean that the output of this algorithm, [...], is not a > derivative database. I consider your interpretation very similar to mine. I fail to see what you are criticizing. > If you need an example: Take a translator for geographic names [...]. > These names - in sufficient volume - evidently form a derivative database IMO I agree with you. Yet I don't see why you provide this example, or where you disagree with me. For the record: I find the geocoding example more interesting since it already has practical applications, it provides a parallel for the data process of ML and it comes with a Community-LegalWG guideline. > When considering this subject, maybe think of it less as a question of > copying data, think of it more as a process of mimicry. My final two cents: Take the Geocoding guideline, replace "Geocoding" by "Machine Learning" and this is, in my humble opinion, an acceptable first draft for discussion. Is this draft suitable, or is there any parts that do not hold against reality or practical effects? Is there a need to take into account the type of input and output data, and whether the output data is suitable for inclusion in a geographical database such as OSM? See also bits of the Horizontal Layers guideline, such as "If you improve data used in the OpenStreetMap layer, such as additions or factual corrections, then you need to share those improvements." Would they apply? How it could be extended for non-map products? -- althio ___ legal-talk mailing list legal-talk@openstreetmap.org https://lists.openstreetmap.org/listinfo/legal-talk
Re: [OSM-legal-talk] OSM for training ML machines
On Wednesday 10 April 2019, althio wrote: > > A typical "learned" model, based on a ML algorithm and a substantial > extract of OSM data: > That seems like a Produced Work to me. > > Hence... > [...] Maybe i have not been clear enough with my comment - approaching this matter based on gut feeling and wishful thinking (seems like...) without considering the practical effects is a very bad idea. You can design 'learning' algorithms to essentially replicate the training data so to just sweepingly declare any output of algorithms as having no copyright connection to training data is a recipe for desaster (if you subscribe to the spirit of the OdbL) or a recipe for success (if your goal is to abolish share-alike and attribution through the back door - which of course many corporate OSM data users would find highly desirable). And as also said concentrating exclusively on the produced work vs. derivative database is not really helpful, in particular since we have established a long time ago that using a produced work to reconstruct semantic information of substantial volume will not set you free of the requirements of the ODbL regarding derivative databases. So even if you have a basis for considering the algorithm trained with OSM data a produced work, that does not mean that the output of this algorithm, which might be data of exactly the same type as in the OSM database, is not a derivative database. If you need an example: Take a translator for geographic names trained using OSM data. This translator in practical use will spit out names or name components identical to those from the OSM database (if it does not it'd be pretty useless). These names - in sufficient volume - evidently form a derivative database IMO - even if they are not the result of a literal copy but result from 'knowledge' encoded in a neural network. When considering this subject, maybe think of it less as a question of copying data, think of it more as a process of mimicry. -- Christoph Hormann http://www.imagico.de/ ___ legal-talk mailing list legal-talk@openstreetmap.org https://lists.openstreetmap.org/listinfo/legal-talk
Re: [OSM-legal-talk] OSM for training ML machines
I will add my 2 cents in the same pot as Kathleen. A typical "learned" model, based on a ML algorithm and a substantial extract of OSM data: That seems like a Produced Work to me. Hence... - licence for the training inputs (underlying database, data structures built before learning): release under ODbL (Derivative Database; publish the entire database; or alterations; or algorithm) - licence for the model (weights, internal data structures built during learning): Produced Work, release under any license that you like (Share Alike: no), required to credit OpenStreetMap (Attribution: yes) - licence for the results (outputs): provided there are an insubstantial extract or contain no OSM data, release under any license that you like (Share Alike: no), not required to credit OpenStreetMap (Attribution: no) If the results (outputs) are used to create a new database that contains the whole or a substantial part of the contents of the OSM database, this new database would be considered a Derivative Database and would trigger share-alike obligations under section 4.4.b of the ODbL. [shameless plug of Geocoding guideline] In fact, I think the Geocoding guideline is a very good starting point and could be extended to cover other applications (ML-based or not). Geocoder underlying database ~equivalent~ training inputs Geocoder application ~equivalent~ ML-based model Geocoding results ~equivalent~ model outputs This is my understanding or interpretation of the current materials: https://opendatacommons.org/licenses/odbl/1.0/ https://wiki.osmfoundation.org/wiki/Licence/Community_Guidelines/Produced_Work_-_Guideline https://wiki.openstreetmap.org/wiki/Open_Data_License/Produced_Work_-_Guideline https://wiki.osmfoundation.org/wiki/Licence/Community_Guidelines/Geocoding_-_Guideline -- althio On Tue, 9 Apr 2019 at 15:35, Kathleen Lu via legal-talk wrote: > > My two cents: > I'm not sure what you mean by internal data structures. If OSM data is used > to train a ML algorithm, then I would think that the training inputs could be > a substantial extract (possibly a trivial transformation of an extract). But > what is trained would be an algorithm/weights, which I generally do not think > of as a database at all? But since it uses an OSM database, a Produced Work > seems the right concept: > "a work (such as an image, audiovisual material, text, > or sounds) resulting from using the whole or a Substantial part of the > Contents (via a search or other query) from this Database, a Derivative > Database, or this Database as part of a Collective Database." > -Kathleen > > > > On Tue, Apr 9, 2019 at 5:06 AM Frederik Ramm wrote: >> >> Hi, >> >> is it a community consensus that, when someone uses OSM to train their >> machine learning "black box", the internal data structures built during >> learning constitute a derivative database? Or are there people who argue >> that somehow the "black box" can ingest OSM data at will and still >> remain 100% intellectual property of its operator? >> >> Further, assuming that we have a system that has ingested OSM by deep >> learning and we say that this means its internal database is ODbL, what >> would this mean for the output later produced by the same machine? >> >> Bye >> Frederik >> >> -- >> Frederik Ramm ## eMail frede...@remote.org ## N49°00'09" E008°23'33" >> >> ___ >> legal-talk mailing list >> legal-talk@openstreetmap.org >> https://lists.openstreetmap.org/listinfo/legal-talk > > ___ > legal-talk mailing list > legal-talk@openstreetmap.org > https://lists.openstreetmap.org/listinfo/legal-talk ___ legal-talk mailing list legal-talk@openstreetmap.org https://lists.openstreetmap.org/listinfo/legal-talk
Re: [OSM-legal-talk] OSM for training ML machines
On Tuesday 09 April 2019, Frederik Ramm wrote: > > is it a community consensus that, when someone uses OSM to train > their machine learning "black box", the internal data structures > built during learning constitute a derivative database? Or are there > people who argue that somehow the "black box" can ingest OSM data at > will and still remain 100% intellectual property of its operator? > > Further, assuming that we have a system that has ingested OSM by deep > learning and we say that this means its internal database is ODbL, > what would this mean for the output later produced by the same > machine? I see two underlying questions in this that are both not really specific to OSM and the ODbL: * does training a neural network or some other kind of self learning/self adjusting algorithm create a derivative work of the training data. * under what circumstances does running/applying an algorithm (which is not commonly understood to produce a derivative work of the algorithm itself) disseminate so much of itself in its output (the extreme case of this being a self replicating program) that its output needs to be considered a derivative work of the algorithm itself. I find both of these to be fascinating and significant questions but as said before i suppose there is already significant literature on this so it might not make that much sense to contemplate how the OSM community would like the anwsers to these questions to be in isolation without looking how this is seen elsewhere. What makes things more complicated in the OSM case is the distinction between produced work and derivative database. That is indeed a question we need to discuss in the OSM community specifically. But it does not really make sense to start this discussion before having some kind of consensus on the more fundamental questions mentioned before. And i'd like to in that context quote myself with something i said here last June: > And yes, we probably need a broader discussion on the topic of > analytic use of OSM data, in particular in the context of 'big data', > and how this relates to the ODbL. It seems to me opinions on this > are too much based on wishful thinking and too little aim to form a > consistent framework that supports desirable and harmless use cases > but does not create loopholes against the spirit of the license. -- Christoph Hormann http://www.imagico.de/ ___ legal-talk mailing list legal-talk@openstreetmap.org https://lists.openstreetmap.org/listinfo/legal-talk