For better or worse, it seems clear that the cat is out of the bag.
Identity detection through stylometry is now an established technology and
you can easily find code on GitHub or elsewhere (e.g.
https://github.com/jabraunlin/reddit-user-id) to accomplish it (if you have
the time and energy to build a data set and train the model). Back in 2017,
there was even a start-up company that was offering this as a service.
Whatever danger is embodied in Amir's code, it's only a matter of time
before this danger is ubiquitous. And for the worst-case
scenario—governments using the technology to hunt down dissidents—I imagine
this is already happening. So while I agree there is a moral consideration
to releasing this software, I think the moral implications aren't actually
that huge. Eventually, we will just have to accept that creating separate
accounts is not an effective way to protect your identity. That said, I
think taking precautions to minimize (or at least slow down) the potential
abuse of this technology is sensible. TheDJ offered many good suggestions
in this vein so I won't repeat them here. Overall though, I think moving
ahead with this tool is a good idea and I hope you are able to come to a
solution that is amenable to everyone. The WMF is also interested in this
technology (as a potential mitigation for IP masking), so the outcome may
help inform their work as well.

On Fri, Aug 7, 2020 at 5:51 AM Derk-Jan Hartman <
[email protected]> wrote:

> As others, I see several problems
> 1. If the code is public, someone can duplicate it and bypass our internal
> 'safekeeping', because it uses public data.
> 2. Risk of misuse by either incompetence or malice
> 3. Risk of accidentally exposing legitimate sockpuppets even in the most
> closed off situations.
> 4. Give ppl insight into how the AI works
>
> My answers to those:
>
> 1. I have no problem with keeping this in a private repo (yet technically
> opensourced) code. We also run private mailinglists and have private repos
> for configuration secrets. Yes it is a bit of a stretch, but.. IAR. At the
> same time, from the description, seems like something any AI developer with
> a bit of determination can reproduce... so... for how long will this matter
> ?
> 2. NDA + OAuth access for those who need it. Aggressive action logging of
> usage of the software. Showing these logs to all users of the tool to
> enforce social control. "User X investigated the matches of account: Y",
> User Z investigated match on previously known sockpuppet BlockedQ"
> 3. Usage wise, I'd have two flows.
>     1. Matches: Surface 'matches'  that match previously known sockpuppets
> (will require keeping track of that list). Only disclose details of a match
> upon additional user action (logged).
>     2. Requests: Enter specific account name(s) and request if there are
> matches on/between that/those name(s). (logged)
>     Those flows might have different levels match certainty perhaps...
>     If you want to go even further..  Requiring signoff on a request by
> another user before you can actually view the matches.
> 4. That does leave you with the problem of how you can give ppl insight
> into why an AI matched something.. that is a hard problem. I don't know
> enough about that problem space.
>
> DJ
>
> > On 6 Aug 2020, at 04:33, Amir Sarabadani <[email protected]> wrote:
> >
> > Hey,
> > I have an ethical question that I couldn't answer yet and have been
> asking
> > around but no definite answer yet so I'm asking it in a larger audience
> in
> > hope of a solution.
> >
> > For almost a year now, I have been developing an NLP-based AI system to
> be
> > able to catch sock puppets (two users pretending to be different but
> > actually the same person). It's based on the way they speak. The way we
> > speak is like a fingerprint and it's unique to us and it's really hard to
> > forge or change on demand (unlike IP/UA), as the result if you apply some
> > basic techniques in AI on Wikipedia discussions (which can be really
> > lengthy, trust me), the datasets and sock puppets shine.
> >
> > Here's an example, I highly recommend looking at these graphs, I compared
> > two pairs of users, one pair that are not sock puppets and the other is a
> > pair of known socks (a user who got banned indefinitely but came back
> > hidden under another username). [1][2] These graphs are based one of
> > several aspects of this AI system.
> >
> > I have talked about this with WMF and other CUs to build and help us
> > understand and catch socks. Especially the ones that have enough
> resources
> > to change their IP/UA regularly (like sock farms, and/or UPEs) and also
> > with the increase of mobile intern providers and the horrible way they
> > assign IP to their users, this can get really handy in some SPI ("Sock
> > puppet investigation") [3] cases.
> >
> > The problem is that this tool, while being built only on public
> > information, actually has the power to expose legitimate sock puppets.
> > People who live under oppressive governments and edit on sensitive
> topics.
> > Disclosing such connections between two accounts can cost people their
> > lives.
> >
> > So, this code is not going to be public, period. But we need to have this
> > code in Wikimedia Cloud Services so people like CUs in other wikis be
> able
> > to use it as a web-based tool instead of me running it for them upon
> > request. But WMCS terms of use explicitly say code should never be
> > closed-source and this is our principle. What should we do? I pay a
> > corporate cloud provider for this and put such important code and data
> > there? We amend the terms of use to have some exceptions like this one?
> >
> > The most plausible solution suggested so far (thanks Huji) is to have a
> > shell of a code that would be useless without data, and keep the code
> that
> > produces the data (out of dumps) closed (which is fine, running that code
> > is not too hard even on enwiki) and update the data myself. This might be
> > doable (which I'm around 30% sure, it still might expose too much) but it
> > wouldn't cover future cases similar to mine and I think a more long-term
> > solution is needed here. Also, it would reduce the bus factor to 1, and
> > maintenance would be complicated.
> >
> > What should we do?
> >
> > Thanks
> > [1]
> >
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
> > [2]
> >
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
> > [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
> > --
> > Amir (he/him)
> > _______________________________________________
> > Wikitech-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to