I think an important thing to note is that it's public information, so such
a model, either better or worse can easily be built by an AI enthusiast.
The potential for misuse is not much as it's relatively easy to game, and I
don't think that the model's results will hold more water than behaviour
analysis done by a human (which some editors excel at). Theoretically, by
feeding such edits into an assessment system similar to ClueBot and having
expert sockpuppet hunters assess them would result in a much more accurate
and more "dangerous" model, so to say - but with public information, it
shouldn't be closed source and probably only stifles innovation (for e.g.
GPT-3's eventual release).
If the concern is privacy, probably best to dismantle the entire project
but again, someone who wants to can simply put it in the hours required to
do something similar, so not much point and I think by raising this here,
it's probably resulted in the Streisand effect and more people are now
aware of your model and it's possible repercussions, although transparency
is quite integral in all open-source communities. In the end, it all comes
down to your choice, there's no right answer as far as I can tell.

Best,
QEDK

On Fri, Aug 7, 2020, 02:19 John Erling Blad <[email protected]> wrote:

> Nice idea! First time I wrote about this being possible was back in
> 2008-ish.
>
> The problem is quite trivial, you use some observable feature to
> fingerprint an adversary. The adversary can then game the system if the
> observable feature can be somehow changed or modified. To avoid this the
> observable features are usually chosen to be physical properties that can't
> be easily changed.
>
> In this case the features are word and/or relations between words, and then
> the question is “Can the adversary change the choice of words?” Yes he can,
> because the choice of words is not an inherent physical property of the
> user. In fact there are several programs that help users express themselves
> in a more fluent way, and such systems will change the observable features
> i.e. choice of words. The program will move the observable features (the
> words) from one user-specific distribution to another more program-specific
> distribution. You will observe the users a priori to be different, but with
> the program they will be a posteriori more similar.
>
> A real problem is your own poisoning of the training data. That happens
> when you find some subject to be the same as your postulated one, and then
> feed the information back into your training data. If you don't do that
> your training data will start to rot because humans change over time. It is
> bad anyway you do it.
>
> Even more fun is an adversary that knows what you are doing, and tries to
> negate your detection algorithm, or even fool you into believing he is
> someone else. It is after all nothing more than word count and statistics.
> What will you do when someone edits a Wikipedia-page and your system tells
> you “This revision is most likely written by Jimbo”?
>
> Several such programs exist, and I'm a bit perplexed that they are not in
> more use among Wikipedia's editors. Some of them are more aggressive, and
> can propose quite radical rewrites of the text. I use one of them, and it
> is not the best, but still it corrects me all the time.
>
> I believe it would be better to create a system where users are internally
> identified and externally authenticated. (The previous is biometric
> identification, and must adhere to privacy laws.)
>
> On Thu, Aug 6, 2020 at 4:33 AM Amir Sarabadani <[email protected]>
> wrote:
>
> > Hey,
> > I have an ethical question that I couldn't answer yet and have been
> asking
> > around but no definite answer yet so I'm asking it in a larger audience
> in
> > hope of a solution.
> >
> > For almost a year now, I have been developing an NLP-based AI system to
> be
> > able to catch sock puppets (two users pretending to be different but
> > actually the same person). It's based on the way they speak. The way we
> > speak is like a fingerprint and it's unique to us and it's really hard to
> > forge or change on demand (unlike IP/UA), as the result if you apply some
> > basic techniques in AI on Wikipedia discussions (which can be really
> > lengthy, trust me), the datasets and sock puppets shine.
> >
> > Here's an example, I highly recommend looking at these graphs, I compared
> > two pairs of users, one pair that are not sock puppets and the other is a
> > pair of known socks (a user who got banned indefinitely but came back
> > hidden under another username). [1][2] These graphs are based one of
> > several aspects of this AI system.
> >
> > I have talked about this with WMF and other CUs to build and help us
> > understand and catch socks. Especially the ones that have enough
> resources
> > to change their IP/UA regularly (like sock farms, and/or UPEs) and also
> > with the increase of mobile intern providers and the horrible way they
> > assign IP to their users, this can get really handy in some SPI ("Sock
> > puppet investigation") [3] cases.
> >
> > The problem is that this tool, while being built only on public
> > information, actually has the power to expose legitimate sock puppets.
> > People who live under oppressive governments and edit on sensitive
> topics.
> > Disclosing such connections between two accounts can cost people their
> > lives.
> >
> > So, this code is not going to be public, period. But we need to have this
> > code in Wikimedia Cloud Services so people like CUs in other wikis be
> able
> > to use it as a web-based tool instead of me running it for them upon
> > request. But WMCS terms of use explicitly say code should never be
> > closed-source and this is our principle. What should we do? I pay a
> > corporate cloud provider for this and put such important code and data
> > there? We amend the terms of use to have some exceptions like this one?
> >
> > The most plausible solution suggested so far (thanks Huji) is to have a
> > shell of a code that would be useless without data, and keep the code
> that
> > produces the data (out of dumps) closed (which is fine, running that code
> > is not too hard even on enwiki) and update the data myself. This might be
> > doable (which I'm around 30% sure, it still might expose too much) but it
> > wouldn't cover future cases similar to mine and I think a more long-term
> > solution is needed here. Also, it would reduce the bus factor to 1, and
> > maintenance would be complicated.
> >
> > What should we do?
> >
> > Thanks
> > [1]
> >
> >
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
> > [2]
> >
> >
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
> > [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
> > --
> > Amir (he/him)
> > _______________________________________________
> > Wikitech-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to