Re: [Wiki-research-l] Research on automatically created articles

2016-09-13 Thread Tilman Bayer
The new issue of the Wikimedia Research Newsletter contains a review
of the paper by Denny, also mentioning the debate on this list and on ANI:
https://blog.wikimedia.org/2016/09/12/research-newsletter-august-2016/
(there is some additional discussion in the comments there and on the
talk page of the Signpost version)

On Tue, Aug 23, 2016 at 8:20 PM, Stuart A. Yeates  wrote:
> For the sake of completeness, the archival URL for the thread at ANI is
>
> https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/IncidentArchive931#Moving_discussion_from_wikimedia_research_mailing_list
>
> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
> On Tue, Aug 16, 2016 at 7:04 AM, Samuel Klein  wrote:
>>
>> Thanks Sidd for responding actively in this thread.
>>
>> The biggest problem here: the algorithm used in this research were bad.
>> They produced nonsense that wasn't remotely grammatical.  You should have
>> caught most of these problems.  (The early version of the bot (for just
>> plays) had a poor success rate as well, but it seemed plausible that a
>> template for tiny play articles could be effectively filled out with
>> automation.)
>>
>> Two interesting results IMO:
>>  + A nonsensical article with a decent first sentence & sections, and refs
>> (however random), can serve as encouragement to write a real article.
>> Possibly more of an encouragement than just the first sentence alone.  I
>> believe there's some related research into how people respond to cold emails
>> that include mistakes & nonsense.  (Surely there's a more effective \
>> non-offensive way to produce similar results)
>>  + We could use even a naive measure of the coverage & consistency of new
>> article review.  (If it drops below a certain threshhold, we could do
>> something like change the background color & search-engine metadata for
>> pages that haven't been properly reviewed yet)
>>
>> For future researchers:
>> If we encourage people to spend more time making tools work – rather than
>> doing something simple (even counterproductive) and writing a paper about it
>> – everyone will benefit.  The main namespace is full of bots, both fully
>> automatic and requiring a human to run them. Anyone considering or
>> implementing wiki automation should look at them and talk to the community
>> of bot maintainers.
>>
>> Sam
>>
>> On Mon, Aug 15, 2016 at 1:28 PM, siddhartha banerjee 
>> wrote:
>>>
>>> Ziko,
>>>
>>> Thanks for your detailed email. Agree on all the comments.
>>>
>>> Some earlier comments might have been harsh, but I understand that there
>>> is a valid reason behind it and also the dedication of so many people
>>> involved to help reach Wikipedia where it is today.
>>>
>>> We should have been more diligent in finding out policies and rules
>>> (including IRB) before entering content on Wikipedia. We promise not to
>>> repeat anything of this sort in the future and also I am trying to summarize
>>> all that has been discussed here to prevent such unpleasant experiences from
>>> other researchers in this area.
>>>
>>> -- Sidd
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>>
>> --
>> Samuel Klein  @metasj   w:user:sj  +1 617 529 4266
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-23 Thread Stuart A. Yeates
For the sake of completeness, the archival URL for the thread at ANI is

https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/IncidentArchive931#Moving_discussion_from_wikimedia_research_mailing_list

cheers
stuart

--
...let us be heard from red core to black sky

On Tue, Aug 16, 2016 at 7:04 AM, Samuel Klein  wrote:

> Thanks Sidd for responding actively in this thread.
>
> The biggest problem here: the algorithm used in this research were bad.
> They produced nonsense that wasn't remotely grammatical.  You should have
> caught most of these problems.  (The early version of the bot (for just
> plays) had a poor success rate as well, but it seemed plausible that a
> template for tiny play articles could be effectively filled out with
> automation.)
>
> Two interesting results IMO:
>  + A nonsensical article with a decent first sentence & sections, and refs
> (however random), can serve as encouragement to write a real article.
> Possibly more of an encouragement than just the first sentence alone.  I
> believe there's some related research into how people respond to cold
> emails that include mistakes & nonsense.  (Surely there's a more effective
> \ non-offensive way to produce similar results)
>  + We could use even a naive measure of the coverage & consistency of new
> article review.  (If it drops below a certain threshhold, we could do
> something like change the background color & search-engine metadata for
> pages that haven't been properly reviewed yet)
>
> For future researchers:
> If we encourage people to spend more time making tools work – rather than
> doing something simple (even counterproductive) and writing a paper about
> it – everyone will benefit.  The main namespace is full of bots, both fully
> automatic and requiring a human to run them. Anyone considering or
> implementing wiki automation should look at them and talk to the community
> of bot maintainers.
>
> Sam
>
> On Mon, Aug 15, 2016 at 1:28 PM, siddhartha banerjee 
> wrote:
>
>> Ziko,
>>
>> Thanks for your detailed email. Agree on all the comments.
>>
>> Some earlier comments might have been harsh, but I understand that there
>> is a valid reason behind it and also the dedication of so many people
>> involved to help reach Wikipedia where it is today.
>>
>> We should have been more diligent in finding out policies and rules
>> (including IRB) before entering content on Wikipedia. We promise not to
>> repeat anything of this sort in the future and also I am trying to
>> summarize all that has been discussed here to prevent such unpleasant
>> experiences from other researchers in this area.
>>
>> -- Sidd
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
>
> --
> Samuel Klein  @metasj   w:user:sj  +1 617 529 4266
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-15 Thread Samuel Klein
Thanks Sidd for responding actively in this thread.

The biggest problem here: the algorithm used in this research were bad.
They produced nonsense that wasn't remotely grammatical.  You should have
caught most of these problems.  (The early version of the bot (for just
plays) had a poor success rate as well, but it seemed plausible that a
template for tiny play articles could be effectively filled out with
automation.)

Two interesting results IMO:
 + A nonsensical article with a decent first sentence & sections, and refs
(however random), can serve as encouragement to write a real article.
Possibly more of an encouragement than just the first sentence alone.  I
believe there's some related research into how people respond to cold
emails that include mistakes & nonsense.  (Surely there's a more effective
\ non-offensive way to produce similar results)
 + We could use even a naive measure of the coverage & consistency of new
article review.  (If it drops below a certain threshhold, we could do
something like change the background color & search-engine metadata for
pages that haven't been properly reviewed yet)

For future researchers:
If we encourage people to spend more time making tools work – rather than
doing something simple (even counterproductive) and writing a paper about
it – everyone will benefit.  The main namespace is full of bots, both fully
automatic and requiring a human to run them. Anyone considering or
implementing wiki automation should look at them and talk to the community
of bot maintainers.

Sam

On Mon, Aug 15, 2016 at 1:28 PM, siddhartha banerjee 
wrote:

> Ziko,
>
> Thanks for your detailed email. Agree on all the comments.
>
> Some earlier comments might have been harsh, but I understand that there
> is a valid reason behind it and also the dedication of so many people
> involved to help reach Wikipedia where it is today.
>
> We should have been more diligent in finding out policies and rules
> (including IRB) before entering content on Wikipedia. We promise not to
> repeat anything of this sort in the future and also I am trying to
> summarize all that has been discussed here to prevent such unpleasant
> experiences from other researchers in this area.
>
> -- Sidd
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
Samuel Klein  @metasj   w:user:sj  +1 617 529 4266
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-15 Thread siddhartha banerjee
Ziko,

Thanks for your detailed email. Agree on all the comments.

Some earlier comments might have been harsh, but I understand that there is
a valid reason behind it and also the dedication of so many people involved
to help reach Wikipedia where it is today.

We should have been more diligent in finding out policies and rules
(including IRB) before entering content on Wikipedia. We promise not to
repeat anything of this sort in the future and also I am trying to
summarize all that has been discussed here to prevent such unpleasant
experiences from other researchers in this area.

-- Sidd
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-15 Thread Ziko van Dijk
Dear Sidd,

Thank you for your multiple replies to this list. I am happy to see the
commitment to find the balance between research needs and the needs and
ethics for a complicated website/community as Wikipedia is.

I hope that the comments on this list don't shock you, even if they appear
sometimes to be a little harsh. But this comes from the long time
dedication of people who are involved in Wikipedia in several ways.

And honestly, the Wikimedia movement, the WMF, the community etc. still
don't have a thorough and coherent concept about Wikipedia related
research. We don't want to forbid it in general, but we had some very
unpleasant experiences in the past. Anybody but the "community" can only
recommend on good practices, and getting a clear opinion from the community
can be difficult.

My impression is that researchers sometimes want to test the social system
of Wikipedia, to see it as a kind of social instrument in which the "crowd
intelligence" evaluates content and rejects or accepts it. The quality
assessment of the content produced is passed on to the "crowd inteligence".

Actually, in a certain way I do the same when I write a Wikipedia article:
it can be refused by the community or not. The difference is in the motive:
I want to improve Wikipedia, not to test a social system. Therefore, I
carefully read and reread my texts before putting them on Wikipedia. But
someone who wants to create articles automatically and then have the social
system tests them, puts them unedited on Wikipedia in order not to "spoil"
the results.

I can fully understand when Wikipedians and others regard this behavior as
an abuse of Wikipedia. And for any researcher, the final problem will
always be that "Wikipedia" cannot be a partner for the researcher. The
Wikipedia community wouldn't agree to have researchers contribute
questionable content for the purpose of testing the social system / the
quality of content. They tend to regard this behavior of researchers as a
kind of vandalism.

We have seen this in all varieties, e.g. when someone gives a presentation
about Wikipedia and deliberately vandalises an article in order to
demonstrate how quickly the social system responds to vandalism. I find it
also questionable when a teacher lets students contribute articles and
attributes credit points depending on wether the article is deleted or not,
or becomes a featured article. That is in my view a kind of outsourcing of
the evaluation that should be done by the teacher himself - based on his
own, pedagogical criteria, not on the encyclopedical and sometimes strange
decisions of the Wikipedia community.

As others have noted, there are several ethical problems that can occur,
e.g. with regard to multiple accounts.

Sorry for the long mail.

Kind regards
Ziko


Am Montag, 15. August 2016 schrieb siddhartha banerjee :

> Agreed about the people issue.
> Editors were made to edit/delete without consent.
> IRB application also will be added to this list.
>
> - Sidd
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-14 Thread siddhartha banerjee
Agreed about the people issue.
Editors were made to edit/delete without consent.
IRB application also will be added to this list.

- Sidd
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-14 Thread Stuart A. Yeates
I disagree.

You continue to treat the problems as information systems issues; they are
people / ethnographic issues.

This is typified in your complete failure to separate WMF, admins and
editors. WMF host this mailing list and the web servers involved and are
irrelevant unless we're dealing with libel, slander or denial of service
attacks (which we don't appear to be) or applying for grants. en.wiki
editors are the rank and file group of people who you've been making work
for (fixing pages, suggest pages for deletion, etc). en.wiki admins enforce
the consensus of editors (block users, delete pages based on an established
consensus, etc).

cheers
stuart

--
...let us be heard from red core to black sky

On Mon, Aug 15, 2016 at 10:16 AM, siddhartha banerjee 
wrote:

> Hello,
>
> Based on the discussion and suggestion in the Admin incidents page:
> https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/
> Incidents#their_results, I have gone to each of the articles (that still
> existed) and made corrections and changes necessary -- both in terms of the
> content written as well as unreliable sources. I have requested
> administrators to check if my edits still have issues, and I would go back
> and change anything else required. I guess my advisor would be posting to
> this thread only later this week, so before that I wanted to summarize all
> that I learnt during the discussion here and on the incidents page.
>
> 1. Multiple accounts policy: Do not use multiple user accounts to post
> content.
> 2. Research ethics:  There was a serious issue in assumptions made (even
> by other researchers as can be seen from the multiple papers mentioned who
> work in this area). Furthermore, when our previous work (
> https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_
> Signpost/2015-01-28/Recent_research) was mentioned on Wikimedia
> newsletter, it did not provide any indication to us about the issues with
> legitimacy about this kind of research. But, based on that, the assumptions
> were inappropriate. It is better to involve the WMF community by letting
> them know about any project prior to its start and engaging them such that
> best decisions could be taken and such similar situations do not arise.
> As an administrator mentioned in the discussion and I think is very
> important to note: 'you not only denied the community the opportunity to
> decide whether we wish to allow/participate in this research, you precluded
> any efforts we might have made to minimize the disruption and affect a
> quick clean-up'.
> Based on the last few emails, it seems that IRB is waived, however, that
> waiver should be stamped (but this should be after the community has been
> informed of a task -- if a research might cause some disruption, it should
> not be done at any cost). Also, it would be better to create articles in a
> different namespace. The problem here was that clicking on red-links
> directly went to the article creation markup page -- which should have been
> put into draft space. But still, even creating drafts imply that other
> editors are looking at it, which should not be done without prior consent.
> Testing of any content should be done offline, and not on Wikipedia -- as
> it can potentially disrupt. Even with moderate quality content, it implies
> wastage of time for editors. I plan to bring all of these to the notice of
> the research committee who had approved this work such that similar issues
> do not happen in the future. Also, I plan to write on this and share this
> to the wider community who have worked or are working on similar problems
> [I am not sure if they have already been contacted by someone from WMF]. If
> they could be also roped into the discussion. that would be better is what
> I think.
> One thing I would quote from the discussion in the incident page:"Because
> researchers and institutions need to realize that this project is not a
> laboratory for their work, not unless they make an effort to work with the
> community" and this is also very important.
> My apologies for the extra work that had to be done by the numerous
> editors to edit the content and clean them -- that cannot be reverted now
> but can definitely be stopped in future. We did not add any content after
> Feb earlier this year and have promised in that discussion not to create
> anything more. If we want to do some analysis, we plan to use other
> crowdsourcing techniques (such as Amazon mech turk) and find out quality of
> the generated content.
>
> Please add anything you think that I have missed and also regarding the
> clean-up as I have tried to remove the irrelevant material from all the
> articles edited using the usernames.
>
>
> Thanks,
> Sidd
>
>
>
>
>
> On Fri, Aug 12, 2016 at 10:02 AM, siddhartha banerjee 
> wrote:
>
>> Hi,
>>
>> My advisor, Prof. Mitra is busy in travels this week. He said he will be
>> posting to this thread about his thoughts later 

Re: [Wiki-research-l] Research on automatically created articles

2016-08-14 Thread siddhartha banerjee
Thanks Kerry. In part 2, when I mention that the waiver should be stamped,
I implied applying for IRB and then getting a stamp. Sorry for not being
clear.

Thanks,
Sidd
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-14 Thread Kerry Raymond
One missing item is:

 

Submit an application to the IRB.

 

Kerry

 

 

From: Wiki-research-l [mailto:wiki-research-l-boun...@lists.wikimedia.org] On 
Behalf Of siddhartha banerjee
Sent: Monday, 15 August 2016 8:17 AM
To: wiki-research-l@lists.wikimedia.org
Subject: Re: [Wiki-research-l] Research on automatically created articles

 

Hello,

 

Based on the discussion and suggestion in the Admin incidents page: 
https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/Incidents#their_results,
 I have gone to each of the articles (that still existed) and made corrections 
and changes necessary -- both in terms of the content written as well as 
unreliable sources. I have requested administrators to check if my edits still 
have issues, and I would go back and change anything else required. I guess my 
advisor would be posting to this thread only later this week, so before that I 
wanted to summarize all that I learnt during the discussion here and on the 
incidents page. 

 

1. Multiple accounts policy: Do not use multiple user accounts to post content. 

2. Research ethics:  There was a serious issue in assumptions made (even by 
other researchers as can be seen from the multiple papers mentioned who work in 
this area). Furthermore, when our previous work 
(https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research)
 was mentioned on Wikimedia newsletter, it did not provide any indication to us 
about the issues with legitimacy about this kind of research. But, based on 
that, the assumptions were inappropriate. It is better to involve the WMF 
community by letting them know about any project prior to its start and 
engaging them such that best decisions could be taken and such similar 
situations do not arise.

As an administrator mentioned in the discussion and I think is very important 
to note: 'you not only denied the community the opportunity to decide whether 
we wish to allow/participate in this research, you precluded any efforts we 
might have made to minimize the disruption and affect a quick clean-up'. 

Based on the last few emails, it seems that IRB is waived, however, that waiver 
should be stamped (but this should be after the community has been informed of 
a task -- if a research might cause some disruption, it should not be done at 
any cost). Also, it would be better to create articles in a different 
namespace. The problem here was that clicking on red-links directly went to the 
article creation markup page -- which should have been put into draft space. 
But still, even creating drafts imply that other editors are looking at it, 
which should not be done without prior consent. Testing of any content should 
be done offline, and not on Wikipedia -- as it can potentially disrupt. Even 
with moderate quality content, it implies wastage of time for editors. I plan 
to bring all of these to the notice of the research committee who had approved 
this work such that similar issues do not happen in the future. Also, I plan to 
write on this and share this to the wider community who have worked or are 
working on similar problems [I am not sure if they have already been contacted 
by someone from WMF]. If they could be also roped into the discussion. that 
would be better is what I think. 

One thing I would quote from the discussion in the incident page:"Because 
researchers and institutions need to realize that this project is not a 
laboratory for their work, not unless they make an effort to work with the 
community" and this is also very important. 

My apologies for the extra work that had to be done by the numerous editors to 
edit the content and clean them -- that cannot be reverted now but can 
definitely be stopped in future. We did not add any content after Feb earlier 
this year and have promised in that discussion not to create anything more. If 
we want to do some analysis, we plan to use other crowdsourcing techniques 
(such as Amazon mech turk) and find out quality of the generated content. 

 

Please add anything you think that I have missed and also regarding the 
clean-up as I have tried to remove the irrelevant material from all the 
articles edited using the usernames. 

 

 

Thanks,
Sidd

 

 

 

 

 

On Fri, Aug 12, 2016 at 10:02 AM, siddhartha banerjee <sidd2...@gmail.com 
<mailto:sidd2...@gmail.com> > wrote:

Hi,

 

My advisor, Prof. Mitra is busy in travels this week. He said he will be 
posting to this thread about his thoughts later next week. 

 

Also, one thing he wanted me to mention here is the following: 

Although the content in the articles were generated by an algorithm, a human — 
I — took those articles and posted them online. We randomly chose few articles 
and checked whether any objectionable content was collected from the web. We 
planned to remove those before posting on Wikipedia. We did not create a bot 
that went and created the articles randomly. We generated the content 

Re: [Wiki-research-l] Research on automatically created articles

2016-08-14 Thread Kerry Raymond
 

Yes, the *IRB* might have chosen to waive the requirement of consent, but it 
appears the IRB was never given the opportunity to review the proposed 
research. The student researcher and/or their advisor is not in a position to 
make that decision on the IRB’s behalf by not submitting an application.

 

And as per (3), it would appear that recruiting Wikipedia readers and editors 
to review the generated articles could have been a simple alternative to 
placing them in Wikipedia main space and avoided the problem being discussed. 
Perhaps if the application had been properly presented to the IRB, that might 
have been the outcome. 

 

As a Dean of Research in an IT faculty (prior to my retirement), I am well 
aware that research students (and sometimes their advisors/supervisors) often 
take the view that “I’m doing research on software, not people, I don’t need to 
worry about ethics”. I agree there is no ethics issue in relation to the 
automatic construction of Wikipedia-like articles, but there is an issue about 
putting them into Wikipedia mainspace and/or its processes to observe the 
reaction to them (which did not seem to occur in the prior research mentioned).

 

Kerry

 

From: Wiki-research-l [mailto:wiki-research-l-boun...@lists.wikimedia.org] On 
Behalf Of Giovanni Luca Ciampaglia
Sent: Saturday, 13 August 2016 12:57 AM
To: Research into Wikimedia content and communities 
<wiki-research-l@lists.wikimedia.org>
Subject: Re: [Wiki-research-l] Research on automatically created articles

 

​Kerry,

 

I haven't had the time to read the paper itself, but regarding your comments on 
the need for informed consent, I would like to point out that, at least from 
what I have gleaned in this thread so far, it seems to me that consent could 
have probably been waived. Let me quote the relevant regulations here (cf.  
<http://www.hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr-46/#46.116> 
CFR 46.116(d)):​

 

 

(d) An IRB may approve a consent procedure which does not include, or which 
alters, some or all of the elements of informed consent set forth in this 
section, or waive the requirements to obtain informed consent provided the IRB 
finds and documents that:

 (1) The research involves no more than minimal risk to the subjects;

 (2) The waiver or alteration will not adversely affect the rights and welfare 
of the subjects;

 (3) The research could not practicably be carried out without the waiver or 
alteration; and

 (4) Whenever appropriate, the subjects will be provided with additional 
pertinent information after participation.

 

I think that given the few article titles seen so far, the research did involve 
no more than minimal risk to users and editors. Requiring them to ask informed 
consent from *every* person that came across to those pages without assistance 
from the developers seems unfeasible; and waiving consent in this case does not 
seem to adversely affect the rights and welfare of subjects either. Ditto for 
follow-up information.

 

Should they have applied to get exempt status from their IRB (i.e. essentially 
to get a stamp of approval on what I just said)? Yes, they should have. This 
doesn't change the nature of their research though, which is what matters to 
the present discussion.

 

Could they have put the articles in another namespace instead of the main one? 
Yes, but that is a question of being considerate to other users/editors, not 
about whether the research is legitimate/ethical or not.

 

Best

 

Giovanni

 




 

 <http://glciampaglia.com> Giovanni Luca Ciampaglia ∙ Assistant Research 
Scientist, Indiana University

 

 

On Fri, Aug 12, 2016 at 4:22 AM, Kerry Raymond <kerry.raym...@gmail.com 
<mailto:kerry.raym...@gmail.com> > wrote:

And to its policies

 

http://guru.psu.edu/policies/RP03.html

 

With particular reference to

 

"Intervention includes both physical procedures by which data are gathered (for 
example, venipuncture) and manipulations of the participant or the 
participant’s environment that are performed for research purposes."

 

Putting that articles into Wikipedia manipulated the environment of Wikipedia 
readers and editors.

 

Now I am not saying that huge harm was done, you would have to ask those who 
subsequently edited the articles (a known group) and those who read the 
articles (an unknown group) to find out if they are unhappy about what took 
place.

 

What I am saying is that if consideration had been given to the question who is 
impacted by this research plan, the maybe the research plan would have been 
redesigned to prevent the problem, and we would not have to have this 
conversation.

 

Kerry

 

Sent from my iPad


On 12 Aug 2016, at 6:08 PM, Kerry Raymond <kerry.raym...@gmail.com 
<mailto:kerry.raym...@gmail.com> > wrote:

I draw attention to Penn State's IRB website

 

https://www.research.psu.edu/irb/submit

Sent from my iPad


O

Re: [Wiki-research-l] Research on automatically created articles

2016-08-13 Thread Federico Leva (Nemo)
It's worth noting that research exists which *actively* sought to change 
real-life behaviour of Wikipedia visitors, such as 
https://www.econstor.eu/handle/10419/127472 whose authors expanded 
articles about certain Spanish cities in order to make tourists visit 
those cities more.


Nemo

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-12 Thread siddhartha banerjee
Hi,

My advisor, Prof. Mitra is busy in travels this week. He said he will be
posting to this thread about his thoughts later next week.

Also, one thing he wanted me to mention here is the following:
Although the content in the articles were generated by an algorithm, a
human — I — took those articles and posted them online. We randomly chose
few articles and checked whether any objectionable content was collected
from the web. We planned to remove those before posting on Wikipedia. We
did not create a bot that went and created the articles randomly. We
generated the content offline and then copy-pasted the content of randomly
selected articles. While objectionable content was decided to be removed,
we did not make any changes to sentences anywhere other than that because
that would void checking for linguistic consistency -- which was our soul
purpose. Also, it was done in 'good faith' and hence we just worked on bare
minimum articles to get an idea , not let a bot create random junk. Our
algo does not have the capability of judging whether the cited references
(when we search on google) are reliable or not, but we thought that
reviewers on Wikipedia would remove content from such links as well as
references if they are unreliable. While some references were removed
because of such reasons (eg https://en.wikipedia.org/wiki/Atripliceae),
there were some articles removed saying promotional content (which, as
well, our algo cannot really determine).

Thanks for the comments here, we will keep them in mind if we do anything
similar to this in the future, and I will try to inform other researchers
who work in this area.

Thanks,
Sidd
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-12 Thread Giovanni Luca Ciampaglia
​Kerry,

I haven't had the time to read the paper itself, but regarding your
comments on the need for informed consent, I would like to point out that,
at least from what I have gleaned in this thread so far, it seems to me
that consent could have probably been waived. Let me quote the relevant
regulations here (cf. CFR 46.116(d)

):​


(d) An IRB may approve a consent procedure which does not include, or which
> alters, some or all of the elements of informed consent set forth in this
> section, or waive the requirements to obtain informed consent provided the
> IRB finds and documents that:
>
>  (1) The research involves no more than minimal risk to the subjects;
>
>  (2) The waiver or alteration will not adversely affect the rights and
> welfare of the subjects;
>
>  (3) The research could not practicably be carried out without the waiver
> or alteration; and
>
>  (4) Whenever appropriate, the subjects will be provided with additional
> pertinent information after participation.
>
>
I think that given the few article titles seen so far, the research did
involve no more than minimal risk to users and editors. Requiring them to
ask informed consent from *every* person that came across to those pages
without assistance from the developers seems unfeasible; and waiving
consent in this case does not seem to adversely affect the rights and
welfare of subjects either. Ditto for follow-up information.

Should they have applied to get exempt status from their IRB (i.e.
essentially to get a stamp of approval on what I just said)? Yes, they
should have. This doesn't change the nature of their research though, which
is what matters to the present discussion.

Could they have put the articles in another namespace instead of the main
one? Yes, but that is a question of being considerate to other
users/editors, not about whether the research is legitimate/ethical or not.

Best

Giovanni



Giovanni Luca Ciampaglia  *∙* Assistant Research
Scientist, Indiana University


On Fri, Aug 12, 2016 at 4:22 AM, Kerry Raymond 
wrote:

> And to its policies
>
> http://guru.psu.edu/policies/RP03.html
>
> With particular reference to
>
> "Intervention includes both physical procedures by which data are
> gathered (for example, venipuncture) and manipulations of the participant
> or the participant’s environment that are performed for research purposes."
>
> Putting that articles into Wikipedia manipulated the environment of
> Wikipedia readers and editors.
>
> Now I am not saying that huge harm was done, you would have to ask those
> who subsequently edited the articles (a known group) and those who read the
> articles (an unknown group) to find out if they are unhappy about what took
> place.
>
> What I am saying is that if consideration had been given to the question
> who is impacted by this research plan, the maybe the research plan would
> have been redesigned to prevent the problem, and we would not have to have
> this conversation.
>
> Kerry
>
> Sent from my iPad
>
> On 12 Aug 2016, at 6:08 PM, Kerry Raymond  wrote:
>
> I draw attention to Penn State's IRB website
>
> https://www.research.psu.edu/irb/submit
>
> Sent from my iPad
>
> On 12 Aug 2016, at 6:03 PM, Kerry Raymond  wrote:
>
> I am asking you to share the documentation of the ethical clearance or
> exemption your institution would have required, not what people did or
> didn't say to you as part of conference reviewing or at conferences.
> Ethical clearance is a process that should have been undertaken before your
> research commenced, not when you are writing the paper or attending a
> conference. Are you saying you undertook the research without any
> consideration of the ethics? Does your university have no guidelines about
> this?
>
> The Wikipedia guidelines about content analysis are not particularly
> relevant here. You were not analysing existing Wikipedia articles but
> injecting new articles of dubious quality into Wikipedia.
>
> Nor is the data about individuals my point. If you wasted people's time
> reacting to the articles created, you did them harm. If people derived
> incorrect information from reading your articles, you did them harm. None
> of those people were aware they were part of your research experiment; that
> means they did not have informed consent in relation to choosing to
> participate in your experiment. You could have generated the articles and
> sought the opinions of readers and editors of Wikipedia on those articles
> without placing them into Wikipedia itself. That way would have enabled
> informed consent; others not wishing to take part would not be mislead into
> doing so.
>
> Sent from my iPad
>
> On 12 Aug 2016, at 3:24 PM, siddhartha banerjee 
> wrote:
>
> I thought I should add this too as I missed it in the previous email.
> 

Re: [Wiki-research-l] Research on automatically created articles

2016-08-12 Thread Kerry Raymond
I draw attention to Penn State's IRB website

https://www.research.psu.edu/irb/submit

Sent from my iPad

> On 12 Aug 2016, at 6:03 PM, Kerry Raymond  wrote:
> 
> I am asking you to share the documentation of the ethical clearance or 
> exemption your institution would have required, not what people did or didn't 
> say to you as part of conference reviewing or at conferences. Ethical 
> clearance is a process that should have been undertaken before your research 
> commenced, not when you are writing the paper or attending a conference. Are 
> you saying you undertook the research without any consideration of the 
> ethics? Does your university have no guidelines about this?
> 
> The Wikipedia guidelines about content analysis are not particularly relevant 
> here. You were not analysing existing Wikipedia articles but injecting new 
> articles of dubious quality into Wikipedia.
> 
> Nor is the data about individuals my point. If you wasted people's time 
> reacting to the articles created, you did them harm. If people derived 
> incorrect information from reading your articles, you did them harm. None of 
> those people were aware they were part of your research experiment; that 
> means they did not have informed consent in relation to choosing to 
> participate in your experiment. You could have generated the articles and 
> sought the opinions of readers and editors of Wikipedia on those articles 
> without placing them into Wikipedia itself. That way would have enabled 
> informed consent; others not wishing to take part would not be mislead into 
> doing so.
> 
> Sent from my iPad
> 
>> On 12 Aug 2016, at 3:24 PM, siddhartha banerjee  wrote:
>> 
>> I thought I should add this too as I missed it in the previous email.
>> This link: 
>> https://en.wikipedia.org/wiki/Wikipedia:Ethically_researching_Wikipedia
>> talks about the Content Analysis (seeing number of references removed, or 
>> content removed)-- which we did (with the few articles)  and that is what we 
>> followed as it says "generally considered exempt from such requirements and 
>> does not require an IRB approval.". 
>> My advisor should be able to add more thoughts on it (I have requested him 
>> to reply on this thread).
>> 
>> Thanks,
>> Sidd
>> 
>> 
>> 
>> 
>>> On Thu, Aug 11, 2016 at 9:36 PM, siddhartha banerjee  
>>> wrote:
>>> As I have mentioned earlier, this is not the first work on article 
>>> generation. This is one of the first work we know: 
>>> https://people.csail.mit.edu/csauper/pubs/sauper-sm-thesis.pdf
>>> https://people.csail.mit.edu/regina/my_papers/wiki.pdf
>>> All these did not mention anything about human subjects as finally no 
>>> personal information is used (about the person, who is deleting, etc). Nor 
>>> did any reviewers/attendees in the conferences in this area question on 
>>> this aspect. 
>>> Also, 
>>> https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research
>>>  is relevant here as it talks about our previous work.
>>> 
>>> if "record of someone doing something" is relevant from human subjects 
>>> point of view, any data on Wikipedia can be used to find the editors (if 
>>> not the real person). For example:
>>> https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/viewFile/3505/3968
>>> https://alchemy.cs.washington.edu/papers/wu08/wu08.pdf
>>> I have met several researchers who work using data (revisions from 
>>> Wikipedia) and nothin on IRB ever came up. 
>>> 
>>> Nevertheless, as I said, if there are concrete rules, I think it would help 
>>> the research community as a whole to know what can or cannot be done and 
>>> also ask for permissions.
>>> I appreciate the suggestions that Stuart mentioned in a previous email abut 
>>> experimenting on would be deleted or articles lacking sources. But, as of 
>>> now we are not planning anything and if we do, we would for sure get in 
>>> touch with Denny (who had a video chat with me before starting this thread) 
>>> and would try to know the best ways of doing it.
>>> 
>>> I have asked my PhD advisor (other author on the paper) to check this 
>>> thread and he will be able to give more inputs as I am not very qualified 
>>> to comment on these aspects. 
>>> 
>>> Thanks,
>>> Sidd
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-12 Thread Kerry Raymond
I am asking you to share the documentation of the ethical clearance or 
exemption your institution would have required, not what people did or didn't 
say to you as part of conference reviewing or at conferences. Ethical clearance 
is a process that should have been undertaken before your research commenced, 
not when you are writing the paper or attending a conference. Are you saying 
you undertook the research without any consideration of the ethics? Does your 
university have no guidelines about this?

The Wikipedia guidelines about content analysis are not particularly relevant 
here. You were not analysing existing Wikipedia articles but injecting new 
articles of dubious quality into Wikipedia.

Nor is the data about individuals my point. If you wasted people's time 
reacting to the articles created, you did them harm. If people derived 
incorrect information from reading your articles, you did them harm. None of 
those people were aware they were part of your research experiment; that means 
they did not have informed consent in relation to choosing to participate in 
your experiment. You could have generated the articles and sought the opinions 
of readers and editors of Wikipedia on those articles without placing them into 
Wikipedia itself. That way would have enabled informed consent; others not 
wishing to take part would not be mislead into doing so.

Sent from my iPad

> On 12 Aug 2016, at 3:24 PM, siddhartha banerjee  wrote:
> 
> I thought I should add this too as I missed it in the previous email.
> This link: 
> https://en.wikipedia.org/wiki/Wikipedia:Ethically_researching_Wikipedia
> talks about the Content Analysis (seeing number of references removed, or 
> content removed)-- which we did (with the few articles)  and that is what we 
> followed as it says "generally considered exempt from such requirements and 
> does not require an IRB approval.". 
> My advisor should be able to add more thoughts on it (I have requested him to 
> reply on this thread).
> 
> Thanks,
> Sidd
> 
> 
> 
> 
>> On Thu, Aug 11, 2016 at 9:36 PM, siddhartha banerjee  
>> wrote:
>> As I have mentioned earlier, this is not the first work on article 
>> generation. This is one of the first work we know: 
>> https://people.csail.mit.edu/csauper/pubs/sauper-sm-thesis.pdf
>> https://people.csail.mit.edu/regina/my_papers/wiki.pdf
>> All these did not mention anything about human subjects as finally no 
>> personal information is used (about the person, who is deleting, etc). Nor 
>> did any reviewers/attendees in the conferences in this area question on this 
>> aspect. 
>> Also, 
>> https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research
>>  is relevant here as it talks about our previous work.
>> 
>> if "record of someone doing something" is relevant from human subjects point 
>> of view, any data on Wikipedia can be used to find the editors (if not the 
>> real person). For example:
>> https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/viewFile/3505/3968
>> https://alchemy.cs.washington.edu/papers/wu08/wu08.pdf
>> I have met several researchers who work using data (revisions from 
>> Wikipedia) and nothin on IRB ever came up. 
>> 
>> Nevertheless, as I said, if there are concrete rules, I think it would help 
>> the research community as a whole to know what can or cannot be done and 
>> also ask for permissions.
>> I appreciate the suggestions that Stuart mentioned in a previous email abut 
>> experimenting on would be deleted or articles lacking sources. But, as of 
>> now we are not planning anything and if we do, we would for sure get in 
>> touch with Denny (who had a video chat with me before starting this thread) 
>> and would try to know the best ways of doing it.
>> 
>> I have asked my PhD advisor (other author on the paper) to check this thread 
>> and he will be able to give more inputs as I am not very qualified to 
>> comment on these aspects. 
>> 
>> Thanks,
>> Sidd
> 
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-11 Thread siddhartha banerjee
I thought I should add this too as I missed it in the previous email.
This link:
https://en.wikipedia.org/wiki/Wikipedia:Ethically_researching_Wikipedia
talks about the Content Analysis (seeing number of references removed, or
content removed)-- which we did (with the few articles)  and that is what
we followed as it says "generally considered exempt from such requirements
and does not require an IRB approval.".
My advisor should be able to add more thoughts on it (I have requested him
to reply on this thread).

Thanks,
Sidd




On Thu, Aug 11, 2016 at 9:36 PM, siddhartha banerjee 
wrote:

> As I have mentioned earlier, this is not the first work on article
> generation. This is one of the first work we know: https://people.csail.
> mit.edu/csauper/pubs/sauper-sm-thesis.pdf
> https://people.csail.mit.edu/regina/my_papers/wiki.pdf
> All these did not mention anything about human subjects as finally no
> personal information is used (about the person, who is deleting, etc). Nor
> did any reviewers/attendees in the conferences in this area question on
> this aspect.
> Also, https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_
> Signpost/2015-01-28/Recent_research is relevant here as it talks about
> our previous work.
>
> if "record of someone doing something" is relevant from human subjects
> point of view, any data on Wikipedia can be used to find the editors (if
> not the real person). For example:
> https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/viewFile/3505/3968
> https://alchemy.cs.washington.edu/papers/wu08/wu08.pdf
> I have met several researchers who work using data (revisions from
> Wikipedia) and nothin on IRB ever came up.
>
> Nevertheless, as I said, if there are concrete rules, I think it would
> help the research community as a whole to know what can or cannot be done
> and also ask for permissions.
> I appreciate the suggestions that Stuart mentioned in a previous email
> abut experimenting on would be deleted or articles lacking sources. But, as
> of now we are not planning anything and if we do, we would for sure get in
> touch with Denny (who had a video chat with me before starting this thread)
> and would try to know the best ways of doing it.
>
> I have asked my PhD advisor (other author on the paper) to check this
> thread and he will be able to give more inputs as I am not very qualified
> to comment on these aspects.
>
> Thanks,
> Sidd
>
>
>
>
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-11 Thread siddhartha banerjee
As I have mentioned earlier, this is not the first work on article
generation. This is one of the first work we know:
https://people.csail.mit.edu/csauper/pubs/sauper-sm-thesis.pdf
https://people.csail.mit.edu/regina/my_papers/wiki.pdf
All these did not mention anything about human subjects as finally no
personal information is used (about the person, who is deleting, etc). Nor
did any reviewers/attendees in the conferences in this area question on
this aspect.
Also,
https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research
is relevant here as it talks about our previous work.

if "record of someone doing something" is relevant from human subjects
point of view, any data on Wikipedia can be used to find the editors (if
not the real person). For example:
https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/viewFile/3505/3968
https://alchemy.cs.washington.edu/papers/wu08/wu08.pdf
I have met several researchers who work using data (revisions from
Wikipedia) and nothin on IRB ever came up.

Nevertheless, as I said, if there are concrete rules, I think it would help
the research community as a whole to know what can or cannot be done and
also ask for permissions.
I appreciate the suggestions that Stuart mentioned in a previous email abut
experimenting on would be deleted or articles lacking sources. But, as of
now we are not planning anything and if we do, we would for sure get in
touch with Denny (who had a video chat with me before starting this thread)
and would try to know the best ways of doing it.

I have asked my PhD advisor (other author on the paper) to check this
thread and he will be able to give more inputs as I am not very qualified
to comment on these aspects.

Thanks,
Sidd
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-11 Thread Stuart A. Yeates
You interacted with living people through wikipedia and you're recorded
information about whether they deleted your references. Many people (like
myself) are directly traceable from their wikipedia accounts to their
real-life identifies.

How is a record of someone doing something not about them?

cheers
stuart


--
...let us be heard from red core to black sky

On Fri, Aug 12, 2016 at 2:04 PM, siddhartha banerjee 
wrote:

> As I mentioned earlier, I was not sure about the multiple account policy.
> I got the notification about the incident being raised, and I will be happy
> with whatever decisions Wiki administrators make.
>
> As Denny mentioned, we did not plan anything large-scale but only for a
> small group of edits. Furthermore, we mentioned the results only being
> valid until a particular date before the submission of that conference
> paper and things may have already changed a lot (articles removed, edited
> further, etc). We have not made any additions since Feb, nor do we plan to
> do anything further. Whatever we do, would be offline.
>
> To Denny's point about other researchers trying to do the same kind of
> research, I do see research in this area coming up and it might make sense
> to have certain rules (although I do not have much idea on how these things
> work abt rules on Wiki in general.) I know this because some researchers
> have contacted me previously on this work, and they are also looking into
> similar areas. One example in this area of work is the following -- this is
> very recent: http://snap.stanford.edu/wikiworkshop2016/papers/
> wikiworkshop_icwsm2016_pochampally.pdf
>
> Regarding human subjects, no reviewers in the conferences as well as any
> other person from Wikimedia mentioned anything on that earlier. Our
> previous works were featured earlier on Wikimedia newsletters (links in
> earlier emails) and still nothing on it was mentioned nor we found any
> information on Wikipedia in general about it. As per the requirements,
> approval would be necessary if: *Data about living individuals through
> intervention or interaction or **Identifiable private information about
> living individuals. *As is mentioned. the "about" fact is very imp --
> because nothing about editors data was used or collected in the research.
>
> If rules do change, I will keep following the thread and also please let
> me know -- I will try to inform to all researchers who work in this area if
> they get in touch with me.
>
> -- Siddhartha
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-11 Thread siddhartha banerjee
As I mentioned earlier, I was not sure about the multiple account policy. I
got the notification about the incident being raised, and I will be happy
with whatever decisions Wiki administrators make.

As Denny mentioned, we did not plan anything large-scale but only for a
small group of edits. Furthermore, we mentioned the results only being
valid until a particular date before the submission of that conference
paper and things may have already changed a lot (articles removed, edited
further, etc). We have not made any additions since Feb, nor do we plan to
do anything further. Whatever we do, would be offline.

To Denny's point about other researchers trying to do the same kind of
research, I do see research in this area coming up and it might make sense
to have certain rules (although I do not have much idea on how these things
work abt rules on Wiki in general.) I know this because some researchers
have contacted me previously on this work, and they are also looking into
similar areas. One example in this area of work is the following -- this is
very recent:
http://snap.stanford.edu/wikiworkshop2016/papers/wikiworkshop_icwsm2016_pochampally.pdf

Regarding human subjects, no reviewers in the conferences as well as any
other person from Wikimedia mentioned anything on that earlier. Our
previous works were featured earlier on Wikimedia newsletters (links in
earlier emails) and still nothing on it was mentioned nor we found any
information on Wikipedia in general about it. As per the requirements,
approval would be necessary if: *Data about living individuals through
intervention or interaction or **Identifiable private information about
living individuals. *As is mentioned. the "about" fact is very imp --
because nothing about editors data was used or collected in the research.

If rules do change, I will keep following the thread and also please let me
know -- I will try to inform to all researchers who work in this area if
they get in touch with me.

-- Siddhartha
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Research on automatically created articles

2016-08-11 Thread Kerry Raymond
Presuming the research is being conducted under the usual ethics regimes, 
putting articles into mainspace Wikipedia is putting them in front or readers 
and into the workflows and activities of editors. This would appear to me to 
constitute experiments on human subjects, which usually introduces issues of 
informed consent and potential to harm them. Can we be shown theethical 
approval documents for this particular project to see how these concerns were 
addressed?

Kerry

Sent from my iPad

> On 12 Aug 2016, at 9:45 AM, Stuart A. Yeates  wrote:
> 
> I think you misunderstand the nature of en.wiki.
> 
> en.wiki is not a rule-based automata; en.wiki is an autonomous community that 
> works by consensus. 
> 
> I cannot imagine a set of research rules constructed outside en.wiki that 
> lets you 'safely' do interact with it. Observe it, maybe, but not interact 
> with it. I can also imagine that certain kinds of observation (or certain 
> results coming out of observation) making further observation difficult.
> 
> The best advice I can provide is to team up with an experienced editor or two.
> 
> [For editing for educational rather than research purposes see 
> https://en.wikipedia.org/wiki/Wikipedia:Education_program ]
> 
> cheers
> stuart
> 
> --
> ...let us be heard from red core to black sky
> 
>> On Fri, Aug 12, 2016 at 11:04 AM, Ziko van Dijk  wrote:
>> Hello,
>> 
>> Do we have a collection of already existing and relevant policies and 
>> statements, at least for English Wikipedia? On Meta I found this page
>> https://meta.wikimedia.org/wiki/Research:Wikipedia_Research_Management
>> which main statement is that research is too various and complex to give 
>> some few recommendations.
>> 
>> At first sight, I find it difficult to read something relevant from 
>> https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not
>> 
>> I imagine that guidelines could be helpful with regard to a) research that 
>> includes editing wiki pages, b) the editing of students or pupils for 
>> educational purposes. 
>> 
>> Research and educational activity should not disturb the efforts of the 
>> Wikipedia community to create and improve encyclopedic content. Disturbance 
>> can occur from creating sub standard content and involving in activities 
>> that disrupts work flows. ...
>> 
>> This guidelines could be only a recommendation, as long the Wikipedia 
>> communities don't change their rules. But it'd be great, anyway, if the 
>> guidelines can be based somehow on existing Wikipedia rules.
>> 
>> Kind regards
>> Ziko
>> 
>> 
>> 
>> 
>> 
>> 2016-08-12 0:41 GMT+02:00 Denny Vrandečić :
>>> So here's the list of accounts that were used in order to create the 
>>> articles:
>>> 
>>> https://en.wikipedia.org/wiki/Special:Contributions/Brownweepy 
>>> https://en.wikipedia.org/wiki/Special:Contributions/Theatremania 
>>> https://en.wikipedia.org/wiki/Special:Contributions/Bhopebhai 
>>> https://en.wikipedia.org/wiki/Special:Contributions/Dicdac123 
>>> https://en.wikipedia.org/wiki/Special:Contributions/MightyPepper 
>>> 
>>> Also some edits may have been done through IPs.
>>> 
>>> In discussion with Sidd it was clear that they did not plan to ever 
>>> mass-create a large number of articles, and it is only these 50 articles or 
>>> so we can clean up now. I am not terribly worried about this particular 
>>> work (according to the paper there were 47 surviving articles at the time 
>>> of writing, i.e. in Spring).
>>> 
>>> What I am concerned about is the fact that there will be more such 
>>> experiments from other groups. It would be great to set up a few rules for 
>>> this kind of behavior, so that we can at least point to them. If the only 
>>> rule that was broken here was the "don't use multiple accounts" rule, I am 
>>> not sure whether that would be sufficient.
>>> 
>>> Cheers,
>>> Denny
>>> 
>>> 
>>> 
 On Wed, Aug 10, 2016 at 1:47 AM Stuart A. Yeates  wrote:
 * The previous work you cite appears to have created articles in the draft 
 namespace rather than the article namespace. This is a very important and 
 very relevant detail, meaning your situation is in no way comparable to 
 the previous work from my point of view
 * You appear to be solving a problem that the community of wikipedia 
 editors does not have. We have enough low-quality stub articles that need 
 human effort to improve and we're not really interested in more unless 
 either (a) they demonstrably combat some of the systematic biases we're 
 struggling with or (b) they demonstrably attract new cohorts users to do 
 that improvement. Note that the examples discussed in the research 
 newsletter are a non-English writer and a women writer. These are 
 important details.
 * Your paper appears not to attempt to make any attempt to measure the 
 statistical significance of your results; this isn't 

Re: [Wiki-research-l] Research on automatically created articles

2016-08-11 Thread Stuart A. Yeates
I think you misunderstand the nature of en.wiki.

en.wiki is not a rule-based automata; en.wiki is an autonomous community
that works by consensus.

I cannot imagine a set of research rules constructed outside en.wiki that
lets you 'safely' do interact with it. Observe it, maybe, but not interact
with it. I can also imagine that certain kinds of observation (or certain
results coming out of observation) making further observation difficult.

The best advice I can provide is to team up with an experienced editor or
two.

[For editing for educational rather than research purposes see
https://en.wikipedia.org/wiki/Wikipedia:Education_program ]

cheers
stuart

--
...let us be heard from red core to black sky

On Fri, Aug 12, 2016 at 11:04 AM, Ziko van Dijk  wrote:

> Hello,
>
> Do we have a collection of already existing and relevant policies and
> statements, at least for English Wikipedia? On Meta I found this page
> https://meta.wikimedia.org/wiki/Research:Wikipedia_Research_Management
> which main statement is that research is too various and complex to give
> some few recommendations.
>
> At first sight, I find it difficult to read something relevant from
> https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not
>
> I imagine that guidelines could be helpful with regard to a) research that
> includes editing wiki pages, b) the editing of students or pupils for
> educational purposes.
>
> Research and educational activity should not disturb the efforts of the
> Wikipedia community to create and improve encyclopedic content.
> Disturbance can occur from creating sub standard content and involving in
> activities that disrupts work flows. ...
>
> This guidelines could be only a recommendation, as long the Wikipedia
> communities don't change their rules. But it'd be great, anyway, if the
> guidelines can be based somehow on existing Wikipedia rules.
>
> Kind regards
> Ziko
>
>
>
>
>
> 2016-08-12 0:41 GMT+02:00 Denny Vrandečić :
>
>> So here's the list of accounts that were used in order to create the
>> articles:
>>
>> https://en.wikipedia.org/wiki/Special:Contributions/Brownweepy
>> https://en.wikipedia.org/wiki/Special:Contributions/Theatremania
>> https://en.wikipedia.org/wiki/Special:Contributions/Bhopebhai
>> https://en.wikipedia.org/wiki/Special:Contributions/Dicdac123
>> https://en.wikipedia.org/wiki/Special:Contributions/MightyPepper
>>
>> Also some edits may have been done through IPs.
>>
>> In discussion with Sidd it was clear that they did not plan to ever
>> mass-create a large number of articles, and it is only these 50 articles or
>> so we can clean up now. I am not terribly worried about this particular
>> work (according to the paper there were 47 surviving articles at the time
>> of writing, i.e. in Spring).
>>
>> What I am concerned about is the fact that there will be more such
>> experiments from other groups. It would be great to set up a few rules for
>> this kind of behavior, so that we can at least point to them. If the only
>> rule that was broken here was the "don't use multiple accounts" rule, I am
>> not sure whether that would be sufficient.
>>
>> Cheers,
>> Denny
>>
>>
>>
>> On Wed, Aug 10, 2016 at 1:47 AM Stuart A. Yeates 
>> wrote:
>>
>>> * The previous work you cite appears to have created articles in the
>>> draft namespace rather than the article namespace. This is a very important
>>> and very relevant detail, meaning your situation is in no way comparable to
>>> the previous work from my point of view
>>> * You appear to be solving a problem that the community of wikipedia
>>> editors does not have. We have enough low-quality stub articles that need
>>> human effort to improve and we're not really interested in more unless
>>> either (a) they demonstrably combat some of the systematic biases we're
>>> struggling with or (b) they demonstrably attract new cohorts users to do
>>> that improvement. Note that the examples discussed in the research
>>> newsletter are a non-English writer and a women writer. These are important
>>> details.
>>> * Your paper appears not to attempt to make any attempt to measure the
>>> statistical significance of your results; this isn't science.
>>> * Most of your sources are _really_ _really_ bad.
>>> https://en.wikipedia.org/wiki/Talonid Contains 8 unique refs, one of
>>> which is good, one of which is a passable and the others should be removed
>>> immediately (but I won't because it'll make it harder for third parties
>>> reading this conversation to follow it.).
>>>
>>> If you want to properly evaluate your technique, try this: Randomly pick
>>> N articles from https://en.wikipedia.org/wiki/
>>> Category:Articles_lacking_sources subcats splitting them into control
>>> and subjects randomly. Parse each subject article for sentences that your
>>> system appears to understand. For each sentence your thing you understand
>>> look for reliable sources to support that sentence. 

Re: [Wiki-research-l] Research on automatically created articles

2016-08-11 Thread Ziko van Dijk
Hello,

Do we have a collection of already existing and relevant policies and
statements, at least for English Wikipedia? On Meta I found this page
https://meta.wikimedia.org/wiki/Research:Wikipedia_Research_Management
which main statement is that research is too various and complex to give
some few recommendations.

At first sight, I find it difficult to read something relevant from
https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not

I imagine that guidelines could be helpful with regard to a) research that
includes editing wiki pages, b) the editing of students or pupils for
educational purposes.

Research and educational activity should not disturb the efforts of the
Wikipedia community to create and improve encyclopedic content. Disturbance
can occur from creating sub standard content and involving in activities
that disrupts work flows. ...

This guidelines could be only a recommendation, as long the Wikipedia
communities don't change their rules. But it'd be great, anyway, if the
guidelines can be based somehow on existing Wikipedia rules.

Kind regards
Ziko





2016-08-12 0:41 GMT+02:00 Denny Vrandečić :

> So here's the list of accounts that were used in order to create the
> articles:
>
> https://en.wikipedia.org/wiki/Special:Contributions/Brownweepy
> https://en.wikipedia.org/wiki/Special:Contributions/Theatremania
> https://en.wikipedia.org/wiki/Special:Contributions/Bhopebhai
> https://en.wikipedia.org/wiki/Special:Contributions/Dicdac123
> https://en.wikipedia.org/wiki/Special:Contributions/MightyPepper
>
> Also some edits may have been done through IPs.
>
> In discussion with Sidd it was clear that they did not plan to ever
> mass-create a large number of articles, and it is only these 50 articles or
> so we can clean up now. I am not terribly worried about this particular
> work (according to the paper there were 47 surviving articles at the time
> of writing, i.e. in Spring).
>
> What I am concerned about is the fact that there will be more such
> experiments from other groups. It would be great to set up a few rules for
> this kind of behavior, so that we can at least point to them. If the only
> rule that was broken here was the "don't use multiple accounts" rule, I am
> not sure whether that would be sufficient.
>
> Cheers,
> Denny
>
>
>
> On Wed, Aug 10, 2016 at 1:47 AM Stuart A. Yeates 
> wrote:
>
>> * The previous work you cite appears to have created articles in the
>> draft namespace rather than the article namespace. This is a very important
>> and very relevant detail, meaning your situation is in no way comparable to
>> the previous work from my point of view
>> * You appear to be solving a problem that the community of wikipedia
>> editors does not have. We have enough low-quality stub articles that need
>> human effort to improve and we're not really interested in more unless
>> either (a) they demonstrably combat some of the systematic biases we're
>> struggling with or (b) they demonstrably attract new cohorts users to do
>> that improvement. Note that the examples discussed in the research
>> newsletter are a non-English writer and a women writer. These are important
>> details.
>> * Your paper appears not to attempt to make any attempt to measure the
>> statistical significance of your results; this isn't science.
>> * Most of your sources are _really_ _really_ bad.
>> https://en.wikipedia.org/wiki/Talonid Contains 8 unique refs, one of
>> which is good, one of which is a passable and the others should be removed
>> immediately (but I won't because it'll make it harder for third parties
>> reading this conversation to follow it.).
>>
>> If you want to properly evaluate your technique, try this: Randomly pick
>> N articles from https://en.wikipedia.org/wiki/Category:Articles_lacking_
>> sources subcats splitting them into control and subjects randomly. Parse
>> each subject article for sentences that your system appears to understand.
>> For each sentence your thing you understand look for reliable sources to
>> support that sentence. Add a single ref to a single statement in each
>> article. Add all the refs using a single account with a message on the user
>> page about the nature of the edits. If you're not able to add any refs,
>> mark it as a failure. Measure article lifespan for each group.
>>
>> If you're in a hurry and want fast results, work with articles less than
>> a week old (hint: articles IDs are numerically increasing sequence) or the
>> intersection of https://en.wikipedia.org/wiki/Category:Articles_lacking_
>> sources subcats and Category:Articles_for_deletion Both of these groups
>> of articles are actively being considered for deletion.
>>
>> cheers
>> stuart
>>
>>
>> --
>> ...let us be heard from red core to black sky
>>
>> On Wed, Aug 10, 2016 at 9:30 AM, siddhartha banerjee 
>> wrote:
>>
>>> Hello Everyone,
>>>
>>> I am the first author of the paper that Denny has referred. 

Re: [Wiki-research-l] Research on automatically created articles

2016-08-11 Thread Denny Vrandečić
So here's the list of accounts that were used in order to create the
articles:

https://en.wikipedia.org/wiki/Special:Contributions/Brownweepy
https://en.wikipedia.org/wiki/Special:Contributions/Theatremania
https://en.wikipedia.org/wiki/Special:Contributions/Bhopebhai
https://en.wikipedia.org/wiki/Special:Contributions/Dicdac123
https://en.wikipedia.org/wiki/Special:Contributions/MightyPepper

Also some edits may have been done through IPs.

In discussion with Sidd it was clear that they did not plan to ever
mass-create a large number of articles, and it is only these 50 articles or
so we can clean up now. I am not terribly worried about this particular
work (according to the paper there were 47 surviving articles at the time
of writing, i.e. in Spring).

What I am concerned about is the fact that there will be more such
experiments from other groups. It would be great to set up a few rules for
this kind of behavior, so that we can at least point to them. If the only
rule that was broken here was the "don't use multiple accounts" rule, I am
not sure whether that would be sufficient.

Cheers,
Denny



On Wed, Aug 10, 2016 at 1:47 AM Stuart A. Yeates  wrote:

> * The previous work you cite appears to have created articles in the draft
> namespace rather than the article namespace. This is a very important and
> very relevant detail, meaning your situation is in no way comparable to the
> previous work from my point of view
> * You appear to be solving a problem that the community of wikipedia
> editors does not have. We have enough low-quality stub articles that need
> human effort to improve and we're not really interested in more unless
> either (a) they demonstrably combat some of the systematic biases we're
> struggling with or (b) they demonstrably attract new cohorts users to do
> that improvement. Note that the examples discussed in the research
> newsletter are a non-English writer and a women writer. These are important
> details.
> * Your paper appears not to attempt to make any attempt to measure the
> statistical significance of your results; this isn't science.
> * Most of your sources are _really_ _really_ bad.
> https://en.wikipedia.org/wiki/Talonid Contains 8 unique refs, one of
> which is good, one of which is a passable and the others should be removed
> immediately (but I won't because it'll make it harder for third parties
> reading this conversation to follow it.).
>
> If you want to properly evaluate your technique, try this: Randomly pick N
> articles from
> https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats
> splitting them into control and subjects randomly. Parse each subject
> article for sentences that your system appears to understand. For each
> sentence your thing you understand look for reliable sources to support
> that sentence. Add a single ref to a single statement in each article. Add
> all the refs using a single account with a message on the user page about
> the nature of the edits. If you're not able to add any refs, mark it as a
> failure. Measure article lifespan for each group.
>
> If you're in a hurry and want fast results, work with articles less than a
> week old (hint: articles IDs are numerically increasing sequence) or the
> intersection of
> https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats
> and Category:Articles_for_deletion Both of these groups of articles are
> actively being considered for deletion.
>
> cheers
> stuart
>
>
> --
> ...let us be heard from red core to black sky
>
> On Wed, Aug 10, 2016 at 9:30 AM, siddhartha banerjee 
> wrote:
>
>> Hello Everyone,
>>
>> I am the first author of the paper that Denny has referred. Firstly, I
>> want to thank Denny for asking me to join this list and know more about
>> this discussion.
>>
>> 1. Regarding quality, we know that there are issues, and even in the
>> conference, I have repeatedly told the audience that I am not satisfied
>> with the quality of the content generated. However, the percentage of
>> articles that were not removed when the paper was submitted was minimal. I
>> have sent Denny a list of accounts that were used and it might have been
>> possible that several articles created have been removed from those
>> accounts within the last couple of months. I was not aware of the multiple
>> account policy.
>>
>> 2. The area of Wikipedia article generation have been explored by others
>> in the past. [http://www.aclweb.org/anthology/P09-1024,
>> http://wwwconference.org/proceedings/www2011/companion/p161.pdf] We were
>> not aware of any rules regarding these sort of experiments. However, we do
>> understand that such experiments can harm the general quality of this great
>> encyclopedic resource, hence we did out analysis on bare minimum articles.
>> In fact, we did our initial work on it back in 2014, and Wikimedia research
>> even covered details about our paper here --
>> 

Re: [Wiki-research-l] Research on automatically created articles

2016-08-10 Thread Stuart A. Yeates
* The previous work you cite appears to have created articles in the draft
namespace rather than the article namespace. This is a very important and
very relevant detail, meaning your situation is in no way comparable to the
previous work from my point of view
* You appear to be solving a problem that the community of wikipedia
editors does not have. We have enough low-quality stub articles that need
human effort to improve and we're not really interested in more unless
either (a) they demonstrably combat some of the systematic biases we're
struggling with or (b) they demonstrably attract new cohorts users to do
that improvement. Note that the examples discussed in the research
newsletter are a non-English writer and a women writer. These are important
details.
* Your paper appears not to attempt to make any attempt to measure the
statistical significance of your results; this isn't science.
* Most of your sources are _really_ _really_ bad. https://en.wikipedia.org/
wiki/Talonid Contains 8 unique refs, one of which is good, one of which is
a passable and the others should be removed immediately (but I won't
because it'll make it harder for third parties reading this conversation to
follow it.).

If you want to properly evaluate your technique, try this: Randomly pick N
articles from https://en.wikipedia.org/wiki/Category:Articles_lacking_
sources subcats splitting them into control and subjects randomly. Parse
each subject article for sentences that your system appears to understand.
For each sentence your thing you understand look for reliable sources to
support that sentence. Add a single ref to a single statement in each
article. Add all the refs using a single account with a message on the user
page about the nature of the edits. If you're not able to add any refs,
mark it as a failure. Measure article lifespan for each group.

If you're in a hurry and want fast results, work with articles less than a
week old (hint: articles IDs are numerically increasing sequence) or the
intersection of https://en.wikipedia.org/wiki/Category:Articles_lacking_
sources subcats and Category:Articles_for_deletion Both of these groups of
articles are actively being considered for deletion.

cheers
stuart


--
...let us be heard from red core to black sky

On Wed, Aug 10, 2016 at 9:30 AM, siddhartha banerjee 
wrote:

> Hello Everyone,
>
> I am the first author of the paper that Denny has referred. Firstly, I
> want to thank Denny for asking me to join this list and know more about
> this discussion.
>
> 1. Regarding quality, we know that there are issues, and even in the
> conference, I have repeatedly told the audience that I am not satisfied
> with the quality of the content generated. However, the percentage of
> articles that were not removed when the paper was submitted was minimal. I
> have sent Denny a list of accounts that were used and it might have been
> possible that several articles created have been removed from those
> accounts within the last couple of months. I was not aware of the multiple
> account policy.
>
> 2. The area of Wikipedia article generation have been explored by others
> in the past. [http://www.aclweb.org/anthology/P09-1024, http://
> wwwconference.org/proceedings/www2011/companion/p161.pdf] We were not
> aware of any rules regarding these sort of experiments. However, we do
> understand that such experiments can harm the general quality of this great
> encyclopedic resource, hence we did out analysis on bare minimum articles.
> In fact, we did our initial work on it back in 2014, and Wikimedia research
> even covered details about our paper here -- https://blog.wikimedia.org/
> 2015/02/02/wikimedia-research-newsletter-january-2015/#Bot_
> detects_theatre_play_scripts_on_the_web_and_writes_
> Wikipedia_articles_about_them
>
> If questions were raised at that point, we would surely not have done
> anything further on this, or rather do things offline without creating or
> adding any content on Wikipedia.
>
> I understand your point about imposing rules and I think it makes sense.
> However, during this research, we were not aware of any rules, hence
> continued our work.
> As I have told Denny, our purpose was to check whether we could create
> bare minimal articles which could be eventually improved by authors on
> Wikipedia, and also to see if they are totally removed. But, it was done
> with a few articles and we did not create anything beyond that point. Also,
> we did not do any manual modifications to the articles although we saw
> quality issues because it would void our analysis and claims.
>
> Thanks everyone for your time and the great work you are doing for the
> Wikipedia community.
>
> Regards,
> Sidd
>
>
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l 

Re: [Wiki-research-l] Research on automatically created articles

2016-08-09 Thread Ziko van Dijk
Hello Denny,

I agree with all three points. The experiment reminds me of "babelfish
accidents" as we called them in de.WP, and the experimemts of Google and
Microsoft to "support" "translations" between Wikipedias.

Very strange this repeating "Dick Barbour is legendary in..."

Kind regards
Ziko


2016-08-09 20:29 GMT+02:00 Denny Vrandečić :

> Hi all,
>
> I found a paper at IJCAI 2016, which left me quite curious:
> https://siddbanpsu.github.io/publications/ijcai16-banerjee.pdf
>
> In short, they find red links, classify them, find the closest similar
> articles, use the section titles from these articles to decide on sections,
> search for content for the sections, paraphrase it, and write complete
> Wikipedia articles.
>
> Then they uploaded the articles to Wikipedia, and from the 50 uploaded
> articles, only 3 got deleted. The rest stayed. I was rather excited when I
> heard that - where the articles really that good?
>
> Then I took a look at the articles and... well, judge for yourself. The
> paper only mentions three articles of the 47 survivors:
>
> https://en.wikipedia.org/wiki/Dick_Barbour
>
> https://en.wikipedia.org/wiki/Atripliceae (here is the last version as
> created by the bot before significant human clean-up: https://en.
> wikipedia.org/w/index.php?title=Atripliceae=697456858 )
>
> https://en.wikipedia.org/wiki/Talonid
>
> I have connected with the first author and he promised me to give a list
> of all articles as soon as he can get it, which will be in a few weeks
> because he is away from his university computer right now. He was able to
> produce one more article though:
>
> https://en.wikipedia.org/wiki/Sonia_Bianchetti_Garbato
>
> (Also, see history for the extent of human clean-up)
>
> I am not writing to talk badly about the authors or about the reviewing
> practice at IJCAI, or about the state of research in that area. Also, I
> really do not want to discourage research in this area.
>
> I have a few questions, though:
>
> 1) the fact that so many of these articles have survived for half a year
> indicates that there are some problems with our review processes. Does
> someone want to make an investigation why these articles survived in the
> given state?
>
> 2) as far as I know we don't have rules for this kind of experiments, but
> maybe we should. In particular, I feel, that, BLPs should not be created by
> an experimental approach like this one. Should we set up rules for this
> kind of experiments?
>
> 3) Wikipedia contributors are participating in these experiments without
> consent. I find that worrysome, and would like to hear what others think.
>
> I have invited the first author to join this list.
>
> I understand the motivation: by exposing from the beginning that these
> articles were created by bots, they would have been scrutinized differently
> than articles written by humans. Therefore they remained quiet about the
> fact (but are willing to reveal it now, now that the experiment is over -
> they also explicitly don't have any intentions of expanding the scope of
> the experiment at the given point of time).
>
> Cheers,
> Denny
>
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l