Re: [Wikitech-l] research-oriented toolserver?

2009-03-14 Thread Daniel Kinzler
Morten Warncke-Wang schrieb:
 Hi all,
 
 Judging by the replies we think we've failed to communicate clearly
 some of the ideas we wanted to put forward, and we'd like to take the
 opportunity to try to clear that up.
 
 We did not want to narrow this down to be only about a third party
 toolserver.  Before we initiated contact we noticed the need for
 adding more resources to the existing cluster.  Therefore we also had
 in mind the idea of augmenting the toolserver, rather than attempt to
 create a competitor for it.  For instance this could help allow the
 toolserver to also host applications requiring some amounts of text
 crunching, which is currently not feasible as far as we can tell.

That would be excellent.

 Additionally we think there could perhaps be two paths to account
 creation, one for Wikipedians and one for researchers, with the
 research path laid out with clearer documentation on the requirements
 projects would need to fit the toolserver and what the application
 should contain, which combined with faster feedback would aid to make
 the process easier for the researchers.

I think this should be done for all accounts. Why only researchers?

 We hope that this clears up some central points in our ideas
 surrounding a research oriented toolserver.  Currently we are
 exploring several ideas and this particular one might not become more
 than a thought and a thread on a mailing list.  Nonetheless perhaps
 there are thoughts here that can become more solid somewhere down the
 line.

In order to develop ideas, it would be useful to get some idea of what kind of
resources you think you can contribute, and under what terms and in what
timeframe. I know that talking money in public is usually a bad idea, especially
if the money isn't really there yet. If you like, contact me in private,
preferrably under my office address, daniel.kinzler AT wikimedia.de. I'm
responsible for toolserver operations, so I suppose it's my job to look into 
this.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-14 Thread Daniel Kinzler
Brian schrieb:
 I think what the toolserver guys are saying is that they've got the
 data (e.g., a replica of the master database) and they are willing to
 expand operations to include larger-scale computations, and so yes
 they are willing to become more research oriented. They just need
 the extra hardware of course. I think it's difficult to estimate how
 much but here are some applications that I would like to make or see
 made sooner or later:
 
 * WikiBlame - A Lucene index of the history of all projects that can
 instantly find the authors of a pasted snippet. I'm not clear on the
 memory requirements of hosting an app like this after the index is
 created, but the index will be terabyte-size at 35% of the text dump.

Note that WikiTrust can do this too, and will probably go into testing soon. For
now, the database for WikiTrust weill be off-site, but if it goes live on
wikipedia, the hardwaree would be run at the main wmf cluster, and not on the
toolserver.

 * WikiBlame for images - an image similarity algorithm over all images
 in all projects that can find all places a given image is being used.
 I believe there is a one-time major cpu cost when first analyzing the
 images and then a much lesser realtime comparison cost. Again, the
 memory requirements of hosting such an app are unclear.

That would be very nice to have...

 * A vandalism classifier bot that uses the entire history of a wiki in
 order to predict whether the current edit is vandalism. Basically, a
 major extension of existing published work on automatically detecting
 vandalism, which only used several hundred edits. This would require
 major cpu resources for training but very little cost for real-time
 classification.

Pretty big for a toolserver poroject. But an excellent research topic!

 * Dumps, including extended dump formats such as a natural language
 parse of the full text of the recent version of a wiki made readily
 available for researchers.
 
 Finally, there are many worthwhile projects that have been presented
 at past Wikimanias or published in the literature that deserve to be
 kept up to date as the encyclopedia continues to grow. Permanent
 hosting for such projects would be a worthwhile goal, as would
 reaching out to these researchers. If the foundation can afford such
 an endeavor, the hardware cost is actually not that great. Perhaps
 datacenter fees are.

Please don't foprget that the toolserver is NOT run by the wikimedia foundation.
It's run by wikimedia germany, which has maybe a tenth of the foundation's
budget. If the foundation is interested in supporting us further, that's great,
we just need to keep responsibilities clear: is the foundation runnign a
project, or is the foundation heling us (wikimedia germany) to run a project?...

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-14 Thread Brian
How will WikiTrust accomplish the WikiBlame function? I think I know
what WikiTrust is: http://trust.cse.ucsc.edu/

What gives it the function that you can enter a piece of wiki code
from the history of any wiki - totally out of context - and it returns
the authors?

On Sat, Mar 14, 2009 at 2:02 AM, Daniel Kinzler dan...@brightbyte.de wrote:
 Brian schrieb:
 * WikiBlame - A Lucene index of the history of all projects that can
 instantly find the authors of a pasted snippet. I'm not clear on the
 memory requirements of hosting an app like this after the index is
 created, but the index will be terabyte-size at 35% of the text dump.

 Note that WikiTrust can do this too, and will probably go into testing soon. 
 For
 now, the database for WikiTrust weill be off-site, but if it goes live on
 wikipedia, the hardwaree would be run at the main wmf cluster, and not on the
 toolserver.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-13 Thread Morten Warncke-Wang
Hi all,

Judging by the replies we think we've failed to communicate clearly
some of the ideas we wanted to put forward, and we'd like to take the
opportunity to try to clear that up.

We did not want to narrow this down to be only about a third party
toolserver.  Before we initiated contact we noticed the need for
adding more resources to the existing cluster.  Therefore we also had
in mind the idea of augmenting the toolserver, rather than attempt to
create a competitor for it.  For instance this could help allow the
toolserver to also host applications requiring some amounts of text
crunching, which is currently not feasible as far as we can tell.

Additionally we think there could perhaps be two paths to account
creation, one for Wikipedians and one for researchers, with the
research path laid out with clearer documentation on the requirements
projects would need to fit the toolserver and what the application
should contain, which combined with faster feedback would aid to make
the process easier for the researchers.

We hope that this clears up some central points in our ideas
surrounding a research oriented toolserver.  Currently we are
exploring several ideas and this particular one might not become more
than a thought and a thread on a mailing list.  Nonetheless perhaps
there are thoughts here that can become more solid somewhere down the
line.

Morten Warncke-Wang, Research Assistant
John Riedl, Professor
GroupLens Research
www.grouplens.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-12 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Aryeh Gregor:
 If I understand correctly, the only change being contemplated here is not
 replicating the databases that are entirely secret (databases of private
 wikis).

this is correct.

 I might be misunderstanding, though.  If only entire databases need to
 be hidden, why can't the toolserver just be set up not to replicate
 those, given that MySQL supports that?

because it would require a proxy server under WMF control that filtered out the
evil tables and provided a clean replicated feed to the toolserver, which is
a lot more effort (and more fragile) than just moving the bad data.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm5FO8ACgkQIXd7fCuc5vKOhQCdGrF+u80Y4H8H/YcKwyTxce/5
iM8AnRAaS/xAuouawGht0/clWe13H8FG
=hwrB
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-12 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Morten Warncke-Wang:
 What do you think?  Seem like a useful idea if we can find sufficient
 resources, and put together a management plan?

no, like Daniel said, this is a waste of time and effort.  i originally assumed
that a research toolserver would be different in some technical sense, which
might make at least some sense (although i've argued against that elsewhere in
this thread).  however, i completely fail to understand your reasoning here.

is there some backstory i'm missing?  did you apply for a Toolserver account
and were rejected because you aren't a Wikipedia editor?  does the WM-DE have a
history of doing this?  (i'm certainly not aware of it, if so...) 

if you want to improve the account approval process at the Toolserver, doesn't
it make more sense to do that, rather than creating a completely new project to
fix one small issue?

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm5FnwACgkQIXd7fCuc5vJCTQCgrdu1UILmXifN4KAfMM64FVk5
seUAoKw3jUuQW9kp/aHdSqAs3lZBX82T
=PRVy
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-12 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brion Vibber:
 Could be done. We're also fine with new toolserver roots as long as we
 approve em too for now.

it would have been nice if the Toolserver was aware of this ;-)

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm5GLgACgkQIXd7fCuc5vJNTwCbBLBE5grZpHtLrKj8IiAgNTFN
8awAoKyAtofejah80yBSR4XaNSmEv3L0
=fFvc
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-12 Thread Daniel Kinzler
Anthony schrieb:
 On Tue, Mar 10, 2009 at 12:29 AM, Andrew Garrett and...@werdn.us wrote:
 
 On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858sn...@yahoo.com.au
 wrote:
 Currently all data, including private data, is replicated to the
 toolserver. We could not do this with a third-party server.
 My understanding is that the the toolserver(/s) are owned by the
 german chapter and not by wikimedia directly so why is private data
 being replicated onto them?
 Because it was chosen as the best technical solution. Is there a
 specific problem with private data being on the toolserver? If so,
 what?

 You should be aware that toolserver roots are approved by the
 foundation before becoming roots.
 
 
 You answer the questions in your first paragraph with your sentence in the
 second.   Think Cathedral vs. Bazaar.
 
 On Tue, Mar 10, 2009 at 4:27 AM, Daniel Kinzler dan...@brightbyte.dewrote:
 Robert Rohde schrieb:
 On Mon, Mar 9, 2009 at 9:29 PM, Andrew Garrett and...@werdn.us wrote:
 Logistically it would be nice to have a means of providing an
 exclusively public data replica for purposes such as research, though
 I can certainly see how that could get technically messy.

 As far as I know, there is simply no efficient way to do this currently.
 
 How much information does the live feed provide?  Every revision, or just a
 subset of revisions?  How much would it cost the WMF to provide a single
 near-live stream of every revision?

A feed service for all revisions is available, see
http://meta.wikimedia.org/wiki/Wikimedia_update_feed_service. Search engines
like to use it (think: answers.com) and they are made to pay for it. Researches
should generally get it for free. Just ask brion.

This doesn provide notifications in the range of seconds (which might bee needed
for vandal-fighting tools), but should be quite sufficient to keep a text
database up to date. For real-time notifications, the only decent method is the
RC feed on IRC, but that's hard to parse and messages frequently get truncated.

Having better means for distributing notifications of changes is something i'm
quite interested in. XMPP would be a very good choice, I think, I wrote about it
a while ago here: http://brightbyte.de/page/RecentChanges_via_Jabber. I did
not write about including full revision text or diffs in the notifications, but
that's sure possible. It may be a bit too heavy for a general purpose feed, but
it would be feasible wehen using PubSub, I think. Anyway, getting this
implemented would be nice. If anyone has time and/or money he could commit
towards this, that would be excellent :)

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-12 Thread Brion Vibber
On 3/12/09 7:14 AM, River Tarnell wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Brion Vibber:
 Could be done. We're also fine with new toolserver roots as long as we
 approve em too for now.

 it would have been nice if the Toolserver was aware of this ;-)

I was pretty sure this came up in an IRC chat a few months ago; my 
apologies if we didn't both realize it. :)

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Aryeh Gregor:
 I don't think the toolserver is used for backups.

it is, but only in the sense that it's our only off-site copy of the database.
it was not created to act as a backup...

  At least I hope it's not, given its reliability (which is quite good, but
  quite good is scary for backups).

... however, if we had enough money to support the toolserver properly, i think
it would be perfectly reliable as a backup.  that's something that might change
this year.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm3csYACgkQIXd7fCuc5vKCIgCcCzL9EGZwgZhOn5Dj/U2a6wPe
/NgAn0UJzytuVBfcOUjoUs4VFWNOgeJu
=HskZ
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Robert Rohde:
 In particular, I think it is useful to separate tools from analysis.

why?

 Tools need high availability and low lag relative to the live site, but
 analysis doesn't care if it gets out of date and should use scheduling etc.
 to balance large loads.

what is preventing people from using the current toolserver for this analysis?
what do we need to change about the platform that will enable people to run it
on the current toolserver?

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm3eD0ACgkQIXd7fCuc5vJeNQCbB3zmpKh2jLmyJDqr6riSXtE5
1GMAoLjUPl28JgGFiXMAMKEEF2659DI8
=R0i8
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andrea Forte:
 Let me know if you have a grant proposal you'd like help with!

well, i'm still not sure what exactly people need.  perhaps the various
academic people could produce a list of what they want to do on the toolserver
and what's missing at the moment?  (e.g. fast text access, search, ...)

then we can look at the best way to provide this, including where the money
should come from.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm3eZsACgkQIXd7fCuc5vIb0gCfSOEH+xZA70n2NjZjEHRLTLt2
5tgAmwTy4Qf/qqIqWHwLr030rzmzHr/0
=U3tz
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread Brian
I vote for making the toolserver the head-node to a much larger
beowulf cluster that has a well configured job scheduler. The data
that needs to be crunched is already right there - it makes sense to
put a research cluster there as well.

There will always be a limited supply of resources. Perhaps there
should be a public approval system for the resources, where the
community gets to pick which jobs should get added to the queue based
on public analysis of the code and a description of the computation.

There will be no shortage of participants ;)

On Wed, Mar 11, 2009 at 2:37 AM, River Tarnell
ri...@loreley.flyingparchment.org.uk wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Robert Rohde:
 In particular, I think it is useful to separate tools from analysis.

 why?

 Tools need high availability and low lag relative to the live site, but
 analysis doesn't care if it gets out of date and should use scheduling etc.
 to balance large loads.

 what is preventing people from using the current toolserver for this analysis?
 what do we need to change about the platform that will enable people to run it
 on the current toolserver?

        - river.
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (HP-UX)

 iEYEARECAAYFAkm3eD0ACgkQIXd7fCuc5vJeNQCbB3zmpKh2jLmyJDqr6riSXtE5
 1GMAoLjUPl28JgGFiXMAMKEEF2659DI8
 =R0i8
 -END PGP SIGNATURE-

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brian:
 I vote for making the toolserver the head-node to a much larger beowulf
 cluster that has a well configured job scheduler.

so the issue is that more CPU is needed to run the research jobs?  how much
more?  do you have an example of a job and what it would require to run here?

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm3fcsACgkQIXd7fCuc5vLahACgl/mTCSMcqndaChCrooL9geWo
qYYAnRBmY5aFv3uvScH6uZWcDB8fTV5a
=Q0+7
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread Brian
Sure - creating a lucene index of the entire revision history of all
wikipedia's for a WikiBlame extension.

More realistically (although I would like to do the above) a natural
language parse of the current revision of the english wikipedia. Based
on the supposed availability of this hardware, I'd say it could be
done in less than a week.

https://wiki.toolserver.org/view/Servers

I have to say the toolserver has grown a lot from that first donated server ^_^

On Wed, Mar 11, 2009 at 3:00 AM, River Tarnell
ri...@loreley.flyingparchment.org.uk wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Brian:
 I vote for making the toolserver the head-node to a much larger beowulf
 cluster that has a well configured job scheduler.

 so the issue is that more CPU is needed to run the research jobs?  how much
 more?  do you have an example of a job and what it would require to run here?

        - river.
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (HP-UX)

 iEYEARECAAYFAkm3fcsACgkQIXd7fCuc5vLahACgl/mTCSMcqndaChCrooL9geWo
 qYYAnRBmY5aFv3uvScH6uZWcDB8fTV5a
 =Q0+7
 -END PGP SIGNATURE-

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Robert Rohde:
 The starting point is providing full-text history availability and once you
 have that there are a number of different projects (like wikiblame) which
 would desire to pull and process every revision in some way.

okay, so full text access has been a 'would be nice' thing for a while.  i
added an item to this year's shopping list for it.

it seems more useful to provide the text in uncompressed form, instead of the
MediaWiki internal form that's almost impossible to work with.  does that seem
reasonable?

 Some of the code I've worked with would probably take weeks to run
 single-threaded against enwiki, but that can be made practical if one is
 willing to throw enough cores at the problem.

well, this probably isn't something we could afford ourselves, but if there's
enough interest in a batch computing infrastructure, it's probably worth
talking to external organisations about this.

 From an exterior point of view it often seems like toolserver is
 significantly lagged or tools are going down, and from that I have generally
 assumed that it operates relatively close to capacity a lot of the time.

that is correct.  the way it works is we run at or over capacity for a while,
until we can afford new hardware, then things are fast for a while, until we
reach capacity again.  this repeats every year or so.  (interestingly, this is
exactly how Wikipedia worked in the first few years.)

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm3iigACgkQIXd7fCuc5vKo+ACfS62b7U0dF+EtTcLcrEBHE22I
h1QAoItjhW1XYmzRl3KyJDFmxQ4nMvye
=jvq3
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread Aryeh Gregor
On Wed, Mar 11, 2009 at 11:20 AM, Brion Vibber br...@wikimedia.org wrote:
 Quite so. :) Replication is fantastic against outright failure, but by
 itself doesnt help agaibst daya loss within the system which gets
 replicated right alobg with it

 we're working on ensuring we've got regular snapshots as well, though
 this isn't up yet. Regular snapshots plus the replication binlogs
 provide for point-in-time restoration.

Maybe you need to move the DB servers to ZFS on Solaris too.  ;)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread Brion Vibber
On 3/11/09 9:43 AM, Aryeh Gregor wrote:
 On Wed, Mar 11, 2009 at 12:35 PM, Platonidesplatoni...@gmail.com  wrote:
 I know. That's precisely what i'm addressing. From your email, WMF is
 reorganising their databases so the toolserver can get more admins
 (less private data is replicated/stored at ts).
 Any such schema change to the schema would be pretty big, IMHO (and yet
 incomplete).

 If I understand correctly, the only change being contemplated here is
 not replicating the databases that are entirely secret (databases of
 private wikis).  Toolserver roots would still have access to things
 like the recentchanges table and hidden revisions on public wikis, and
 would presumably still have to sign NDAs or act as Foundation agents
 or whatever to access those.

 I might be misunderstanding, though.  If only entire databases need to
 be hidden, why can't the toolserver just be set up not to replicate
 those, given that MySQL supports that?

Could be done. We're also fine with new toolserver roots as long as we 
approve em too for now.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread Platonides
River Tarnell wrote:
 it seems more useful to provide the text in uncompressed form, instead of the
 MediaWiki internal form that's almost impossible to work with.  does that seem
 reasonable?

The tools should get the text in uncompressed form. The interface to do
that is not so important.
Given the amount of text, I don't think storing text with some kind of
compression is something to discard right away.

A common data access interface would be interesting. Perhaps as a C
library to link, include as php extension... Then implement it for
different sources:
-Toolserver text replication
-WikiProxy
-Mysql mediawiki database
-Mediawiki API
-XML dump

Then applications just need to be designed for the text interface,
debugged with a local install, tested with a small dump, deployed on
toolserver...


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread DaB.
Am Tuesday 10 March 2009 01:07:36 schrieb phoebe ayers:
 their Wikipedia-related research has been
 put on hold for a few months because of the delay. (It seems like
 there is a big backlog of account requests right now and only one
 person working on them?)

hello, I'm DaB. and I'm the lazzy guy that approve the accounts for normal. 
I'm sorry that your request take a lot of time, perhaps I can tell you why it 
took so long: You requested your account at the end of last year. At this 
time our servers was quite loaded and we wait for addition. So I decide to 
not create new accounts for first. At the beginn of december we planed which 
new servers we will buy and we hoped to bought them in December. For some 
reason that not worked and we bought not before January. So I decide to 
create no new accounts before the delivery. But it took several weeks until 
the servers were delivered and one week more to set them up and another week 
to check them. Now we have the ressources to create new accounts, but then I 
got the flue (and have it still). I hope that I can create new accounts soon. 
Daniel was so nice and offer himself for help so it should take not so much 
time.

And BTW: I saw all your emails, wiki-emails and wiki-messages you and some 
others send, you was not ignored.

Sincerly,
DaB.

-- 
wp-blog.de


signature.asc
Description: This is a digitally signed message part.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread Morten Warncke-Wang
Hello everyone.  We started the conversation with Phoebe about the
possibility of a research-oriented toolserver that could be used by
researchers who wish to explore novel gadgets or other tools for
Wikipedia users.  The toolserver could provide back-end support for
these gadgets.

By the phrase research-oriented toolserver we are looking for
similar services to what is available in the existing toolserver
cluster.  From what we've heard of the research infrastructures being
developed at Syracuse and Concordia, they will be valuable for
researchers who are in need of full text data access on a large scale.
The research toolserver, by contrast, would be for tools that need
live access to Wikipedia databases, but that would only access the
full text on a small scale through the Wikipedia API.

The major difference from our perspective is how applications for new
accounts would be handled.  Our idea is to be able to hand out
accounts based around the likelihood of effective research, rather
than on visibility within Wikipedia, or on the usefulness of the
resulting tool to the larger Wikipedia community.  The latter two
cases are already handled well by the existing toolserver and its
application process.  Accounts on the research toolserver would be
approved based on the quality of the research ideas, and the ability
of the proposing team to carry out the research.  

The research toolserver would need a more transparent decision-making
process for approving accounts.  The basis for decisions should be
clear to applicants so they're able to write better applications, and
denied applications should be returned with feedback about why the
decision was made.

What do you think?  Seem like a useful idea if we can find sufficient
resources, and put together a management plan?

Morten Warncke-Wang, Research Assistant
John Riedl, Professor
GroupLens Research
www.grouplens.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread John Doe
The current toolserver user base is always willing to help. I for one am
willing to review and run queries on the database if requested. I also am a
very active python programmer and can use that to assist. If you have
requests let me know. (Unless Im losing it) there have been accounts that
are only created to be used for database queries. But until then feel free
to email or contact me. I think that improving and expanding the current TS
is the best option as further duplications will result in lower preformance.

Betacommand

On Wed, Mar 11, 2009 at 7:05 PM, Morten Warncke-Wang mor...@cs.umn.eduwrote:

 Hello everyone.  We started the conversation with Phoebe about the
 possibility of a research-oriented toolserver that could be used by
 researchers who wish to explore novel gadgets or other tools for
 Wikipedia users.  The toolserver could provide back-end support for
 these gadgets.

 By the phrase research-oriented toolserver we are looking for
 similar services to what is available in the existing toolserver
 cluster.  From what we've heard of the research infrastructures being
 developed at Syracuse and Concordia, they will be valuable for
 researchers who are in need of full text data access on a large scale.
 The research toolserver, by contrast, would be for tools that need
 live access to Wikipedia databases, but that would only access the
 full text on a small scale through the Wikipedia API.

 The major difference from our perspective is how applications for new
 accounts would be handled.  Our idea is to be able to hand out
 accounts based around the likelihood of effective research, rather
 than on visibility within Wikipedia, or on the usefulness of the
 resulting tool to the larger Wikipedia community.  The latter two
 cases are already handled well by the existing toolserver and its
 application process.  Accounts on the research toolserver would be
 approved based on the quality of the research ideas, and the ability
 of the proposing team to carry out the research.

 The research toolserver would need a more transparent decision-making
 process for approving accounts.  The basis for decisions should be
 clear to applicants so they're able to write better applications, and
 denied applications should be returned with feedback about why the
 decision was made.

 What do you think?  Seem like a useful idea if we can find sufficient
 resources, and put together a management plan?

 Morten Warncke-Wang, Research Assistant
 John Riedl, Professor
 GroupLens Research
 www.grouplens.org

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread Aryeh Gregor
On Wed, Mar 11, 2009 at 7:05 PM, Morten Warncke-Wang mor...@cs.umn.edu wrote:
 The major difference from our perspective is how applications for new
 accounts would be handled.  Our idea is to be able to hand out
 accounts based around the likelihood of effective research, rather
 than on visibility within Wikipedia, or on the usefulness of the
 resulting tool to the larger Wikipedia community.  The latter two
 cases are already handled well by the existing toolserver and its
 application process.  Accounts on the research toolserver would be
 approved based on the quality of the research ideas, and the ability
 of the proposing team to carry out the research.

As far as I know, the account approval process on the toolserver is
fairly lax.  As long as you have some credible Wikipedia-related
reason to use the toolserver, whether tools or research, you should be
able to get an account.  Am I wrong?  Have any researchers been
rejected from the toolserver?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] research-oriented toolserver?

2009-03-11 Thread Daniel Kinzler
 The major difference from our perspective is how applications for new
 accounts would be handled.  Our idea is to be able to hand out
 accounts based around the likelihood of effective research, rather
 than on visibility within Wikipedia, or on the usefulness of the
 resulting tool to the larger Wikipedia community.  The latter two
 cases are already handled well by the existing toolserver and its
 application process.  Accounts on the research toolserver would be
 approved based on the quality of the research ideas, and the ability
 of the proposing team to carry out the research.  
 
 The research toolserver would need a more transparent decision-making
 process for approving accounts.  The basis for decisions should be
 clear to applicants so they're able to write better applications, and
 denied applications should be returned with feedback about why the
 decision was made.
 
 What do you think?  Seem like a useful idea if we can find sufficient
 resources, and put together a management plan?

If the only problem solved by setting up a dedicated research cluster is that of
the account approval system, then by all means lets fix the system on the
toolserver, and keep things together. Apart from the fact that full database
replication to a third party system is very unlikely to happen for legal
reasons, it would be a waste of hardware and effort.

For a system with a very much different focus, such as text crunching, a
separate cluster seems worth considering, even though I'd of course prefer to
have everything available to our users. But a second system with a  spec very
similar to ours (live replicated meta data) seems wasteful, even if replication
was technically and legally feasible.

Let's try to fix the problems of the current toolserver, starting with the
application process and continuing with a plan for on how research projects
could contribute to the hardware platform and infrastructure software.

As to the approval policy: research projects are usually approved, if their
resource requirements are not too steep. Utility to the wikimedia user community
is only one factor that is considered, it's not required for research projects.
Making the process more transparent and giving feedback more swiftly is indeed
something we should work on. In fact, I will try to set aside a fixed amount of
working time for this.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Daniel Kinzler
Robert Rohde schrieb:
 On Mon, Mar 9, 2009 at 9:29 PM, Andrew Garrett and...@werdn.us wrote:
 On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858sn...@yahoo.com.au wrote:
 Currently all data, including private data, is replicated to the
 toolserver. We could not do this with a third-party server.
 My understanding is that the the toolserver(/s) are owned by the
 german chapter and not by wikimedia directly so why is private data
 being replicated onto them?
 Because it was chosen as the best technical solution. Is there a
 specific problem with private data being on the toolserver? If so,
 what?
 
 I'd say the added worries about security and access approval are a
 problem partially bundled up with that, even if they can be worked
 around.
 
 Logistically it would be nice to have a means of providing an
 exclusively public data replica for purposes such as research, though
 I can certainly see how that could get technically messy.

As far as I know, there is simply no efficient way to do this currently. MySQL's
replication can be told to omit entire tables, but not individual columns or
even rows. That would be required though. Witrh the new revision-deletion
feature, we have even more trouble.

So, toolserver roots need to be trusted and approved by the foundation. However,
account *approval* doesn't require root access. It doesn't require any access,
technically. Accoiunt *creation* of course does, but that's not much of a
problem (except currently, because of infrastructure changes due to new serves,
but that will be fixed soon).

To avoid confusion: *two* Daniels can do approval: DaB and me. We both don't
have much time, currently - DaB does it every now and then, and I don't do it at
all, admittedly - i'm caught up in organizing the dev meeting and hardware
orders besides doing my regular develoment jobs. I suppose we should streamline
the process, yes. This would be a good topic for the developer meeting, maybe.


-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Daniel Kinzler
Bilal Abdul Kader schrieb:
 Greetings,
 We are setting up a research server at Concordia University (Canada) that is
 dedicated for Wikipedia. We would love to share the resources with anyone
 interested.
 
 In case anyone needs help setting it up, we would love to help as well.
 
 bilal

There's a project for a biggish research cluster for wikipedia data awaiting
funding at the Syracuse University. I forwarded your mail to one of the people
involved. Perhaps you can join forces.

 
 On Mon, Mar 9, 2009 at 8:07 PM, phoebe ayers phoebe.w...@gmail.com wrote:
 
 Hi all,
 I'm not sure exactly where to raise this, so am asking here.

 A researcher I have been in touch with has proposed starting a 2nd,
 research-oriented Wikimedia toolserver. He thinks his lab can pay for
 the hardware and would be willing to maintain it, if they could get
 help setting it up. He got this idea after a member of his research
 group tried (unsuccessfully so far -- no response) to get an account
 on the current toolserver; their Wikipedia-related research has been
 put on hold for a few months because of the delay. (It seems like
 there is a big backlog of account requests right now and only one
 person working on them?)  This research group has done some
 interesting Wikipedia research to date and I expect they could do more
 with access to the right data.

I apologize for the delay, perhaps you can send me some detaqils in private, and
I'll look at it. DaB doesn't have much time lately, and we had some major
changes in infrastructure to take care of, that caused some delays.

 Personally, I think a dedicated toolserver is a great idea for the
 research community, but I know very little about the technical issues
 involved and/or whether this has been proposed before. Please comment,
 and I can pass on replies and put the researcher in touch with the
 tech team if it seems like a good idea.

If it makes sense to run a separate cluster largely depends on what kind of data
you need access too, and in what time frame. If you workj mustly on secondaty
data like link tables, and you need the data in near-real time, use
toolserver.org. That's what it's there for, and it's unlikely you can set up
anything that could get the same data with low latency.

However, if you work mostly on full text, toolserver.org is not so useful anyway
- there's no direct access to full page text there anyway, not to search
indexes. Having a dedicated cluster for research on textual content, perhaps
providing content in various pre-processed forms, would be a very good idea.
This is what the project I mentioned above aims at, and I'll be happy to support
this effort officially, as Wikimedia Germany's tech guy.


-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

phoebe ayers:
 Personally, I think a dedicated toolserver is a great idea for the research
 community, but I know very little about the technical issues involved and/or
 whether this has been proposed before. Please comment, and I can pass on
 replies and put the researcher in touch with the tech team if it seems like a
 good idea.

i don't understand what research-oriented toolserver means.  what will the
research-toolserver provide that the current toolserver doesn't provide?

is the only issue the time it takes for accounts to be created?  this is a
WM-DE issue; the more people who complain to WM-DE about this, the more likely
it is to be resolved.  (so far, i've had zero communications from WM-DE about
how the only people able to approve accounts are so busy with other things
nowadays.  on the other hand, i didn't ask them about it either; i suppose they
don't bother monitoring the toolserver most of the time.)

we recently conducted a survey of toolserver users, and account approval (not
creation) was generally felt to be quite slow.  once i produce a report from
the results of that survey, we might be able to get WM-DE to do something about
it.

most of the issues with the current toolserver come down to money.  we don't
have enough money to afford redundant databases, so any failure is a major
problem and creates inconvenience for users.  we don't have enough money for a
paid admin, so it often takes a long time for things to get done.  we don't
have enough money to upgrade hardware when we need it, so things are often slow
until the money is available.  i think the only non-money issue is that the
Wikimedia Foundation won't allow us to add any more admins until they do some
internal reorganisation of their databases, which we've been waiting for for
several months now.  

the more separate toolservers we have, the less efficiently the money is spent.
sure, every chapter and university could have their own toolserver, but i don't
see how that's a better situation than these people contributing to a single
toolserver in order to fix the problems that prevent people from using it.
i've lost count of how often i've heard the toolserver sucks; let's start our
own.  what i don't understand is why no one says the toolserver sucks; how
can we make it better?.  (there _has_ been some interest from other chapters
recently about how to improve the toolserver; however, most chapters don't have
a lot of money to spend.  a single additional database servers for the
toolserver would cost at least EUR8'000.)

in the past, we had a lot of problems getting WM-DE to do anything for the
toolserver (it seemed everyone there was busy with something else), but that's
been better recently, so i think we're making some progress.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm2dV4ACgkQIXd7fCuc5vLkOwCgv9zShn4f8BVLHe5w8pYJuatU
z8gAoLQOtJjveh1pzd1kPDiz7RWTN1zL
=9qOq
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Aryeh Gregor:
 Oh.  Why does a single specific person have to handle the approval of
 all toolserver account requests, then?

because accounts have to be approved by WM-DE, and WM-DE has designated this
person to approve accounts on their behalf.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm2deAACgkQIXd7fCuc5vJBLQCeINPPjEA50FjFlphN70J9gnAx
7dkAoJ1WXk0hWFOLj1ZZNbwNG0fBDVok
=+dbS
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread phoebe ayers
Thanks for the responses, all.

Daniel and Bilal: the notes about the possible servers at Syracuse and
Concordia are very interesting; it sounds like the researchers
interested in such things should team up.

Daniel: I am not sure what type of data is needed -- this is not my
project (I'm only the messenger!) but I'll pass along your message and
send you private details (and encourage the researcher to reply
himself).

River: Well, you say that part of the issue with the toolserver is
money and time... and this person that I've been talking to is
offering to throw money and time at the problem. So, what can they
constructively do?

All: Like I said, I am unclear on the technical issues involved, but
as for why a separate research toolserver might be useful... :
I see a difference in the type of information a researcher might want
to pull (public data, large sets of related page information,
full-text mining, ??) and the types of tools that the current
toolserver mainly supports (editcount tools, catscan, etc). I also see
a difference in how the two groups might be authenticated -- there's a
difference between being a trusted Wikipedian or trusted Wikimedia
developer and being a trusted technically-competent researcher (for
instance, I recognized the affiliation of the person who was trying to
apply, because I've read their research papers; but if you were going
on wikimedia status alone, they don't have any).

-- Phoebe

-- 
* I use this address for lists; send personal messages to phoebe.ayers
at gmail.com *

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

phoebe ayers:
 River: Well, you say that part of the issue with the toolserver is money and
 time... and this person that I've been talking to is offering to throw money
 and time at the problem. So, what can they constructively do?
 
i think this is being discussed privately now...

 I see a difference in the type of information a researcher might want to pull
 (public data, large sets of related page information, full-text mining, ??)
 and the types of tools that the current toolserver mainly supports (editcount
 tools, catscan, etc).

so, what is missing from the current toolserver that prevents researchers from
working with large data sets?

 I also see a difference in how the two groups might be authenticated --
 there's a difference between being a trusted Wikipedian or trusted Wikimedia
 developer and being a trusted technically-competent researcher 

i don't see why access to the toolserver would be restricted to Wikipedia
editors.  in fact, i'd be happier giving access to a recognised academic expert
than some random guy on Wikipedia.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm2zSQACgkQIXd7fCuc5vKYSACdF2IJwcfhWEarjgDC8FmMSls1
NN0An2jLSu3/mhLCEAsLuoZz0x3DE8mP
=ZHMA
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Andrea Forte
I've been trying to do some work mining the full en dump with revision
history and was involved in getting together the Syracuse grant
proposal. To give you an idea, for me personally, the incentive for a
new resource is a need for a server (perhaps a cluster) to support
full-text queries at a reasonable speed. People at various research
institutions duplicate this effort over and over.

Andrea



On Tue, Mar 10, 2009 at 2:26 PM, phoebe ayers phoebe.w...@gmail.com wrote:
 Thanks for the responses, all.

 Daniel and Bilal: the notes about the possible servers at Syracuse and
 Concordia are very interesting; it sounds like the researchers
 interested in such things should team up.

 Daniel: I am not sure what type of data is needed -- this is not my
 project (I'm only the messenger!) but I'll pass along your message and
 send you private details (and encourage the researcher to reply
 himself).

 River: Well, you say that part of the issue with the toolserver is
 money and time... and this person that I've been talking to is
 offering to throw money and time at the problem. So, what can they
 constructively do?

 All: Like I said, I am unclear on the technical issues involved, but
 as for why a separate research toolserver might be useful... :
 I see a difference in the type of information a researcher might want
 to pull (public data, large sets of related page information,
 full-text mining, ??) and the types of tools that the current
 toolserver mainly supports (editcount tools, catscan, etc). I also see
 a difference in how the two groups might be authenticated -- there's a
 difference between being a trusted Wikipedian or trusted Wikimedia
 developer and being a trusted technically-competent researcher (for
 instance, I recognized the affiliation of the person who was trying to
 apply, because I've read their research papers; but if you were going
 on wikimedia status alone, they don't have any).

 -- Phoebe

 --
 * I use this address for lists; send personal messages to phoebe.ayers
 at gmail.com *

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andrea Forte:
 To give you an idea, for me personally, the incentive for a new resource is a
 need for a server (perhaps a cluster) to support full-text queries at a
 reasonable speed. 

then why not help us do this on the existing toolserver, so everyone can have
access to it, instead of duplicating it yet again somewhere else?

there are many toolserver users who would like direct access to text, and the
ability to search it.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkm2zgIACgkQIXd7fCuc5vLrvgCgkWY9BizcJCSunzrk+dPdrcJO
U4wAn0kIpQd7NYVBHfKNwR+dTM2rTon6
=rSHL
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Robert Rohde
On Tue, Mar 10, 2009 at 1:27 PM, River Tarnell
ri...@loreley.flyingparchment.org.uk wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 phoebe ayers:
 River: Well, you say that part of the issue with the toolserver is money and
 time... and this person that I've been talking to is offering to throw money
 and time at the problem. So, what can they constructively do?

 i think this is being discussed privately now...

If other research groups are interested in contributing to this, who
should they be talking to?

snip

 i don't see why access to the toolserver would be restricted to Wikipedia
 editors.  in fact, i'd be happier giving access to a recognised academic 
 expert
 than some random guy on Wikipedia.

The converse of this is that some recognized experts would probably
prefer to administer their own server/cluster rather than relying on
some random guy with Wikimedia DE (or wherever) to get things done.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Andrea Forte
Let me know if you have a grant proposal you'd like help with!

Andrea

On Tue, Mar 10, 2009 at 4:30 PM, River Tarnell
ri...@loreley.flyingparchment.org.uk wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Andrea Forte:
 To give you an idea, for me personally, the incentive for a new resource is a
 need for a server (perhaps a cluster) to support full-text queries at a
 reasonable speed.

 then why not help us do this on the existing toolserver, so everyone can have
 access to it, instead of duplicating it yet again somewhere else?

 there are many toolserver users who would like direct access to text, and the
 ability to search it.

        - river.
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (HP-UX)

 iEYEARECAAYFAkm2zgIACgkQIXd7fCuc5vLrvgCgkWY9BizcJCSunzrk+dPdrcJO
 U4wAn0kIpQd7NYVBHfKNwR+dTM2rTon6
 =rSHL
 -END PGP SIGNATURE-

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Daniel Kinzler
Robert Rohde schrieb:
 On Tue, Mar 10, 2009 at 1:27 PM, River Tarnell
 ri...@loreley.flyingparchment.org.uk wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 phoebe ayers:
 River: Well, you say that part of the issue with the toolserver is money and
 time... and this person that I've been talking to is offering to throw money
 and time at the problem. So, what can they constructively do?
 i think this is being discussed privately now...
 
 If other research groups are interested in contributing to this, who
 should they be talking to?

Wikimedia Germany. That is, I guess, me. Send mail to daniel dot kinzler at
wikimedia dot de. I'll forward it as appropriate.

 i don't see why access to the toolserver would be restricted to Wikipedia
 editors.  in fact, i'd be happier giving access to a recognised academic 
 expert
 than some random guy on Wikipedia.
 
 The converse of this is that some recognized experts would probably
 prefer to administer their own server/cluster rather than relying on
 some random guy with Wikimedia DE (or wherever) to get things done.

An academic institution may also get a serious research grant for this - that
would be more complicated if the money would be handeled via the german chapter.
Though it's something we are, of course, also interested in.

Basically, if we could all work on making the toolserver THE ONE PLACE for
working with wikipedia's data, that would be perfect. If, for some reason, it
makes sense to build a separate cluster, I propose to give it a distict purpose
and profile: let it provide facilities for fulltext research, with low priority
for the update latency, and high priority of having fulltext in various forms,
with search indexes, word lists, and all the fun.

Regards,
Daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Brion Vibber
On 3/10/09 5:29 PM, Aryeh Gregor wrote:
 On Tue, Mar 10, 2009 at 7:54 PM, Platonidesplatoni...@gmail.com  wrote:
 Is mediawiki table structure going to change?

 Yes, it changes on a regular basis.

 Moreover, any more private method for sharing the tables (eg. a trigger
 deleting the row when rev_deleted is set) would precisely lose the
 backup ability the toolserver is performing.

 I don't think the toolserver is used for backups.  At least I hope
 it's not, given its reliability (which is quite good, but quite good
 is scary for backups).

The existence of the replicas on toolserver is one of our backups. 
Obviously we want to improve our offsite backups to include complete 
offline snapshots as well. It's in progress. :)

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] research-oriented toolserver?

2009-03-09 Thread phoebe ayers
Hi all,
I'm not sure exactly where to raise this, so am asking here.

A researcher I have been in touch with has proposed starting a 2nd,
research-oriented Wikimedia toolserver. He thinks his lab can pay for
the hardware and would be willing to maintain it, if they could get
help setting it up. He got this idea after a member of his research
group tried (unsuccessfully so far -- no response) to get an account
on the current toolserver; their Wikipedia-related research has been
put on hold for a few months because of the delay. (It seems like
there is a big backlog of account requests right now and only one
person working on them?)  This research group has done some
interesting Wikipedia research to date and I expect they could do more
with access to the right data.

Personally, I think a dedicated toolserver is a great idea for the
research community, but I know very little about the technical issues
involved and/or whether this has been proposed before. Please comment,
and I can pass on replies and put the researcher in touch with the
tech team if it seems like a good idea.

-- user: first post on wikitech phoebe

-- 
* I use this address for lists; send personal messages to phoebe.ayers
at gmail.com *

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-09 Thread Andrew Garrett
On Tue, Mar 10, 2009 at 11:07 AM, phoebe ayers phoebe.w...@gmail.com wrote:
 Personally, I think a dedicated toolserver is a great idea for the
 research community, but I know very little about the technical issues
 involved and/or whether this has been proposed before. Please comment,
 and I can pass on replies and put the researcher in touch with the
 tech team if it seems like a good idea.

Currently all data, including private data, is replicated to the
toolserver. We could not do this with a third-party server.

-- 
Andrew Garrett

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-09 Thread Casey Brown
On Mon, Mar 9, 2009 at 9:33 PM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 . . . and this fact is also apparently a major reason for the slowness
 of new user review.  New roots can't be added to the toolserver until
 the private data is moved off, so there are too few roots to add new
 users.

Really?  We just got a new root (Werdna) and normally regular roots do
not handle new accounts anyway -- that job rests with the WMDE
contact, currently DaB, doesn't it?

-- 
Casey Brown
Cbrown1023

---
Note:  This e-mail address is used for mailing lists.  Personal emails sent to
this address will probably get lost.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-09 Thread Andrew Garrett
On Tue, Mar 10, 2009 at 12:33 PM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 . . . and this fact is also apparently a major reason for the slowness
 of new user review.  New roots can't be added to the toolserver until
 the private data is moved off, so there are too few roots to add new
 users.

The bottleneck is in approval (by Wikimedia DE's representative
Daniel), not in creating their accounts.

-- 
Andrew Garrett
Sent from: Sydney Nsw Australia.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] research-oriented toolserver?

2009-03-09 Thread Aryeh Gregor
On Mon, Mar 9, 2009 at 9:54 PM, Andrew Garrett and...@werdn.us wrote:
 The bottleneck is in approval (by Wikimedia DE's representative
 Daniel), not in creating their accounts.

Oh.  Why does a single specific person have to handle the approval of
all toolserver account requests, then?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-09 Thread K. Peachey
 Currently all data, including private data, is replicated to the
 toolserver. We could not do this with a third-party server.
My understanding is that the the toolserver(/s) are owned by the
german chapter and not by wikimedia directly so why is private data
being replicated onto them?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-09 Thread Andrew Garrett
On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858sn...@yahoo.com.au wrote:
 Currently all data, including private data, is replicated to the
 toolserver. We could not do this with a third-party server.
 My understanding is that the the toolserver(/s) are owned by the
 german chapter and not by wikimedia directly so why is private data
 being replicated onto them?

Because it was chosen as the best technical solution. Is there a
specific problem with private data being on the toolserver? If so,
what?

You should be aware that toolserver roots are approved by the
foundation before becoming roots.

-- 
Andrew Garrett

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-09 Thread Robert Rohde
On Mon, Mar 9, 2009 at 9:29 PM, Andrew Garrett and...@werdn.us wrote:
 On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858sn...@yahoo.com.au wrote:
 Currently all data, including private data, is replicated to the
 toolserver. We could not do this with a third-party server.
 My understanding is that the the toolserver(/s) are owned by the
 german chapter and not by wikimedia directly so why is private data
 being replicated onto them?

 Because it was chosen as the best technical solution. Is there a
 specific problem with private data being on the toolserver? If so,
 what?

I'd say the added worries about security and access approval are a
problem partially bundled up with that, even if they can be worked
around.

Logistically it would be nice to have a means of providing an
exclusively public data replica for purposes such as research, though
I can certainly see how that could get technically messy.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l