Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

2014-01-21 Thread Amir Ladsgroup
On Wed, Jan 22, 2014 at 10:31 AM, Matthew Flaschen
wrote:

> On 01/21/2014 09:47 PM, Amir Ladsgroup wrote:
>
>> One of the things I can't understand is why we are extracting summary of
>> pages for Yahoo? Is it our job to do it? the dumps are really huge
>> e.g. forwikidata:
>> wikidatawiki-20140106-abstract.xml> wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml
>> >14.1
>>
>> GB
>> Compare it to: full history:
>> wikidatawiki-20140106-pages-meta-history.xml.bz2> dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-
>> meta-history.xml.bz2>8.8
>> GB
>>
>
> That's because the Yahoo one isn't compressed.
>
> why? can we make it compressed? It's really annoying to see that huge file
there for (even almost) no reason.


> I'm not sure if Yahoo still uses those abstracts, but I wouldn't be
> surprised at all if other people are.
>
> Matt Flaschen
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
Amir
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

2014-01-21 Thread Matthew Flaschen

On 01/21/2014 09:47 PM, Amir Ladsgroup wrote:

One of the things I can't understand is why we are extracting summary of
pages for Yahoo? Is it our job to do it? the dumps are really huge
e.g. forwikidata:
wikidatawiki-20140106-abstract.xml14.1
GB
Compare it to: full history:
wikidatawiki-20140106-pages-meta-history.xml.bz28.8
GB


That's because the Yahoo one isn't compressed.

I'm not sure if Yahoo still uses those abstracts, but I wouldn't be 
surprised at all if other people are.


Matt Flaschen


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

2014-01-21 Thread Amir Ladsgroup
One of the things I can't understand is why we are extracting summary of
pages for Yahoo? Is it our job to do it? the dumps are really huge
e.g. forwikidata:
wikidatawiki-20140106-abstract.xml14.1
GB
Compare it to: full history:
wikidatawiki-20140106-pages-meta-history.xml.bz28.8
GB

So why we are doing this?
Best


On Wed, Jan 22, 2014 at 4:10 AM, Anthony  wrote:

> If you're going to use xz then you wouldn't even have to recompress the
> blocks that haven't changed and are already well compressed.
>
>
> On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer  wrote:
>
> > Ack, sorry for the (no subject); again in the right thread:
> >
> > > For external uses like XML dumps integrating the compression
> > > strategy into LZMA would however be very attractive. This would also
> > > benefit other users of LZMA compression like HBase.
> >
> > For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
> >
> > That has a 4 MB buffer, compression ratios within 15-25% of
> > current 7zip (or histzip), and goes at 30MB/s on my box,
> > which is still 8x faster than the status quo (going by a 1GB
> > benchmark).
> >
> > Trying to get quick-and-dirty long-range matching into LZMA isn't
> > feasible for me personally and there may be inherent technical
> > difficulties. Still, I left a note on the 7-Zip boards as folks
> > suggested; feel free to add anything there:
> > https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
> >
> > Thanks for the reply,
> > Randall
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
Amir
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

2014-01-21 Thread Anthony
If you're going to use xz then you wouldn't even have to recompress the
blocks that haven't changed and are already well compressed.


On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer  wrote:

> Ack, sorry for the (no subject); again in the right thread:
>
> > For external uses like XML dumps integrating the compression
> > strategy into LZMA would however be very attractive. This would also
> > benefit other users of LZMA compression like HBase.
>
> For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
>
> That has a 4 MB buffer, compression ratios within 15-25% of
> current 7zip (or histzip), and goes at 30MB/s on my box,
> which is still 8x faster than the status quo (going by a 1GB
> benchmark).
>
> Trying to get quick-and-dirty long-range matching into LZMA isn't
> feasible for me personally and there may be inherent technical
> difficulties. Still, I left a note on the 7-Zip boards as folks
> suggested; feel free to add anything there:
> https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
>
> Thanks for the reply,
> Randall
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Announcement: Sam Smith (phuedx) joins Wikimedia as Growth Engineer in Features

2014-01-21 Thread Matthew Flaschen

On 01/21/2014 01:50 PM, Terry Chay wrote:

Hello everyone,

It’s with great pleasure that I’m announcing that Sam Smith[1] has
joined the WIkimedia Foundation as a Software Engineer in Features
Engineering. He'll be working with the Growth team.[2]


Welcome, Sam. :)  It's been great working with you so far, and I'm glad 
to have you officially on the team.


Matt Flaschen


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Announcement: Sam Smith (phuedx) joins Wikimedia as Growth Engineer in Features

2014-01-21 Thread Steven Walling
On Tue, Jan 21, 2014 at 1:50 PM, Terry Chay  wrote:

> Hello everyone,
>
> It’s with great pleasure that I’m announcing that Sam Smith[1] has joined
> the WIkimedia Foundation as a Software Engineer in Features Engineering.
> He'll be working with the Growth team.[2]
>
> Before joining us, Sam was previously a member of the Last.fm web team
> (web-slingers) where he helped to build the new catalogue pages, the
> Last.fm Spotify app, the new (*the only*) user on-boarding flow, and
> helped immortalise his favorite band (Maybeshewill) in most of their unit
> test cases. Before that he worked on everything from Java job schedulers,
> to Pascal windows installers, to Microsoft server sysadmining, to Zend
> Framework and symfony migrations. Ask him which was the was the worst—my
> money is on the PHP migration[3]. He received his Masters Degree in physics
> at the University of Warwick.
>
> Sam is based in London (the capital of England, not the city in Ontario or
> the settlement the island of Kiribati). He lives in Surray Quays with his
> wife, Lisa, and his 19 month old son, George. His hobbies include juggling
> (he's juggled for 6 years), unicycling (one day he's going to attempt the
> distance record… one day!), climbing (specifically bouldering, he's
> actually really afraid of heights), coffee (it's not really a hobby, it's
> an obsession), and playing Lineage 1[4]. Ask him to do some unicycling up a
> boulder while drinking coffee and playing Lineage… now *that* would be
> juggling!
>
> His first official day is today, Tuesday, January 21, 2013. (What? On
> time? He signed his contract last year, so I had a lot of time to prepare.
> Having said that, I didn't start this e-mail until this morning so balance
> has been restored to the force.)
>
> Please join me in a not-belated welcome of Sam Smith to the Wikimedia
> Foundation. :-)
>

\o/ Very very glad to have you aboard Sam. And not just because you're a
fellow coffee snob. ;-)

Sam is our people. From our very first transatlantic conversation, I got a
clear sense that Sam doesn't just like solving engineering problems. He
really deeply cares about doing right by users, making sure they have a
decent (maybe even a fun and enjoyable!) experience. I'm privileged to have
someone of his experience, intelligence, and empathy for users on our team.

-- 
Steven Walling,
Product Manager
https://wikimediafoundation.org/
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

2014-01-21 Thread Randall Farmer
Ack, sorry for the (no subject); again in the right thread:

> For external uses like XML dumps integrating the compression
> strategy into LZMA would however be very attractive. This would also
> benefit other users of LZMA compression like HBase.

For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.

That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).

Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/

Thanks for the reply,
Randall
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] (no subject)

2014-01-21 Thread Randall Farmer
> For external uses like XML dumps integrating the compression
> strategy into LZMA would however be very attractive. This would also
> benefit other users of LZMA compression like HBase.

For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.

That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).

Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/

Thanks for the reply,
Randall



On Tue, Jan 21, 2014 at 2:19 PM, Randall Farmer  wrote:

> > For external uses like XML dumps integrating the compression
> > strategy into LZMA would however be very attractive. This would also
> > benefit other users of LZMA compression like HBase.
>
> For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
>
> That has a 4 MB buffer, compression ratios within 15-25% of
> current 7zip (or histzip), and goes at 30MB/s on my box,
> which is still 8x faster than the status quo (going by a 1GB
> benchmark).
>
> Re: trying to get long-range matching into LZMA, first, I
> couldn't confidently hack on liblzma. Second, Igor might
> not want to do anything as niche-specific as this (but who
> knows!). Third, even with a faster matching strategy, the
> LZMA *format* seems to require some intricate stuff (range
> coding) that be a blocker to getting the ideal speeds
> (honestly not sure).
>
> In any case, I left a note on the 7-Zip boards as folks have
> suggested: 
> https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
>
> Thanks for the reply,
> Randall
>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Announcement: Sam Smith (phuedx) joins Wikimedia as Growth Engineer in Features

2014-01-21 Thread Terry Chay
Hello everyone, 

It’s with great pleasure that I’m announcing that Sam Smith[1] has joined the 
WIkimedia Foundation as a Software Engineer in Features Engineering. He'll be 
working with the Growth team.[2]

Before joining us, Sam was previously a member of the Last.fm web team 
(web-slingers) where he helped to build the new catalogue pages, the Last.fm 
Spotify app, the new (the only) user on-boarding flow, and helped immortalise 
his favorite band (Maybeshewill) in most of their unit test cases. Before that 
he worked on everything from Java job schedulers, to Pascal windows installers, 
to Microsoft server sysadmining, to Zend Framework and symfony migrations. Ask 
him which was the was the worst—my money is on the PHP migration[3]. He 
received his Masters Degree in physics at the University of Warwick.

Sam is based in London (the capital of England, not the city in Ontario or the 
settlement the island of Kiribati). He lives in Surray Quays with his wife, 
Lisa, and his 19 month old son, George. His hobbies include juggling (he's 
juggled for 6 years), unicycling (one day he's going to attempt the distance 
record… one day!), climbing (specifically bouldering, he's actually really 
afraid of heights), coffee (it's not really a hobby, it's an obsession), and 
playing Lineage 1[4]. Ask him to do some unicycling up a boulder while drinking 
coffee and playing Lineage… now that would be juggling!

His first official day is today, Tuesday, January 21, 2013. (What? On time? He 
signed his contract last year, so I had a lot of time to prepare. Having said 
that, I didn't start this e-mail until this morning so balance has been 
restored to the force.)

Please join me in a not-belated welcome of Sam Smith to the Wikimedia 
Foundation. :-)

Take care,

terry

P.S. Because Jared is demanding we include a picture for new staff and 
contractors, here is one[5]:


[1] https://github.com/phuedx
[2] https://www.mediawiki.org/wiki/Growth
[3] http://www.codinghorror.com/blog/2012/06/the-php-singularity.html
[4] http://en.wikipedia.org/wiki/Lineage_(video_game)
[5] http://en.wikipedia.org/wiki/Heston_Blumenthal 
https://commons.wikimedia.org/wiki/File:Heston_Blumenthal_Chef_Whites.jpg




terry chay  최태리
Director of Features Engineering
Wikimedia Foundation
“Imagine a world in which every single human being can freely share in the sum 
of all knowledge. That's our commitment.”

p: +1 (415) 839-6885 x6832
m: +1 (408) 480-8902
e: tc...@wikimedia.org
i: http://terrychay.com/
w: http://meta.wikimedia.org/wiki/User:Tychay
aim: terrychay

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Status of the new PDF Renderer

2014-01-21 Thread C. Scott Ananian
Amir, Gerard:
The easiest way to test locally at the moment is to use the standalone
'mw-ocg-bundler' and 'mw-ocg-latexer' node packages.  There are good
installation instructions in the READMEs, see:

https://npmjs.org/package/mw-ocg-bundler
https://npmjs.org/package/mw-ocg-latexer

and let me know if I need to document anything better.

This will let you pull individual articles from an arbitrary wiki, and
then typeset them with xelatex.

There is currently good support for quite a number of languages.  My
standard test case contains:
http://ar.wikipedia.org/wiki/ليونيل_ميسي
http://ar.wikipedia.org/wiki/بشير_الثاني_الشهابي
http://ar.wikipedia.org/wiki/حمزة_بن_عبد_المطلب
http://ar.wikipedia.org/wiki/إسطنبول
http://ar.wikipedia.org/wiki/الحرب_الإنجليزية_الزنجبارية
http://de.wikipedia.org/wiki/Papier
http://en.wikipedia.org/wiki/Durian
http://es.wikipedia.org/wiki/Latas_de_sopa_Campbell
http://fa.wikipedia.org/wiki/کعبه_زرتشت
http://fr.wikipedia.org/wiki/Trachylepis_atlantica
http://he.wikipedia.org/wiki/ספרטה
http://hi.wikipedia.org/wiki/रामायण
http://it.wikipedia.org/wiki/La_vita_è_meravigliosa
http://ja.wikipedia.org/wiki/熊野三山本願所
http://ja.wikipedia.org/wiki/金星の日面通過
http://ko.wikipedia.org/wiki/조화진동자
http://ml.wikipedia.org/wiki/മലയാളം
http://pl.wikipedia.org/wiki/Efekt_potwierdzenia
http://pt.wikipedia.org/wiki/Scaphyglottis
http://ru.wikipedia.org/wiki/Битва_при_Платеях
http://simple.wikipedia.org/wiki/Taoism
http://vi.wikipedia.org/wiki/Vệ_tinh_tự_nhiên_của_Sao_Thiên_Vương
http://zh.wikipedia.org/wiki/納粹德國海軍

and a few other English articles.  That said, I don't read most of
these languages, so I've mostly been trying to ensure that our output
matches the HTML displayed by the wiki.  It is quite possible I've
chosen bad-looking fonts, or that there are other details that could
be improved.  (For example, the way that Vietnamese stacked accents
was bad for a while; I've fixed that now.) Comments eagerly requested!
  --scott

ps. there are a number of minor issues with citations in RTL
languages, even in our standard HTML rendering on the wikis; it
appears that our citation templates should be more aggressive about
adding  tags or lang attributes to ensure that citations of LTR
sources in an RTL article are displayed as nicely as possible.  If
these fixes are made to the source, the latex output should inherit
them.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wikimedia-l] Reasonator use in Wikipedias

2014-01-21 Thread Ryan Kaldari
Can you explain how such a {{Reasonator}} template would actually work. You
say that it would be a stand-in until the article was actually written, but
how would it know when the article is actually written? Is there a way to
access the target article's state via Lua?

From a community perspective, linking to external sites from body content
is normally frowned upon (on en.wiki at least), even if the link is to a
sister project. There are two main reasons for this:
1. It discourages the creation of new articles via redlinks
2. It can be confusing for readers to be sent to other sites while surfing
Wikipedia content. (This is one of reasons why the WMF Multimedia team has
been developing the Media Viewer.)

My suggestion would be to leave the redlinks intact, but to provide a
pop-up when hovering over the redlinks (similar to Navigation pop-ups (
https://en.wikipedia.org/wiki/Wikipedia:Tools/Navigation_popups)). This
pop-up could provide a small set of core data (via an ajax request) and
also a link to the full Reasonator page. I would probably implement this as
a gadget first and do a few design iterations based on user-feedback before
proposing it as something for readers.

Ryan Kaldari


On Tue, Jan 21, 2014 at 10:02 AM, Magnus Manske  wrote:

> On a technical note, Reasonator is pure JavaScript, so should be easily
> portable, even to a Wikipedia:Reasonator.js page (or several pages, with
> support JS).
>
> git here:
> https://bitbucket.org/magnusmanske/reasonator
>
>
> On Tue, Jan 21, 2014 at 5:13 PM, Ryan Lane  wrote:
>
> > On Tue, Jan 21, 2014 at 7:17 AM, Gerard Meijssen
> > wrote:
> >
> > > Hoi,
> > >
> > > At this moment Wikipedia "red links" provide no information whatsoever.
> > > This is not cool.
> > >
> > > In Wikidata we often have labels for the missing (=red link) articles.
> We
> > > can and do provide information from Wikidata in a reasonable way that
> is
> > > informative in the "Reasonator". We also provide additional search
> > > information on many Wikipedias.
> > >
> > > In the Reasonator we have now implemented "red lines" [1]. They
> indicate
> > > when a label does not exist in the primary language that is in use.
> > >
> > > What we are considering is creating a template {{Reasonator}} that will
> > > present information based on what is available in Wikidata. Such a
> > template
> > > would be a stand in until an article is actually written. What we would
> > > provide is information that is presented in the same way as we provide
> it
> > > as this moment in time [2]
> > >
> > > This may open up a box of worms; Reasonator is NOT using any caching.
> > There
> > > may be lots of other reasons why you might think this proposal is evil.
> > All
> > > the evil that is technical has some merit but, you have to consider
> that
> > > the other side of the equation is that we are not "sharing in the sum
> of
> > > all knowledge" even when we have much of the missing requested
> > information
> > > available to us.
> > >
> > > One saving (technical) grace, Reasonator loads round about as quickly
> as
> > > WIkidata does.
> > >
> > > As this is advance warning, I hope that you can help with the issues
> that
> > > will come about. I hope that you will consider the impact this will
> have
> > on
> > > our traffic and measure to what extend it grows our data.
> > >
> > > The Reasonator pages will not show up prettily on mobile phones .. so
> > does
> > > Wikidata by the way. It does not consider Wikipedia zero. There may be
> > more
> > > issues that may require attention. But again, it beats not serving the
> > > information that we have to those that are requesting it.
> > >
> >
> > I have a strong feeling you're going to bring labs to its knees.
> >
> > Sending editors to labs is one thing, but you're proposing sending
> readers
> > to labs, to a service that isn't cached.
> >
> > If reasonator is something we want to support for something like this,
> > maybe we should consider turning it into a production service?
> >
> > - Ryan
> > ___
> > Wikimedia-l mailing list
> > wikimedi...@lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > 
> >
>
>
>
> --
> undefined
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wikimedia-l] Reasonator use in Wikipedias

2014-01-21 Thread Magnus Manske
On a technical note, Reasonator is pure JavaScript, so should be easily
portable, even to a Wikipedia:Reasonator.js page (or several pages, with
support JS).

git here:
https://bitbucket.org/magnusmanske/reasonator


On Tue, Jan 21, 2014 at 5:13 PM, Ryan Lane  wrote:

> On Tue, Jan 21, 2014 at 7:17 AM, Gerard Meijssen
> wrote:
>
> > Hoi,
> >
> > At this moment Wikipedia "red links" provide no information whatsoever.
> > This is not cool.
> >
> > In Wikidata we often have labels for the missing (=red link) articles. We
> > can and do provide information from Wikidata in a reasonable way that is
> > informative in the "Reasonator". We also provide additional search
> > information on many Wikipedias.
> >
> > In the Reasonator we have now implemented "red lines" [1]. They indicate
> > when a label does not exist in the primary language that is in use.
> >
> > What we are considering is creating a template {{Reasonator}} that will
> > present information based on what is available in Wikidata. Such a
> template
> > would be a stand in until an article is actually written. What we would
> > provide is information that is presented in the same way as we provide it
> > as this moment in time [2]
> >
> > This may open up a box of worms; Reasonator is NOT using any caching.
> There
> > may be lots of other reasons why you might think this proposal is evil.
> All
> > the evil that is technical has some merit but, you have to consider that
> > the other side of the equation is that we are not "sharing in the sum of
> > all knowledge" even when we have much of the missing requested
> information
> > available to us.
> >
> > One saving (technical) grace, Reasonator loads round about as quickly as
> > WIkidata does.
> >
> > As this is advance warning, I hope that you can help with the issues that
> > will come about. I hope that you will consider the impact this will have
> on
> > our traffic and measure to what extend it grows our data.
> >
> > The Reasonator pages will not show up prettily on mobile phones .. so
> does
> > Wikidata by the way. It does not consider Wikipedia zero. There may be
> more
> > issues that may require attention. But again, it beats not serving the
> > information that we have to those that are requesting it.
> >
>
> I have a strong feeling you're going to bring labs to its knees.
>
> Sending editors to labs is one thing, but you're proposing sending readers
> to labs, to a service that isn't cached.
>
> If reasonator is something we want to support for something like this,
> maybe we should consider turning it into a production service?
>
> - Ryan
> ___
> Wikimedia-l mailing list
> wikimedi...@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
>



-- 
undefined
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

2014-01-21 Thread Gabriel Wicke
On 01/21/2014 01:23 AM, Randall Farmer wrote:
> Anyway, I'm saying too many fundamentally unimportant words. If the status
> quo re: compression in fact causes enough pain to give histzip a fuller
> look, or if there's some way to redirect the tech in it towards a useful
> end, it would be great to hear from interested folks; if not, it was fun
> work but there may not be much more to do or say.

Efficient compression with large match windows is very interesting for
storing history in databases like Cassandra as well. When storing a
wikitext dump in Cassandra, gzip with its 32k sliding window yields a db
size of about 16-18% of the input text size. This could be much better
if repetitions larger than 32k could be caught. With more verbose HTML
this is even more important, as more articles will be larger than 32k.

For internal uses tool support is not very important, so a port of
histzip / rzip could work well. For external uses like XML dumps
integrating the compression strategy into LZMA would however be very
attractive. This would also benefit other users of LZMA compression like
HBase.

Gabriel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Reasonator use in Wikipedias

2014-01-21 Thread Ryan Lane
On Tue, Jan 21, 2014 at 7:17 AM, Gerard Meijssen
wrote:

> Hoi,
>
> At this moment Wikipedia "red links" provide no information whatsoever.
> This is not cool.
>
> In Wikidata we often have labels for the missing (=red link) articles. We
> can and do provide information from Wikidata in a reasonable way that is
> informative in the "Reasonator". We also provide additional search
> information on many Wikipedias.
>
> In the Reasonator we have now implemented "red lines" [1]. They indicate
> when a label does not exist in the primary language that is in use.
>
> What we are considering is creating a template {{Reasonator}} that will
> present information based on what is available in Wikidata. Such a template
> would be a stand in until an article is actually written. What we would
> provide is information that is presented in the same way as we provide it
> as this moment in time [2]
>
> This may open up a box of worms; Reasonator is NOT using any caching. There
> may be lots of other reasons why you might think this proposal is evil. All
> the evil that is technical has some merit but, you have to consider that
> the other side of the equation is that we are not "sharing in the sum of
> all knowledge" even when we have much of the missing requested information
> available to us.
>
> One saving (technical) grace, Reasonator loads round about as quickly as
> WIkidata does.
>
> As this is advance warning, I hope that you can help with the issues that
> will come about. I hope that you will consider the impact this will have on
> our traffic and measure to what extend it grows our data.
>
> The Reasonator pages will not show up prettily on mobile phones .. so does
> Wikidata by the way. It does not consider Wikipedia zero. There may be more
> issues that may require attention. But again, it beats not serving the
> information that we have to those that are requesting it.
>

I have a strong feeling you're going to bring labs to its knees.

Sending editors to labs is one thing, but you're proposing sending readers
to labs, to a service that isn't cached.

If reasonator is something we want to support for something like this,
maybe we should consider turning it into a production service?

- Ryan
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wikimedia-l] Reasonator use in Wikipedias

2014-01-21 Thread Magnus Manske
Note: the [2] link goes to the test site. The correct link is:
tools.wmflabs.org/reasonator/?lang=oc&q=35610


On Tue, Jan 21, 2014 at 3:17 PM, Gerard Meijssen
wrote:

> Hoi,
>
> At this moment Wikipedia "red links" provide no information whatsoever.
> This is not cool.
>
> In Wikidata we often have labels for the missing (=red link) articles. We
> can and do provide information from Wikidata in a reasonable way that is
> informative in the "Reasonator". We also provide additional search
> information on many Wikipedias.
>
> In the Reasonator we have now implemented "red lines" [1]. They indicate
> when a label does not exist in the primary language that is in use.
>
> What we are considering is creating a template {{Reasonator}} that will
> present information based on what is available in Wikidata. Such a template
> would be a stand in until an article is actually written. What we would
> provide is information that is presented in the same way as we provide it
> as this moment in time [2]
>
> This may open up a box of worms; Reasonator is NOT using any caching. There
> may be lots of other reasons why you might think this proposal is evil. All
> the evil that is technical has some merit but, you have to consider that
> the other side of the equation is that we are not "sharing in the sum of
> all knowledge" even when we have much of the missing requested information
> available to us.
>
> One saving (technical) grace, Reasonator loads round about as quickly as
> WIkidata does.
>
> As this is advance warning, I hope that you can help with the issues that
> will come about. I hope that you will consider the impact this will have on
> our traffic and measure to what extend it grows our data.
>
> The Reasonator pages will not show up prettily on mobile phones .. so does
> Wikidata by the way. It does not consider Wikipedia zero. There may be more
> issues that may require attention. But again, it beats not serving the
> information that we have to those that are requesting it.
> Thanks,
>   GerardM
>
>
> [1]
>
> http://ultimategerardm.blogspot.nl/2014/01/reasonator-is-red-lining-your-data.html
> [2] http://tools.wmflabs.org/reasonator/test/?lang=oc&q=35610
> ___
> Wikimedia-l mailing list
> wikimedi...@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 




-- 
undefined
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] How to collaborate when writing OAuth applications?

2014-01-21 Thread Chris Steipp
Yeah, it's not possible to drop it yourself yet. Let me, or any oauth admin
(stewards) know that you wasn't it dropped, and we can reject it.
On Jan 21, 2014 6:31 AM, "Dan Andreescu"  wrote:

> >
> > Another question is: i would like to drop my first "test-app"
> > consumer. How can I do it?
> >
>
> I'm not sure, but I would like to drop a consumer as well.  Last time I
> asked it was not yet possible.
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Reasonator use in Wikipedias

2014-01-21 Thread Gerard Meijssen
Hoi,

At this moment Wikipedia "red links" provide no information whatsoever.
This is not cool.

In Wikidata we often have labels for the missing (=red link) articles. We
can and do provide information from Wikidata in a reasonable way that is
informative in the "Reasonator". We also provide additional search
information on many Wikipedias.

In the Reasonator we have now implemented "red lines" [1]. They indicate
when a label does not exist in the primary language that is in use.

What we are considering is creating a template {{Reasonator}} that will
present information based on what is available in Wikidata. Such a template
would be a stand in until an article is actually written. What we would
provide is information that is presented in the same way as we provide it
as this moment in time [2]

This may open up a box of worms; Reasonator is NOT using any caching. There
may be lots of other reasons why you might think this proposal is evil. All
the evil that is technical has some merit but, you have to consider that
the other side of the equation is that we are not "sharing in the sum of
all knowledge" even when we have much of the missing requested information
available to us.

One saving (technical) grace, Reasonator loads round about as quickly as
WIkidata does.

As this is advance warning, I hope that you can help with the issues that
will come about. I hope that you will consider the impact this will have on
our traffic and measure to what extend it grows our data.

The Reasonator pages will not show up prettily on mobile phones .. so does
Wikidata by the way. It does not consider Wikipedia zero. There may be more
issues that may require attention. But again, it beats not serving the
information that we have to those that are requesting it.
Thanks,
  GerardM


[1]
http://ultimategerardm.blogspot.nl/2014/01/reasonator-is-red-lining-your-data.html
[2] http://tools.wmflabs.org/reasonator/test/?lang=oc&q=35610
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] How to collaborate when writing OAuth applications?

2014-01-21 Thread Dan Andreescu
>
> Another question is: i would like to drop my first "test-app"
> consumer. How can I do it?
>

I'm not sure, but I would like to drop a consumer as well.  Last time I
asked it was not yet possible.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] How to collaborate when writing OAuth applications?

2014-01-21 Thread Cristian Consonni
2014/1/21 Dan Andreescu :
> Hi Cristian,
>
> We did basically the same thing with Wikimetrics.  For development, we just
> registered a consumer that redirects to localhost:5000 and committed the
> consumer key and secret.  Here's the relevant config file:
>
> https://git.wikimedia.org/blob/analytics%2Fwikimetrics.git/f0aa046c401f0726ac93dc6a5424cd4a33ef86f1/wikimetrics%2Fconfig%2Fweb_config.yaml
>
> Let me know if you have any questions.  The "production" configuration is
> just hidden and kept secret on the wikimetrics server.

This makes a lot of sense, I will register a wtosm-test consumer and
configure it to point to localhost.

Another question is: i would like to drop my first "test-app"
consumer. How can I do it?

Thank you.

Cristian

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wikimedia engineering report, December 2013

2014-01-21 Thread Guillaume Paumier
On Tue, Jan 21, 2014 at 12:07 PM, Guillaume Paumier
 wrote:
>
> The report covering Wikimedia engineering activities in November 2013 is now 
> available.

s/November/December , obviously :)

-- 
Guillaume Paumier

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Wikimedia engineering report, December 2013

2014-01-21 Thread Guillaume Paumier
Hi,

The report covering Wikimedia engineering activities in November 2013 is
now available.

Wiki version:
https://www.mediawiki.org/wiki/Wikimedia_engineering_report/2013/December
Blog version:
https://blog.wikimedia.org/2014/01/21/engineering-report-december-2013/

We're also proposing a shorter, simpler and translatable version of this
report that does not assume specialized technical knowledge:
https://www.mediawiki.org/wiki/Wikimedia_engineering_report/2013/December/summary

Below is the HTML text of the report.

As always, feedback is appreciated on the usefulness of the report and its
summary, and on how to improve them.

--

Major news in December include:

   - a retrospective on Language Engineering
events,
   including the language summit in Pune, India;
   - the launch of a draft
featureon
the English Wikipedia, to provide a gentler start for Wikipedia
articles.

*Note: We’re also providing a shorter, simpler and translatable version of
this report

that does not assume specialized technical knowledge.*

 Personnel Work with us 

Are you looking to work for Wikimedia? We have a lot of hiring coming up,
and we really love talking to active community members about these roles.

   - VP of Engineering 
   - Software Engineer –
Growth
   - Software Engineer – VisualEditor
(Features)
   - Software Engineer – Language
Engineering
   - Software Engineer 
   - QA Automation
Engineer
   - Full Stack Developer –
Analytics
   - Analytics – Product
Manager
   - Operations Engineer
   - Sr. Operations
Engineer
   - Operations Security
Engineer
   - Graphic Design Interns –
Paid

Announcements

   - Sherah Smith joined the Wikimedia Foundation as Fundraising Engineer
   in Features Engineering
(announcement
   ).
   - Kunal Mehta  joined the
   Wikimedia Foundation as contractor in Features Engineering
(announcement
   ).
   - Andrew Green  joined the
   Wikimedia Foundation as a contractor in Features Engineering, working on
   the Education Program
(announcement
   ).

 Technical Operations

*Datacenter RFP *
As part of our ongoing work in selecting a new location for our next
Datacenter, members of the team traveled to several candidate locations
throughout the US to tour facilities, meet facility staff, and otherwise
continue the selection process. Following this process, we have been able
to shortlist our bid proposals, and have begun the final selection process.
We hope to complete bid selection and legal review in January.Work
continues on migrating our remaining services to our Ashburn datacenter.
Consolidation and migration of databases, fundraising infrastructure, Labs,
as well as progress on updating the configuration (puppetization) of
several miscellaneous services was accomplished in December Additionaly,
“triage” of the hardware within the facility was performed, with an eye
towards what will be delivered to Ashburn, what will end up in our new
facility, and what will be decommissioned.

*Wikimedia Labs *
Andrew Bogott purged of empty projects and stale instances, resulting in
more accurate usage statistics for Labs:

   - Number of projects: 140
   - Number of instances: 403
   - Amount of RAM in use (in MBs): 1,592,832
   - Amount of allocated storage (in GBs): 21,525
   - Number of virtual CPUs in use: 797
   - Number of users: 2,425

Tool Labs saw a bump in usage as the winter holidays provided an
opportunity for volunteers to migrate tools from the Toolserver and work on
new projects; there are now 531 tools managed by 435 users, ranging from
simple database queries to elaborate editing adjuncts using the ne

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

2014-01-21 Thread Randall Farmer
>
> That does not sound like much economically. Do keep in mind the cost of
> porting, deploying, maintaining, obtaining, and so on, new tools.


Briefly, yes, CPU-hours don't cost too much, but I don't think the
potential win is limited to the direct CPU-hours saved.

In more detail: For Wikimedia a quicker-running task is probably easier to
manage, maybe less likely to fail and thus need human attention; dump users
get more up-to-date content if dump processing's quicker; users who get
histzip also get a tool they can (for example) use to quickly pack a
modified XML file through in a pipeline. It's a relatively small
(500-line), hackable tool and could serve as a base for later work: for
instance, I've tried to rig the format so future compressors make
backwards-compatible archives they can insert into without recompressing
all the TBs of input. There are pages on meta going a few years back about
ideas for improving compression speed, and there were past format changes
for operational reasons (chunking full-history dumps) and other
dump-related proposals in Wikimedia-land (a project this past summer about
a new dump tool), so I don't think I'm entirely swatting at gnats by trying
to work up another possible tool.

I'm talking about keeping at least one of the current, widely supported
formats around, which I think would limit hardship for existing users. I'm
sort of curious how many full-history-dump users there are and if they have
anything to say. You mentioned porting; histzip is a Go program that's easy
to cross-compile for different OSes/architectures (as I have for
Windows/Mac/Linux on the github page, though not various BSDs).

I would definitely recommend talking to Igor Pavlov (7-Zip) about this,
> he might be interested in having this as part of 7-Zip as some kind of
> "fast" option, and also the developers of the `xz` tools. There might
> even be ways this could fit within existing extensibility mechanisms of
> the formats.


7-Zip is definitely a very cool and flexible program. I think it can
actually run faster than it's going in the current dumps setup: -mx=3
maintains ratios better than bzip's, but runs faster than bzip. That's a
few times slower than histzip|bzip and slightly larger output, but it's a
boost from the status quo. (There's an argument for maintaining that, not
bzip, as the widely-supported format, which I'd mentioned in the
xmldatadumps-l branch of this thread, or for just changing the 7z settings
and calling it a day)

Interesting to hear from Nemo that Pavlov was interested in long-range
zipping. histzip doesn't have source he could drop into his C program (it's
in Go) and it's really aimed at a narrow niche (long repetitions at a
certain distance) so I doubt I could get it integrated there.

Anyway, I'm saying too many fundamentally unimportant words. If the status
quo re: compression in fact causes enough pain to give histzip a fuller
look, or if there's some way to redirect the tech in it towards a useful
end, it would be great to hear from interested folks; if not, it was fun
work but there may not be much more to do or say.


On Mon, Jan 20, 2014 at 4:49 PM, Bjoern Hoehrmann  wrote:

> * Randall Farmer wrote:
> >As I understand, compressing full-history dumps for English Wikipedia and
> >other big wikis takes a lot of resources: enwiki history is about 10TB
> >unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's
> >over a day of server time. There's been talk about ways to speed that up
> in
> >the past.[1]
>
> That does not sound like much economically. Do keep in mind the cost of
> porting, deploying, maintaining, obtaining, and so on, new tools. There
> might be hundreds of downstream users and if every one of them has to
> spend a couple of minutes adopting to a new format, that can quickly
> outweigh any savings, as a simple example.
>
> >Technical datadaump aside: *How could I get this more thoroughly tested,
> >then maybe added to the dump process, perhaps with an eye to eventually
> >replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk
> >to to get started? (I'd dealt with Ariel Glenn before, but haven't seen
> >activity from Ariel lately, and in any case maybe playing with a new tool
> >falls under Labs or some other heading than dumps devops.) Am I nuts to be
> >even asking about this? Are there things that would definitely need to
> >change for integration to be possible? Basically, I'm trying to get this
> >from a tech demo to something with real-world utility.
>
> I would definitely recommend talking to Igor Pavlov (7-Zip) about this,
> he might be interested in having this as part of 7-Zip as some kind of
> "fast" option, and also the developers of the `xz` tools. There might
> even be ways this could fit within existing extensibility mechanisms of
> the formats. Igor Pavlov tends to be quite response through the SF.net
> bug tracker. In any case, they might be able to give directions how this
> might become, or