date:20180503

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

2018-05-03 Thread Neil Patel Quinn

Also, for the curious, the request for dedicated HTML dumps is tracked in
this Phabricator task: https://phabricator.wikimedia.org/T182351

On Thu, 3 May 2018 at 15:19, Bartosz Dziewoński  wrote:

> On 2018-05-03 20:54, Aidan Hogan wrote:
> > I am wondering what is the fastest/best way to get a local dump of
> > English Wikipedia in HTML? We are looking just for the current versions
> > (no edit history) of articles for the purposes of a research project.
>
> The Kiwix project provides HTML dumps of Wikipedia for offline reading:
> http://www.kiwix.org/downloads/
>
> Their downloads use the ZIM file format, looks like there are libraries
> available for reading it in many programming languages:
> http://www.openzim.org/wiki/Readers
>
> --
> Bartosz Dziewoński
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Neil Patel Quinn 
(he/him/his)
product analyst, Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

2018-05-03 Thread Bartosz Dziewoński


On 2018-05-03 20:54, Aidan Hogan wrote:
I am wondering what is the fastest/best way to get a local dump of 
English Wikipedia in HTML? We are looking just for the current versions 
(no edit history) of articles for the purposes of a research project.


The Kiwix project provides HTML dumps of Wikipedia for offline reading: 
http://www.kiwix.org/downloads/


Their downloads use the ZIM file format, looks like there are libraries 
available for reading it in many programming languages: 
http://www.openzim.org/wiki/Readers


--
Bartosz Dziewoński

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

2018-05-03 Thread Neil Patel Quinn

Hey Aidan!

I would suggest checking out RESTBase (
https://www.mediawiki.org/wiki/RESTBase), which offers an API for
retrieving HTML versions of Wikipedia pages. It's maintained by the
Wikimedia Foundation and used by a number of production Wikimedia services,
so you can rely on it.

I don't believe there are any prepared dumps of this HTML, but you should
be able to iterate through the RESTBase API, as long as you follow the
rules (from https://en.wikipedia.org/api/rest_v1/):

   - *Limit your clients to no more than 200 requests/s to this API. Each
   API endpoint's documentation may detail more specific usage limits.*
   - *Set a unique User-Agent or Api-User-Agent header that allows us to
   contact you quickly. Email addresses or URLs of contact pages work well.*



On Thu, 3 May 2018 at 14:26, Aidan Hogan  wrote:

> Hi Fae,
>
> On 03-05-2018 16:18, Fæ wrote:
> > On 3 May 2018 at 19:54, Aidan Hogan  wrote:
> >> Hi all,
> >>
> >> I am wondering what is the fastest/best way to get a local dump of
> English
> >> Wikipedia in HTML? We are looking just for the current versions (no edit
> >> history) of articles for the purposes of a research project.
> >>
> >> We have been exploring using bliki [1] to do the conversion of the
> source
> >> markup in the Wikipedia dumps to HTML, but the latest version seems to
> take
> >> on average several seconds per article (including after the most common
> >> templates have been downloaded and stored locally). This means it would
> take
> >> several months to convert the dump.
> >>
> >> We also considered using Nutch to crawl Wikipedia, but with a reasonable
> >> crawl delay (5 seconds) it would several months to get a copy of every
> >> article in HTML (or at least the "reachable" ones).
> >>
> >> Hence we are a bit stuck right now and not sure how to proceed. Any
> help,
> >> pointers or advice would be greatly appreciated!!
> >>
> >> Best,
> >> Aidan
> >>
> >> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
> >
> > Just in case you have not thought of it, how about taking the XML dump
> > and converting it to the format you are looking for?
> >
> > Ref
> https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
> >
>
> Thanks for the pointer! We are currently attempting to do something like
> that with bliki. The issue is that we are interested in the
> semi-structured HTML elements (like lists, tables, etc.) which are often
> generated through external templates with complex structures. Often from
> the invocation of a template in an article, we cannot even tell if it
> will generate a table, a list, a box, etc. E.g., it might say "Weather
> box" in the markup, which gets converted to a table.
>
> Although bliki can help us to interpret and expand those templates, each
> page takes quite long, meaning months of computation time to get the
> semi-structured data we want from the dump. Due to these templates, we
> have not had much success yet with this route of taking the XML dump and
> converting it to HTML (or even parsing it directly); hence we're still
> looking for other options. :)
>
> Cheers,
> Aidan
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Neil Patel Quinn 
(he/him/his)
product analyst, Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

2018-05-03 Thread Aidan Hogan


Hi Fae,

On 03-05-2018 16:18, Fæ wrote:

On 3 May 2018 at 19:54, Aidan Hogan  wrote:

Hi all,

I am wondering what is the fastest/best way to get a local dump of English
Wikipedia in HTML? We are looking just for the current versions (no edit
history) of articles for the purposes of a research project.

We have been exploring using bliki [1] to do the conversion of the source
markup in the Wikipedia dumps to HTML, but the latest version seems to take
on average several seconds per article (including after the most common
templates have been downloaded and stored locally). This means it would take
several months to convert the dump.

We also considered using Nutch to crawl Wikipedia, but with a reasonable
crawl delay (5 seconds) it would several months to get a copy of every
article in HTML (or at least the "reachable" ones).

Hence we are a bit stuck right now and not sure how to proceed. Any help,
pointers or advice would be greatly appreciated!!

Best,
Aidan

[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home


Just in case you have not thought of it, how about taking the XML dump
and converting it to the format you are looking for?

Ref 
https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia



Thanks for the pointer! We are currently attempting to do something like 
that with bliki. The issue is that we are interested in the 
semi-structured HTML elements (like lists, tables, etc.) which are often 
generated through external templates with complex structures. Often from 
the invocation of a template in an article, we cannot even tell if it 
will generate a table, a list, a box, etc. E.g., it might say "Weather 
box" in the markup, which gets converted to a table.


Although bliki can help us to interpret and expand those templates, each 
page takes quite long, meaning months of computation time to get the 
semi-structured data we want from the dump. Due to these templates, we 
have not had much success yet with this route of taking the XML dump and 
converting it to HTML (or even parsing it directly); hence we're still 
looking for other options. :)


Cheers,
Aidan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GSoC 2018 Introduction: Hagar Shilo

2018-05-03 Thread Eran Rosenthal

Good luck / בהצלחה!


On Thu, May 3, 2018 at 7:39 PM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> Welcome / ברוכה הבאה!
>
> בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo ‏<
> hagarshi...@mail.tau.ac.il>:
>
> > Hi All,
> >
> > My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv
> > University, Israel.
> >
> > This summer I will be working on a user search menu and user filters for
> > Wikipedia's "Recent changes" section. Here is the workplan:
> > https://phabricator.wikimedia.org/T190714
> >
> > My mentors are Moriel and Roan.
> >
> > I am looking forward to becoming a Wikimedia developer and an open source
> > contributor.
> >
> > Cheers,
> > Hagar
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

2018-05-03 Thread Fæ

On 3 May 2018 at 19:54, Aidan Hogan  wrote:
> Hi all,
>
> I am wondering what is the fastest/best way to get a local dump of English
> Wikipedia in HTML? We are looking just for the current versions (no edit
> history) of articles for the purposes of a research project.
>
> We have been exploring using bliki [1] to do the conversion of the source
> markup in the Wikipedia dumps to HTML, but the latest version seems to take
> on average several seconds per article (including after the most common
> templates have been downloaded and stored locally). This means it would take
> several months to convert the dump.
>
> We also considered using Nutch to crawl Wikipedia, but with a reasonable
> crawl delay (5 seconds) it would several months to get a copy of every
> article in HTML (or at least the "reachable" ones).
>
> Hence we are a bit stuck right now and not sure how to proceed. Any help,
> pointers or advice would be greatly appreciated!!
>
> Best,
> Aidan
>
> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Just in case you have not thought of it, how about taking the XML dump
and converting it to the format you are looking for?

Ref 
https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

Fae
-- 
fae...@gmail.com https://commons.wikimedia.org/wiki/User:Fae

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Thank you for participating in the global Wikimedia survey!

2018-05-03 Thread Edward Galvez

Hello everyone,

I would like to share my deepest gratitude for everyone who responded to
the Wikimedia Communities and Contributors Survey. The survey has already
closed for this year. The quality of the results has improved because more
people responded. We heard from over 200 people who work in volunteer
developer spaces like Phabricator, IRC, Mediawiki, mailing lists, and many
others, which is a solid increase from last year.

We are working on analyzing the data already and hope to have something
published on meta in a couple months. Be sure to watch Community Engagement
Insights 
for when we publish the reports. We will also message those individuals who
signed up on the thank you page or sent us an email to receive updates
about the report. Feel free to reach out to me directly at egalvez[at]
wikimedia.org or at my talk page on meta
.

Thank you again to everyone for sharing your opinions with us!


-- 
Edward Galvez
Evaluation Strategist, Surveys
Learning & Evaluation
Community Engagement
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Getting a local dump of Wikipedia in HTML

2018-05-03 Thread Aidan Hogan


Hi all,

I am wondering what is the fastest/best way to get a local dump of 
English Wikipedia in HTML? We are looking just for the current versions 
(no edit history) of articles for the purposes of a research project.


We have been exploring using bliki [1] to do the conversion of the 
source markup in the Wikipedia dumps to HTML, but the latest version 
seems to take on average several seconds per article (including after 
the most common templates have been downloaded and stored locally). This 
means it would take several months to convert the dump.


We also considered using Nutch to crawl Wikipedia, but with a reasonable 
crawl delay (5 seconds) it would several months to get a copy of every 
article in HTML (or at least the "reachable" ones).


Hence we are a bit stuck right now and not sure how to proceed. Any 
help, pointers or advice would be greatly appreciated!!


Best,
Aidan

[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GSoC 2018 Introduction: Hagar Shilo

2018-05-03 Thread Amir E. Aharoni

Welcome / ברוכה הבאה!

בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo ‏<
hagarshi...@mail.tau.ac.il>:

> Hi All,
>
> My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv
> University, Israel.
>
> This summer I will be working on a user search menu and user filters for
> Wikipedia's "Recent changes" section. Here is the workplan:
> https://phabricator.wikimedia.org/T190714
>
> My mentors are Moriel and Roan.
>
> I am looking forward to becoming a Wikimedia developer and an open source
> contributor.
>
> Cheers,
> Hagar
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] GSoC 2018 Introduction: Hagar Shilo

2018-05-03 Thread Hagar Shilo

Hi All,

My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv
University, Israel.

This summer I will be working on a user search menu and user filters for
Wikipedia's "Recent changes" section. Here is the workplan:
https://phabricator.wikimedia.org/T190714

My mentors are Moriel and Roan.

I am looking forward to becoming a Wikimedia developer and an open source
contributor.

Cheers,
Hagar
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Scribunto localization/code review request

2018-05-03 Thread Martin Urbanec

Hello all,

can somebody with +2 on Scribunto please review
https://gerrit.wikimedia.org/r/403603 for
https://phabricator.wikimedia.org/T184512?

Thanks!

Martin
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] [MediaWiki-announce] MediaWiki 1.31.0-rc.0 now available

2018-05-03 Thread Chad Horohoe

Hi,

I'd like to announce the immediate availability of MediaWiki 1.31.0-rc.0,
the first release candidate for 1.31.x. Links at the end of the e-mail. The
tag has been signed and pushed to Git.

This is not a final release and should not be used for production websites.
There's several major outstanding bugs on the workboard. As always please
do try out the release candidate in a test environment. It's how we find
bugs that didn't surface in initial development :)

Full release notes:
https://phabricator.wikimedia.org/diffusion/MW/browse/REL1_31/RELEASE-NOTES-1.31
https://www.mediawiki.org/wiki/Release_notes/1.31

**
Download:
https://releases.wikimedia.org/mediawiki/1.31/mediawiki-1.31.0-rc.0.tar.gz

Core only, no extensions:
https://releases.wikimedia.org/mediawiki/1.31/mediawiki-core-1.31.0-rc.0.tar.gz.sig

GPG signatures:
https://releases.wikimedia.org/mediawiki/1.31/mediawiki-core-1.31.0-rc.0.tar.gz.sig
https://releases.wikimedia.org/mediawiki/1.31/mediawiki-1.31.0-rc.0.tar.gz.sig

Public keys:
https://www.mediawiki.org/keys/keys.html

--
Chad Horohoe
___
MediaWiki announcements mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-announce
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wikimedia-l] Τι σας κάνει ευτυχισμένη αυτήν την εβδομάδα? / What's making you happy this week? (Week of 22 April 2018)

2018-05-03 Thread Pine W

Hi Oleg.

Thanks for the suggestion. I have considered requesting that Wikimedia
volunteers translate "What's making you happy this week?" into their
languages. However, because this initiative appears to have a relatively
modest level of community support, and because volunteer translators' time
is so valuable, I am reluctant to request translators' time. If I continue
to be involved with this initiative for a few more months, or if I make a
particularly bad mistake in translation, then I may change my mind. (:

Regards,

Pine
( https://meta.wikimedia.org/wiki/User:Pine )

On Mon, Apr 30, 2018 at 11:36 AM, Saint Johann  wrote:

> Or you could’ve just give a link somewhere with those messages to a page
> where someone could drop you a translation in their language ;-)
>
> Oleg
>
>
>
> On 30/04/2018 21:05, Pine W wrote:
>
>> Thank you for that information, Alexandros. :) I rely extensively on
>> machine translation, and online resources like Wiktionary, when I do
>> translations for these weekly threads in most languages. There will
>> probably be more linguistic errors in the future. Comments regarding
>> translations would be appreciated.
>>
>> Pine
>> ( https://meta.wikimedia.org/wiki/User:Pine )
>>
>> On Mon, Apr 30, 2018 at 2:57 AM, Alexandros Kosiaris <
>> akosia...@wikimedia.org> wrote:
>>
>> Aside from the actual content (it's nice to see the legal case in
>>> Greece ending), seeing the subject in Greek was one more reason I
>>> became happy. But I think a small correction is in place. A more
>>> appropriate way of saying "What's making you happy this week?" would
>>> be "Τι σας κάνει ευτυχείς αυτήν την εβδομάδα;", where "ευτυχείς" (an
>>> adjective) would be the plural form of happy, which is used both when
>>> addressing groups of people and when being polite. Alternatively
>>> "ευτυχισμένους", could be used, with exact same meaning, just using
>>> the participle form (and in the appropriate conjugation) instead of
>>> the adjective. "Xαρούμενους" (again a participle, just of a different
>>> verb) would also be valid with the same meaning for most people,
>>> although if one wants to be pedantic, "χαρά" is closer to "joy" than
>>> "happiness".
>>>
>>> Regards,
>>>
>>> --
>>> Alexandros Kosiaris 
>>>
>>> ___
>>> Wikitech-l mailing list
>>> Wikitech-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

Re: [Wikitech-l] GSoC 2018 Introduction: Hagar Shilo

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

[Wikitech-l] Thank you for participating in the global Wikimedia survey!

[Wikitech-l] Getting a local dump of Wikipedia in HTML

Re: [Wikitech-l] GSoC 2018 Introduction: Hagar Shilo

[Wikitech-l] GSoC 2018 Introduction: Hagar Shilo

[Wikitech-l] Scribunto localization/code review request

[Wikitech-l] [MediaWiki-announce] MediaWiki 1.31.0-rc.0 now available

Re: [Wikitech-l] [Wikimedia-l] Τι σας κάνει ευτυχισμένη αυτήν την εβδομάδα? / What's making you happy this week? (Week of 22 April 2018)

13 matches

Site Navigation

Mail list logo

Footer information