Re: [Xmldatadumps-l] pbzip2 proposal

2012-01-27 Thread Federico Leva (Nemo)
Richard Jelinek, 28/01/2012 00:38: don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... There's a quite old comparison here: https://www.mediawiki.org/wiki/Dbzip2 but https://wikitech.wikimedia.org/view/Dumps/Parallelization

Re: [Xmldatadumps-l] [Wikitech-l] Fwd: Old English Wikipedia image dump from 2005

2012-01-31 Thread Federico Leva (Nemo)
K. Peachey, 31/01/2012 10:17: On Tue, Jan 31, 2012 at 6:13 PM, Ariel T. Glenn wrote: You don't need an account to read the content, only to edit. Ariel I believe they mean watchlisting (so they get email notifs) (If email alerts are even activated over there) You can also use Atom feeds an

Re: [Xmldatadumps-l] Malware reported in mirror

2012-07-02 Thread Federico Leva (Nemo)
Kevin Day, 02/07/2012 05:27: [Found trojan] /z/public/pub/wikimedia/images/wiktionary/fj/c/c4/citibank-car-loan.pdf [Found exploit] /z/public/pub/wikimedia/images/wikisource/ar/7/7d/الحراب_في_صدر_البهاء_والباب.pdf [Found exploit] /z/public/pub/wikimedia/images/wikisource

Re: [Xmldatadumps-l] [Wikitech-l] HTML wikipedia dumps: Could you please provide them, or make public the code for interpreting templates?

2012-09-09 Thread Federico Leva (Nemo)
Shouldn't you be using ZIM, and aren't dumpHTML and siblings The Right Way to do it? See also http://openzim.org/Build_your_ZIM_file Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/x

Re: [Xmldatadumps-l] Format

2012-11-09 Thread Federico Leva (Nemo)
Platonides, this reminds me: have you/we ever documented https://gerrit.wikimedia.org/r/#/c/6717/ somewhere? And do we have some system in place to avoid such problems (import/export incompatibilities) to come up again? John, 09/11/2012 22:32: I am actually looking to re-write that tool to avo

Re: [Xmldatadumps-l] Encoding issue in the last ZH dump

2013-01-08 Thread Federico Leva (Nemo)
Ariel T. Glenn, 08/01/2013 09:26: The issue is that the bad character was added in 2004, see https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB% E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80% 29&action=edit&oldid=386385 I've requested removal and revdeletion: https://zh.w

Re: [Xmldatadumps-l] Housekeeping categories?

2013-02-13 Thread Federico Leva (Nemo)
I don't think there's any simple/reliable way: your only option is probably crossing the whole category tree and find out whether a category is not a (sub-){1,100}category of https://en.wikipedia.org/wiki/Category:Articles or equivalent... and hope there are not too many loops! Nemo

Re: [Xmldatadumps-l] Housekeeping categories?

2013-02-23 Thread Federico Leva (Nemo)
Robert Crowe, 23/02/2013 21:58: I tried finding all the subcategories of Category:Wikipedia_administration, but unfortunately that includes many non-administration categories also. Will administration categories be limited to those that contain only: - Categories - Files - Talk pages Or are t

Re: [Xmldatadumps-l] [Fwd: organizing Wikimedia mirrors]

2013-02-25 Thread Federico Leva (Nemo)
Ariel T. Glenn, 25/02/2013 12:01: Forwarding in case there are folks on this list interested. What I really want is something sourceforge-like: it knows which mirror has a copy of the file and has the most bandwidth available for the user. Kiwix.org uses mirrobrain to manage a bandwidth which

Re: [Xmldatadumps-l] Wikidata project and interwiki links removed in wiki text

2013-03-04 Thread Federico Leva (Nemo)
François Bonzon, 04/03/2013 16:35: How can I now extract interwiki links from dumps? Is there a separate Wikidata dump I should download? What attributes for look for to join Wikidata and separate language wiki dumps? Thanks for your help. http://dumps.wikimedia.org/huwiki/20130224/huwiki-20130

Re: [Xmldatadumps-l] Wikidata project and interwiki links removed in wiki text

2013-03-04 Thread Federico Leva (Nemo)
François Bonzon, 04/03/2013 18:22: I confirm I now see interwiki language links originating from Wikidata in wiki--langlinks.sql.gz dumps, with the format described in the 2nd link you sent. However, this is a MySQL dump, not a XML dump. Language links are then no more available in XML data dump

Re: [Xmldatadumps-l] Processing french dump

2013-03-21 Thread Federico Leva (Nemo)
Benoit Lelong, 11/12/2012 16:11: I am currently planning to process the last french dump. I would like to ask if somebody has already found or used a good OpenNLP french sentence detection model. If yes please let me know where to find one. What have you found? Probably wiktionary-l is a better

Re: [Xmldatadumps-l] I need Database tables Mapping to DB Dumps

2013-04-10 Thread Federico Leva (Nemo)
Imran Latif, 10/04/2013 23:06: I'm doing research project on Wikipedia, so i need the Wikipedia data. I decided to use the database dumps of Wikipedia for this purpose but there are too much files there i don't know which file populates the which table. Would you please provide some information t

Re: [Xmldatadumps-l] I need Database tables Mapping to DB Dumps

2013-04-10 Thread Federico Leva (Nemo)
Imran Latif, 11/04/2013 08:26: Thanks for replying, Your reply makes sense. I just need to confirm that if i used the following Dump http://dumps.wikimedia.org/fiwiki/20130323/ And download all sql and XML files and populates my table using some utility, then the whole Wikipedia data is configu

Re: [Xmldatadumps-l] Pagecounts data missing (2009/09/21 - 2009/10/01)

2013-05-02 Thread Federico Leva (Nemo)
Giovanni Luca Ciampaglia, 02/05/2013 22:40: Hi, I noticed that some pagecounts data files are missing, namely the files in the interval (2009092116 - 2009100100) (ends excluded). See http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/ Does anybody know the reason why these da

Re: [Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]

2013-05-06 Thread Federico Leva (Nemo)
Randall Farmer, 06/05/2013 08:37: To wrap up what I started earlier, here's a slightly tweaked copy of the last script I sent around [...] But, all that said, declaring blks2.py a (kinda fun to work on!) dead end. :) If you're done with it, you may want to drop it on a Wikimedia repo like

Re: [Xmldatadumps-l] Help needed for Volunteer Uganda project - Importing enwiki-20130403-pages-articles-multistream.xml.bz2

2013-06-03 Thread Federico Leva (Nemo)
Richard Ive, 02/06/2013 12:38: Hi all, I am helping the charity Volunteer Uganda set up an offline eLearning computer system with 15 Raspberry Pi's and cheap desktop computer for a server. Why aren't you using Kiwix? Reportedly, it even runs standalone on a Raspberry Pi without problems. Ne

Re: [Xmldatadumps-l] Help needed for Volunteer Uganda project - Importing enwiki-20130403-pages-articles-multistream.xml.bz2

2013-06-03 Thread Federico Leva (Nemo)
Richard Ive, 03/06/2013 12:06: In all honesty, I didn't know about it until now. Everything else we are using is web biased (Khan Academy lite, ebooks and emedia), so for our model the Wikipedia website works best. I would guess it is cheaper to buy a £300 desktop with 2TB for the wiki Mysql dat

Re: [Xmldatadumps-l] Image dump tarball status?

2013-07-08 Thread Federico Leva (Nemo)
Kevin Day, 08/07/2013 03:38: We've got our hardware back up, but during our outage the Wikimedia folks did a datacenter move. The source where we were grabbing all the images from isn't running right now, so there's no new image data to build new tarballs from. As soon as the Wikimedia people

Re: [Xmldatadumps-l] Extracted page abstracts for Yahoo

2013-07-29 Thread Federico Leva (Nemo)
Andreas Meier, 28/07/2013 22:48: Hello, there is a problem with the extracted page abstracts for Yahoo on the big wikis moved to the new infrastructure. During generation everything seems to be fine, but it ended with a 159kb file. An other question: Why is this step not parallelized? Sorry, I

Re: [Xmldatadumps-l] Installing dumps too slow

2013-08-23 Thread Federico Leva (Nemo)
I've copied the above info to Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Re: [Xmldatadumps-l] relationship between logging and page_restrictions

2013-09-12 Thread Federico Leva (Nemo)
Xavier Vinyals Mirabent, 12/09/2013 20:01: Are the values in the columns pr_id and log_id equivalent? I'm trying to select all changes in editing protection status for Wikipedia articles but the table Page_restrictions doesn't contain a time stamp, and the table logging doesn't specify the kind o

Re: [Xmldatadumps-l] [Wikitech-l] Bulk download

2013-09-23 Thread Federico Leva (Nemo)
Jeremy Baron, 23/09/2013 16:11: On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" mailto:mihai.chinto...@skobbler.com>> wrote: > I have a list of about 1.8 million images which I have to download from commons.wikimedia.org . Is there any simple way to do this which doesn't

[Xmldatadumps-l] "Tarballs" of all 2004-2012 Commons files now available at archive.org

2013-10-13 Thread Federico Leva (Nemo)
WikiTeam has just finished archiving all Wikimedia Commons files up to 2012 (and some more) on the Internet Archive: https://archive.org/details/wikimediacommons So far it's about 24 TB of archives and there are also a hundred torrents you can help seed, ranging from few hundred MB to over a TB,

Re: [Xmldatadumps-l] [Commons-l] [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

2013-10-15 Thread Federico Leva (Nemo)
Paul A. Houle, 15/10/2013 00:35: I’d like to see the Commons backups available in the AMZN S3 cloud, even if it is only as “requester pays”. Frankly, my experience is that getting data from the Internet Archive is so slow that I wonder if they are on the Moon. When did you try last time? They

Re: [Xmldatadumps-l] Question : Import dump files

2013-11-05 Thread Federico Leva (Nemo)
장기숭, 06/11/2013 05:58: i try to import sql to mysql and import xml using importDump.php but it doesn't work as well Have you read ? Nemo Question 1) How to import xml and sql 2) whole files will be imported? ( wikidatawiki-lat

Re: [Xmldatadumps-l] Get images of article.

2013-11-08 Thread Federico Leva (Nemo)
Yannick Guigui, 08/11/2013 10:11: Please I want to get all images of wikipedia frensh and English, I much did it cost to book it on hardisk? In can't download it because I don't have enought bandwidth from my country. What do you need them for? Originals would be about 2+1 TB and anyone can dow

Re: [Xmldatadumps-l] Get images of article.

2013-11-08 Thread Federico Leva (Nemo)
demo (3min in french) of the webapp :https://www.youtube.com/watch?v=0f-HJhOw1-U If I get small images in french and english to download to the app,my problem will revolved. Tank a lot Federico Le vendredi 8 novembre 2013, Federico Leva (Nemo) a écrit : Yannick Guigui, 08/11/2013 10:11:

Re: [Xmldatadumps-l] ENwiktionary database dump issue

2013-12-09 Thread Federico Leva (Nemo)
Tina Lukša, 08/12/2013 18:01: Hello! I am using WikiTaxi for importing wiki databases and their offline usage. I've never had issues before but the latest two English wiktionary databases haven't been working correctly. As seen in the screenshot, the translations from various languages can't be s

Re: [Xmldatadumps-l] ENwiktionary database dump issue

2013-12-09 Thread Federico Leva (Nemo)
mported wiki. There were never requierments for any extensions - it's a straightforward portable app. Unfortunately, once I update a wiki I delete the old import so I neither know the exact date of the working dump nor can I take a screencap to prove that everything worked just fine prior the update

Re: [Xmldatadumps-l] Missing page in zh.wiktionary.org dump?: https://zh.wiktionary.org/wiki/Template:漢語寫法 or %E6%B1%89%E8%AF%AD%E5%86%99%E6%B3%95

2013-12-18 Thread Federico Leva (Nemo)
gnosygnu, 19/12/2013 06:01: Hi. I'm not sure if this is a dump issue, but I thought I'd start off here. I had a user report a missing page in zh.wiktionary.org : https://sourceforge.net/p/xowa/tickets/291/. It seems that a Main namespace page (學生) references a template

Re: [Xmldatadumps-l] Compressing full-history dumps faster

2014-01-20 Thread Federico Leva (Nemo)
Randall Farmer, 20/01/2014 23:39: Hi, everyone. tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with the same avg compression ratio as 7zip. [...] Wow! Thanks for continuing work on this. Technical datadaump aside: *How could I get this more thoroughly tested, then maybe

Re: [Xmldatadumps-l] Template expansion inconsistency

2014-02-22 Thread Federico Leva (Nemo)
wp mirror, 22/02/2014 23:40: Still, it would be nice if the dump files could be fixed. Fixed? is the full page name as it's supposed to be. Either you're doing something wrong with the import, or the import script/special page has a bug (not uncommon, but needs a bug report with steps to re

Re: [Xmldatadumps-l] Template expansion inconsistency

2014-02-23 Thread Federico Leva (Nemo)
wp mirror, 23/02/2014 15:26: c) Third best, would be to patch `mwxml2sql'. This I also favor, but would like some guidance from its author, Ariel Glenn, before I start hacking. This seems the most likely. Probably, mwxml2sql has to be fixed so that it does whatever importDump.php/Special:Impo

Re: [Xmldatadumps-l] [Wikitech-l] Compressing full-history dumps faster

2014-03-08 Thread Federico Leva (Nemo)
Randall Farmer, 21/01/2014 23:26: Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/

Re: [Xmldatadumps-l] Where's abstract.xml.gz?

2014-03-14 Thread Federico Leva (Nemo)
Clem Wang, 13/03/2014 23:16: I'd rather minimize traffic by downloading the compressed version of the file. Does "Accept: gzip" header not work? Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mai

Re: [Xmldatadumps-l] Hindi

2014-04-23 Thread Federico Leva (Nemo)
Benoit Lelong, 24/04/2014 00:18: Has anybody already tried to compute the hindi wikipedia dump ? Ahem, compute in what way? Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldata

Re: [Xmldatadumps-l] Help for dump

2014-06-02 Thread Federico Leva (Nemo)
Yannick Guigui, 02/06/2014 10:59: It’s about Dump of Wikipedia. I’m working in a project which uses Wikipedia Database to display articles offline. Nice! But please don't use the raw database + images. What you need is already available: http://kiwix.org If it's a problem for you to get about

Re: [Xmldatadumps-l] LATEST DUMPS

2014-06-11 Thread Federico Leva (Nemo)
Alex Druk, 09/06/2014 10:34: I wonder if anyone know when dumps fro may data will be ready? Usually dumps preparation for data of previous month start on 2-8 of the next month (http://dumps.wikimedia.org/enwiki/) However June dumps preparation for May data not started yet http://dumps.wikimedia.o

Re: [Xmldatadumps-l] incremental dump issues

2014-07-04 Thread Federico Leva (Nemo)
wp mirror, 04/07/2014 23:33: > 2.1) Speed: Importation proceeds at less than 0.1 pages/sec. This means > that, for the largest wikis (commonswiki, enwiki, wikidatawiki) > importation cannot be completed before the `xincr' for the next day is > posted. Did you try https://meta.wikimedia.org/wiki/Da

Re: [Xmldatadumps-l] incremental dump issues

2014-07-04 Thread Federico Leva (Nemo)
wp mirror, 05/07/2014 06:11: > Dear Federico, > > Thanks for the links. The advise on > I have > already implemented. Bits of > are also > implemented. > > I am not clear about `

Re: [Xmldatadumps-l] Getting display: mw:Help:Magic words#Other on the page

2014-10-13 Thread Federico Leva (Nemo)
Arquillos, Diana, 13/10/2014 21:36: Since, the dump from August we updated the content and since then, our pages doesn't render properly. What dump were you using before that? Seems that it doesn't render properly the modules/templates contains in double bracelets {{}} but not always. So

Re: [Xmldatadumps-l] Proposal: Stop dumping inactive/closed wikis

2015-01-17 Thread Federico Leva (Nemo)
Richard Jelinek, 17/01/2015 19:59: latest at our servers is aar-20141223.xml.bz with 22974 bytes 22974 entire bytes! What a terrible waste! If we want to save space, I propose to cut on some formats of en.wiki dumps. *That* would save a lot of resources. ;-) Nemo __

Re: [Xmldatadumps-l] Proposal: Stop dumping inactive/closed wikis

2015-01-18 Thread Federico Leva (Nemo)
Richard Jelinek, 18/01/2015 09:45: Not a big deal - is it? Indeed not a big deal, I'd set up an email filter. Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Re: [Xmldatadumps-l] No Wikivoyage dumps since 2014

2015-02-05 Thread Federico Leva (Nemo)
Nicolas Raoul, 05/02/2015 08:45: hoping to use this content with Wikivoyage offline browsers like Kiwix ZIM files for Kiwix don't rely on XML dumps. In fact, the latest ZIM was produced few hours ago. http://download.kiwix.org/zim/wikivoyage/?C=M;O=A Nemo ___

Re: [Xmldatadumps-l] difference between data dumps

2015-02-12 Thread Federico Leva (Nemo)
126, 12/02/2015 15:38: I got stucked with an open source project which calls for enwiki-latest-pages-articles.xml.bz2 while I only have enwiki-latest-pages-articles-multistream.xml.bz2, the network status is too bad for me to download another large file, so I wondered what is the difference

Re: [Xmldatadumps-l] Proposal or question for a dump of all .svg files in commons.

2015-06-19 Thread Federico Leva (Nemo)
D. Hansen, 19/06/2015 23:09: One suggestion was to downloadcommonswiki-20150417-all-titles, which I did. But this file does contain deleted names and renamed names, and names the partly have "File:" and some that don't have "File:" or a similiar indicator at the start. Doing just a small sample

Re: [Xmldatadumps-l] Need an old dump

2015-07-05 Thread Federico Leva (Nemo)
Saurabh Sarda, 05/07/2015 08:03: The dump file is named enwiki-20081008-pages-articles.xml.bz2 It's the Nth time someone comes and asks this dump, check the archives! http://article.gmane.org/gmane.org.wikimedia.xmldatadumps/1069/ At this point you could try torrent search engines, but it's pr

Re: [Xmldatadumps-l] Flow, LiquidThreads, and XML dumps

2015-07-10 Thread Federico Leva (Nemo)
wp mirror, 11/07/2015 02:51: If not, can you forecast when Flow will appear in the dumps? Before any deployment, I expect. https://phabricator.wikimedia.org/T89398 If Flow is being deployed on actual wikis without this feature, then this becomes a critical bug of "Unbreak now" priority. Nemo

Re: [Xmldatadumps-l] Use of dumps in mediawiki

2015-09-07 Thread Federico Leva (Nemo)
Yoni Lamri, 07/09/2015 12:01: My simple question, how to correctly install a wikipedia mirror from dumps in mediawiki ? Did you follow https://meta.wikimedia.org/wiki/Data_dumps/Tools_for_importing ? My goal: Create an offline wikimedia server, from FR, EN or PT dumps, (1 language only). [.

Re: [Xmldatadumps-l] Use of dumps in mediawiki

2015-09-07 Thread Federico Leva (Nemo)
Yoni Lamri, 07/09/2015 16:50: Our company is creating a partnerchip with the wikipedia fondation and we can not use kiwix which is not comming from this fondation. The Wikipedia Foundation doesn't exist and Kiwix is an official Wikimedia tool. Nemo _

Re: [Xmldatadumps-l] [Wiki-research-l] Download of pageviews dataset

2015-11-11 Thread Federico Leva (Nemo)
Cristian Consonni, 11/11/2015 15:09: I am working with a student on scientific citation on Wikipedia and, very simply put, we would like to use the pageview dataset to have a rough measure of how many times a paper was viewed thanks to Wikipedia.[*] The full dataset is, as of now, ~ 4.7TB in siz

Re: [Xmldatadumps-l] corrupted files english december

2015-12-28 Thread Federico Leva (Nemo)
Luigi Assom, 28/12/2015 16:47: Or ps. does dbpedia is working with wikimedia staff or are two completely separated things? Completely separate. I was wondering why wiki release dumps every month, while ~dbpedia each year. Probably because DBpedia isn't an automated process and their maps ne

Re: [Xmldatadumps-l] 2016-01 dumps halted?

2016-01-10 Thread Federico Leva (Nemo)
Presumably still https://phabricator.wikimedia.org/T121348 As you can see https://phabricator.wikimedia.org/project/feed/1519/ , recent days have been dedicated to a conference. Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org

Re: [Xmldatadumps-l] [Wikitech-l] Wikipedia dumps

2016-01-20 Thread Federico Leva (Nemo)
Bernardo Sulzbach, 20/01/2016 19:10: > If you could decompress it correctly (20151201), don't mind my report. > I just would want it removed if it was confirmed to be a problematic > file. This is soon unneeded as new dumps are being generated: https://dumps.wikimedia.org/enwiki/20160113/ The

Re: [Xmldatadumps-l] Extracting featured article meta-history dumps

2016-01-31 Thread Federico Leva (Nemo)
Anmol Dalmia, 28/01/2016 13:14: Is there is list of such articles and their article ids or any such tool that can crawl through to produce these lists? https://meta.wikimedia.org/wiki/Wikidata/Development/Badges https://www.wikidata.org/wiki/Wikidata:Data_access Nemo _

Re: [Xmldatadumps-l] old dumps

2016-02-14 Thread Federico Leva (Nemo)
Have you already tried all the mirrors? https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Dumps Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Re: [Xmldatadumps-l] Wikipedia 20150205 dump

2016-02-18 Thread Federico Leva (Nemo)
Praveen Balaji, 18/02/2016 18:45: I was wondering if someone could point me the english wikipedia "enwiki-20150205-pages-articles-multistream" dump from which the 2015-04 dbpedia dumps were extracted. They used to be hosted on dump.wikipedia.org but are 404 now. You

Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-05-15 Thread Federico Leva (Nemo)
Ariel Glenn WMF, 04/05/2016 14:33: You can access it at http://wikimedia.crc.nd.edu/other/ so please do! Great news, especially because it's ten times faster than dumps.wikimedia.org! Finally, every time I need a dataset to quickly verify a sudden idea I have, the download becomes a matter of

Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-06-17 Thread Federico Leva (Nemo)
Ariel Glenn WMF, 17/06/2016 13:21: For folks from specific institutions that suddenly no longer have access, I can forward instution names along and hope that helps. It would be nice to whitelist the wmflabs.org servers, which would benefit from a faster server to download this stuff from. N

Re: [Xmldatadumps-l] [Wikitech-l] wikidatawiki.xml.bz2 failed integrity check

2016-08-10 Thread Federico Leva (Nemo)
"lbzip2 -t /public/dumps/public/wikidatawiki/20160801/wikidatawiki-20160801-pages-articles.xml.bz2" succeeds for me on Labs. You should compare the checksum of your copy with https://dumps.wikimedia.org/wikidatawiki/20160801/wikidatawiki-20160801-sha1sums.txt (says c6a823508240d161e481e5d00459

Re: [Xmldatadumps-l] Portal:Current events in revision history dumps?

2016-09-08 Thread Federico Leva (Nemo)
Govind, 08/09/2016 12:03: I'm experimenting on revision history dumps of Wikipedia. Do you mean the English Wikipedia? I've confusion regarding archiving of Portal:Current Events in revision history dumps. The title you mention only has 2 edits: https://en.wikipedia.org/w/index.php?title=P

Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-09-27 Thread Federico Leva (Nemo)
Federico Leva (Nemo), 17/06/2016 14:59: Ariel Glenn WMF, 17/06/2016 13:21: For folks from specific institutions that suddenly no longer have access, I can forward instution names along and hope that helps. It would be nice to whitelist the wmflabs.org servers, which would benefit from a

Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-09-27 Thread Federico Leva (Nemo)
Ok. Ariel Glenn WMF, 27/09/2016 11:47: http://dumps.wikimedia.your.org/other/mediacounts/daily/2016/ There are mediacounts here, is the download speed acceptable? Oh yes, that's around 50 MiB/s. I did not see this directory linked from their main page so I thought they had removed it; I'll

Re: [Xmldatadumps-l] Wikipedia page IDs

2016-12-03 Thread Federico Leva (Nemo)
Renato Stoffalette Joao, 03/12/2016 14:47: Secondly, could anybody kindly explain to me if some Wikipedia pages changed their IDs from the past ? Or if so point to me where this might be documented ? https://www.mediawiki.org/wiki/Manual:Page_table#page_id Please avoid such massive crosspostin

[Xmldatadumps-l] Fwd: Divide XML dumps by page.page_namespace (and figure out what to do with the "pages-articles" dump)

2017-01-17 Thread Federico Leva (Nemo)
Input requested: https://lists.wikimedia.org/pipermail/wikitech-l/2017-January/087393.html , https://phabricator.wikimedia.org/T99483 Personally I think that the main issue is the slowness of some of the tools people use (including dumps.wikimedia.org itself), so I tried to improve the docs a

Re: [Xmldatadumps-l] revised index.html for dumps?

2017-01-30 Thread Federico Leva (Nemo)
Aww, but the monobook background is so *cute*. :( A server kitten just died. Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

[Xmldatadumps-l] 2009 pages-meta-current XML dumps torrents

2017-02-16 Thread Federico Leva (Nemo)
A dozen "historical" dumps torrents have appeared on some open trackers: https://torrentproject.se/?t=pages-meta-current (https://archive.fo/UK2Qp ) In the last 4 days I couldn't find any seeder though. Nemo ___ Xmldatadumps-l mailing list Xmldatadump

Re: [Xmldatadumps-l] For entity research data

2017-09-05 Thread Federico Leva (Nemo)
Good morning, 06/09/2017 05:42: We are interested in Entity Research. We haven’t found a way to do that through the channels on the web page and were wondering if you have any ideas on how such data could be collected? Have you tried Wikidata? That's where Wikipedia data largely is. https://ww

Re: [Xmldatadumps-l] Official .torrent site for dumps files!?

2017-09-18 Thread Federico Leva (Nemo)
Felipe Ewald, 18/09/2017 04:31: Is this the official Wikimedia Foundation site for .torrent of the dumps files? It's not official, but it seems to work ok. Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimed

Re: [Xmldatadumps-l] inter-language wikipedia links dump

2018-04-07 Thread Federico Leva (Nemo)
David Straub, 07/04/2018 18:02: I am trying to identify which en.wikipedia dump contains the links between the English language version of wikipedia and other language versions for individual articles.

Re: [Xmldatadumps-l] change to output file numbering of big wikis

2018-05-31 Thread Federico Leva (Nemo)
Ariel Glenn WMF, 31/05/2018 14:36: The reason for the increase is that near the end of the run there are usually just a few big wikis taking their time at completing. If they run with 6 processes at once, they'll finish up a bit sooner. Thanks for putting more CPUs at work for us. :) Federico

Re: [Xmldatadumps-l] Current size of English Wikipedia Dump

2018-09-26 Thread Federico Leva (Nemo)
Andy Famiglietti, 26/09/2018 22:37: Does anyone know the actual size of the decompressed dump? Silly grep solution from a WMF Labs machine: $ find /public/dumps/public/enwiki/20180901 -name "enwiki*pages-meta-history*7z" -print0 | xargs -0 -n1 -I§ sh -c "7z l § | tail -n 1 | grep -Eo '^ +[0-

Re: [Xmldatadumps-l] image dumps tarballs

2019-03-20 Thread Federico Leva (Nemo)
Samuel Hoover, 20/03/19 19:51: Are Wikipedia or Commons image dumps available anywhere? The biggest we have is at . We still have to catch up after 2016 (the more the wait, the less to-be-deleted content slips in). Federico

Re: [Xmldatadumps-l] image dumps tarballs

2019-03-20 Thread Federico Leva (Nemo)
Samuel Hoover, 20/03/19 21:24: Does that mean Commons is currently culling its content? and that it makes most sense to wait for a post 2016 dump until after housecleaning is complete? No, I just mean that it takes time to identify copyright violations and so on. Most deletions happen for con

Re: [Xmldatadumps-l] Wikitaxi omits certain information from pages and inconvenient strings are found throughout the pages

2020-05-03 Thread Federico Leva (Nemo)
Fluffy Cat, 04/05/20 06:00: Besides being inconvenient regarding visuals, alot of these unidentified strings replace actual information in the Wiki, due to which that info becomes inaccessible. I have faced this problem in every Wikitaxi page that I have used. This is normal: WikiTaxi attempts

Re: [Xmldatadumps-l] Wikitaxi omits certain information from pages and inconvenient strings are found throughout the pages

2020-05-04 Thread Federico Leva (Nemo)
Fluffy Cat, 04/05/20 10:40: Kiwix was what I tried aswell but it has a 37 GB download which I was trying to avoid. Have you tried wikipedia_en_all_mini_2020-04.zim, which is only 11 GB, less than the latest dump? The most suitable way to save on size depends on what your purposes and resour

Re: [Xmldatadumps-l] Wikitaxi omits certain information from pages and inconvenient strings are found throughout the pages

2020-05-04 Thread Federico Leva (Nemo)
Fluffy Cat, 04/05/20 17:27: I believe the 37 GB English Wiki contains all articles without images, which is what I am looking for. So, it might take a few months but I shall probably download that one, through Kiwix. Meanwhile, Wikitaxi should be fine. If it takes you months to download it, I'm

Re: [Xmldatadumps-l] difference between the xml dumps of the english wikipedia and the pages themselves

2020-06-10 Thread Federico Leva (Nemo)
Fidel Sergio Gil Guevara, 10/06/20 15:25: Do the xml file dump use the tag names rather than some other form of URL resolution to create this [[]] tags.? No. The dumps contain the wikitext as is. Maybe you have an older version of the dump, before this edit?

Re: [Xmldatadumps-l] List of dumped wikis, discrepancy with Wikidata

2020-08-03 Thread Federico Leva (Nemo)
In short, you need to fix Wikidata. I think that's maintained manually. Authoritative sources can be found in the usual places: https://noc.wikimedia.org/ colin johnston, 03/08/20 09:15: > This does not seem to comply with foundation data protection retention policy > for article removal No such

[Xmldatadumps-l] Re: Part of pages missing in N0 enterprise dumps

2022-02-13 Thread Federico Leva (Nemo)
Il 13/02/22 21:16, Erik del Toro ha scritto: Can they be found somewhere else? In N6 or N14? For me it seems that articles/pages that have a colon like Anexo: or Conjugaison: are not part. These are not namespace 0. Perhaps the export process forgot to respect $wgContentNamespaces? Federico

[Xmldatadumps-l] Re: Newbie question: Which file?

2022-03-11 Thread Federico Leva (Nemo)
Not a silly question at all, as there are many options. Il 06/02/22 18:33, Hugh Barnard via Xmldatadumps-l ha scritto: I'd like XML or HTML, no images, to make a crawl of UK local elections, [...] It sounds like an exploratory phase where you may benefit from a higher-level look at the data a

[Xmldatadumps-l] Re: Part of pages missing in N0 enterprise dumps

2022-03-18 Thread Federico Leva (Nemo)
Il 18/03/22 14:04, Erik del Toro ha scritto: Just wanted to tell you, thathttp://aarddict.org users and dictionary creators also stumbled about these missing namespaces and are now suggesting to continue scraping these. So is scraping the expected approach? Thanks for mentioning this. Not sure

[Xmldatadumps-l] Re: Wiki Dump Question

2023-02-27 Thread Federico Leva (Nemo)
Il 25/02/23 06:20, Chris Couch ha scritto: I have been using my Synology file server to share the wiki dump torrents Thanks for sharing! Which torrent files are you talking about exactly? There are no official WMF-provided torrents for the XML dumps, so there are various unofficial ones. Do

[Xmldatadumps-l] Re: How to get wikipedia data dump?

2023-11-29 Thread Federico Leva (Nemo)
Il 29/11/23 07:52, hrishipate...@gmail.com ha scritto: I'm currently looking for latest Wikipedia data dumps that includes the complete history of Wikipedia edits for research purpose. https://meta.wikimedia.org/wiki/Data_dumps contains some information. If you mean the non-deleted revision h

[Xmldatadumps-l] Re: sha-256

2024-04-20 Thread Federico Leva (Nemo)
Adding new checksum files may or may not be a big deal. If the snapshot hosts have enough memory to keep the files in cache a bit longer, so they don't need to be read back from disk, running new checksums may be very fast. https://wikitech.wikimedia.org/wiki/Dumps has more information on the