[Xmldatadumps-l] Re: sha-256

2024-04-20 Thread Federico Leva (Nemo)
Adding new checksum files may or may not be a big deal. If the snapshot hosts have enough memory to keep the files in cache a bit longer, so they don't need to be read back from disk, running new checksums may be very fast. https://wikitech.wikimedia.org/wiki/Dumps has more information on the

[Xmldatadumps-l] Re: How to get wikipedia data dump?

2023-11-29 Thread Federico Leva (Nemo)
Il 29/11/23 07:52, hrishipate...@gmail.com ha scritto: I'm currently looking for latest Wikipedia data dumps that includes the complete history of Wikipedia edits for research purpose. https://meta.wikimedia.org/wiki/Data_dumps contains some information. If you mean the non-deleted revision

[Xmldatadumps-l] Re: Wiki Dump Question

2023-02-27 Thread Federico Leva (Nemo)
Il 25/02/23 06:20, Chris Couch ha scritto: I have been using my Synology file server to share the wiki dump torrents Thanks for sharing! Which torrent files are you talking about exactly? There are no official WMF-provided torrents for the XML dumps, so there are various unofficial ones.

[Xmldatadumps-l] Re: Part of pages missing in N0 enterprise dumps

2022-03-18 Thread Federico Leva (Nemo)
Il 18/03/22 14:04, Erik del Toro ha scritto: Just wanted to tell you, thathttp://aarddict.org users and dictionary creators also stumbled about these missing namespaces and are now suggesting to continue scraping these. So is scraping the expected approach? Thanks for mentioning this. Not

[Xmldatadumps-l] Re: Newbie question: Which file?

2022-03-11 Thread Federico Leva (Nemo)
Not a silly question at all, as there are many options. Il 06/02/22 18:33, Hugh Barnard via Xmldatadumps-l ha scritto: I'd like XML or HTML, no images, to make a crawl of UK local elections, [...] It sounds like an exploratory phase where you may benefit from a higher-level look at the data

[Xmldatadumps-l] Re: Part of pages missing in N0 enterprise dumps

2022-02-13 Thread Federico Leva (Nemo)
Il 13/02/22 21:16, Erik del Toro ha scritto: Can they be found somewhere else? In N6 or N14? For me it seems that articles/pages that have a colon like Anexo: or Conjugaison: are not part. These are not namespace 0. Perhaps the export process forgot to respect $wgContentNamespaces? Federico

Re: [Xmldatadumps-l] List of dumped wikis, discrepancy with Wikidata

2020-08-03 Thread Federico Leva (Nemo)
In short, you need to fix Wikidata. I think that's maintained manually. Authoritative sources can be found in the usual places: https://noc.wikimedia.org/ colin johnston, 03/08/20 09:15: > This does not seem to comply with foundation data protection retention policy > for article removal No

Re: [Xmldatadumps-l] difference between the xml dumps of the english wikipedia and the pages themselves

2020-06-10 Thread Federico Leva (Nemo)
Fidel Sergio Gil Guevara, 10/06/20 15:25: Do the xml file dump use the tag names rather than some other form of URL resolution to create this [[]] tags.? No. The dumps contain the wikitext as is. Maybe you have an older version of the dump, before this edit?

Re: [Xmldatadumps-l] Wikitaxi omits certain information from pages and inconvenient strings are found throughout the pages

2020-05-04 Thread Federico Leva (Nemo)
Fluffy Cat, 04/05/20 10:40: Kiwix was what I tried aswell but it has a 37 GB download which I was trying to avoid. Have you tried wikipedia_en_all_mini_2020-04.zim, which is only 11 GB, less than the latest dump? The most suitable way to save on size depends on what your purposes and

Re: [Xmldatadumps-l] Wikitaxi omits certain information from pages and inconvenient strings are found throughout the pages

2020-05-04 Thread Federico Leva (Nemo)
Fluffy Cat, 04/05/20 06:00: Besides being inconvenient regarding visuals, alot of these unidentified strings replace actual information in the Wiki, due to which that info becomes inaccessible. I have faced this problem in every Wikitaxi page that I have used. This is normal: WikiTaxi attempts

Re: [Xmldatadumps-l] image dumps tarballs

2019-03-20 Thread Federico Leva (Nemo)
Samuel Hoover, 20/03/19 21:24: Does that mean Commons is currently culling its content? and that it makes most sense to wait for a post 2016 dump until after housecleaning is complete? No, I just mean that it takes time to identify copyright violations and so on. Most deletions happen for

Re: [Xmldatadumps-l] Current size of English Wikipedia Dump

2018-09-26 Thread Federico Leva (Nemo)
Andy Famiglietti, 26/09/2018 22:37: Does anyone know the actual size of the decompressed dump? Silly grep solution from a WMF Labs machine: $ find /public/dumps/public/enwiki/20180901 -name "enwiki*pages-meta-history*7z" -print0 | xargs -0 -n1 -I§ sh -c "7z l § | tail -n 1 | grep -Eo '^

Re: [Xmldatadumps-l] change to output file numbering of big wikis

2018-05-31 Thread Federico Leva (Nemo)
Ariel Glenn WMF, 31/05/2018 14:36: The reason for the increase is that near the end of the run there are usually just a few big wikis taking their time at completing. If they run with 6 processes at once, they'll finish up a bit sooner. Thanks for putting more CPUs at work for us. :)

Re: [Xmldatadumps-l] inter-language wikipedia links dump

2018-04-07 Thread Federico Leva (Nemo)
David Straub, 07/04/2018 18:02: I am trying to identify which en.wikipedia dump contains the links between the English language version of wikipedia and other language versions for individual articles.

Re: [Xmldatadumps-l] Official .torrent site for dumps files!?

2017-09-18 Thread Federico Leva (Nemo)
Felipe Ewald, 18/09/2017 04:31: Is this the official Wikimedia Foundation site for .torrent of the dumps files? It's not official, but it seems to work ok. Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org

Re: [Xmldatadumps-l] For entity research data

2017-09-06 Thread Federico Leva (Nemo)
Good morning, 06/09/2017 05:42: We are interested in Entity Research. We haven’t found a way to do that through the channels on the web page and were wondering if you have any ideas on how such data could be collected? Have you tried Wikidata? That's where Wikipedia data largely is.

[Xmldatadumps-l] 2009 pages-meta-current XML dumps torrents

2017-02-16 Thread Federico Leva (Nemo)
A dozen "historical" dumps torrents have appeared on some open trackers: https://torrentproject.se/?t=pages-meta-current (https://archive.fo/UK2Qp ) In the last 4 days I couldn't find any seeder though. Nemo ___ Xmldatadumps-l mailing list

Re: [Xmldatadumps-l] revised index.html for dumps?

2017-01-30 Thread Federico Leva (Nemo)
Aww, but the monobook background is so *cute*. :( A server kitten just died. Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

[Xmldatadumps-l] Fwd: Divide XML dumps by page.page_namespace (and figure out what to do with the "pages-articles" dump)

2017-01-17 Thread Federico Leva (Nemo)
Input requested: https://lists.wikimedia.org/pipermail/wikitech-l/2017-January/087393.html , https://phabricator.wikimedia.org/T99483 Personally I think that the main issue is the slowness of some of the tools people use (including dumps.wikimedia.org itself), so I tried to improve the docs

Re: [Xmldatadumps-l] Wikipedia page IDs

2016-12-03 Thread Federico Leva (Nemo)
Renato Stoffalette Joao, 03/12/2016 14:47: Secondly, could anybody kindly explain to me if some Wikipedia pages changed their IDs from the past ? Or if so point to me where this might be documented ? https://www.mediawiki.org/wiki/Manual:Page_table#page_id Please avoid such massive

Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-09-27 Thread Federico Leva (Nemo)
Federico Leva (Nemo), 17/06/2016 14:59: Ariel Glenn WMF, 17/06/2016 13:21: For folks from specific institutions that suddenly no longer have access, I can forward instution names along and hope that helps. It would be nice to whitelist the wmflabs.org servers, which would benefit from

Re: [Xmldatadumps-l] Portal:Current events in revision history dumps?

2016-09-08 Thread Federico Leva (Nemo)
Govind, 08/09/2016 12:03: I'm experimenting on revision history dumps of Wikipedia. Do you mean the English Wikipedia? I've confusion regarding archiving of Portal:Current Events in revision history dumps. The title you mention only has 2 edits:

Re: [Xmldatadumps-l] [Wikitech-l] wikidatawiki.xml.bz2 failed integrity check

2016-08-10 Thread Federico Leva (Nemo)
"lbzip2 -t /public/dumps/public/wikidatawiki/20160801/wikidatawiki-20160801-pages-articles.xml.bz2" succeeds for me on Labs. You should compare the checksum of your copy with https://dumps.wikimedia.org/wikidatawiki/20160801/wikidatawiki-20160801-sha1sums.txt (says

Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-06-17 Thread Federico Leva (Nemo)
Ariel Glenn WMF, 17/06/2016 13:21: For folks from specific institutions that suddenly no longer have access, I can forward instution names along and hope that helps. It would be nice to whitelist the wmflabs.org servers, which would benefit from a faster server to download this stuff from.

Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-05-15 Thread Federico Leva (Nemo)
Ariel Glenn WMF, 04/05/2016 14:33: You can access it at http://wikimedia.crc.nd.edu/other/ so please do! Great news, especially because it's ten times faster than dumps.wikimedia.org! Finally, every time I need a dataset to quickly verify a sudden idea I have, the download becomes a matter

Re: [Xmldatadumps-l] Wikipedia 20150205 dump

2016-02-18 Thread Federico Leva (Nemo)
Praveen Balaji, 18/02/2016 18:45: I was wondering if someone could point me the english wikipedia "enwiki-20150205-pages-articles-multistream" dump from which the 2015-04 dbpedia dumps were extracted. They used to be hosted on dump.wikipedia.org but are 404 now. You

Re: [Xmldatadumps-l] corrupted files english december

2015-12-28 Thread Federico Leva (Nemo)
Luigi Assom, 28/12/2015 16:47: Or ps. does dbpedia is working with wikimedia staff or are two completely separated things? Completely separate. I was wondering why wiki release dumps every month, while ~dbpedia each year. Probably because DBpedia isn't an automated process and their maps

Re: [Xmldatadumps-l] Use of dumps in mediawiki

2015-09-07 Thread Federico Leva (Nemo)
Yoni Lamri, 07/09/2015 12:01: My simple question, how to correctly install a wikipedia mirror from dumps in mediawiki ? Did you follow https://meta.wikimedia.org/wiki/Data_dumps/Tools_for_importing ? My goal: Create an offline wikimedia server, from FR, EN or PT dumps, (1 language only).

Re: [Xmldatadumps-l] Need an old dump

2015-07-05 Thread Federico Leva (Nemo)
Saurabh Sarda, 05/07/2015 08:03: The dump file is named enwiki-20081008-pages-articles.xml.bz2 It's the Nth time someone comes and asks this dump, check the archives! http://article.gmane.org/gmane.org.wikimedia.xmldatadumps/1069/ At this point you could try torrent search engines, but it's

Re: [Xmldatadumps-l] No Wikivoyage dumps since 2014

2015-02-05 Thread Federico Leva (Nemo)
Nicolas Raoul, 05/02/2015 08:45: hoping to use this content with Wikivoyage offline browsers like Kiwix ZIM files for Kiwix don't rely on XML dumps. In fact, the latest ZIM was produced few hours ago. http://download.kiwix.org/zim/wikivoyage/?C=M;O=A Nemo

Re: [Xmldatadumps-l] incremental dump issues

2014-07-05 Thread Federico Leva (Nemo)
wp mirror, 05/07/2014 06:11: Dear Federico, Thanks for the links. The advise on https://meta.wikimedia.org/wiki/Data_dumps/ImportDump.php I have already implemented. Bits of https://www.mediawiki.org/wiki/Manual:Performance_tuning are also implemented. I am not clear about ``setting

Re: [Xmldatadumps-l] incremental dump issues

2014-07-04 Thread Federico Leva (Nemo)
wp mirror, 04/07/2014 23:33: 2.1) Speed: Importation proceeds at less than 0.1 pages/sec. This means that, for the largest wikis (commonswiki, enwiki, wikidatawiki) importation cannot be completed before the `xincr' for the next day is posted. Did you try

Re: [Xmldatadumps-l] LATEST DUMPS

2014-06-11 Thread Federico Leva (Nemo)
Alex Druk, 09/06/2014 10:34: I wonder if anyone know when dumps fro may data will be ready? Usually dumps preparation for data of previous month start on 2-8 of the next month (http://dumps.wikimedia.org/enwiki/) However June dumps preparation for May data not started yet

Re: [Xmldatadumps-l] Help for dump

2014-06-02 Thread Federico Leva (Nemo)
Yannick Guigui, 02/06/2014 10:59: It’s about Dump of Wikipedia. I’m working in a project which uses Wikipedia Database to display articles offline. Nice! But please don't use the raw database + images. What you need is already available: http://kiwix.org If it's a problem for you to get about

Re: [Xmldatadumps-l] [Wikitech-l] Compressing full-history dumps faster

2014-03-08 Thread Federico Leva (Nemo)
Randall Farmer, 21/01/2014 23:26: Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there:

Re: [Xmldatadumps-l] Template expansion inconsistency

2014-02-22 Thread Federico Leva (Nemo)
wp mirror, 22/02/2014 23:40: Still, it would be nice if the dump files could be fixed. Fixed? title is the full page name as it's supposed to be. Either you're doing something wrong with the import, or the import script/special page has a bug (not uncommon, but needs a bug report with steps

Re: [Xmldatadumps-l] Installing dumps too slow

2013-08-23 Thread Federico Leva (Nemo)
I've copied the above info to https://meta.wikimedia.org/wiki/Data_dumps/Tools_for_importing#Converting_to_SQL_first Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Re: [Xmldatadumps-l] Help needed for Volunteer Uganda project - Importing enwiki-20130403-pages-articles-multistream.xml.bz2

2013-06-03 Thread Federico Leva (Nemo)
Richard Ive, 02/06/2013 12:38: Hi all, I am helping the charity Volunteer Uganda set up an offline eLearning computer system with 15 Raspberry Pi's and cheap desktop computer for a server. Why aren't you using Kiwix? Reportedly, it even runs standalone on a Raspberry Pi without problems.

Re: [Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]

2013-05-06 Thread Federico Leva (Nemo)
Randall Farmer, 06/05/2013 08:37: To wrap up what I started earlier, here's a slightly tweaked copy of the last script I sent around [...] But, all that said, declaring blks2.py a (kinda fun to work on!) dead end. :) If you're done with it, you may want to drop it on a Wikimedia repo like

Re: [Xmldatadumps-l] Pagecounts data missing (2009/09/21 - 2009/10/01)

2013-05-02 Thread Federico Leva (Nemo)
Giovanni Luca Ciampaglia, 02/05/2013 22:40: Hi, I noticed that some pagecounts data files are missing, namely the files in the interval (2009092116 - 2009100100) (ends excluded). See http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/ Does anybody know the reason why these

Re: [Xmldatadumps-l] I need Database tables Mapping to DB Dumps

2013-04-11 Thread Federico Leva (Nemo)
Imran Latif, 11/04/2013 08:26: Thanks for replying, Your reply makes sense. I just need to confirm that if i used the following Dump http://dumps.wikimedia.org/fiwiki/20130323/ And download all sql and XML files and populates my table using some utility, then the whole Wikipedia data is

Re: [Xmldatadumps-l] I need Database tables Mapping to DB Dumps

2013-04-10 Thread Federico Leva (Nemo)
Imran Latif, 10/04/2013 23:06: I'm doing research project on Wikipedia, so i need the Wikipedia data. I decided to use the database dumps of Wikipedia for this purpose but there are too much files there i don't know which file populates the which table. Would you please provide some information

Re: [Xmldatadumps-l] Processing french dump

2013-03-21 Thread Federico Leva (Nemo)
Benoit Lelong, 11/12/2012 16:11: I am currently planning to process the last french dump. I would like to ask if somebody has already found or used a good OpenNLP french sentence detection model. If yes please let me know where to find one. What have you found? Probably wiktionary-l is a

Re: [Xmldatadumps-l] Wikidata project and interwiki links removed in wiki text

2013-03-04 Thread Federico Leva (Nemo)
François Bonzon, 04/03/2013 16:35: How can I now extract interwiki links from dumps? Is there a separate Wikidata dump I should download? What attributes for look for to join Wikidata and separate language wiki dumps? Thanks for your help.

Re: [Xmldatadumps-l] Wikidata project and interwiki links removed in wiki text

2013-03-04 Thread Federico Leva (Nemo)
François Bonzon, 04/03/2013 18:22: I confirm I now see interwiki language links originating from Wikidata in languagewiki-date-langlinks.sql.gz dumps, with the format described in the 2nd link you sent. However, this is a MySQL dump, not a XML dump. Language links are then no more available in

Re: [Xmldatadumps-l] Housekeeping categories?

2013-02-23 Thread Federico Leva (Nemo)
Robert Crowe, 23/02/2013 21:58: I tried finding all the subcategories of Category:Wikipedia_administration, but unfortunately that includes many non-administration categories also. Will administration categories be limited to those that contain only: - Categories - Files - Talk pages Or are

Re: [Xmldatadumps-l] Housekeeping categories?

2013-02-13 Thread Federico Leva (Nemo)
I don't think there's any simple/reliable way: your only option is probably crossing the whole category tree and find out whether a category is not a (sub-){1,100}category of https://en.wikipedia.org/wiki/Category:Articles or equivalent... and hope there are not too many loops! Nemo

Re: [Xmldatadumps-l] Encoding issue in the last ZH dump

2013-01-08 Thread Federico Leva (Nemo)
Ariel T. Glenn, 08/01/2013 09:26: The issue is that the bad character was added in 2004, see https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB% E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80% 29action=editoldid=386385 I've requested removal and revdeletion:

Re: [Xmldatadumps-l] [Wikitech-l] Fwd: Old English Wikipedia image dump from 2005

2012-01-31 Thread Federico Leva (Nemo)
K. Peachey, 31/01/2012 10:17: On Tue, Jan 31, 2012 at 6:13 PM, Ariel T. Glennar...@wikimedia.org wrote: You don't need an account to read the content, only to edit. Ariel I believe they mean watchlisting (so they get email notifs) (If email alerts are even activated over there) You can