Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-31 Thread Petr Onderka
> > For storing updateable indexes, Berkeley DB 4-5, GDBM, and higher-level > options like SQLite are widely used. > LevelDB is > pretty cool too. > I think that with the amount of data we're dealing with, it makes sense to have the file format under tight cont

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-10 Thread Petr Onderka
On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer wrote: > > Keeping the dumps in a text-based format doesn't make sense, because > that can't be updated efficiently, which is the whole reason for the new > dumps. > > First, glad to see there's motion here. > > It's definitely true that recompressin

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-07 Thread Ariel T. Glenn
Στις 07-07-2013, ημέρα Κυρ, και ώρα 21:09 -0700, ο/η Randall Farmer έγραψε: > Sorry, reading back over this thread late. > > > > What I hope for is a format that allows dumps to be produced much > more > > rapidly, where the time to produce the incrementals grows only as > the > > number of edits

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-04 Thread Petr Onderka
On Wed, Jul 3, 2013 at 11:29 PM, Tyler Romeo wrote: > You should look into maybe using cmake or some other automated build system > to handle the cross-platform compatibility. I will look into that. > Also, are you planning on using > C++11 features? (Just asking because I'm a big C++11 fan.

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-03 Thread Tyler Romeo
You should look into maybe using cmake or some other automated build system to handle the cross-platform compatibility. Also, are you planning on using C++11 features? (Just asking because I'm a big C++11 fan. ;) ). *-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Comput

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-03 Thread Petr Onderka
I'm writing it in C++. If you want, you can follow my progress in the operations/dumps/incremental repo, branch gsoc [1] (but there isn't almost anything there yet). And I don't have any computers with non-x86 architecture, so I won't be able to test that. [1]: https://git.wikimedia.org/log/operat

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-03 Thread Petr Onderka
ailto: > wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Petr Onderka > Sent: Wednesday, July 03, 2013 4:04 PM > To: Wikimedia developers; Wikipedia Xmldatadumps-l > Subject: Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new > incremental dumps > > A reply to

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-03 Thread Petr Onderka
The problem is that appending is not enough, especially if you want to keep the current format. 1. With the current format you almost could append new pages, but not new revisions of existing pages, because they belong in the middle of the XML. 2. We also need to handle deletions (and undeletions)

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-03 Thread Brion Vibber
On Wed, Jul 3, 2013 at 7:49 AM, Erik Zachte wrote: > > it will now be a command line application that outputs the data as > uncompressed XML, in the same format as current dumps. > > That will help a great deal. But I assume your application will be for > Linux only? > So it would help to still g

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-03 Thread Erik Zachte
ia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Petr Onderka Sent: Wednesday, July 03, 2013 4:04 PM To: Wikimedia developers; Wikipedia Xmldatadumps-l Subject: Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps A reply to all those who basically want t

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-03 Thread Magnus Manske
Thanks, that sounds like a good solution. On Wed, Jul 3, 2013 at 3:04 PM, Petr Onderka wrote: > A reply to all those who basically want to keep the current XML dumps: > > I have decided to change the primary way of reading the dumps: it will now > be a command line application that outputs the

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-03 Thread Petr Onderka
A reply to all those who basically want to keep the current XML dumps: I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps. This way, you should be able to use the n

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-02 Thread Ariel T. Glenn
Στις 02-07-2013, ημέρα Τρι, και ώρα 11:47 +0100, ο/η Neil Harris έγραψε: > The simplest possible dump format is the best, and there's already a > thriving ecosystem around the current XML dumps, which would be broken > by moving to a binary format. Binary file formats and APIs defined by > cod

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-02 Thread Neil Harris
On 01/07/13 23:21, Nicolas Torzec wrote: Hi there, In principle, I understand the need for binary formats and compression in a context with limited resources. On the other hand, plain text formats are easy to work with, especially for third-party users and organizations. Playing the devil adv