Re: [Wikitech-l] Suggested file format of new incremental dumps

Petr Onderka Mon, 01 Jul 2013 11:17:28 -0700

>
> I was envisioning that we would produce "diff dumps" in one pass
> (presumably in a much shorter time than the fulls we generate now) and
> would apply those against previous fulls (in the new format) to produce
> new fulls, hopefully also in less time.  What do you have in mind for
> the production of the new fulls?
>


What I originally imagined is that the full dump will be modified directly
and a description of the changes made to it will be also written to the
diff dump.
But now I think that creating the diff and then applying it makes more
sense, because it's simpler.
But I also think that doing the two at the same time will be faster,
because it's less work (no need to read and parse the diff).
So what I imagine now is something like this:

1. Read information about a change in a page/revision
2. Create diff object in memory
3. Write the diff object to the diff file
4. Apply the diff object to the full dump


> It might be worth seeing how large the resulting en wp history files are
> going to be if you compress each revision separaately for version 1 of
> this project.  My fear is that even with 7z it's going to make the size
> unwieldy.  If the thought is that it's a first round prototype, not
> meant to be run on large projects, that's another story.
>

I do expect that full dump of enwiki using this compression would be way
too big.
So yes, this was meant just to have something working, so that I can
concentrate on doing compression properly later (after the mid-term).


> I'm not sure about removing the restrictions data; someone must have
> wanted it, like the other various fields that have crept in over time.
> And we should expect there will be more such fields over time...
>

If I understand the code in XmlDumpWriter.openPage correctly, that data
comes from the page_restrictions row [1], which doesn't seem to be used in
non-ancient versions of MediaWiki.

I did think about versioning the page and revision objects in the dump, but
I'm not sure how exactly to handle upgrades from one version to another.
For now, I think I'll have just one global "data version" per file, but
I'll make sure that adding a version to each object in the future will be
possible.


> We need to get some of the wikidata users in on the model/format
> discussion, to see what use they plan to make of those fields and what
> would be most convenient for them.
>
> It's quite likely that these new fulls will need to be split into chunks
> much as we do with the current en wp files.  I don't know what that
> would mean for the diff files.  Currently we split in an arbitrary way
> based on sequences of page numbers, writing out separate stub files and
> using those for the content dumps.  Any thoughts?
>

If possible, I would prefer to keep everything in a single file.
If that won't be possible, I think it makes sense to split on page ids, but
make the split id visible (probably in the file name) and unchanging  from
month to month.
If it turns out that a single chunk grows too big, we might consider adding
a "split" instruction to diff dumps, but that's probably not necessary now.

Petr Onderka

[1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Suggested file format of new incremental dumps

Reply via email to