Dear Ariel,

0) Context.

I am trying to understand how the XML dumps are split (as seen for enwiki,
frwiki, dewiki,, etc.).  This is because I would like to write a script
that can recognize when a complete set of, say `pages-articles', split
dumps has been posted (even if the `pages-meta-history' split dumps are not
complete). To that end, I have some questions.

1) Naming.

Most wikis with split files (`dewiki', `frwiki', `wikidatawiki', and six
others) are split into four pieces. There is a one-to-one correspondence
between the `pages' and `stub' split files.  It is easy to write code for
this case.

How are the split dumps for the `enwiki' (and soon the `frwiki' and
`dewiki') named?  I notice that the page range of the last `pages' split
file changes every month. There are no pages ranges on the `stub' files.
There is a many-to-one correspondence between `pages-meta-history' and
`stub-meta-history' split files.  It is harder to write code for this case.
It is also not possible to use the `mwxml2sql' transform tool unless there
is a one-to-one correspondence between `pages' and `stub' files.

2) Splitting.

How are the dumps split.?

There seems to be a one-to-one correspondence between `pages-articles' and
`stub-articles' files.  Yet, the `enwiki-20151002' dumps are split in an
anomalous way.  The `pages-articles' dumps are split into 28 files, while
the `stub-articles' dumps are split into 27 files. Likewise with the
`pages-meta-current' (28 files) and `stub-meta-current' dumps (27 files).
Should my code be able to handle this as valid, or flag it as a bug?

There is a many-to-one correspondence between `pages-meta-history' and
`stub-meta-history' files. How do we understand this well enough to write
code?

3) Posting.

When split dumps are generated, are the files posted one-by-one, or
atomically as a complete set?  In other words, how do we recognize when a
`pages-articles' dump set is complete, even if the `pages-meta-history'
dump set is missing?

Sincerely Yours,
Kent

On Fri, Dec 4, 2015 at 4:24 AM, Ariel T. Glenn <agl...@wikimedia.org> wrote:

> Στις 03-12-2015, ημέρα Πεμ, και ώρα 15:30 -0700, ο/η Bryan White
> έγραψε:
> > I see where almost all the dumps have "Dump complete" next to them
> > and the data has been transferred to labs.  Problem is, the dumps are
> > not complete.  Is this the new paradigm?... After each stage of the
> > dump, label them done and then transfer what files were generated?
> > Wash, rinse and repeat?
> >
> > Bryan
> > _______________________________________________
>
> Transferring each file that is complete when the rsync runs is the new
> paradigm, which has been happening since sometime last month. The
> marking of all dumps as 'Dump complete' is a bug from my last deploy 2
> days ago; I have to track that down.  It should be listing them as
> 'Partial Dump'.
>
> Ariel
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Reply via email to