https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #4 from Kiran Mathew Koshy <[email protected]> ---
Regarding zimdiff, which is the first phase of my GSoC project, here is the
file format I'm proposing:

The zimdiff tool will take two zim files as input. Lets call them start_file
and end_file. The purpose is to generate a third file, diff_file, which can be
used to update the contents of start_file to obtain end_file.

Files: 
1.start_file
2.end_file
3.diff_file

File format for diff_file:

The diff_file can be a ZIM file itself, with a few additional metadata entries.
The articles(non-metadata) inside the diff_file will be:

1.the articles that have been added to the  end_file, which were not present in
start_file. These will share a common namespace. There will be a list of new
articles in the metadata and their corresponding namespaces, so that the
original namespace information is not lost.

2. Articles that have been modified.(the new version of the article).A list of
articles that have been modified will be kept among metadata, and using this
list, the old articles can be removed and the new articles inserted when the
update is done.

There is an alternate approach, using the patience-diff algorithm (used by
git), which will take a much longer time to complete, given the huge number of
articles.
The alternate approach is as follows:
For all modified articles, apply a diff(patience-diff preferred, since it will
b) algorithm between the original and final article and store the output
instead of the entire new article.
Advantages: 
(i) Most of the updates in Wikipedia are tiny ones, so this approach will free
a lot of space, and the zimdiff file would be considerably smaller.
(ii) Computation is done in servers, and not very often(once in two weeks or
so), so time isn't exactly a constraint.

Disadvantages:
(i) More complex algorithm, thereby prone 
(ii) More time required to compute. 


New articles in Metadata:

1. Hashes of the start_file and end_file.
2. List of new articles, and their namespaces
3. List of modified articles, and their corresponding old articles.
4. List of deleted articles.

Please add your comments, especially on the method for storing updated
articles.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to