https://bugzilla.wikimedia.org/show_bug.cgi?id=47406
--- Comment #4 from Kiran Mathew Koshy <[email protected]> --- Regarding zimdiff, which is the first phase of my GSoC project, here is the file format I'm proposing: The zimdiff tool will take two zim files as input. Lets call them start_file and end_file. The purpose is to generate a third file, diff_file, which can be used to update the contents of start_file to obtain end_file. Files: 1.start_file 2.end_file 3.diff_file File format for diff_file: The diff_file can be a ZIM file itself, with a few additional metadata entries. The articles(non-metadata) inside the diff_file will be: 1.the articles that have been added to the end_file, which were not present in start_file. These will share a common namespace. There will be a list of new articles in the metadata and their corresponding namespaces, so that the original namespace information is not lost. 2. Articles that have been modified.(the new version of the article).A list of articles that have been modified will be kept among metadata, and using this list, the old articles can be removed and the new articles inserted when the update is done. There is an alternate approach, using the patience-diff algorithm (used by git), which will take a much longer time to complete, given the huge number of articles. The alternate approach is as follows: For all modified articles, apply a diff(patience-diff preferred, since it will b) algorithm between the original and final article and store the output instead of the entire new article. Advantages: (i) Most of the updates in Wikipedia are tiny ones, so this approach will free a lot of space, and the zimdiff file would be considerably smaller. (ii) Computation is done in servers, and not very often(once in two weeks or so), so time isn't exactly a constraint. Disadvantages: (i) More complex algorithm, thereby prone (ii) More time required to compute. New articles in Metadata: 1. Hashes of the start_file and end_file. 2. List of new articles, and their namespaces 3. List of modified articles, and their corresponding old articles. 4. List of deleted articles. Please add your comments, especially on the method for storing updated articles. -- You are receiving this mail because: You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
