https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #5 from Kelson [Emmanuel Engelhart] <kel...@kiwix.org> ---
I also think we should use the ZIM format for the ZIM diff format. I'm not sure
to fully understand the proposition of Kiran, but here is the actual status of
my thoughts about this topic.

ZIM files contain:
* articles: text, metadata, pictures, sounds, videos... Multimedia contents
often compose the majority of the size of a ZIM file
* A header with a few technical data
* A sorted title list and a sorted url list

The ZIM diff file must contain the necessary data to allow to make:
start_file + diff_file = end_file

So you need to be able to:
* delete/replace/add articles
* replace/update headers
* Rewrite url/title lists
... and everything in a way you will get exactly the same end_file at the end
of the patch process.

To add articles, the easier way is simply to add all article which are
available in the diff_file and not available in the start_file.

To replace article, follow the same process like to add article. I'm not a fan
of having text article diffs in the diff_file because (1) This makes everything
more complicated to code (2) I'm not convinced at all that this saves a lot of
mass storage space because Wikipedia articles are pretty heavily modified. This
is true especially if you have many multimedia contents in the diff file (3)
This will really slow-down the patch process... and this process will already
be pretty slow.

To delete articles, we need a list somewhere as a metadata, see my paragraph
about additional metadata at the end of this comment.

* To replace/update headers, I propose simply to overwrite the old start_file
ones with the diff_files ones (except diff specific metadata).

* Both url/title list must be recomputed during the patch process... but we
don't need IMO to store anything related to that here.

So, we could achieve to have a diff file with only the new/modified articles,
new metadata values and a few aditional diff infos which could be stored in a
special metadata entry "M/Diff" with couple of values. I see at least two
mandatory values:
* "originuuid", containing the uuid of the start_file
* "deleteUrls", with the list of urls to be deleted.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to