[Bug 47406] Incremental update: zimdiff zimpatch

2014-10-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

Andre Klapper aklap...@wikimedia.org changed:

   What|Removed |Added

   Priority|Unprioritized   |Lowest

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2014-10-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

Nemo federicol...@tiscali.it changed:

   What|Removed |Added

   Priority|Lowest  |Normal

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2014-02-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #19 from Nemo federicol...@tiscali.it ---
(In reply to Gerrit Notification Bot from comment #18)
 Change 82449 had a related patch set uploaded by Nemo bis:
 Incremental update: zimdiff and zimpatch
 
 https://gerrit.wikimedia.org/r/82449

Has anyone ever used that code?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2014-02-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #20 from Kelson [Emmanuel Engelhart] kel...@kiwix.org ---
We have currently have a bottleneck on side code reviewing at openzim.
I have to work to integrate this, it's not forgotten.
I use this code yes and it works, although I still want to fix a few details
before releasing.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2014-02-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #18 from Gerrit Notification Bot gerritad...@wikimedia.org ---
Change 82449 had a related patch set uploaded by Nemo bis:
Incremental update: zimdiff and zimpatch

https://gerrit.wikimedia.org/r/82449

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-10-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #17 from Quim Gil q...@wikimedia.org ---
If you have open tasks or bugs left, one possibility is to list them at
https://www.mediawiki.org/wiki/Google_Code-In and volunteer yourself as mentor.

We have heard from Google and free software projects participating in Code-in
that students participating in this programs have done a great work finishing
and polishing GSoC projects, many times mentores by the former GSoC student.
The key is to be able to split the pending work in little tasks.

More information in the wiki page. If you have questions you can ask there or
you can contact me directly.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-10-06 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #16 from Kiran Mathew Koshy kiranmathewko...@gmail.com ---
GSoC 2013 Summary:

Thanks to everyone, I managed to pass GSoC 2013. Its been a great summer, and
I'll be sticking around for more.

Regarding zimdiff and zimpatch- Both tools are ready for action- 
https://github.com/wikimedia/openzim

The Kiwix integration is not complete yet, there is some UI stuff left, and
some work in linking the zimpatch class with javascript. I will be working with
these in the near future.

https://github.com/kiwix/kiwix_mirror

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-09-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #15 from Kelson [Emmanuel Engelhart] kel...@kiwix.org ---
@Quim

The project is for now not completed. Kiran is still working on it and will
continue until the deadline: monday. What is actually done represents around
70% of the whole project and I have some expectations that Kiran could reach
80%. Kiran is more optimistic.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-09-17 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #14 from Quim Gil q...@wikimedia.org ---
GSoC soft pencils down date was yesterday and all coding must stop on 23
September. Has this project been completed?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-09-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #13 from Gerrit Notification Bot gerritad...@wikimedia.org ---
Change 82449 had a related patch set uploaded by Kiran mathew koshy 1993:
Bug 47406 Incremental update: zimdiff and zimpatch.

https://gerrit.wikimedia.org/r/82449

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-09-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

Gerrit Notification Bot gerritad...@wikimedia.org changed:

   What|Removed |Added

 Status|NEW |PATCH_TO_REVIEW

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-07-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #12 from Kiran Mathew Koshy kiranmathewko...@gmail.com ---
Zimdiff and zimpatch has been completed, and currently we are in the testing
phase.

I've fixed most of the bugs, but i'm sure there are some left. For instance, we
haven't achieved a checksum match between the generated zim file and the
original zim file. (The checksum matches in smaller zim files,but fails in
larger files.)
One probable reason is because of different compression algorithms used in
older files.

Everyone is invited to check out the code and post their comments.

zimdiff:
https://github.com/kiranmathewkoshy/zimdiff

zimpatch:
https://github.com/kiranmathewkoshy/zimpatch

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-06-21 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #11 from Kiran Mathew Koshy kiranmathewko...@gmail.com ---
Zimdiff:

https://github.com/kiranmathewkoshy/zimdiff.git

right on schedule.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-06-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #5 from Kelson [Emmanuel Engelhart] kel...@kiwix.org ---
I also think we should use the ZIM format for the ZIM diff format. I'm not sure
to fully understand the proposition of Kiran, but here is the actual status of
my thoughts about this topic.

ZIM files contain:
* articles: text, metadata, pictures, sounds, videos... Multimedia contents
often compose the majority of the size of a ZIM file
* A header with a few technical data
* A sorted title list and a sorted url list

The ZIM diff file must contain the necessary data to allow to make:
start_file + diff_file = end_file

So you need to be able to:
* delete/replace/add articles
* replace/update headers
* Rewrite url/title lists
... and everything in a way you will get exactly the same end_file at the end
of the patch process.

To add articles, the easier way is simply to add all article which are
available in the diff_file and not available in the start_file.

To replace article, follow the same process like to add article. I'm not a fan
of having text article diffs in the diff_file because (1) This makes everything
more complicated to code (2) I'm not convinced at all that this saves a lot of
mass storage space because Wikipedia articles are pretty heavily modified. This
is true especially if you have many multimedia contents in the diff file (3)
This will really slow-down the patch process... and this process will already
be pretty slow.

To delete articles, we need a list somewhere as a metadata, see my paragraph
about additional metadata at the end of this comment.

* To replace/update headers, I propose simply to overwrite the old start_file
ones with the diff_files ones (except diff specific metadata).

* Both url/title list must be recomputed during the patch process... but we
don't need IMO to store anything related to that here.

So, we could achieve to have a diff file with only the new/modified articles,
new metadata values and a few aditional diff infos which could be stored in a
special metadata entry M/Diff with couple of values. I see at least two
mandatory values:
* originuuid, containing the uuid of the start_file
* deleteUrls, with the list of urls to be deleted.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-06-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #6 from Kiran Mathew Koshy kiranmathewko...@gmail.com ---
Yes, a diff algorithm would make the code quite complex.

However, I don't think the patch process would be slowed down if a diff
algorithm is implemented, since the most time consuming portion involving file
diff is the creation of the diff file.

How about if we leave a provision for the diff algorithm ?
 i.e, for an updated article, either the entire new article can be stored in
the diff_file, or the output of the diff algorithm can be stored.
Once the rest of the code is implemented, the diff algorithm can be done, as an
alternative to storing the entire article.

Once completed, the final version of the program will compute the diff of the
articles, and use the smaller of the two(diff and new article).

A list of articles whose diff is given will be maintained in metadata.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-06-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #7 from Kelson [Emmanuel Engelhart] kel...@kiwix.org ---
@ Kiran

You proposition makes sense to me but I would propose to make the zimdiff tool
working fine without this before working on that. So, if everything works well
and you have time left, then this would be an interesting improvement.

My last concern is realated to the ZIM format internals: the ability to produce
exactly the same file end_file at the end of the patch process. That means that
the articles are exactly stored in the same clusters, in the same order, etc...
Do you have invested time on that to see how this could be guaranteed?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-06-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

Christian Puehringer c...@gmx.at changed:

   What|Removed |Added

 CC||c...@gmx.at

--- Comment #8 from Christian Puehringer c...@gmx.at ---
I agree with that it is a good idea to support both replacement of entire
articles and diffs. 
This is because I believe that omission of diff may have a huge impact on file
size: In particular articles which are updated infrequently, and there are a
lot of them, often have changes which only affect a single line, such as
spelling fixes or interwiki link updates. Thus not supporting diffs may lead to
very large diff file. 
On the other hand, when diff-support is available, it still may make sense
to store complete articles instead of diffs for some articles, for example
to reduce processing effort in diff creation, but also merge.

One other point which is worth considering is disk space usage during merge. In
particular on mobile devices, when the planned download manager is implemented,
the incremental update capability would be very useful. However, if the update
process requires free space for the old, the diff, and the new file, it won't
be useable in many cases because there is not enough freespace on mobile
devices.
Therefore something like in-place updating the old zim file, and merging while
dowloading the diff would make sense.
For sure this is not trivial, and hardly possible to be implemented in the
GSoC, but it would make sense to keep this in mind when defining the diff and
merge processes and the diff file format, so that this feature could be added
in the future. For example, with such an approach it is important that the zim
file stays consistend during the update, and later resuming is possible.
Note: If actual in-place update is not feasible, the zim-split feature could be
used, e.g. on download of full zim file is split in let's say 64 MB chunks. On
update new chunk (size may change) is written completely before old is deleted.
Thus during update only 64 MB additional storage is required, while without
in-place update tens of GB could be necessary.

And other interesting feature could be to support on-the-fly merge: When this
feature is used the diff zim file is not merged during update. Instead it is
just stored besides the old zim file, and the zimlib merges the articles from
both (or more) files on access. Benefit is that the end user does not need to
run the probably pretty long running merge task. In addition it is faster,
because no recompression needs to be done. Furthermore, no additional storage
during update is required. Drawback is that is uses more storage, thus it
depends on the size of the diff-files, whether this approach is actually
feasible. 
Anyway, I think its worth considering this feature, as it should not require
much additional effort, in fact it could even be an intermediate step for the
currently planned approach.
For sure for this on-the-fly approach supporting multiple deltas would make
sense, but this could be implemented later.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-06-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #9 from Kiran Mathew Koshy kiranmathewko...@gmail.com ---
http://openzim.org/wiki/Zimdiff 
Page has been created for zimdiff.
I have added the pseudo code there.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-06-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #10 from Kiran Mathew Koshy kiranmathewko...@gmail.com ---
@Christian Puehringer
Although I don't fully understand the in-place update, I think it is a good
idea to support the on-the-fly merge- This can be quite useful on mobile
devices where the zimpatch would take quite a long time.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-06-15 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

--- Comment #4 from Kiran Mathew Koshy kiranmathewko...@gmail.com ---
Regarding zimdiff, which is the first phase of my GSoC project, here is the
file format I'm proposing:

The zimdiff tool will take two zim files as input. Lets call them start_file
and end_file. The purpose is to generate a third file, diff_file, which can be
used to update the contents of start_file to obtain end_file.

Files: 
1.start_file
2.end_file
3.diff_file

File format for diff_file:

The diff_file can be a ZIM file itself, with a few additional metadata entries.
The articles(non-metadata) inside the diff_file will be:

1.the articles that have been added to the  end_file, which were not present in
start_file. These will share a common namespace. There will be a list of new
articles in the metadata and their corresponding namespaces, so that the
original namespace information is not lost.

2. Articles that have been modified.(the new version of the article).A list of
articles that have been modified will be kept among metadata, and using this
list, the old articles can be removed and the new articles inserted when the
update is done.

There is an alternate approach, using the patience-diff algorithm (used by
git), which will take a much longer time to complete, given the huge number of
articles.
The alternate approach is as follows:
For all modified articles, apply a diff(patience-diff preferred, since it will
b) algorithm between the original and final article and store the output
instead of the entire new article.
Advantages: 
(i) Most of the updates in Wikipedia are tiny ones, so this approach will free
a lot of space, and the zimdiff file would be considerably smaller.
(ii) Computation is done in servers, and not very often(once in two weeks or
so), so time isn't exactly a constraint.

Disadvantages:
(i) More complex algorithm, thereby prone 
(ii) More time required to compute. 


New articles in Metadata:

1. Hashes of the start_file and end_file.
2. List of new articles, and their namespaces
3. List of modified articles, and their corresponding old articles.
4. List of deleted articles.

Please add your comments, especially on the method for storing updated
articles.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 47406] Incremental update: zimdiff zimpatch

2013-05-06 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

Kelson [Emmanuel Engelhart] kel...@kiwix.org changed:

   What|Removed |Added

Summary|zimdiff  zimpatch  |Incremental update: zimdiff
   || zimpatch

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l