[Wikitech-l] GSoC 2013 Summary : Incremental updates for Kiwix

2013-10-08 Thread Kiran Mathew Koshy
Hello everyone,

This summer, I was working on the project ZIM incremental updates for
Kiwixhttps://www.mediawiki.org/wiki/User:Kiran_mathew_1993/ZIM_incremental_updates_for_Kiwix
 as part of GSoC 2013, under my mentors Emmanuel  Engelhart and Tommi
 Mäkitalo


  The tools zimdiff and zimpatch- used for incremental updates to a zim
file- has been created. Zimdiff is used to create a diff file between two
versions of a zim file, and zimpatch is used to obtain the final version of
the file using the diff file and the original file. These tools have been
added as classes in the existing zimlib library(part of openzim project).
  Some integration into Kiwix has been done, mostly on the server side.
Some parts of the client- side integration is still left, and I am working
on it.

https://github.com/kiranmathewkoshy/kiwix_mirror
https://github.com/kiranmathewkoshy/openzim

Its been a great experience working with my mentors, and I intend to stick
around for more.
-- 
Kiran Mathew Koshy
Electrical Engineering,
IIT Patna,
Patna,
India.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Invitation to connect on LinkedIn

2013-07-15 Thread Kiran Mathew Koshy
LinkedIn




I'd like to add you to my professional network on LinkedIn.

- Kiran Mathew

Kiran Mathew Koshy
Undergraduate Student at Indian Institute of Technology, Patna
Patna Area, India

Confirm that you know Kiran Mathew Koshy:
https://www.linkedin.com/e/-bua0c1-hj5f89cj-1z/isd/14983036463/Gt85iP45/?hs=falsetok=3l0D32iCEDplQ1

--
You are receiving Invitation to Connect emails. Click to unsubscribe:
http://www.linkedin.com/e/-bua0c1-hj5f89cj-1z/vsMo_US2a8xM--zTjDDg1CKVRmgKBlzYrmFNoI/goo/wikitech-l%40wikimedia%2Eorg/20061/I4988354359_1/?hs=falsetok=1mmaTOjfwDplQ1

(c) 2012 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA.


  
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] GSoC: Incremental updates for Kwix Reader(offline wikipedia)- Application

2013-05-01 Thread Kiran Mathew Koshy
Hello,

I have submitted my project application for GSoC '13. Please review it.

Link: https://www.mediawiki.org/wiki/User:Kiran_mathew_1993

Thanks,

-- 
Kiran Mathew Koshy
Electrical Engineering,
IIT Patna,
Patna,
India.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GSoC Project

2013-04-29 Thread Kiran Mathew Koshy
First of all, let me thank everyone who has commented on this thread. Sorry
about not responding earlier. My exams are going on. You can certainly
expect more response from me once they are over.


On Tue, Apr 30, 2013 at 4:18 AM, Emmanuel Engelhart
emman...@engelhart.orgwrote:

 Dear Kiran

 Before commenting your proposal, let me thank:
 * Quim for having renamed this thread... I wouldn't have got a chance to
 read it otherwise.
 * Gnosygnu and Sumana for their previous answers.

 Your emails points three problems:
 (1) The size of the offline dumps
 (2) Server mode of the offline solution
 (3) The need of incremental updates

 Regarding (1), I disagree. We have the ZIM format which is open, has an
 extremly efficient standard implementation, provides high compression
 rates and fast random access: http://www.openzim.org

 Regarding (2), Kiwix, which is a ZIM reader, already does it: you can
 either share Kiwix on a network disk or use Kiwix HTTP compatible daemon
 called kiwix-serve: http://www.kiwix.org/wiki/Kiwix-serve

 Regarding (3), I agree. This is an old feature request in the openZIM
 project. It's both on the roadmap and in the bug tracker:
 * http://www.openzim.org/wiki/Roadmap
 * https://bugzilla.wikimedia.org/show_bug.cgi?id=47406

 But, I also think the solution you propose isn't adapted to the problem.
 Setting up a Mediawiki is not easy, it's resource intensive and you
 don't need all this power (of the software setup) for the usage you want
 to do.



 On the other side, with ZIM you have a format which provides all what
 you need, runs on devices which costs only a few dozens of USD and we
 will make this incremental update trivial for the final user (it's just
 a matter of time ;).



I don't think power is much of a priority, but I agree the ZIM format would
be easier, since it directly reads from the ZIM file




 So to fix that problem, there is my approach: we should implement two
 tools I call zimdiff and zimpatch:
 * zimdiff is a tool able to compute the difference between two ZIM files
 * zimpatch is a tool able to patch a ZIM file with a ZIM diff file

 The incrementation process would be:
 * Compute a ZIM diff file (done by the ZIM provider)
 * Download and path the old ZIM file with the ZIM diff file (done by
 the user)

 We could implement two modes for zimpatch, leasy and normal:
 * leasy mode: simple merge of the file and rewriting of the index (fast
 but need a lot of mass storage)
 * normal mode: recompute a new file (slow but need less mass storage)

 Regarding the ZIM diff file format... the discussion is open, but it
 looks like we could simply reuse the ZIM format and zimpatch would work
 like a zimmerge (does not exist, it's just for the explanation).

 Everything could be done IMO in only a few hundreds of smart lines of
 C++. I would be really surprised if this need more than 2000 lines. But,
 to do that, we need a pretty talentuous C++ developer, maybe you?


Yes, this is a quite easy task. I can do this. I can go through the ZIM
format and the zimlib library in a few days.

Regarding the *zimpatch*, I think it would be better to implement both
methods( although I prefer the 2nd one). The user can then select the one
which he wants , depending on his configuration.
Lastly, we can add the *zimdiff as an automated task in the server*.
zimpatch and downloading the zim file can also be automated and added to
Kiwix.


If there's time left, I can port the zimlib library to python or PHP, so it
becomes easier for people to hack.

If you have any more suggestions, please comment. I'll submit the proposal
in ~ 12 hours.(again, exams).


 If your or someone else is interested we would probably be able to find
 a tutor.

 Kind regards
 Emmanuel

 PS: Wikimedia has an offline centric mailing list, let me add it in CC:
 https://lists.wikimedia.org/mailman/listinfo/offline-l

 Le 26/04/2013 22:27, Kiran Mathew Koshy a écrit :
  Hi guys,
 
  I have an own idea  for my GSoC project that I'd like to share with you.
  Its not a perfect one, so please forgive any mistakes.
 
  The project is related to the existing GSoC project *Incremental Data
 dumps
  * , but is in no way a replacement for it.
 
 
  *Offline Wikipedia*
 
  For a long time, a lot of offline solutions for Wikipedia have sprung up
 on
  the internet. All of these have been unofficial solutions, and  have
  limitations. A major problem is the* increasing size of  the data dumps*,
  and the problem of *updating the local content. *
 
  Consider the situation in a place where internet is costly/
  unavailable.(For the purpose of discussion, lets consider a school in a
 3rd
  world country.) Internet speeds are extremely slow, and accessing
 Wikipedia
  directly from the web is out of the question.
  Such a school would greatly benefit from an instance of Wikipedia on  a
  local server. Now up to here, the school can use any of the freely
  available offline Wikipedia solutions to make a local instance. The
 problem

[Wikitech-l] GSoC Project Idea

2013-04-26 Thread Kiran Mathew Koshy
Hi guys,

I have an own idea  for my GSoC project that I'd like to share with you.
Its not a perfect one, so please forgive any mistakes.

The project is related to the existing GSoC project *Incremental Data dumps
* , but is in no way a replacement for it.


*Offline Wikipedia*

For a long time, a lot of offline solutions for Wikipedia have sprung up on
the internet. All of these have been unofficial solutions, and  have
limitations. A major problem is the* increasing size of  the data dumps*,
and the problem of *updating the local content. *

Consider the situation in a place where internet is costly/
unavailable.(For the purpose of discussion, lets consider a school in a 3rd
world country.) Internet speeds are extremely slow, and accessing Wikipedia
directly from the web is out of the question.
Such a school would greatly benefit from an instance of Wikipedia on  a
local server. Now up to here, the school can use any of the freely
available offline Wikipedia solutions to make a local instance. The problem
arises when the database in the local instance becomes obsolete. The client
is then required to download an entire new dump(approx. 10 GB in size) and
load it into the database.
Another problem that arises is that most 3rd part programs *do not allow
network access*, and a new instance of the database is required(approx. 40
GB) on each installation.For instance, in a school with around 50 desktops,
each desktop would require a 40 GB  database. Plus, *updating* them becomes
even more difficult.

So here's my *idea*:
Modify the existing MediaWiki software and to add a few PHP/Python scripts
which will automatically update the database and will run in the
background.(Details on how the update is done is described later).
Initially, the MediaWiki(modified) will take an XML dump/ SQL dump (SQL
dump preferred) as input and will create the local instance of Wikipedia.
Later on, the updates will be added to the database automatically by the
script.

The installation process is extremely easy, it just requires a server
package like XAMPP and the MediaWiki bundle.


Process of updating:

There will be two methods of updating the server. Both will be implemented
into the MediaWiki bundle. Method 2 requires the functionality of
incremental data dumps, so it can be completed only after the functionality
is available. Perhaps I can collaborate with the student selected for
incremental data dumps.

Method 1: (online update) A list of all pages are made and published by
Wikipedia. This can be in an XML format. The only information  in the XML
file will be the page IDs and the last-touched date. This file will be
downloaded by the MediaWiki bundle, and the page IDs will be compared with
the pages of the existing local database.

case 1: A new page ID in XML file: denotes a new page added.
case 2: A page which is present in the local database is not among the page
IDs- denotes a deleted page.
case 3: A page in the local database has a different 'last touched'
 compared to the one in the local database- denotes an edited page.

In each case, the change is made in the local database and if the new page
data is required, the data is obtained using MediaWiki API.
These offline instances of Wikipedia will be only used in cases where the
internet speeds are very low, so they *won't cause much load on the servers*
.

method 2: (offline update): (Requires the functionality of the existing
project Incremental data dumps):
   In this case, the incremental data dumps are downloaded by the
user(admin) and fed to the MediaWiki installation the same way the original
dump is fed(as a normal file), and the corresponding changes are made by
the bundle. Since I'm not aware of the XML format used in incremental
updates, I cannot describe it now.

Advantages : An offline solution can be provided for regions where internet
access is a scarce resource. this would greatly benefit developing nations
, and would help in making the world's information more free and openly
available to everyone.

All comments are welcome !

PS: about me: I'm a 2nd year undergraduate student in Indian Institute of
Technology, Patna. I code for fun.
Languages: C/C++,Python,PHP,etc.
hobbies: CUDA programming, robotics, etc.

-- 
Kiran Mathew Koshy
Electrical Engineering,
IIT Patna,
Patna
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] GSoC Project

2013-04-26 Thread Kiran Mathew Koshy
Hi guys,

I have an own idea  for my GSoC project that I'd like to share with you.
Its not a perfect one, so please forgive any mistakes.

The project is related to the existing GSoC project *Incremental Data dumps
* , but is in no way a replacement for it.


*Offline Wikipedia*

For a long time, a lot of offline solutions for Wikipedia have sprung up on
the internet. All of these have been unofficial solutions, and  have
limitations. A major problem is the* increasing size of  the data dumps*,
and the problem of *updating the local content. *

Consider the situation in a place where internet is costly/
unavailable.(For the purpose of discussion, lets consider a school in a 3rd
world country.) Internet speeds are extremely slow, and accessing Wikipedia
directly from the web is out of the question.
Such a school would greatly benefit from an instance of Wikipedia on  a
local server. Now up to here, the school can use any of the freely
available offline Wikipedia solutions to make a local instance. The problem
arises when the database in the local instance becomes obsolete. The client
is then required to download an entire new dump(approx. 10 GB in size) and
load it into the database.
Another problem that arises is that most 3rd part programs *do not allow
network access*, and a new instance of the database is required(approx. 40
GB) on each installation.For instance, in a school with around 50 desktops,
each desktop would require a 40 GB  database. Plus, *updating* them becomes
even more difficult.

So here's my *idea*:
Modify the existing MediaWiki software and to add a few PHP/Python scripts
which will automatically update the database and will run in the
background.(Details on how the update is done is described later).
Initially, the MediaWiki(modified) will take an XML dump/ SQL dump (SQL
dump preferred) as input and will create the local instance of Wikipedia.
Later on, the updates will be added to the database automatically by the
script.

The installation process is extremely easy, it just requires a server
package like XAMPP and the MediaWiki bundle.


Process of updating:

There will be two methods of updating the server. Both will be implemented
into the MediaWiki bundle. Method 2 requires the functionality of
incremental data dumps, so it can be completed only after the functionality
is available. Perhaps I can collaborate with the student selected for
incremental data dumps.

Method 1: (online update) A list of all pages are made and published by
Wikipedia. This can be in an XML format. The only information  in the XML
file will be the page IDs and the last-touched date. This file will be
downloaded by the MediaWiki bundle, and the page IDs will be compared with
the pages of the existing local database.

case 1: A new page ID in XML file: denotes a new page added.
case 2: A page which is present in the local database is not among the page
IDs- denotes a deleted page.
case 3: A page in the local database has a different 'last touched'
 compared to the one in the local database- denotes an edited page.

In each case, the change is made in the local database and if the new page
data is required, the data is obtained using MediaWiki API.
These offline instances of Wikipedia will be only used in cases where the
internet speeds are very low, so they *won't cause much load on the servers*
.

method 2: (offline update): (Requires the functionality of the existing
project Incremental data dumps):
   In this case, the incremental data dumps are downloaded by the
user(admin) and fed to the MediaWiki installation the same way the original
dump is fed(as a normal file), and the corresponding changes are made by
the bundle. Since I'm not aware of the XML format used in incremental
updates, I cannot describe it now.

Advantages : An offline solution can be provided for regions where internet
access is a scarce resource. this would greatly benefit developing nations
, and would help in making the world's information more free and openly
available to everyone.

All comments are welcome !

PS: about me: I'm a 2nd year undergraduate student in Indian Institute of
Technology, Patna. I code for fun.
Languages: C/C++,Python,PHP,etc.
hobbies: CUDA programming, robotics, etc.

-- 
Kiran Mathew Koshy
Electrical Engineering,
IIT Patna,
Patna
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l