[Bug 21860] Add checksum field to text table; expose it in API

bugzilla-daemon Thu, 15 Jul 2010 18:55:34 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=21860


--- Comment #21 from John Erling Blad <[email protected]> 2010-07-16 01:54:41 UTC 
---
Seems like the arguments starts to change and it is more a discussion about
wetter such functionality should be allowed than if there are real reasons for
not supplying them. To somehow stop or hinder the additional services seems
rather counterproductive so I don't want to argue about why and how that shall
be done.

If the API supply some functionality it must be assumed that someone will use
the functionality to make available some additional services. In this case the
base functionality is to supply the text for the revisions. This will take a
long time to transfer and make some additional services difficult to implement.
It will not make them impossible to implement. At the same time the servers
will get additional load to transfer the content.

The services _are_ possible to implement and someone will build them. I have
made some such services and surely someone else surely will do too. In this
situation it is wise to carefully consider which measure can be taken to
decrease the load on the system. Right now we are stuck with a solution that
creates a very large load on the servers.

By calculating a digest in the servers instead of in the clients the total load
from transferring the information will be lower. If the digests are calculated
for each call it would probably be wise to use digests which are easy to
calculate. This still puts some strain on the database servers as they has to
serve the text for each revision. By storing the digests in the database more
heavy digests can be used, and even some heavy to compute locality hashing
functions can be used. All such digests will be fast to transfer, decreasing
the load on the servers.

I propose that we use store at least one digest in the database. This should be
stored in such a way that it is possible to use it for revert detection on the
recent changes page. Probably this should use something like MD5 on a
uncompacted version of the content, check out string grammar if its unknown to
you. It should also be possible to use it for revisions in the API, this to
support services like WikiDashboard. There are probably several lists that
should be adjusted accordingly. To support services like history flow we need
something like a nilsimsa digest on identifiable blocks in the content. I would
propose that each paragraph has a nilsimsa digest. I'm not sure if this needs
to be stored in the database, as this will be requested rather seldom. Probably
they can be calculated on the fly.

I guess at least one digest should be stored in the recent changes table for
page revisions. If this lacks digests for older entries that shouldn't do much
harm, the interesting thing here is to detect wetter a new version is unique
within a short timeframe. I would keep digests for comparison in memcached for
lets say the 4-5 last revisions of an article. Note that this digest must be
reasonable collision free for the complete set of articles and revisions, that
is something like MD5 or other very long digests.

In the revisions table it will be stored at least one digest to identify
revisions. Such digests can be more lightweight and use compacted versions of
the content. The digests should be rather short integers. Remember that those
shall be used to detect similar versions, not the larger sets from the recent
changes table.

For history flow I would compute the digests on the fly in the web server. Such
computations would be rather heavy, but they will be done seldom.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 21860] Add checksum field to text table; expose it in API

Reply via email to