https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
--- Comment #21 from John Erling Blad <[email protected]> 2010-07-16 01:54:41 UTC --- Seems like the arguments starts to change and it is more a discussion about wetter such functionality should be allowed than if there are real reasons for not supplying them. To somehow stop or hinder the additional services seems rather counterproductive so I don't want to argue about why and how that shall be done. If the API supply some functionality it must be assumed that someone will use the functionality to make available some additional services. In this case the base functionality is to supply the text for the revisions. This will take a long time to transfer and make some additional services difficult to implement. It will not make them impossible to implement. At the same time the servers will get additional load to transfer the content. The services _are_ possible to implement and someone will build them. I have made some such services and surely someone else surely will do too. In this situation it is wise to carefully consider which measure can be taken to decrease the load on the system. Right now we are stuck with a solution that creates a very large load on the servers. By calculating a digest in the servers instead of in the clients the total load from transferring the information will be lower. If the digests are calculated for each call it would probably be wise to use digests which are easy to calculate. This still puts some strain on the database servers as they has to serve the text for each revision. By storing the digests in the database more heavy digests can be used, and even some heavy to compute locality hashing functions can be used. All such digests will be fast to transfer, decreasing the load on the servers. I propose that we use store at least one digest in the database. This should be stored in such a way that it is possible to use it for revert detection on the recent changes page. Probably this should use something like MD5 on a uncompacted version of the content, check out string grammar if its unknown to you. It should also be possible to use it for revisions in the API, this to support services like WikiDashboard. There are probably several lists that should be adjusted accordingly. To support services like history flow we need something like a nilsimsa digest on identifiable blocks in the content. I would propose that each paragraph has a nilsimsa digest. I'm not sure if this needs to be stored in the database, as this will be requested rather seldom. Probably they can be calculated on the fly. I guess at least one digest should be stored in the recent changes table for page revisions. If this lacks digests for older entries that shouldn't do much harm, the interesting thing here is to detect wetter a new version is unique within a short timeframe. I would keep digests for comparison in memcached for lets say the 4-5 last revisions of an article. Note that this digest must be reasonable collision free for the complete set of articles and revisions, that is something like MD5 or other very long digests. In the revisions table it will be stored at least one digest to identify revisions. Such digests can be more lightweight and use compacted versions of the content. The digests should be rather short integers. Remember that those shall be used to detect similar versions, not the larger sets from the recent changes table. For history flow I would compute the digests on the fly in the web server. Such computations would be rather heavy, but they will be done seldom. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
