https://bugzilla.wikimedia.org/show_bug.cgi?id=52056
Philippe Verdy <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #15 from Philippe Verdy <[email protected]> --- It's so easy to derive a spammed image by schaning a few random bits in it (including within invisible embedded metadata, such as camera info, or creator software version string, or adding some randomly selected image backgrounds around the bad image) that I think it is superfluous to check the SHA1 digital signature to detect spammed images. SHA1 is the wrong method to identify spammed images, and a better method based on image subsampling, with some distance threashold on color plane values, ignoring all metadata fields, but taking into account the ICC profiles to produce the accurate final color before subsampling, will be much better. Image could be identified by creating identifiable bounding boxes between the most contrasting pixels, in order to eliminate the effect of image realignement with custom internal margins of variable sizes. This done the subsampling can be correctly aligned to a box of 512x512 pixels (if the image is not square, its minimum width/height size will be set between 256 and 512, the maximum will be set to a multiple of 512, creating a horizontal or vertical band of 512x512 squares), and then SHA1 can be computed on subblocks of 8x8 pixels, to compute the number of common subblocks, giving a note for possible copies. Above some threshold, this note will bring an alert for human inspection in a specific category or report showing the two images (one which is identified as spammed or infringing a copyright, and the new image). There exists probably newer algorithms to help matching comparable images. For example Google is able to recognize people faces, or monuments automatically from any photo, using heuristic methods that can correct the effects of difference of light, change of resolution, image cropping, border decorations, slight rotations... Many spammed images are also displaying text in them (e.g. domain names, or tiny URLs), and some OCR may recognize those texts as an additional method to identify spam (we could also forbid the display of external URLs, notably those hosted on tiny URL providers). Are there works somewhere about automatizing recognition of image subjects and a way to develop an extension allowing to compare new incoming images with some wellknown bad images, in a special page where the problematic images will not be publicly downloadable/reusable and so that Commons will not be the distribution vector, notably by phishing emails ? Do we monitor security alerts about phishing emails containing images that could be hosted on Commons or on another wiki? Can we also develop identification mechanims as well for other media types (notably PDF, ePUB, audio and video, without using the basic SHA1 signature ? -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
