https://bugzilla.wikimedia.org/show_bug.cgi?id=52056

Philippe Verdy <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #15 from Philippe Verdy <[email protected]> ---
It's so easy to derive a spammed image by schaning a few random bits in it
(including within invisible embedded metadata, such as camera info, or creator
software version string, or adding some randomly selected image backgrounds
around the bad image) that I think it is superfluous to check the SHA1 digital
signature to detect spammed images.

SHA1 is the wrong method to identify spammed images, and a better method based
on image subsampling, with some distance threashold on color plane values,
ignoring all metadata fields, but taking into account the ICC profiles to
produce the accurate final color before subsampling, will be much better.

Image could be identified by creating identifiable bounding boxes between the
most contrasting pixels, in order to eliminate the effect of image realignement
with custom internal margins of variable sizes. This done the subsampling can 
be correctly aligned to a box of 512x512 pixels (if the image is not square,
its minimum width/height size will be set between 256 and 512, the maximum will
be set to a multiple of 512, creating a horizontal or vertical band of 512x512
squares), and then SHA1 can be computed on subblocks of 8x8 pixels, to compute
the number of common subblocks, giving a note for possible copies.

Above some threshold, this note will bring an alert for human inspection in a
specific category or report showing the two images (one which is identified as
spammed or infringing a copyright, and the new image).

There exists probably newer algorithms to help matching comparable images. For
example Google is able to recognize people faces, or monuments automatically
from any photo, using heuristic methods that can correct the effects of
difference of light, change of resolution, image cropping, border decorations,
slight rotations...

Many spammed images are also displaying text in them (e.g. domain names, or
tiny URLs), and some OCR may recognize those texts as an additional method to
identify spam (we could also forbid the display of external URLs, notably those
hosted on tiny URL providers).

Are there works somewhere about automatizing recognition of image subjects and
a way to develop an extension allowing to compare new incoming images with some
wellknown bad images, in a special page where the problematic images will not
be publicly downloadable/reusable and so that Commons will not be the
distribution vector, notably by phishing emails ? Do we monitor security alerts
about phishing emails containing images that could be hosted on Commons or on
another wiki?

Can we also develop identification mechanims as well for other media types
(notably PDF, ePUB, audio and video, without using the basic SHA1 signature ?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to