Re: [Wikitech-l] [Commons-l] Elog.io now up w/ Commons data

2014-12-11 Thread Jonas Öberg
Hi Cornelius!

For images which it match against the catalog, it should give accurate
information. If it doesn't, use the report link to let us know!

You're right though that for images it doesn't find in its catalog, we
don't provide any information. That's the equivalent of saying this
picture may or may not be openly licensed, but right now we have no
information to tell either way

Sincerely,
Jonas
On 11 Dec 2014 15:57, Cornelius Kibelka cornelius.kibe...@wikimedia.de
wrote:

 Wow, what a nice and interesting browser extension. Congrats!

 Just a question:  as far as I can see the tool doens't give the complete
 and correction licensing information, as the source is missing. Or I'm
 missleading?

 Best
 Cornelius

 2014-12-10 19:30 GMT+01:00 Jonas Öberg jo...@commonsmachinery.se:

 Dear all,

 thanks for all your help with answering questions and giving feedback
 over the last couple of months. I'm happy to say that we're finally at
 a stage where we've hashed 22,452,638 images from Wikimedia Commons
 and launched Elog.io in public beta: http://elog.io/

 Elog.io is an open API as well as browser plugins, that can query and
 get information about images using a perceptual hash that's easy and
 quick to calculate in a browser.

 What the browser extensions allow you to do is match an image you find
 in the wild against Wikimedia Commons. If it can be matched against
 an image from Commons, it'll show you the title, author, and license,
 and give you links back to Wikimedia, the license, and a quick and
 handy Copy as HTML to copy the image and attribution as a HTML
 snippet for pasting into Word, LibreOffice, Wordpress, etc.

 Our API provides lookup functions to find information using a URL (the
 Commons' page name URL) or using the perceptual hash. You get
 information back as JSON in W3C Media Annotations format. of course,
 the information you get back is no better than the one provided by the
 Commons API, so if you already have a page name URL, you may as well
 query it directly, and rely on our API only for searching by
 perceptual hashes.

 The algorithm we use for calculating perceptual hashes, which you'll
 need to query our API, is at http://blockhash.io/


 Sincerely,
 Jonas

 ___
 Commons-l mailing list
 common...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/commons-l




 --
 Cornelius Kibelka

 International Affairs
 Werkstudent | student trainee

 Wikimedia Deutschland e.V.
 Tempelhofer Ufer 23-24
 10963 Berlin

 Tel.: +49 30 219158260
 http://wikimedia.de

 http://wikimedia.de/Stellen Sie sich eine Welt vor, in der jeder Mensch
 freien Zugang zu der
 Gesamtheit des Wissens der Menschheit hat. Helfen Sie uns dabei!
 http://spenden.wikimedia.de/

 Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
 Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
 unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt
 für Körperschaften I Berlin, Steuernummer 27/681/51985.

 ___
 Commons-l mailing list
 common...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/commons-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Open source mobile image recognition in Wikipedia

2014-12-06 Thread Jonas Öberg
Hi Adrien!

 Using the visual word approach I use in Pastec would enable the matching
 of modified images but would also require a lot more resources. Thus, while
 your hash is 256 bits long, an image signature in the Pastec index is
 approximately 8 KB.

8 KB still isn't too bad. It sounds like it could be useful.

 Similarly, I guess that the search complexity of your hash approach is o(1)
 while in Pastec this is much more complicated: first tf-idf ranking and
 then two geometrical rerankings...

Close to o(1) at least. How does Pastec scale to many images? You
mentioned having about 400,000 currently, which is still a rather fair
number, but what about the full ~22M of Wikimedia Commons? I'm
assuming that since tf-idf is a well known method for text mining,
there are well understood and optimised algorithms to search. Perhaps
something like Elasticsearch would be useful right away too?

That would be an advantage, since with our blockhash, we've had to
implement relevant search algorithms ourselves lacking existing
implementations.

One problem that we see and which was discussed recently on the
commons-l mailing list, is the possibility of using approaches like
yours and ours to identify duplicate images in Commons. We've
generated a list of 21274 duplicate pairs, but some of them aren't
actually duplicates, just very similar. Most commonly this is map
data, like [1] and [2], where just a specific region differ.

I'm hypothesizing that your ORB detection would have better success
there, since it would hopefully detect the colored area as a feature
and be able to distinguish the two from each other.

In general, my feeling is that your work with ORB and our work with
Blockhashes complement each other nicely. They work with different use
cases, but have the same purpose, so being able to search using both
would sometimes be an advantage. What is your strategy for scaling
beyond your existing 400,000 images and is there some way we can
cooperate on this? As we go about hashing additional sets (Flickr is a
prime candidate), it would be interesting for us if we could generate
both our blockhash and your ORB visual words signature in an easy way,
since we any way retrieve the images.

[1] 
https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Trujillo_Alto.png
[2] https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Carolina.png

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Open source mobile image recognition in Wikipedia

2014-11-24 Thread Jonas Öberg
Hi Adrien,

this looks very interesting - I'm happy to see your work and I briefly
looked into your sources and API. With your 440 000 images, do you
have any clear idea about the accuracy of ORB? To explain: I'm working
on Elog.io, which provides a *similar* service and API[1] as yours,
but uses a rather different algorithm and store, and a different use
case. Our algorithm is a variant of a Blockhash[2] algorithm, which
does not do any feature detection at all, but which can easily run in
a browser or mobile platform (we have versions for JavaScript, C and
Python) to generate 256 bit hashes of images. With a hamming distance
calculation, we then determine the quality of a match.

We work primarily on a use case of verbatim use, with a user getting
images from Wikimedia and re-using them elsewhere. Algorithms without
feature detection give very bad results for any modifications to an
image, like rotating, cropping, etc. But since that's not within our
use case, it works, though the flip side of if them is of course that
you can't expect to photograph something (a newspaper article with an
image for instance) and then match it against a set of images as you
expect to be able to do.

The other difference is that our database store isn't specifically
tailored to our hashes: we use W3C Media Annotations to store any kind
of metadata about images, and could equally well store your ORB
signatures assuming they can be serialised.

To give you some numbers, for our use cases (verbatim use, potentially
with format change jpg-png etc, and scaling down to 100px width) we
can successfully match ca 87% of cases, and we have a collision rate
(different images resulting in same or near same hashes) of ca 1,2%.
Both numbers against the Wikimedia Commons set.

While we currently have the full ~22M images from Wikimedia Commons in
our database, we're still ironing out the kinks of the system and
making some additional improvements. If you think that we should
consider ORB instead of or in addition to our current algorithms, we'd
love to give that a try, and it'd obviously be very interesting if we
could end up having compatible signatures compared to your database.

Sincerely,
Jonas

[1] http://docs.cmcatalog.apiary.io
[2] http://blockhash.io






Jonas

On 24 November 2014 at 11:25, Adrien Maglo adr...@visualink.io wrote:
 Hello,


 I am not sure this is the right mailing list to introduce this project but I
 have just released Displee. It is a small Android app that allows to search
 for images in the English Wikipedia by taking pictures:
 https://play.google.com/store/apps/details?id=org.visualink.displee
 It is a kind of open source Google Goggles for images from en.wikipedia.org.

 I have developed Displee as a demonstrator of Pastec http://pastec.io, my
 open source image recognition index and search engine for mobile apps.
 The index hosted on my server in France currently contains about 440 000
 images. They may not be the most relevant ones but this is a start. ;-)
 I have also other ideas to improve this tiny app if it has an interest for
 the community.

 Displee source code (MIT) is available here:
 https://github.com/Visu4link/displee
 Pastec source code (LGPL) is available here:
 https://github.com/Visu4link/pastec
 The source code of the Displee back-end is not released yet. It is basically
 a python3 Django application.

 I will be glad to receive your feedback and answer any question!

 Best regards,


 --
 Adrien Maglo
 Pastec developer
 http://www.pastec.io
 +33 6 27 94 34 41

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Jonas Öberg, Founder  Shuttleworth Foundation Fellow
Commons Machinery | jo...@commonsmachinery.se
E-mail is the fastest way to my attention

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] apihighlimits and bot flags

2014-09-19 Thread Jonas Öberg
Dear all,

We're building a Firefox addon to perceptually match images in Commons
against images found elsewhere, so that people can see that they come
from Commons even if they appear on other web sites.
https://moqups.com/jonaso/lopej41Z has a quick mockup.

On https://commons.wikimedia.org/wiki/Commons:Bots/Requests/CommonsHasher
we've requested the apihighlimits right (after discussion on commons-l
starting here: 
https://lists.wikimedia.org/pipermail/commons-l/2014-September/007325.html)
in order to be able to retrieve more than 50 records at once from the
API.

According to EugeneZelenko who tried to grant this right, it could not
be granted through the normal interface. Question then: is
apihighlimits included in the bot flag, or how can the apihighlimits
right be granted?


Sincerely,

-- 
Jonas Öberg, Founder  Shuttleworth Foundation Fellow
Commons Machinery | jo...@commonsmachinery.se
E-mail is the fastest way to my attention

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] apihighlimits and bot flags

2014-09-19 Thread Jonas Öberg
Thanks Bartosz and Petr, much appreciated, this clears up the question nicely :)

Sincerely,
Jonas

On 19 September 2014 10:24, Bartosz Dziewoński matma@gmail.com wrote:
 Yes, the 'apihighlimits' *permission* is included in the 'bot' *group* (and
 the 'sysop' group, too). You can see available groups and the permissions
 they are assigned on
 https://commons.wikimedia.org/wiki/Special:ListGroupRights

 --
 Bartosz Dziewoński


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Jonas Öberg, Founder  Shuttleworth Foundation Fellow
Commons Machinery | jo...@commonsmachinery.se
E-mail is the fastest way to my attention

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l