Re: [Wikidata-l] Kian: The first neural network to serve Wikidata

Amir Ladsgroup Sat, 07 Mar 2015 11:14:16 -0800

On Sat, Mar 7, 2015 at 10:25 PM, Emw <[email protected]> wrote:

> Amir,
>
> What is the false positive rate of your algorithm when dealing with
> fictitious humans and (non-fictitious) non-human organisms?  That is, how
> often does your program classify such non-humans as humans?
>
> I give you an exact number for German Wikipedia in several hours


> Regarding the latter, note that items about individual dogs, elephants,
> chimpanzees and even trees can use properties that are otherwise extremely
> skewed towards humans.  For example, Prometheus (Q590010) [1], an extremely
> old tree, has claims for *date of birth* (P569), *date of death* (P570),
> even *killed by* (P157).  Non-human animals can also have kinship claims
> (e.g. *mother*, *brother, child*), among other properties typically used
> on humans.
>
> The trick to avoid such errors is to give big negative score for having a
group E, or D category.

Feature engineering for this task is a little complicated. At first I group
categories of a Wiki by having human articles. If more than 80% members of
a category are known to be humans, it's a group A category and so on. (D
group= 0%). so an article can be parameterized by number of categories in
each group it has. e.g. an article about human usually is like 5,3,2,0,0
and an article about a tree can be like 1,0,0,6,7 and having one or several
group A category alongside with several group D category prevents the bot
from making such false statements. How it's possible and how a bot can do
that? it's because the huge set of data (training set) we have already and
neural networks algorithms.

Best


> Best,
> Eric
>
> https://www.wikidata.org/wiki/User:Emw
>
> 1.  Prometheus.  https://www.wikidata.org/wiki/Q590010
>
> On Sat, Mar 7, 2015 at 1:44 PM, Amir Ladsgroup <[email protected]>
> wrote:
>
>> Hey Markus,
>> Thanks for your insight :)
>>
>> On Sat, Mar 7, 2015 at 9:52 PM, Markus Krötzsch <
>> [email protected]> wrote:
>>
>>> Hi Amir,
>>>
>>> In spite of all due enthusiasm, please evaluate your results (with
>>> humans!) before making automated edits. In fact, I would contradict Magnus
>>> here and say that such an approach would best be suited to provide
>>> meaningful (pre-filtered) *input* to people who play a Wikidata game,
>>> rather than bypassing the game (and humans) altogether. The expected error
>>> rates are quite high for such an approach, but it can still save a lot of
>>> works for humans.
>>>
>>> there is a "certainty factor" and It can save a lot without making such
>> errors by using the certainty factor
>>
>>
>>> As for the next steps, I would suggest that you have a look at the works
>>> that others have done already. Try Google Scholar:
>>>
>>> https://scholar.google.com/scholar?q=machine+learning+wikipedia
>>>
>>> As you can see, there are countless works on using machine learning
>>> techniques on Wikipedia, both for information extraction (e.g.,
>>> understanding link semantics) and for things like vandalism detection. I am
>>> sure that one could get a lot of inspiration from there, both on potential
>>> applications and on technical hints on how to improve result quality.
>>>
>>> Yes, definitely I would use them, thanks.
>>
>>
>>> You will find that people are using many different approaches in these
>>> works. The good old ANN is still a relevant algorithm in practice, but
>>> there are many other techniques, such as SVNs, Markov models, or random
>>> forests, which have been found to work better than ANNs in many cases. Not
>>> saying that a three-layer feed-forward ANN cannot do some jobs as well, but
>>> I would not restrict to one ML approach if you have a whole arsenal of
>>> algorithms available, most of them pre-implemented in libraries (the first
>>> Google hit has a lot of relevant projects listed:
>>> http://daoudclarke.github.io/machine%20learning%20in%
>>> 20practice/2013/10/08/machine-learning-libraries/). I would certainly
>>> recommend that you don't implement any of the standard ML algorithms from
>>> scratch.
>>>
>>> I use backward propagation algorithm and I use octave in ML for my
>> personal works, but in Wikipedia I use python (for two main reasons:
>> integrating with with other wikipedia-related tools like pywikibot and bad
>> performance of octave and Matlab in big sets of data) and I had to write
>> that parts from scratch since I couldn't find any related library in
>> python. Even algorithms like BFGS is not there (I could find in scipy but I
>> wasn't sure it works correctly and because no documentation is there)
>>
>>> In practice, the most challenging task for successful ML is often
>>> feature engineering: the question which features you use as an input to
>>> your learning algorithm. This is far more important that the choice of
>>> algorithm. Wikipedia in particular offers you so many relevant pieces of
>>> information with each article that are not just mere keywords (links,
>>> categories, in-links, ...)  and it is not easy to decide which of these to
>>> feed into your learner. This will be different for each task you solve
>>> (subject classification is fundamentally different from vandalism
>>> detection, and even different types of vandalism would require very
>>> different techniques). You should pick hard or very large tasks to make
>>> sure that the tweaking you need in each case takes less time than you would
>>> need as a human to solve the task manually ;-)
>>>
>>> Yes, feature engineering is the most important thing and it can be
>> tricky but feature engineering in Wikidata is lot easier (it's easier than
>> Wikipedia. Wikipedia itself it's easier than other places). Anti-Vandalism
>> bots are lot easier in Wikidata than Wikipedia. Editing in Wikidata is
>> limited to certain kinds (like removing a sitelink, etc.) but it's not easy
>> in Wikipedia.
>>
>>
>>> Anyway, it's an interesting field, and we could certainly use some
>>> effort to exploit the countless works in this field for Wikidata. But you
>>> should be aware that this is no small challenge and that there is no
>>> universal solution that will work well even for all the tasks that you have
>>> mentioned in your email.
>>>
>>> Of course, I spent lots of time studying this and I would be happy if
>> anyone who knows about neural networks or AI can contribute too.
>>
>>
>>> Best wishes,
>>>
>>> Markus
>>>
>>>
>>> On 07.03.2015 18:21, Magnus Manske wrote:
>>>
>>>> Congratulations for this bold step towards the Singularity :-)
>>>>
>>>> As for tasks, basically everything us mere humans do in the Wikidata
>>>> game:
>>>> https://tools.wmflabs.org/wikidata-game/
>>>>
>>>> Some may require text parsing. Not sure how to get that working; haven't
>>>> spent much time with (artificial) neural nets in a while.
>>>>
>>>>
>>>>
>>>> On Sat, Mar 7, 2015 at 12:36 PM Amir Ladsgroup <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>
>>>>     Some useful tasks that I'm looking for a way to do are:
>>>>     *Anti-vandal bot (or how we can quantify an edit).
>>>>     *Auto labeling for humans (That's the next task).
>>>>     *Add more :)
>>>>
>>>>
>>>>     On Sat, Mar 7, 2015 at 3:54 PM, Amir Ladsgroup <[email protected]
>>>>     <mailto:[email protected]>> wrote:
>>>>
>>>>         Hey,
>>>>         I spent last few weeks working on this lights off [1] and now
>>>>         it's ready to work!
>>>>
>>>>         Kian is a three-layered neural network with flexible number of
>>>>         inputs and outputs. So if we can parametrize a job, we can teach
>>>>         him easily and get the job done.
>>>>
>>>>         For example and as the first job. We want to add P31:5 (human)
>>>>         to items of Wikidata based on categories of articles in
>>>>         Wikipedia. The only thing we need to is get list of items with
>>>>         P31:5 and list of items of not-humans (P31 exists but not 5 in
>>>>         it). then get list of category links in any wiki we want[2] and
>>>>         at last we feed these files to Kian and let him learn.
>>>>         Afterwards if we give Kian other articles and their categories,
>>>>         he classifies them as human, not human, or failed to determine.
>>>>         As test I gave him categories of ckb wiki (a small wiki) and
>>>>         worked pretty well and now I'm creating the training set from
>>>>         German Wikipedia and the next step will be English Wikipedia.
>>>>         Number of P31:5 will drastically increase this week.
>>>>
>>>>         I would love comments or ideas for tasks that Kian can do.
>>>>
>>>>
>>>>         [1]: Because I love surprises
>>>>         [2]: "select pp_value, cl_to from page_props join categorylinks
>>>>         on pp_page = cl_from where pp_propname = 'wikibase_item';"
>>>>         Best
>>>>         --
>>>>         Amir
>>>>
>>>>
>>>>
>>>>
>>>>     --
>>>>     Amir
>>>>
>>>>     _________________________________________________
>>>>     Wikidata-l mailing list
>>>>     [email protected] <mailto:Wikidata-l@lists.
>>>> wikimedia.org>
>>>>     https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
>>>>     <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Wikidata-l mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Wikidata-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>>
>>
>>
>>
>> --
>> Amir
>>
>>
>> _______________________________________________
>> Wikidata-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>
>>
>
> _______________________________________________
> Wikidata-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>
>


-- 
Amir

_______________________________________________
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Kian: The first neural network to serve Wikidata

Reply via email to