On Sat, Mar 7, 2015 at 10:25 PM, Emw <[email protected]> wrote: > Amir, > > What is the false positive rate of your algorithm when dealing with > fictitious humans and (non-fictitious) non-human organisms? That is, how > often does your program classify such non-humans as humans? > > I give you an exact number for German Wikipedia in several hours
> Regarding the latter, note that items about individual dogs, elephants, > chimpanzees and even trees can use properties that are otherwise extremely > skewed towards humans. For example, Prometheus (Q590010) [1], an extremely > old tree, has claims for *date of birth* (P569), *date of death* (P570), > even *killed by* (P157). Non-human animals can also have kinship claims > (e.g. *mother*, *brother, child*), among other properties typically used > on humans. > > The trick to avoid such errors is to give big negative score for having a group E, or D category. Feature engineering for this task is a little complicated. At first I group categories of a Wiki by having human articles. If more than 80% members of a category are known to be humans, it's a group A category and so on. (D group= 0%). so an article can be parameterized by number of categories in each group it has. e.g. an article about human usually is like 5,3,2,0,0 and an article about a tree can be like 1,0,0,6,7 and having one or several group A category alongside with several group D category prevents the bot from making such false statements. How it's possible and how a bot can do that? it's because the huge set of data (training set) we have already and neural networks algorithms. Best > Best, > Eric > > https://www.wikidata.org/wiki/User:Emw > > 1. Prometheus. https://www.wikidata.org/wiki/Q590010 > > On Sat, Mar 7, 2015 at 1:44 PM, Amir Ladsgroup <[email protected]> > wrote: > >> Hey Markus, >> Thanks for your insight :) >> >> On Sat, Mar 7, 2015 at 9:52 PM, Markus Krötzsch < >> [email protected]> wrote: >> >>> Hi Amir, >>> >>> In spite of all due enthusiasm, please evaluate your results (with >>> humans!) before making automated edits. In fact, I would contradict Magnus >>> here and say that such an approach would best be suited to provide >>> meaningful (pre-filtered) *input* to people who play a Wikidata game, >>> rather than bypassing the game (and humans) altogether. The expected error >>> rates are quite high for such an approach, but it can still save a lot of >>> works for humans. >>> >>> there is a "certainty factor" and It can save a lot without making such >> errors by using the certainty factor >> >> >>> As for the next steps, I would suggest that you have a look at the works >>> that others have done already. Try Google Scholar: >>> >>> https://scholar.google.com/scholar?q=machine+learning+wikipedia >>> >>> As you can see, there are countless works on using machine learning >>> techniques on Wikipedia, both for information extraction (e.g., >>> understanding link semantics) and for things like vandalism detection. I am >>> sure that one could get a lot of inspiration from there, both on potential >>> applications and on technical hints on how to improve result quality. >>> >>> Yes, definitely I would use them, thanks. >> >> >>> You will find that people are using many different approaches in these >>> works. The good old ANN is still a relevant algorithm in practice, but >>> there are many other techniques, such as SVNs, Markov models, or random >>> forests, which have been found to work better than ANNs in many cases. Not >>> saying that a three-layer feed-forward ANN cannot do some jobs as well, but >>> I would not restrict to one ML approach if you have a whole arsenal of >>> algorithms available, most of them pre-implemented in libraries (the first >>> Google hit has a lot of relevant projects listed: >>> http://daoudclarke.github.io/machine%20learning%20in% >>> 20practice/2013/10/08/machine-learning-libraries/). I would certainly >>> recommend that you don't implement any of the standard ML algorithms from >>> scratch. >>> >>> I use backward propagation algorithm and I use octave in ML for my >> personal works, but in Wikipedia I use python (for two main reasons: >> integrating with with other wikipedia-related tools like pywikibot and bad >> performance of octave and Matlab in big sets of data) and I had to write >> that parts from scratch since I couldn't find any related library in >> python. Even algorithms like BFGS is not there (I could find in scipy but I >> wasn't sure it works correctly and because no documentation is there) >> >>> In practice, the most challenging task for successful ML is often >>> feature engineering: the question which features you use as an input to >>> your learning algorithm. This is far more important that the choice of >>> algorithm. Wikipedia in particular offers you so many relevant pieces of >>> information with each article that are not just mere keywords (links, >>> categories, in-links, ...) and it is not easy to decide which of these to >>> feed into your learner. This will be different for each task you solve >>> (subject classification is fundamentally different from vandalism >>> detection, and even different types of vandalism would require very >>> different techniques). You should pick hard or very large tasks to make >>> sure that the tweaking you need in each case takes less time than you would >>> need as a human to solve the task manually ;-) >>> >>> Yes, feature engineering is the most important thing and it can be >> tricky but feature engineering in Wikidata is lot easier (it's easier than >> Wikipedia. Wikipedia itself it's easier than other places). Anti-Vandalism >> bots are lot easier in Wikidata than Wikipedia. Editing in Wikidata is >> limited to certain kinds (like removing a sitelink, etc.) but it's not easy >> in Wikipedia. >> >> >>> Anyway, it's an interesting field, and we could certainly use some >>> effort to exploit the countless works in this field for Wikidata. But you >>> should be aware that this is no small challenge and that there is no >>> universal solution that will work well even for all the tasks that you have >>> mentioned in your email. >>> >>> Of course, I spent lots of time studying this and I would be happy if >> anyone who knows about neural networks or AI can contribute too. >> >> >>> Best wishes, >>> >>> Markus >>> >>> >>> On 07.03.2015 18:21, Magnus Manske wrote: >>> >>>> Congratulations for this bold step towards the Singularity :-) >>>> >>>> As for tasks, basically everything us mere humans do in the Wikidata >>>> game: >>>> https://tools.wmflabs.org/wikidata-game/ >>>> >>>> Some may require text parsing. Not sure how to get that working; haven't >>>> spent much time with (artificial) neural nets in a while. >>>> >>>> >>>> >>>> On Sat, Mar 7, 2015 at 12:36 PM Amir Ladsgroup <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Some useful tasks that I'm looking for a way to do are: >>>> *Anti-vandal bot (or how we can quantify an edit). >>>> *Auto labeling for humans (That's the next task). >>>> *Add more :) >>>> >>>> >>>> On Sat, Mar 7, 2015 at 3:54 PM, Amir Ladsgroup <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hey, >>>> I spent last few weeks working on this lights off [1] and now >>>> it's ready to work! >>>> >>>> Kian is a three-layered neural network with flexible number of >>>> inputs and outputs. So if we can parametrize a job, we can teach >>>> him easily and get the job done. >>>> >>>> For example and as the first job. We want to add P31:5 (human) >>>> to items of Wikidata based on categories of articles in >>>> Wikipedia. The only thing we need to is get list of items with >>>> P31:5 and list of items of not-humans (P31 exists but not 5 in >>>> it). then get list of category links in any wiki we want[2] and >>>> at last we feed these files to Kian and let him learn. >>>> Afterwards if we give Kian other articles and their categories, >>>> he classifies them as human, not human, or failed to determine. >>>> As test I gave him categories of ckb wiki (a small wiki) and >>>> worked pretty well and now I'm creating the training set from >>>> German Wikipedia and the next step will be English Wikipedia. >>>> Number of P31:5 will drastically increase this week. >>>> >>>> I would love comments or ideas for tasks that Kian can do. >>>> >>>> >>>> [1]: Because I love surprises >>>> [2]: "select pp_value, cl_to from page_props join categorylinks >>>> on pp_page = cl_from where pp_propname = 'wikibase_item';" >>>> Best >>>> -- >>>> Amir >>>> >>>> >>>> >>>> >>>> -- >>>> Amir >>>> >>>> _________________________________________________ >>>> Wikidata-l mailing list >>>> [email protected] <mailto:Wikidata-l@lists. >>>> wikimedia.org> >>>> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l >>>> <https://lists.wikimedia.org/mailman/listinfo/wikidata-l> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Wikidata-l mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l >>>> >>>> >>> >>> _______________________________________________ >>> Wikidata-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l >>> >> >> >> >> -- >> Amir >> >> >> _______________________________________________ >> Wikidata-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikidata-l >> >> > > _______________________________________________ > Wikidata-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata-l > > -- Amir
_______________________________________________ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
