This looks really cool and valuable. Thanks for your work on it, Aiko! I took your prototype for a test run, but was a little surprised by the first three tasks it gave me: https://tools.wmflabs.org/aiko-citationhunt/en?id=90bb8e4a https://tools.wmflabs.org/aiko-citationhunt/en?id=d0f3447f https://tools.wmflabs.org/aiko-citationhunt/en?id=d49f1b38
For all three of them, the text appears to already be referenced. (For the second one, the second sentence doesn't have the reference immediately following it, so I can see why that would be a problem, but the tool highlighted the first sentence as well.) Is this a bug, or is the tool telling me that the sources used are unreliable, or am I just misunderstanding something? Emufarmers On Sat, Mar 7, 2020 at 9:03 AM Ai-Jou Chou <[email protected]> wrote: > Hi all, > > I’m happy to announce the outcome of an Outreachy internship > <https://phabricator.wikimedia.org/T233707> that I’m finishing up. It is a > new tool and public dataset named Citation Detective which tool developers > and researchers can now use for their projects. > > Citation Detective <https://meta.wikimedia.org/wiki/Citation_Detective> > contains sentences that have been identified as needing a citation using a > machine learning-based classifier published earlier last year > <https://arxiv.org/pdf/1902.11116.pdf> by WMF researchers and > collaborators. As part of Outreachy, I developed a tool > <https://github.com/AikoChou/citationdetective> (hosted on Toolforge > <https://tools.wmflabs.org>) to run through Wikipedia and extract > high-scoring sentences along with contextual information. > > As an example use case for this data, I also created a proof of concept for > integrating Citation Detective and Citation Hunt > <https://tools.wmflabs.org/citationhunt>. Check out my prototype Citation > Hunt <https://tools.wmflabs.org/aiko-citationhunt>, which uses Citation > Detective to import sentences that would not normally be featured in > Citation Hunt. The repository for that is here > <https://github.com/AikoChou/citationhunt>. > > This dataset currently includes sentences from ~120,000 randomly selected > articles from the English Wikipedia. In future work, we hope to expand this > to more language Wikipedia projects and a greater number of articles. It is > also possible to expand the database to contain more fields in a future > version according to feedback from tool developers and researchers. More > use cases for this type of data were identified in a design research > project > < > https://meta.wikimedia.org/wiki/Research:Identification_of_Unsourced_Statements/API_design_research > > > conducted last year by Jonathan Morgan. > > You can find more information in our Wiki Workshop submission > < > https://commons.wikimedia.org/wiki/File:Citation_Detective_WikiWorkshop2020.pdf > > > and in my blog <https://rollingmist.home.blog/> which documented the whole > journey. > > Thank you very much! > > Kind regard, > Aiko > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
