Hi... I just pulled and compiled the dictionaryannotator and am looking through the code. I'm looking for something that is faster than UIMA Concept-Mapper. I don't need all the functionality of Concept-Mapper, but do need the following: * match all, e.g. if dict entries are "a b c", "a b" and "b c" and input is "a b c" , I need to match "a b c", "a b" and "b c" * skip tokens, e.g. if dict entry is "a c d", it should match on input "a b c d" Can someone familiar with the new dictionary annotator save me some time and say if it supports these matching strategies? Also, any sense of how the system scales? Thanks / Dan -----Original Message----- From: Peter Klügl [mailto:[email protected]] Sent: Tuesday, March 14, 2017 12:52 AM To: [email protected] Subject: Re: New dictionary annotator
Hi, it's now March and I did not yet find the time to compare the different annotators in your benchmark. I just wanted to mention that I did not forget about this and that this is still on my todo list. However, it could easily be April before I find the time. Best, Peter Am 08.12.2016 um 10:43 schrieb Donatas Remeika: > Hi, > > Peter, I did some benchmark on 20 newsgroups texts. The results can be > found here: https://github.com/tokenmill/dictionary-annotator > I didn't measure memory usage, just compared how fast different > annotators do the job. > > Best regards, > Donatas > > On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <[email protected]> wrote: > >> Hi, >> >> >> for the UIMA Ruta paper, I used the enron email dataset [1], but it >> is probably not optimal here. >> >> >> I think we can find a standard scenario (data+terminology), maybe >> something like Genia with MeSH or wikipedia with geonames. Just a >> quick guess. I can help setting something up, but probably not before >> February. >> >> >> Best, >> >> >> Peter >> >> >> [1] https://www.cs.cmu.edu/~enron/ >> >> Am 05.12.2016 um 12:56 schrieb Donatas Remeika: >>> Hi, >>> >>> Thanks for feedback. >>> Yes, it would be interesting to see benchmark results. Maybe you >>> know >> where >>> I could find examples and data for doing benchmarks in UIMA? >>> >>> Best regards, >>> Donatas >>> >>> >>> On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl >>> <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> >>>> a very nice annotator, thank you. >>>> >>>> >>>> Do you have figures how the annotator compares to the others with >>>> respect to speed and memory usage? >>>> >>>> Storing the complete tokens will maybe provide challenges in >>>> scenarios with parallelization if the dictionary is not shared between >>>> annotators. >>>> >>>> Would you be interested to set up a benchmark? >>>> >>>> >>>> Because of the limitations of the dictionaries in ruta, I also >>>> created a new simple dictionary annotator, but it lives now in our >>>> own components repository. Maybe I'll contribute it sometimes to >>>> ruta since it provides exactly the functionality the ruta dictionaries >>>> miss. >>>> >>>> >>>> Best, >>>> >>>> >>>> Peter >>>> >>>> >>>> Am 30.11.2016 um 15:38 schrieb Donatas Remeika: >>>>> Hi, >>>>> >>>>> Just wanted to let you know that we created a new (probably one >>>>> more) dictionary annotator. >>>>> >>>>> Reasons for creating it was: >>>>> - Quite often we used Ruta in our pipelines only because of its >>>> MARKTABLE >>>>> action which is able to set several features on annotation >>>>> - Sometimes dictionaries contain duplicate entries with different >>>> features >>>>> and we need to create annotations for each entry >>>>> - Possibility to use custom dictionary entries tokenizer (default >>>>> is whitespace tokenizer) >>>>> >>>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE. >>>> Big >>>>> thanks to their developers! >>>>> >>>>> Code with examples can be found >>>>> https://github.com/tokenmill/dictionary-annotator >>>>> >>>>> BTW, maybe someone knows Concept Mapper alternative, which is more >>>> uimaFIT >>>>> friendly? >>>>> >>>>> Best regards, >>>>> Donatas >>>>> >>
