Hi, Peter, I did some benchmark on 20 newsgroups texts. The results can be found here: https://github.com/tokenmill/dictionary-annotator I didn't measure memory usage, just compared how fast different annotators do the job.
Best regards, Donatas On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl <[email protected]> wrote: > Hi, > > > for the UIMA Ruta paper, I used the enron email dataset [1], but it is > probably not optimal here. > > > I think we can find a standard scenario (data+terminology), maybe > something like Genia with MeSH or wikipedia with geonames. Just a quick > guess. I can help setting something up, but probably not before February. > > > Best, > > > Peter > > > [1] https://www.cs.cmu.edu/~enron/ > > Am 05.12.2016 um 12:56 schrieb Donatas Remeika: > > Hi, > > > > Thanks for feedback. > > Yes, it would be interesting to see benchmark results. Maybe you know > where > > I could find examples and data for doing benchmarks in UIMA? > > > > Best regards, > > Donatas > > > > > > On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl <[email protected]> > > wrote: > > > >> Hi, > >> > >> > >> a very nice annotator, thank you. > >> > >> > >> Do you have figures how the annotator compares to the others with > >> respect to speed and memory usage? > >> > >> Storing the complete tokens will maybe provide challenges in scenarios > >> with parallelization if the dictionary is not shared between annotators. > >> > >> Would you be interested to set up a benchmark? > >> > >> > >> Because of the limitations of the dictionaries in ruta, I also created a > >> new simple dictionary annotator, but it lives now in our own components > >> repository. Maybe I'll contribute it sometimes to ruta since it provides > >> exactly the functionality the ruta dictionaries miss. > >> > >> > >> Best, > >> > >> > >> Peter > >> > >> > >> Am 30.11.2016 um 15:38 schrieb Donatas Remeika: > >>> Hi, > >>> > >>> Just wanted to let you know that we created a new (probably one more) > >>> dictionary annotator. > >>> > >>> Reasons for creating it was: > >>> - Quite often we used Ruta in our pipelines only because of its > >> MARKTABLE > >>> action which is able to set several features on annotation > >>> - Sometimes dictionaries contain duplicate entries with different > >> features > >>> and we need to create annotations for each entry > >>> - Possibility to use custom dictionary entries tokenizer (default is > >>> whitespace tokenizer) > >>> > >>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE. > >> Big > >>> thanks to their developers! > >>> > >>> Code with examples can be found > >>> https://github.com/tokenmill/dictionary-annotator > >>> > >>> BTW, maybe someone knows Concept Mapper alternative, which is more > >> uimaFIT > >>> friendly? > >>> > >>> Best regards, > >>> Donatas > >>> > >> > >
