Re: Proccesing Bamun characters

2016-12-08 Thread Marshall Schor
Hi Nelson,

I can't see the characters (sorry).

This might be an issue caused by a discrepancy between the coding of the file
being read, and the coding indicated on the xml header.  Can you check that
those two things are the same?

See
http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header
for example.

-Marshall

On 12/8/2016 4:20 PM, nelson rivera wrote:
> i tried to proccess the following text in a service deploy in uima-as,
> because is input of my application. This is the text : 榀  榐  �  �.
> These characters correspond to the bamun language, and apparently are
> not  invalid xml characters because tools such as browsers interpret
> it and show it. After get a new input cas to proccesing, set the text
> and send the request, i get  the exception that i show below in
> uima-as, the framework uima-as work and recovers correctly, just not
> process this characters.
> Could you tell me what happens with these characters, one of these is
> invalid characters for framework uima-as?
>
>
>
> 04:00:31.606 - 14:
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
> WARNING:
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> Character reference 

Proccesing Bamun characters

2016-12-08 Thread nelson rivera
i tried to proccess the following text in a service deploy in uima-as,
because is input of my application. This is the text : 榀  榐  �  �.
These characters correspond to the bamun language, and apparently are
not  invalid xml characters because tools such as browsers interpret
it and show it. After get a new input cas to proccesing, set the text
and send the request, i get  the exception that i show below in
uima-as, the framework uima-as work and recovers correctly, just not
process this characters.
Could you tell me what happens with these characters, one of these is
invalid characters for framework uima-as?



04:00:31.606 - 14:
org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
WARNING:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
Character reference 

Re: New dictionary annotator

2016-12-08 Thread Donatas Remeika
Hi,

Peter, I did some benchmark on 20 newsgroups texts. The results can be
found here: https://github.com/tokenmill/dictionary-annotator
I didn't measure memory usage, just compared how fast different annotators
do the job.

Best regards,
Donatas

On Mon, Dec 5, 2016 at 2:35 PM Peter Klügl  wrote:

> Hi,
>
>
> for the UIMA Ruta paper, I used the enron email dataset [1], but it is
> probably not optimal here.
>
>
> I think we can find a standard scenario (data+terminology), maybe
> something like Genia with MeSH or wikipedia with geonames. Just a quick
> guess. I can help setting something up, but probably not before February.
>
>
> Best,
>
>
> Peter
>
>
> [1] https://www.cs.cmu.edu/~enron/
>
> Am 05.12.2016 um 12:56 schrieb Donatas Remeika:
> > Hi,
> >
> > Thanks for feedback.
> > Yes, it would be interesting to see benchmark results. Maybe you know
> where
> > I could find examples and data for doing benchmarks in UIMA?
> >
> > Best regards,
> > Donatas
> >
> >
> > On Mon, Dec 5, 2016 at 10:52 AM Peter Klügl 
> > wrote:
> >
> >> Hi,
> >>
> >>
> >> a very nice annotator, thank you.
> >>
> >>
> >> Do you have figures how the annotator compares to the others with
> >> respect to speed and memory usage?
> >>
> >> Storing the complete tokens will maybe provide challenges in scenarios
> >> with parallelization if the dictionary is not shared between annotators.
> >>
> >> Would you be interested to set up a benchmark?
> >>
> >>
> >> Because of the limitations of the dictionaries in ruta, I also created a
> >> new simple dictionary annotator, but it lives now in our own components
> >> repository. Maybe I'll contribute it sometimes to ruta since it provides
> >> exactly the functionality the ruta dictionaries miss.
> >>
> >>
> >> Best,
> >>
> >>
> >> Peter
> >>
> >>
> >> Am 30.11.2016 um 15:38 schrieb Donatas Remeika:
> >>> Hi,
> >>>
> >>> Just wanted to let you know that we created a new (probably one more)
> >>> dictionary annotator.
> >>>
> >>> Reasons for creating it was:
> >>>  - Quite often we used Ruta in our pipelines only because of its
> >> MARKTABLE
> >>> action which is able to set several features on annotation
> >>>  - Sometimes dictionaries contain duplicate entries with different
> >> features
> >>> and we need to create annotations for each entry
> >>>  - Possibility to use custom dictionary entries tokenizer (default is
> >>> whitespace tokenizer)
> >>>
> >>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
> >> Big
> >>> thanks to their developers!
> >>>
> >>> Code with examples can be found
> >>> https://github.com/tokenmill/dictionary-annotator
> >>>
> >>> BTW, maybe someone knows Concept Mapper alternative, which is more
> >> uimaFIT
> >>> friendly?
> >>>
> >>> Best regards,
> >>> Donatas
> >>>
> >>
>
>