RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Steve Bethard wrote: I spent some time writing a script for diff-ing CASes I urge anyone interested in comparing cTakes CASes / output to use this type of approach. Comparison of program output is a post-process task, and unless absolutely necessary code to juggle data and metadata belongs

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim, One might want compare the Sentence detector that uses end of line characters as sentence splitters with one that does not. Such a change in sentence splitting would not only effect the sentence type discoveries but also practically every type that follows. Another might want to

Re: cTakes output predictability

2014-10-07 Thread britt fitch
The option Sean mentioned of writing your own custom consumer (without the UIMA id that is causing your issues) should meet these needs I believe. Britt Fitch Wired Informatics 265 Franklin St Ste 1702 Boston, MA 02110 http://wiredinformatics.com

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Hi Sean, Well of course that makes plenty of sense. Testing different cTakes configurations you would expect different output. In our testing we've found several cases where running with the same configuration outputs different data under different moons. Having consistent results helps us know

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
I think we may really prefer the first method. Since it doesn't appear that there are any consequences with moving forward with changing the code, we would really like to move forward with this approach. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014

RE: cTakes output predictability

2014-10-07 Thread Masanz, James J.
FWIW, I agree with Sean that comparing should be a post-processing step and trying to get UIMA internal IDs to match on subsequent runs is not worth opening the code for. -Original Message- From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] Sent: Tuesday, October 07, 2014 10:56

Re: cTakes output predictability

2014-10-07 Thread britt fitch
I think changing the code raises at least some concerns of affecting others, while adding a custom consumer raises zero. Given how easy it is to write a custom consumer, that is my vote. Britt Fitch Wired Informatics 265 Franklin St Ste 1702 Boston, MA 02110

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
I think it would be helpful actually, as digging deeper into the issue has highlighted to me a few places in the code that actually cause inconsistent results to be returned when running the same document through multiple times. I think having the code base be predictable will make it easier to

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
It concerns me a bit by making the code return consistent results would be so concerning. This should be the default mode of operation. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 09:59 AM, britt fitch wrote: I think changing the code raises at

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Jay, I agree. This does lead to reproducible unit tests, which helps us out in the long term. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/06/2014 05:38 PM, jay vyas wrote: Im not a ctakes expert by any means, but in general, I like that idea

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim, It concerns me a bit by making the code return consistent results would be so concerning. Could you please clarify what you mean by consistent results? Do you mean ordering and IDs or are you talking about actual type values not matching? This should be the default mode of

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Hi Sean, No, your not a jerk. These are things worth considering, and I understand your concerns with touching various points of the codebase. I'll talk with our group over here and see where we want to go. We are really interested in cTakes behaving well, so we are usually pretty careful in

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Hi Sean, Yes, I mean actual type values not matching. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 10:46 AM, Finan, Sean wrote: Hi Kim, It concerns me a bit by making the code return consistent results would be so concerning. Could you

Re: cTakes output predictability

2014-10-07 Thread Bruce Tietjen
I did not intend to step on anyone's toes. One of the reasons I proposed the changes was to try to make it extremely obvious when there are significant difference in output from the cTakes pipeline when running the same document again, and once identified, make it easier to identify the source of

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
I'm just about sapped on this topic. What comes below is my final writing. Kim wrote: Yes, I mean actual type values not matching. Ok, this is a very serious problem and should have nothing to do with ordering and/or IDs. I repeat: this should have nothing to do with ordering or ids.

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Hi Bruce, Could you send the record over that you are seeing this on? Thanks, Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 11:20 AM, Bruce Tietjen wrote: I did not intend to step on anyone's toes. One of the reasons I proposed the changes was

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim, Great Catch! I think that by now this thread may be discarded by most as spam. So, I'm back (apologies - I know that you are tired of me by now). I checked the code that you pointed to ... I really dislike looking at older cTakes code because I'm filled with an overwhelming urge to

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Hi Sean, Alright, it seems that rather than doing the sorted approach, we want to manage these individually. I'll create tickets on all of the items we have found so far. This is just one example. Then maybe we can move our discussion of how to solve each one to discussions around that ticket