[ 
https://issues.apache.org/jira/browse/UIMA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601642#action_12601642
 ] 

Jeffrey Sorensen commented on UIMA-1041:
----------------------------------------


The problem is caused ConvertUnicodeStringRef  function in the pythonnator 
source,
due to the behavior of PyUnicode_DecodeUTF16 as documented here
http://docs.python.org/api/builtinCodecs.html

Byte order marks will not be copied into the target string.  Looking at the 
source code
for Python, the following comment can be found in the PyUnicode_DecodeUTF16
source

    /* Check for BOM marks (U+FEFF) in the input and adjust current
       byte order setting accordingly. In native mode, the leading BOM
       mark is skipped, in all other modes, it is copied to the output
       stream as-is (giving a ZWNBSP character). */

this suggests that providing any value for the byteorder parameter will cause 
byte-order
marks to be preserved.

Hence, my proposed replacement code is as follows

static bool ConvertUnicodeStringRef(const UnicodeStringRef &ref,
        PyObject **rv) {
  if (sizeof(Py_UNICODE) == sizeof(UChar)) {
    *rv = PyUnicode_FromUnicode((const Py_UNICODE*) ref.getBuffer(),
        ref.length());
  } else {
    // test for big-endian, preset python decoder for native order
    // this will prevent PyUnicode_DecodeUTF16 from deleting byte order marks
    union { long l; char c[sizeof(long)]; } u;
    u.l = 1;
    int byteorder = (u.c[sizeof(long) - 1] == 1) ? 1 : -1;
    PyObject *r = PyUnicode_DecodeUTF16(
       (const char *) ref.getBuffer(), ref.getSizeInBytes(), 0, &byteorder);
    if (r==0) return false;
    *rv = r;
  }
  return true;
}

where the test for endian ness was lifted from this page
http://unixpapa.com/incnote/byteorder.html

Jeff


> UIMACPP Pythonator issues with annotation offsets and lengths - off by 1 
> errors
> -------------------------------------------------------------------------------
>
>                 Key: UIMA-1041
>                 URL: https://issues.apache.org/jira/browse/UIMA-1041
>             Project: UIMA
>          Issue Type: Bug
>          Components: C++ Framework
>         Environment: RedHat, UIMACPP 2.2.2 release candidate 01, uima base 
> 2.2.2
>            Reporter: Marshall Schor
>
> The sample python script when run in the document analyzer shows annotations 
> where the highlight is always missing the last character, and the details 
> show the offsets for the begin and end to be both one to low.
> To reproduce, run the sample script in the python directory of the 
> scriptators (after doing a build /install of the pythonator following the 
> directions in the python directory in python.html).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to