[
https://issues.apache.org/jira/browse/UIMA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601642#action_12601642
]
Jeffrey Sorensen commented on UIMA-1041:
----------------------------------------
The problem is caused ConvertUnicodeStringRef function in the pythonnator
source,
due to the behavior of PyUnicode_DecodeUTF16 as documented here
http://docs.python.org/api/builtinCodecs.html
Byte order marks will not be copied into the target string. Looking at the
source code
for Python, the following comment can be found in the PyUnicode_DecodeUTF16
source
/* Check for BOM marks (U+FEFF) in the input and adjust current
byte order setting accordingly. In native mode, the leading BOM
mark is skipped, in all other modes, it is copied to the output
stream as-is (giving a ZWNBSP character). */
this suggests that providing any value for the byteorder parameter will cause
byte-order
marks to be preserved.
Hence, my proposed replacement code is as follows
static bool ConvertUnicodeStringRef(const UnicodeStringRef &ref,
PyObject **rv) {
if (sizeof(Py_UNICODE) == sizeof(UChar)) {
*rv = PyUnicode_FromUnicode((const Py_UNICODE*) ref.getBuffer(),
ref.length());
} else {
// test for big-endian, preset python decoder for native order
// this will prevent PyUnicode_DecodeUTF16 from deleting byte order marks
union { long l; char c[sizeof(long)]; } u;
u.l = 1;
int byteorder = (u.c[sizeof(long) - 1] == 1) ? 1 : -1;
PyObject *r = PyUnicode_DecodeUTF16(
(const char *) ref.getBuffer(), ref.getSizeInBytes(), 0, &byteorder);
if (r==0) return false;
*rv = r;
}
return true;
}
where the test for endian ness was lifted from this page
http://unixpapa.com/incnote/byteorder.html
Jeff
> UIMACPP Pythonator issues with annotation offsets and lengths - off by 1
> errors
> -------------------------------------------------------------------------------
>
> Key: UIMA-1041
> URL: https://issues.apache.org/jira/browse/UIMA-1041
> Project: UIMA
> Issue Type: Bug
> Components: C++ Framework
> Environment: RedHat, UIMACPP 2.2.2 release candidate 01, uima base
> 2.2.2
> Reporter: Marshall Schor
>
> The sample python script when run in the document analyzer shows annotations
> where the highlight is always missing the last character, and the details
> show the offsets for the begin and end to be both one to low.
> To reproduce, run the sample script in the python directory of the
> scriptators (after doing a build /install of the pythonator following the
> directions in the python directory in python.html).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.