[ https://issues.apache.org/jira/browse/SOLR-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748172#action_12748172 ]
Yonik Seeley commented on SOLR-1091: ------------------------------------ Looking closer at the byte sequence, this really looks like invalid UTF8 since the 8th bit is set on every byte. The decoder is probably just doing the best that it can with this, but the result isn't going to be what you want in any case. > "phps" (serialized PHP) writer produces invalid output > ------------------------------------------------------ > > Key: SOLR-1091 > URL: https://issues.apache.org/jira/browse/SOLR-1091 > Project: Solr > Issue Type: Bug > Components: search > Affects Versions: 1.3 > Environment: Sun JRE 1.6.0 on Centos 5 > Reporter: frank farmer > Priority: Minor > Fix For: 1.4 > > > The serialized PHP output writer can outputs invalid string lengths for > certain (unusual) input values. Specifically, I had a document containing > the following 6 byte character sequence: \xED\xAF\x80\xED\xB1\xB8 > I was able to create a document in the index containing this value without > issue; however, when fetching the document back out using the serialized PHP > writer, it returns a string like the following: > s:4:"􀁸"; > Note that the string length specified is 4, while the string is actually 6 > bytes long. > When using PHP's native serialize() function, it correctly sets the length to > 6: > # php -r 'var_dump(serialize("\xED\xAF\x80\xED\xB1\xB8"));' > string(13) "s:6:"􀁸";" > The "wt=php" writer, which produces output to be parsed with eval(), doesn't > have any trouble with this string. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.