[jira] [Commented] (COUCHDB-3173) Views return corrupt data for text fields containing non-BMP characters
[ https://issues.apache.org/jira/browse/COUCHDB-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545556#comment-15545556 ] ASF GitHub Bot commented on COUCHDB-3173: - Github user asfgit closed the pull request at: https://github.com/apache/couchdb-couch/pull/202 > Views return corrupt data for text fields containing non-BMP characters > --- > > Key: COUCHDB-3173 > URL: https://issues.apache.org/jira/browse/COUCHDB-3173 > Project: CouchDB > Issue Type: Bug > Components: JavaScript View Server >Affects Versions: 2.0.0 >Reporter: Loke > > When inserting a non-BMP character (i.e. characters with a Unicode codepoint > above {{U+}}), the content gets corrupted after reading it from a view. > At every instance of such characters, there is an exta {{U+FFFD REPLACEMENT > CHARACTER}} inserted into the text. > To reproduce, use the following commands. > Create the document containing a field with the character {{U+1F604 SMILING > FACE WITH OPEN MOUTH AND SMILING EYES}}: > {noformat} > $ curl -X PUT -d '{"type":"foo","value":""}' http://localhost:5984/foo/foo2 > {"ok":true,"id":"foo2","rev":"1-d7da3cd352ef74f6391cc13601081214"} > {noformat} > Get the document to ensure that it was saved properly: > {noformat} > curl -X GET http://localhost:5984/foo/foo2 > {"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":""} > {noformat} > Create a view that will return that document: > {noformat} > $ curl --user user:password -X PUT -d > '{"language":"javascript","views":{"v":{"map":"function(doc){if(doc.type===\"foo\")emit(doc._id,doc);}"}}}' > http://localhost:5984/foo/_design/bugdemo > {"ok":true,"id":"_design/bugdemo","rev":"1-817af2dafecb4cf8213aa7063551daac"} > {noformat} > Get the document from the view: > {noformat} > $ curl -X GET http://localhost:5984/foo/_design/bugdemo/_view/v > {"total_rows":1,"offset":0,"rows":[ > {"id":"foo2","key":"foo2","value":{"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":"�"}} > ]} > {noformat} > Now we can see that the field {{value}} now contains two characters. The > original character as well as {{U+FFFD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (COUCHDB-3173) Views return corrupt data for text fields containing non-BMP characters
[ https://issues.apache.org/jira/browse/COUCHDB-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1554#comment-1554 ] ASF subversion and git services commented on COUCHDB-3173: -- Commit 37d3778172ca354f124334edf13bc09d9abc28bc in couchdb-couch's branch refs/heads/master from [~paul.joseph.davis] [ https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=37d3778 ] Fix CouchJS character replacement This was a bad backport from an old bug. We accidentally backed up when looking at the second half of a surrogate pair. Instead the backup should only happen when we see a low half of a surrogate pair with no preceding high half. COUCHDB-3173 > Views return corrupt data for text fields containing non-BMP characters > --- > > Key: COUCHDB-3173 > URL: https://issues.apache.org/jira/browse/COUCHDB-3173 > Project: CouchDB > Issue Type: Bug > Components: JavaScript View Server >Affects Versions: 2.0.0 >Reporter: Loke > > When inserting a non-BMP character (i.e. characters with a Unicode codepoint > above {{U+}}), the content gets corrupted after reading it from a view. > At every instance of such characters, there is an exta {{U+FFFD REPLACEMENT > CHARACTER}} inserted into the text. > To reproduce, use the following commands. > Create the document containing a field with the character {{U+1F604 SMILING > FACE WITH OPEN MOUTH AND SMILING EYES}}: > {noformat} > $ curl -X PUT -d '{"type":"foo","value":""}' http://localhost:5984/foo/foo2 > {"ok":true,"id":"foo2","rev":"1-d7da3cd352ef74f6391cc13601081214"} > {noformat} > Get the document to ensure that it was saved properly: > {noformat} > curl -X GET http://localhost:5984/foo/foo2 > {"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":""} > {noformat} > Create a view that will return that document: > {noformat} > $ curl --user user:password -X PUT -d > '{"language":"javascript","views":{"v":{"map":"function(doc){if(doc.type===\"foo\")emit(doc._id,doc);}"}}}' > http://localhost:5984/foo/_design/bugdemo > {"ok":true,"id":"_design/bugdemo","rev":"1-817af2dafecb4cf8213aa7063551daac"} > {noformat} > Get the document from the view: > {noformat} > $ curl -X GET http://localhost:5984/foo/_design/bugdemo/_view/v > {"total_rows":1,"offset":0,"rows":[ > {"id":"foo2","key":"foo2","value":{"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":"�"}} > ]} > {noformat} > Now we can see that the field {{value}} now contains two characters. The > original character as well as {{U+FFFD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (COUCHDB-3173) Views return corrupt data for text fields containing non-BMP characters
[ https://issues.apache.org/jira/browse/COUCHDB-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545538#comment-15545538 ] ASF GitHub Bot commented on COUCHDB-3173: - GitHub user davisp opened a pull request: https://github.com/apache/couchdb-couch/pull/202 Fix CouchJS character replacement This was a bad backport from an old bug. We accidentally backed up when looking at the second half of a surrogate pair. Instead the backup should only happen when we see a low half of a surrogate pair with no preceding high half. COUCHDB-3173 You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloudant/couchdb-couch 3173-fix-couchjs-character-replacement Alternatively you can review and apply these changes as the patch at: https://github.com/apache/couchdb-couch/pull/202.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #202 commit 37d3778172ca354f124334edf13bc09d9abc28bc Author: Paul J. DavisDate: 2016-10-04T14:45:36Z Fix CouchJS character replacement This was a bad backport from an old bug. We accidentally backed up when looking at the second half of a surrogate pair. Instead the backup should only happen when we see a low half of a surrogate pair with no preceding high half. COUCHDB-3173 > Views return corrupt data for text fields containing non-BMP characters > --- > > Key: COUCHDB-3173 > URL: https://issues.apache.org/jira/browse/COUCHDB-3173 > Project: CouchDB > Issue Type: Bug > Components: JavaScript View Server >Affects Versions: 2.0.0 >Reporter: Loke > > When inserting a non-BMP character (i.e. characters with a Unicode codepoint > above {{U+}}), the content gets corrupted after reading it from a view. > At every instance of such characters, there is an exta {{U+FFFD REPLACEMENT > CHARACTER}} inserted into the text. > To reproduce, use the following commands. > Create the document containing a field with the character {{U+1F604 SMILING > FACE WITH OPEN MOUTH AND SMILING EYES}}: > {noformat} > $ curl -X PUT -d '{"type":"foo","value":""}' http://localhost:5984/foo/foo2 > {"ok":true,"id":"foo2","rev":"1-d7da3cd352ef74f6391cc13601081214"} > {noformat} > Get the document to ensure that it was saved properly: > {noformat} > curl -X GET http://localhost:5984/foo/foo2 > {"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":""} > {noformat} > Create a view that will return that document: > {noformat} > $ curl --user user:password -X PUT -d > '{"language":"javascript","views":{"v":{"map":"function(doc){if(doc.type===\"foo\")emit(doc._id,doc);}"}}}' > http://localhost:5984/foo/_design/bugdemo > {"ok":true,"id":"_design/bugdemo","rev":"1-817af2dafecb4cf8213aa7063551daac"} > {noformat} > Get the document from the view: > {noformat} > $ curl -X GET http://localhost:5984/foo/_design/bugdemo/_view/v > {"total_rows":1,"offset":0,"rows":[ > {"id":"foo2","key":"foo2","value":{"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":"�"}} > ]} > {noformat} > Now we can see that the field {{value}} now contains two characters. The > original character as well as {{U+FFFD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (COUCHDB-3173) Views return corrupt data for text fields containing non-BMP characters
[ https://issues.apache.org/jira/browse/COUCHDB-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545522#comment-15545522 ] Paul Joseph Davis commented on COUCHDB-3173: Fixed. PR incoming. > Views return corrupt data for text fields containing non-BMP characters > --- > > Key: COUCHDB-3173 > URL: https://issues.apache.org/jira/browse/COUCHDB-3173 > Project: CouchDB > Issue Type: Bug > Components: JavaScript View Server >Affects Versions: 2.0.0 >Reporter: Loke > > When inserting a non-BMP character (i.e. characters with a Unicode codepoint > above {{U+}}), the content gets corrupted after reading it from a view. > At every instance of such characters, there is an exta {{U+FFFD REPLACEMENT > CHARACTER}} inserted into the text. > To reproduce, use the following commands. > Create the document containing a field with the character {{U+1F604 SMILING > FACE WITH OPEN MOUTH AND SMILING EYES}}: > {noformat} > $ curl -X PUT -d '{"type":"foo","value":""}' http://localhost:5984/foo/foo2 > {"ok":true,"id":"foo2","rev":"1-d7da3cd352ef74f6391cc13601081214"} > {noformat} > Get the document to ensure that it was saved properly: > {noformat} > curl -X GET http://localhost:5984/foo/foo2 > {"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":""} > {noformat} > Create a view that will return that document: > {noformat} > $ curl --user user:password -X PUT -d > '{"language":"javascript","views":{"v":{"map":"function(doc){if(doc.type===\"foo\")emit(doc._id,doc);}"}}}' > http://localhost:5984/foo/_design/bugdemo > {"ok":true,"id":"_design/bugdemo","rev":"1-817af2dafecb4cf8213aa7063551daac"} > {noformat} > Get the document from the view: > {noformat} > $ curl -X GET http://localhost:5984/foo/_design/bugdemo/_view/v > {"total_rows":1,"offset":0,"rows":[ > {"id":"foo2","key":"foo2","value":{"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":"�"}} > ]} > {noformat} > Now we can see that the field {{value}} now contains two characters. The > original character as well as {{U+FFFD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (COUCHDB-3173) Views return corrupt data for text fields containing non-BMP characters
[ https://issues.apache.org/jira/browse/COUCHDB-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545511#comment-15545511 ] Paul Joseph Davis commented on COUCHDB-3173: Here's a simpler reproducer: https://gist.github.com/davisp/3cc1a0e5b0de04a3c027f694d5a4bc31 The contents of the gist are pasted below for posterity, but I dunno how well Jira and Chrome will store the raw byte values: repro.js: ["reset", {"reduce_limit":"true", "timeout":5000}] ["add_fun", "function(doc){if(doc.type===\"foo\")emit(doc._id,doc);}"] ["map_doc", {"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":""}] run.sh: cat repro.js | ./bin/couchjs share/server/main.js Should have a fix in a few minutes if I'm lucky. > Views return corrupt data for text fields containing non-BMP characters > --- > > Key: COUCHDB-3173 > URL: https://issues.apache.org/jira/browse/COUCHDB-3173 > Project: CouchDB > Issue Type: Bug > Components: JavaScript View Server >Affects Versions: 2.0.0 >Reporter: Loke > > When inserting a non-BMP character (i.e. characters with a Unicode codepoint > above {{U+}}), the content gets corrupted after reading it from a view. > At every instance of such characters, there is an exta {{U+FFFD REPLACEMENT > CHARACTER}} inserted into the text. > To reproduce, use the following commands. > Create the document containing a field with the character {{U+1F604 SMILING > FACE WITH OPEN MOUTH AND SMILING EYES}}: > {noformat} > $ curl -X PUT -d '{"type":"foo","value":""}' http://localhost:5984/foo/foo2 > {"ok":true,"id":"foo2","rev":"1-d7da3cd352ef74f6391cc13601081214"} > {noformat} > Get the document to ensure that it was saved properly: > {noformat} > curl -X GET http://localhost:5984/foo/foo2 > {"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":""} > {noformat} > Create a view that will return that document: > {noformat} > $ curl --user user:password -X PUT -d > '{"language":"javascript","views":{"v":{"map":"function(doc){if(doc.type===\"foo\")emit(doc._id,doc);}"}}}' > http://localhost:5984/foo/_design/bugdemo > {"ok":true,"id":"_design/bugdemo","rev":"1-817af2dafecb4cf8213aa7063551daac"} > {noformat} > Get the document from the view: > {noformat} > $ curl -X GET http://localhost:5984/foo/_design/bugdemo/_view/v > {"total_rows":1,"offset":0,"rows":[ > {"id":"foo2","key":"foo2","value":{"_id":"foo2","_rev":"1-d7da3cd352ef74f6391cc13601081214","type":"foo","value":"�"}} > ]} > {noformat} > Now we can see that the field {{value}} now contains two characters. The > original character as well as {{U+FFFD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)