[ https://issues.apache.org/jira/browse/SOLR-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748184#action_12748184 ]
frank farmer commented on SOLR-1091: ------------------------------------ My concern is not that solr do anything specific with this garbled data, only that wt=phps always returns a string that can be run through unserialize() without error. Here's the exact case in which I encountered this bug, which may help explain why I reported this issue in the first place: 1) Somehow, a user inserted the aforementioned sequence of bytes in some user-editable content in my application. 2) My code blindly passed that data directly into solr (in retrospect, I should probably be filtering anything that's not valid UTF-8) 3) Users ran queries which included the affected document 4) My code tried to unserialize() the output, and failed with a PHP error (simply replacing the offending "s:4:" with "s:6:" caused the output to unserialize without issue, however). This caused my users to be unable to retrieve results for many queries. Long story short, if you let users insert arbitrary byte sequences into your index (which I'll admit is naive, but I'm sure I'm not the only one who's done this), and you use wt=phps, a malicious user can effectively cause a DoS. Again, I don't care about actually getting these bytes back out of solr unmangled. I only care that the output of wt=phps make it through unserialize() without causing a PHP error. > "phps" (serialized PHP) writer produces invalid output > ------------------------------------------------------ > > Key: SOLR-1091 > URL: https://issues.apache.org/jira/browse/SOLR-1091 > Project: Solr > Issue Type: Bug > Components: search > Affects Versions: 1.3 > Environment: Sun JRE 1.6.0 on Centos 5 > Reporter: frank farmer > Priority: Minor > Fix For: 1.4 > > > The serialized PHP output writer can outputs invalid string lengths for > certain (unusual) input values. Specifically, I had a document containing > the following 6 byte character sequence: \xED\xAF\x80\xED\xB1\xB8 > I was able to create a document in the index containing this value without > issue; however, when fetching the document back out using the serialized PHP > writer, it returns a string like the following: > s:4:"􀁸"; > Note that the string length specified is 4, while the string is actually 6 > bytes long. > When using PHP's native serialize() function, it correctly sets the length to > 6: > # php -r 'var_dump(serialize("\xED\xAF\x80\xED\xB1\xB8"));' > string(13) "s:6:"􀁸";" > The "wt=php" writer, which produces output to be parsed with eval(), doesn't > have any trouble with this string. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.