[ https://issues.apache.org/jira/browse/NUTCH-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel closed NUTCH-1591. ---------------------------------- > Incorrect conversion of ByteBuffer to String > -------------------------------------------- > > Key: NUTCH-1591 > URL: https://issues.apache.org/jira/browse/NUTCH-1591 > Project: Nutch > Issue Type: Bug > Components: crawldb, indexer, parser, storage > Affects Versions: 2.2 > Environment: Mac O/S 10.8.4, JDK 1.6.0_51 > Reporter: Jason Howes > Priority: Critical > Fix For: 2.2.1 > > Attachments: NUTCH-1591.patch, NUTCH-1591.zip, Nutch1591Test.java > > > There are many occurrences of the following ByteBuffer-to-String conversion > throughout the Nutch codebase: > {code} > ByteBuffer buf = ...; > return new String(buf.array); > {code} > This approach assume that the ByteBuffer and its underlying array are aligned > (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the underlying > array is the same as ByteBuffer.remaining()). In many cases this is not the > case. The correct way to convert a ByteBuffer to a String (or stream thereof) > is the following: > {code} > ByteBuffer buf = ...; > return new String(buf.array(), buf.arrayOffset() + buf.position(), > buf.remaining()); > {code} > I noticed this bug when using Nutch with Cassandra. In most cases, the parsed > content contains data from other columns (as well as garbage content) since > the Cassandra client library returns ByteBuffers that are views on top of a > larger byte[]. It also seems that others have hit this as well: > http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop > I've attached a patch based on the release-2.2 tag of the 2.x branch on > GitHub: > https://github.com/apache/nutch/tree/release-2.2 -- This message was sent by Atlassian Jira (v8.3.4#803005)