[ 
https://issues.apache.org/jira/browse/NUTCH-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1591.
----------------------------------

> Incorrect conversion of ByteBuffer to String
> --------------------------------------------
>
>                 Key: NUTCH-1591
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1591
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, indexer, parser, storage
>    Affects Versions: 2.2
>         Environment: Mac O/S 10.8.4, JDK 1.6.0_51
>            Reporter: Jason Howes
>            Priority: Critical
>             Fix For: 2.2.1
>
>         Attachments: NUTCH-1591.patch, NUTCH-1591.zip, Nutch1591Test.java
>
>
> There are many occurrences of the following ByteBuffer-to-String conversion 
> throughout the Nutch codebase:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array);
> {code}
> This approach assume that the ByteBuffer and its underlying array are aligned 
> (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the underlying 
> array is the same as ByteBuffer.remaining()). In many cases this is not the 
> case. The correct way to convert a ByteBuffer to a String (or stream thereof) 
> is the following:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array(), buf.arrayOffset() + buf.position(), 
> buf.remaining());
> {code}
> I noticed this bug when using Nutch with Cassandra. In most cases, the parsed 
> content contains data from other columns (as well as garbage content) since 
> the Cassandra client library returns ByteBuffers that are views on top of a 
> larger byte[]. It also seems that others have hit this as well:
> http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop
> I've attached a patch based on the release-2.2 tag of the 2.x branch on 
> GitHub:
> https://github.com/apache/nutch/tree/release-2.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to