[jira] [Commented] (ACCUMULO-2353) Test improvments to java.io.InputStream.seek() for possible Hadoop patch
[ https://issues.apache.org/jira/browse/ACCUMULO-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592238#comment-15592238 ] Josh Elser commented on ACCUMULO-2353: -- IMO, close this and, if someone wants to follow through with this in Hadoop or some JVM vendor, fantastic. > Test improvments to java.io.InputStream.seek() for possible Hadoop patch > > > Key: ACCUMULO-2353 > URL: https://issues.apache.org/jira/browse/ACCUMULO-2353 > Project: Accumulo > Issue Type: Task > Environment: Java 6 update 45 or later > Hadoop 2.2.0 >Reporter: Dave Marion >Priority: Minor > > At some point (early Java 7 I think, then backported to around Java 6 Update > 45), the java.io.InputStream.seek() method was changed from reading byte[512] > to byte[2048]. The difference can be seen in DeflaterInputStream, which has > not been updated: > {noformat} > public long skip(long n) throws IOException { > if (n < 0) { > throw new IllegalArgumentException("negative skip length"); > } > ensureOpen(); > // Skip bytes by repeatedly decompressing small blocks > if (rbuf.length < 512) > rbuf = new byte[512]; > int total = (int)Math.min(n, Integer.MAX_VALUE); > long cnt = 0; > while (total > 0) { > // Read a small block of uncompressed bytes > int len = read(rbuf, 0, (total <= rbuf.length ? total : > rbuf.length)); > if (len < 0) { > break; > } > cnt += len; > total -= len; > } > return cnt; > } > {noformat} > and java.io.InputStream in Java 6 Update 45: > {noformat} > // MAX_SKIP_BUFFER_SIZE is used to determine the maximum buffer skip to > // use when skipping. > private static final int MAX_SKIP_BUFFER_SIZE = 2048; > public long skip(long n) throws IOException { > long remaining = n; > int nr; > if (n <= 0) { > return 0; > } > > int size = (int)Math.min(MAX_SKIP_BUFFER_SIZE, remaining); > byte[] skipBuffer = new byte[size]; > while (remaining > 0) { > nr = read(skipBuffer, 0, (int)Math.min(size, remaining)); > > if (nr < 0) { > break; > } > remaining -= nr; > } > > return n - remaining; > } > {noformat} > In sample tests I saw about a 20% improvement in skip() when seeking towards > the end of a locally cached compressed file. Looking at the > DecompressorStream in HDFS, the seek method is a near copy of the old > InputStream method: > {noformat} > private byte[] skipBytes = new byte[512]; > @Override > public long skip(long n) throws IOException { > // Sanity checks > if (n < 0) { > throw new IllegalArgumentException("negative skip length"); > } > checkStream(); > > // Read 'n' bytes > int skipped = 0; > while (skipped < n) { > int len = Math.min(((int)n - skipped), skipBytes.length); > len = read(skipBytes, 0, len); > if (len == -1) { > eof = true; > break; > } > skipped += len; > } > return skipped; > } > {noformat} > This task is to evaluate the changes to DecompressorStream with a possible > patch to HDFS and possible bug request to Oracle to port the InputStream.seek > changes to DeflaterInputStream.seek -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-2353) Test improvments to java.io.InputStream.seek() for possible Hadoop patch
[ https://issues.apache.org/jira/browse/ACCUMULO-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586521#comment-15586521 ] Dave Marion commented on ACCUMULO-2353: --- Looks like this was fixed in Hadoop 2.8. What's the disposition for this ticket? > Test improvments to java.io.InputStream.seek() for possible Hadoop patch > > > Key: ACCUMULO-2353 > URL: https://issues.apache.org/jira/browse/ACCUMULO-2353 > Project: Accumulo > Issue Type: Task > Environment: Java 6 update 45 or later > Hadoop 2.2.0 >Reporter: Dave Marion >Priority: Minor > > At some point (early Java 7 I think, then backported to around Java 6 Update > 45), the java.io.InputStream.seek() method was changed from reading byte[512] > to byte[2048]. The difference can be seen in DeflaterInputStream, which has > not been updated: > {noformat} > public long skip(long n) throws IOException { > if (n < 0) { > throw new IllegalArgumentException("negative skip length"); > } > ensureOpen(); > // Skip bytes by repeatedly decompressing small blocks > if (rbuf.length < 512) > rbuf = new byte[512]; > int total = (int)Math.min(n, Integer.MAX_VALUE); > long cnt = 0; > while (total > 0) { > // Read a small block of uncompressed bytes > int len = read(rbuf, 0, (total <= rbuf.length ? total : > rbuf.length)); > if (len < 0) { > break; > } > cnt += len; > total -= len; > } > return cnt; > } > {noformat} > and java.io.InputStream in Java 6 Update 45: > {noformat} > // MAX_SKIP_BUFFER_SIZE is used to determine the maximum buffer skip to > // use when skipping. > private static final int MAX_SKIP_BUFFER_SIZE = 2048; > public long skip(long n) throws IOException { > long remaining = n; > int nr; > if (n <= 0) { > return 0; > } > > int size = (int)Math.min(MAX_SKIP_BUFFER_SIZE, remaining); > byte[] skipBuffer = new byte[size]; > while (remaining > 0) { > nr = read(skipBuffer, 0, (int)Math.min(size, remaining)); > > if (nr < 0) { > break; > } > remaining -= nr; > } > > return n - remaining; > } > {noformat} > In sample tests I saw about a 20% improvement in skip() when seeking towards > the end of a locally cached compressed file. Looking at the > DecompressorStream in HDFS, the seek method is a near copy of the old > InputStream method: > {noformat} > private byte[] skipBytes = new byte[512]; > @Override > public long skip(long n) throws IOException { > // Sanity checks > if (n < 0) { > throw new IllegalArgumentException("negative skip length"); > } > checkStream(); > > // Read 'n' bytes > int skipped = 0; > while (skipped < n) { > int len = Math.min(((int)n - skipped), skipBytes.length); > len = read(skipBytes, 0, len); > if (len == -1) { > eof = true; > break; > } > skipped += len; > } > return skipped; > } > {noformat} > This task is to evaluate the changes to DecompressorStream with a possible > patch to HDFS and possible bug request to Oracle to port the InputStream.seek > changes to DeflaterInputStream.seek -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-2353) Test improvments to java.io.InputStream.seek() for possible Hadoop patch
[ https://issues.apache.org/jira/browse/ACCUMULO-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13898651#comment-13898651 ] Josh Elser commented on ACCUMULO-2353: -- Why file the ticket here and not in Hadoop-Common, [~dlmarion]? Test improvments to java.io.InputStream.seek() for possible Hadoop patch Key: ACCUMULO-2353 URL: https://issues.apache.org/jira/browse/ACCUMULO-2353 Project: Accumulo Issue Type: Task Environment: Java 6 update 45 or later Hadoop 2.2.0 Reporter: Dave Marion Priority: Minor At some point (early Java 7 I think, then backported to around Java 6 Update 45), the java.io.InputStream.seek() method was changed from reading byte[512] to byte[2048]. The difference can be seen in DeflaterInputStream, which has not been updated: {noformat} public long skip(long n) throws IOException { if (n 0) { throw new IllegalArgumentException(negative skip length); } ensureOpen(); // Skip bytes by repeatedly decompressing small blocks if (rbuf.length 512) rbuf = new byte[512]; int total = (int)Math.min(n, Integer.MAX_VALUE); long cnt = 0; while (total 0) { // Read a small block of uncompressed bytes int len = read(rbuf, 0, (total = rbuf.length ? total : rbuf.length)); if (len 0) { break; } cnt += len; total -= len; } return cnt; } {noformat} and java.io.InputStream in Java 6 Update 45: {noformat} // MAX_SKIP_BUFFER_SIZE is used to determine the maximum buffer skip to // use when skipping. private static final int MAX_SKIP_BUFFER_SIZE = 2048; public long skip(long n) throws IOException { long remaining = n; int nr; if (n = 0) { return 0; } int size = (int)Math.min(MAX_SKIP_BUFFER_SIZE, remaining); byte[] skipBuffer = new byte[size]; while (remaining 0) { nr = read(skipBuffer, 0, (int)Math.min(size, remaining)); if (nr 0) { break; } remaining -= nr; } return n - remaining; } {noformat} In sample tests I saw about a 20% improvement in skip() when seeking towards the end of a locally cached compressed file. Looking at the DecompressorStream in HDFS, the seek method is a near copy of the old InputStream method: {noformat} private byte[] skipBytes = new byte[512]; @Override public long skip(long n) throws IOException { // Sanity checks if (n 0) { throw new IllegalArgumentException(negative skip length); } checkStream(); // Read 'n' bytes int skipped = 0; while (skipped n) { int len = Math.min(((int)n - skipped), skipBytes.length); len = read(skipBytes, 0, len); if (len == -1) { eof = true; break; } skipped += len; } return skipped; } {noformat} This task is to evaluate the changes to DecompressorStream with a possible patch to HDFS and possible bug request to Oracle to port the InputStream.seek changes to DeflaterInputStream.seek -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (ACCUMULO-2353) Test improvments to java.io.InputStream.seek() for possible Hadoop patch
[ https://issues.apache.org/jira/browse/ACCUMULO-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13898724#comment-13898724 ] Dave Marion commented on ACCUMULO-2353: --- The rationale was to capture the issue and do some testing before cluttering up the other ticket systems. Test improvments to java.io.InputStream.seek() for possible Hadoop patch Key: ACCUMULO-2353 URL: https://issues.apache.org/jira/browse/ACCUMULO-2353 Project: Accumulo Issue Type: Task Environment: Java 6 update 45 or later Hadoop 2.2.0 Reporter: Dave Marion Priority: Minor At some point (early Java 7 I think, then backported to around Java 6 Update 45), the java.io.InputStream.seek() method was changed from reading byte[512] to byte[2048]. The difference can be seen in DeflaterInputStream, which has not been updated: {noformat} public long skip(long n) throws IOException { if (n 0) { throw new IllegalArgumentException(negative skip length); } ensureOpen(); // Skip bytes by repeatedly decompressing small blocks if (rbuf.length 512) rbuf = new byte[512]; int total = (int)Math.min(n, Integer.MAX_VALUE); long cnt = 0; while (total 0) { // Read a small block of uncompressed bytes int len = read(rbuf, 0, (total = rbuf.length ? total : rbuf.length)); if (len 0) { break; } cnt += len; total -= len; } return cnt; } {noformat} and java.io.InputStream in Java 6 Update 45: {noformat} // MAX_SKIP_BUFFER_SIZE is used to determine the maximum buffer skip to // use when skipping. private static final int MAX_SKIP_BUFFER_SIZE = 2048; public long skip(long n) throws IOException { long remaining = n; int nr; if (n = 0) { return 0; } int size = (int)Math.min(MAX_SKIP_BUFFER_SIZE, remaining); byte[] skipBuffer = new byte[size]; while (remaining 0) { nr = read(skipBuffer, 0, (int)Math.min(size, remaining)); if (nr 0) { break; } remaining -= nr; } return n - remaining; } {noformat} In sample tests I saw about a 20% improvement in skip() when seeking towards the end of a locally cached compressed file. Looking at the DecompressorStream in HDFS, the seek method is a near copy of the old InputStream method: {noformat} private byte[] skipBytes = new byte[512]; @Override public long skip(long n) throws IOException { // Sanity checks if (n 0) { throw new IllegalArgumentException(negative skip length); } checkStream(); // Read 'n' bytes int skipped = 0; while (skipped n) { int len = Math.min(((int)n - skipped), skipBytes.length); len = read(skipBytes, 0, len); if (len == -1) { eof = true; break; } skipped += len; } return skipped; } {noformat} This task is to evaluate the changes to DecompressorStream with a possible patch to HDFS and possible bug request to Oracle to port the InputStream.seek changes to DeflaterInputStream.seek -- This message was sent by Atlassian JIRA (v6.1.5#6160)