[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142510#comment-16142510 ] Dennis Huo commented on MAPREDUCE-6931: --- Done. > Remove TestDFSIO "Total Throughput" calculation > --- > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Priority: Trivial > Attachments: MAPREDUCE-6931-001.patch > > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The different calculated fields can also use toMB and a shared > milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125997#comment-16125997 ] Dennis Huo commented on MAPREDUCE-6931: --- Right, that line is part of the refactor to make time and byte conversions consistently use the helper functions instead of having different places. So the current pull request keeps the refactoring but removes the "Total Throughput" line as you suggested. If you prefer to also remove all the refactoring and keep the hard-coded "(float)execTime / 1000" stuff I can do that too, just let me know. > Remove TestDFSIO "Total Throughput" calculation > --- > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Priority: Trivial > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The different calculated fields can also use toMB and a shared > milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6931) Fix TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124768#comment-16124768 ] Dennis Huo commented on MAPREDUCE-6931: --- Hmm, looking at github I only see the refactoring of the older messages, along with complete removal of the "Total Throughput" line. The confusion might be that there's only one commit because I used "commit --amend", force of habit from other repos iI've worked on where this convention is used for review-time changes to small patches. I could probably reconstruct the commit history if you prefer. > Fix TestDFSIO "Total Throughput" calculation > > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Priority: Trivial > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The different calculated fields can also use toMB and a shared > milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6931) Remove TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Huo updated MAPREDUCE-6931: -- Summary: Remove TestDFSIO "Total Throughput" calculation (was: Fix TestDFSIO "Total Throughput" calculation) > Remove TestDFSIO "Total Throughput" calculation > --- > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Priority: Trivial > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The different calculated fields can also use toMB and a shared > milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6931) Fix TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122505#comment-16122505 ] Dennis Huo commented on MAPREDUCE-6931: --- Fair enough, makes sense. I went ahead and removed that line, keeping the refactorings otherwise; I also updated my commit message and pull request title to reflect the "removal" rather than the "fix" of the line, but it sounds like guidelines are to avoid editing JIRAs inplace, so I'll leave that untouched. > Fix TestDFSIO "Total Throughput" calculation > > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Priority: Trivial > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The different calculated fields can also use toMB and a shared > milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6931) Fix TestDFSIO "Total Throughput" calculation
[ https://issues.apache.org/jira/browse/MAPREDUCE-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122308#comment-16122308 ] Dennis Huo commented on MAPREDUCE-6931: --- Thanks for the explanation! I have no strong preference about removing the particular "Total Throughput" metric, but from my own experience using TestDFSIO in the past, I do find that the "average single-stream throughput" calculation historically provided by TestDFSIO can itself be somewhat misleading in characterizing a cluster since it makes it difficult to infer the level of concurrency corresponding to that per-stream performance without backing out the numbers manually. I see the new metric as being a useful measure of "Effective Aggregate Throughput", all-in including overhead. For example, if I use memory settings that only fit 1 container per physical machine at a time, my TestDFSIO will trickle through 1 task per machine at a time, and those single tasks will have very high single-stream throughput. If I instead do memory packing so that every machine runs, say, 64 tasks concurrently, then single-stream throughput will suffer significantly, while total walltime will decrease significantly. With a walltime-based calculation, I can see at a glance the approximate total throughput rating of my cluster when everything is running at full throttle; I'd expect increasing concurrency to increase aggregate throughput until IO limits are reached, where aggregate throughput will become flat w.r.t. increasing concurrency or slightly declining due to thrashing. This could also be my cloud bias, where it becomes more important to characterize a full-blast cluster against a remote filesystem vs caring so much about per-stream throughputs. It seems like an "effective aggregate throughput" calculation would help encompass the cluster-wide effects of things like optimal CPU oversubscription ratios, scheduler settings, speculative execution vs failure rates, etc. I agree the wording and computation as-is might not be the right fit for this though. I see a few options that might be worthwhile, possibly in some combination: * Change wording to say "Effective Aggregate Throughput" to more accurately describe what the number means * Add a metric displaying the "time" as "Slot Seconds" or something like that so that user doesn't have to compute it by dividing "Total MBytes processes" by "Throughput mb/sec" explicitly. This also helps clarify that the throughput is computed in terms is slot time, not walltime. * Additionally, maybe provide a measure of "average concurrency" taking total slot time divided by walltime. This would legitimately consider scheduler overheads; if my whole test only ran 1 task in an hour, and it only had 30 minutes of slot time, then a concurrency of 0.5 correctly characterizes the fact that I'm only squeezing out 0.5 utilization after factoring in delays. In any case, happy to just delete the one line in-place to have the refactorings committed if you feel it's better not to change/add metrics or if these are better discussed in a followup JIRA, let me know. Re: MAPREDUCE and HDFS, I'll be sure remember TestDFSIO goes under HDFS in the future. For this one I looked at a search for "TestDFSIO" in JIRA and eyeballed that a plurality seemed to be under MAPREDUCE, a smaller fraction in HDFS, and then remaining ones in HADOOP. Combined with this code going under the hadoop-mapreduce directory, it looked like MAPREDUCE was more correct. > Fix TestDFSIO "Total Throughput" calculation > > > Key: MAPREDUCE-6931 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks, test >Affects Versions: 2.8.0 >Reporter: Dennis Huo >Priority: Trivial > > The new "Total Throughput" line added in > https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as > {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but > {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the > actual value: > {code:java} > String resultLines[] = { > "- TestDFSIO - : " + testType, > "Date & time: " + new Date(System.currentTimeMillis()), > "Number of files: " + tasks, > " Total MBytes processed: " + df.format(toMB(size)), > " Throughput mb/sec: " + df.format(size * 1000.0 / (time * > MEGA)), > "Total Throughput mb/sec: " + df.format(toMB(size) / > ((float)execTime)), > " Average IO rate mb/sec: " + df.format(med), > " IO rate std deviation: " + df.format(stdDev), > " Test exec time sec: " + df.format((float)execTime / 1000), > "" }; > {code} > The differe
[jira] [Created] (MAPREDUCE-6931) Fix TestDFSIO "Total Throughput" calculation
Dennis Huo created MAPREDUCE-6931: - Summary: Fix TestDFSIO "Total Throughput" calculation Key: MAPREDUCE-6931 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6931 Project: Hadoop Map/Reduce Issue Type: Bug Components: benchmarks, test Affects Versions: 2.8.0 Reporter: Dennis Huo Priority: Trivial The new "Total Throughput" line added in https://issues.apache.org/jira/browse/HDFS-9153 is currently calculated as {{toMB(size) / ((float)execTime)}} and claims to be in units of "MB/s", but {{execTime}} is in milliseconds; thus, the reported number is 1/1000x the actual value: {code:java} String resultLines[] = { "- TestDFSIO - : " + testType, "Date & time: " + new Date(System.currentTimeMillis()), "Number of files: " + tasks, " Total MBytes processed: " + df.format(toMB(size)), " Throughput mb/sec: " + df.format(size * 1000.0 / (time * MEGA)), "Total Throughput mb/sec: " + df.format(toMB(size) / ((float)execTime)), " Average IO rate mb/sec: " + df.format(med), " IO rate std deviation: " + df.format(stdDev), " Test exec time sec: " + df.format((float)execTime / 1000), "" }; {code} The different calculated fields can also use toMB and a shared milliseconds-to-seconds conversion to make it easier to keep units consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6759) JobSubmitter/JobResourceUploader should parallelize upload of -libjars, -files, -archives
Dennis Huo created MAPREDUCE-6759: - Summary: JobSubmitter/JobResourceUploader should parallelize upload of -libjars, -files, -archives Key: MAPREDUCE-6759 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6759 Project: Hadoop Map/Reduce Issue Type: Improvement Components: job submission Reporter: Dennis Huo During job submission, the {{JobResourceUploader}} currently iterates over for-loops of {{-libjars}}, {{-files}}, and {{-archives}} sequentially, which can significantly slow down job startup time when a large number of files need to be uploaded, especially if staging the files to a cloud object-store based FileSystem implementation like S3, GCS, WABS, etc., where round-trip latencies may be higher than HDFS despite having good throughput when parallelized: {code:title=JobResourceUploader.java} if (files != null) { FileSystem.mkdirs(jtFs, filesDir, mapredSysPerms); String[] fileArr = files.split(","); for (String tmpFile : fileArr) { URI tmpURI = null; try { tmpURI = new URI(tmpFile); } catch (URISyntaxException e) { throw new IllegalArgumentException(e); } Path tmp = new Path(tmpURI); Path newPath = copyRemoteFiles(filesDir, tmp, conf, replication); try { URI pathURI = getPathURI(newPath, tmpURI.getFragment()); DistributedCache.addCacheFile(pathURI, conf); } catch (URISyntaxException ue) { // should not throw a uri exception throw new IOException("Failed to create uri for " + tmpFile, ue); } } } if (libjars != null) { FileSystem.mkdirs(jtFs, libjarsDir, mapredSysPerms); String[] libjarsArr = libjars.split(","); for (String tmpjars : libjarsArr) { Path tmp = new Path(tmpjars); Path newPath = copyRemoteFiles(libjarsDir, tmp, conf, replication); DistributedCache.addFileToClassPath( new Path(newPath.toUri().getPath()), conf, jtFs); } } if (archives != null) { FileSystem.mkdirs(jtFs, archivesDir, mapredSysPerms); String[] archivesArr = archives.split(","); for (String tmpArchives : archivesArr) { URI tmpURI; try { tmpURI = new URI(tmpArchives); } catch (URISyntaxException e) { throw new IllegalArgumentException(e); } Path tmp = new Path(tmpURI); Path newPath = copyRemoteFiles(archivesDir, tmp, conf, replication); try { URI pathURI = getPathURI(newPath, tmpURI.getFragment()); DistributedCache.addCacheArchive(pathURI, conf); } catch (URISyntaxException ue) { // should not throw an uri excpetion throw new IOException("Failed to create uri for " + tmpArchives, ue); } } } {code} Parallelizing the upload of these files would improve job submission time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6758) TestDFSIO should parallelize its creation of control files on setup
Dennis Huo created MAPREDUCE-6758: - Summary: TestDFSIO should parallelize its creation of control files on setup Key: MAPREDUCE-6758 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6758 Project: Hadoop Map/Reduce Issue Type: Improvement Components: test Reporter: Dennis Huo TestDFSIO currently performs a sequential for-loop to create {{nrFiles}} control files in the {{controlDir}} which is a subdirectory of the overall {{test.build.data}} directory, which may be a non-HDFS FileSystem implementation: {code:java} private void createControlFile(FileSystem fs, long nrBytes, // in bytes int nrFiles ) throws IOException { LOG.info("creating control file: "+nrBytes+" bytes, "+nrFiles+" files"); Path controlDir = getControlDir(config); fs.delete(controlDir, true); for(int i=0; i < nrFiles; i++) { String name = getFileName(i); Path controlFile = new Path(controlDir, "in_file_" + name); SequenceFile.Writer writer = null; try { writer = SequenceFile.createWriter(fs, config, controlFile, Text.class, LongWritable.class, CompressionType.NONE); writer.append(new Text(name), new LongWritable(nrBytes)); } catch(Exception e) { throw new IOException(e.getLocalizedMessage()); } finally { if (writer != null) writer.close(); writer = null; } } LOG.info("created control files for: "+nrFiles+" files"); } {code} When testing in an object-store based filesystem with higher round-trip latency than HDFS (like S3 or GCS), this means job setup that might only take seconds in HDFS ends up taking minutes or even tens of minutes against the object stores if the test is using thousands of control files. In the same vein as other JIRAs in [https://issues.apache.org/jira/browse/HADOOP-11694], the control-file creation should be parallelized/multithreaded to efficiently launch large TestDFSIO jobs against FileSystem impls with high round-trip latency but which can still support high overall throughput/QPS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org