[jira] [Assigned] (TEZ-1205) Remove profiling keyword from APIs/configs
[ https://issues.apache.org/jira/browse/TEZ-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V reassigned TEZ-1205: Assignee: Gopal V (was: Rajesh Balamohan) Remove profiling keyword from APIs/configs - Key: TEZ-1205 URL: https://issues.apache.org/jira/browse/TEZ-1205 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Gopal V Priority: Blocker Given that the current functionality to support profiling actually just implies augmenting of command line options for a specified set of tasks, we can word the APIs and configs to be more general purpose. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1205) Remove profiling keyword from APIs/configs
[ https://issues.apache.org/jira/browse/TEZ-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078913#comment-14078913 ] Gopal V commented on TEZ-1205: -- This is passed in as 2 properties - one declaring the actual cmd-opts and the other selecting the vertex/task which it is targeted at. additional does not explain the relationship. Remove profiling keyword from APIs/configs - Key: TEZ-1205 URL: https://issues.apache.org/jira/browse/TEZ-1205 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Gopal V Priority: Blocker Given that the current functionality to support profiling actually just implies augmenting of command line options for a specified set of tasks, we can word the APIs and configs to be more general purpose. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (TEZ-1360) Provide vertex parallelism to each vertex task
[ https://issues.apache.org/jira/browse/TEZ-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V reassigned TEZ-1360: Assignee: Gopal V Provide vertex parallelism to each vertex task -- Key: TEZ-1360 URL: https://issues.apache.org/jira/browse/TEZ-1360 Project: Apache Tez Issue Type: Bug Reporter: Johannes Zillmann Assignee: Gopal V It would be good for a task to get a info about the total task count of its vertex. With this there would be an equivalent for map-reduce' {{mapred.map.tasks}} and {{mapred.reduce.tasks}} and mr-applications using this could be ported to Tez more easily. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1360) Provide vertex parallelism to each vertex task
[ https://issues.apache.org/jira/browse/TEZ-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1360: - Attachment: TEZ-1332.1.patch Provide vertex parallelism to each vertex task -- Key: TEZ-1360 URL: https://issues.apache.org/jira/browse/TEZ-1360 Project: Apache Tez Issue Type: Bug Reporter: Johannes Zillmann Assignee: Gopal V Attachments: TEZ-1332.1.patch It would be good for a task to get a info about the total task count of its vertex. With this there would be an equivalent for map-reduce' {{mapred.map.tasks}} and {{mapred.reduce.tasks}} and mr-applications using this could be ported to Tez more easily. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1372) Fix preWarm to work after recent API changes
[ https://issues.apache.org/jira/browse/TEZ-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089848#comment-14089848 ] Gopal V commented on TEZ-1372: -- [~bikassaha]: getting to this after HIVE-7601 and HIVE-7639 gets a test run. Fix preWarm to work after recent API changes Key: TEZ-1372 URL: https://issues.apache.org/jira/browse/TEZ-1372 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.5.0 Reporter: Bikas Saha Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-1372.1.patch, TEZ-1372.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1372) Fix preWarm to work after recent API changes
[ https://issues.apache.org/jira/browse/TEZ-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091011#comment-14091011 ] Gopal V commented on TEZ-1372: -- Missed PreWarmVertex.java in diff? Fix preWarm to work after recent API changes Key: TEZ-1372 URL: https://issues.apache.org/jira/browse/TEZ-1372 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.5.0 Reporter: Bikas Saha Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-1372.1.patch, TEZ-1372.2.patch, TEZ-1372.3.patch, TEZ-1372.svg -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1205) Remove profiling keyword from APIs/configs
[ https://issues.apache.org/jira/browse/TEZ-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1205: - Assignee: Rajesh Balamohan (was: Gopal V) Remove profiling keyword from APIs/configs - Key: TEZ-1205 URL: https://issues.apache.org/jira/browse/TEZ-1205 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Rajesh Balamohan Priority: Blocker Attachments: TEZ-1205.1.patch, TEZ-1205.2.patch Given that the current functionality to support profiling actually just implies augmenting of command line options for a specified set of tasks, we can word the APIs and configs to be more general purpose. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1360) Provide vertex parallelism to each vertex task
[ https://issues.apache.org/jira/browse/TEZ-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093368#comment-14093368 ] Gopal V commented on TEZ-1360: -- [~hitesh]: looks like I'll have to start over after the API changes. Moving parallelism to TaskContext. Provide vertex parallelism to each vertex task -- Key: TEZ-1360 URL: https://issues.apache.org/jira/browse/TEZ-1360 Project: Apache Tez Issue Type: Bug Reporter: Johannes Zillmann Assignee: Gopal V Attachments: TEZ-1360.1.patch It would be good for a task to get a info about the total task count of its vertex. With this there would be an equivalent for map-reduce' {{mapred.map.tasks}} and {{mapred.reduce.tasks}} and mr-applications using this could be ported to Tez more easily. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1360) Provide vertex parallelism to each vertex task
[ https://issues.apache.org/jira/browse/TEZ-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1360: - Attachment: TEZ-1360.2.patch Provide vertex parallelism to each vertex task -- Key: TEZ-1360 URL: https://issues.apache.org/jira/browse/TEZ-1360 Project: Apache Tez Issue Type: Bug Reporter: Johannes Zillmann Assignee: Gopal V Attachments: TEZ-1360.1.patch, TEZ-1360.2.patch It would be good for a task to get a info about the total task count of its vertex. With this there would be an equivalent for map-reduce' {{mapred.map.tasks}} and {{mapred.reduce.tasks}} and mr-applications using this could be ported to Tez more easily. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1411) Address initial feedback on swimlanes
[ https://issues.apache.org/jira/browse/TEZ-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098965#comment-14098965 ] Gopal V commented on TEZ-1411: -- [~jeagles]: sure, that shouldn't be a big issue. The reason to use SVG was to have the clickable links. Address initial feedback on swimlanes - Key: TEZ-1411 URL: https://issues.apache.org/jira/browse/TEZ-1411 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Gopal V Priority: Blocker Fix For: 0.5.0 Few other good to have things 1) A wrapper script that takes care of the command chaining with a single appId as input from the user. 2) Legend in the README or in the svg itself about what is what. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1411) Address initial feedback on swimlanes
[ https://issues.apache.org/jira/browse/TEZ-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099297#comment-14099297 ] Gopal V commented on TEZ-1411: -- [~jeagles]: You can produce a zoomed out view by modifing the -t variable. I intend to rewrite this tool, without needing regex based log-parsing and pull all the information from ATS/SimpleLoggingHistoryService directly. The latter is trivial to use, just add this to the tez-site.xml - to log ATS-like info into HDFS. {code} property nametez.simple.history.logging.dir/name value${fs.default.name}/tez-history//value /property {code} I will encourage you to use either of those, because I'll try to push out more tooling I have built for post-hoc analysis from that data. Address initial feedback on swimlanes - Key: TEZ-1411 URL: https://issues.apache.org/jira/browse/TEZ-1411 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Gopal V Priority: Blocker Fix For: 0.5.0 Attachments: TEZ-1411.1.patch, large.am.history.txt Few other good to have things 1) A wrapper script that takes care of the command chaining with a single appId as input from the user. 2) Legend in the README or in the svg itself about what is what. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1390) Replace byte[] with ByteBuffer as the type of user payload in the API
[ https://issues.apache.org/jira/browse/TEZ-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101350#comment-14101350 ] Gopal V commented on TEZ-1390: -- Need to put a warning around {code} ... +byte[] bytes = new byte[bb.limit() - bb.position()]; +bb.get(bytes); {code} This only works consistently because every single call to DagTypeConverters.convertFromTezUserPayload(userPayload); produces a throw-away reference. Also for consistency, we could use the {{size}} variable in the array allocation as well. Replace byte[] with ByteBuffer as the type of user payload in the API - Key: TEZ-1390 URL: https://issues.apache.org/jira/browse/TEZ-1390 Project: Apache Tez Issue Type: Improvement Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Priority: Blocker Attachments: TEZ-1390.1.patch, TEZ-1390.2.patch, TEZ-1390.3.patch, TEZ-1390.4.txt, pig.payload.txt This is just and API change. Internally we can continue to use byte[] since thats a much bigger change. The translation from ByteBuffer to byte[] in the API layer should not have perf impact. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1466) Fix JDK8 builds of tez
Gopal V created TEZ-1466: Summary: Fix JDK8 builds of tez Key: TEZ-1466 URL: https://issues.apache.org/jira/browse/TEZ-1466 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Assignee: Gopal V Priority: Trivial Attachments: TEZ-1466.1.patch Tez fails to build on JDK8 due to stricter generics checks on a unit test {code} sortedDataMap = TreeMultimap.create(this.correctComparator, Ordering.natural()); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1360) Provide vertex parallelism to each vertex task
[ https://issues.apache.org/jira/browse/TEZ-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1360: - Attachment: TEZ-1360.4.patch Provide vertex parallelism to each vertex task -- Key: TEZ-1360 URL: https://issues.apache.org/jira/browse/TEZ-1360 Project: Apache Tez Issue Type: Bug Reporter: Johannes Zillmann Assignee: Gopal V Fix For: 0.5.1 Attachments: TEZ-1360.1.patch, TEZ-1360.2.patch, TEZ-1360.3.patch, TEZ-1360.4.patch It would be good for a task to get a info about the total task count of its vertex. With this there would be an equivalent for map-reduce' {{mapred.map.tasks}} and {{mapred.reduce.tasks}} and mr-applications using this could be ported to Tez more easily. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TEZ-1469) AM/Session LRs are not shipped to vertices in new API use-case
[ https://issues.apache.org/jira/browse/TEZ-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104286#comment-14104286 ] Gopal V edited comment on TEZ-1469 at 8/20/14 6:26 PM: --- I think this is a breakage between Session and non-Session modes. When I use sessions, this works as expected. was (Author: gopalv): I think this is a breakage between Session and on-Session modes. When I use sessions, this works as expected. AM/Session LRs are not shipped to vertices in new API use-case -- Key: TEZ-1469 URL: https://issues.apache.org/jira/browse/TEZ-1469 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Priority: Blocker Attachments: TEZ-1469.1.patch, tez-broadcast-example.tgz Previously in the tez codebase, the session LRs were part of each vertex's LRs, automatically. During 0.5.0 API changes, the following no longer provides local LRs to the vertices, even if it is part of the session LR. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1469) AM/Session LRs are not shipped to vertices in new API use-case
[ https://issues.apache.org/jira/browse/TEZ-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104286#comment-14104286 ] Gopal V commented on TEZ-1469: -- I think this is a breakage between Session and on-Session modes. When I use sessions, this works as expected. AM/Session LRs are not shipped to vertices in new API use-case -- Key: TEZ-1469 URL: https://issues.apache.org/jira/browse/TEZ-1469 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Priority: Blocker Attachments: TEZ-1469.1.patch, tez-broadcast-example.tgz Previously in the tez codebase, the session LRs were part of each vertex's LRs, automatically. During 0.5.0 API changes, the following no longer provides local LRs to the vertices, even if it is part of the session LR. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1469) AM/Session LRs are not shipped to vertices in new API use-case
[ https://issues.apache.org/jira/browse/TEZ-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104378#comment-14104378 ] Gopal V commented on TEZ-1469: -- Fixed that issue in the DAGAppMaster, but still only getting tez LRs and the pb binary as session resources. Tracked that down to {code} MapString, LocalResource sessionJars = new HashMapString, LocalResource(tezJarResources.size() + 1); sessionJars.putAll(tezJarResources); sessionJars.put(TezConstants.TEZ_PB_BINARY_CONF_NAME, binaryConfLRsrc); DAGProtos.PlanLocalResourcesProto proto = DagTypeConverters.convertFromLocalResources(sessionJars); sessionJarsPBOutStream = TezCommonUtils.createFileForAM(fs, sessionJarsPath); proto.writeDelimitedTo(sessionJarsPBOutStream); {code} Tez does not use any user-provided jar in session resources, which is how hive-exec.jar is shipped IIRC. AM/Session LRs are not shipped to vertices in new API use-case -- Key: TEZ-1469 URL: https://issues.apache.org/jira/browse/TEZ-1469 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Priority: Blocker Attachments: TEZ-1469.1.patch, tez-broadcast-example.tgz Previously in the tez codebase, the session LRs were part of each vertex's LRs, automatically. During 0.5.0 API changes, the following no longer provides local LRs to the vertices, even if it is part of the session LR. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1469) AM/Session LRs are not shipped to vertices in new API use-case
[ https://issues.apache.org/jira/browse/TEZ-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104462#comment-14104462 ] Gopal V commented on TEZ-1469: -- I'm fine with the DAG as well, if that is guaranteed to not re-localize during a run. The downside is that both AM and DAG needs exactly the same JARS for hive, so please provide a way to debug these issues in production as well. AM/Session LRs are not shipped to vertices in new API use-case -- Key: TEZ-1469 URL: https://issues.apache.org/jira/browse/TEZ-1469 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Priority: Blocker Attachments: TEZ-1469.1.patch, tez-broadcast-example.tgz Previously in the tez codebase, the session LRs were part of each vertex's LRs, automatically. During 0.5.0 API changes, the following no longer provides local LRs to the vertices, even if it is part of the session LR. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1469) AM/Session LRs are not shipped to vertices in new API use-case
[ https://issues.apache.org/jira/browse/TEZ-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1469: - Attachment: TEZ-1469.2.patch Patch to DAGAppMaster to add AM Local Resources even if not in session mode. AM/Session LRs are not shipped to vertices in new API use-case -- Key: TEZ-1469 URL: https://issues.apache.org/jira/browse/TEZ-1469 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Priority: Blocker Attachments: TEZ-1469.1.patch, TEZ-1469.2.patch, tez-broadcast-example.tgz Previously in the tez codebase, the session LRs were part of each vertex's LRs, automatically. During 0.5.0 API changes, the following no longer provides local LRs to the vertices, even if it is part of the session LR. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1479) Disambiguate between ShuffleInputEventHandler and ShuffleInputEventHandlerImpl (which are not related)
Gopal V created TEZ-1479: Summary: Disambiguate between ShuffleInputEventHandler and ShuffleInputEventHandlerImpl (which are not related) Key: TEZ-1479 URL: https://issues.apache.org/jira/browse/TEZ-1479 Project: Apache Tez Issue Type: Bug Reporter: Gopal V common.shuffle.impl.ShuffleInputEventHandler is not related to shuffle.common.impl.ShuffleInputEventHandlerImpl This is extremely confusing and needs refactoring internally to be readable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1332) tez swimlanes UI tool
[ https://issues.apache.org/jira/browse/TEZ-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106580#comment-14106580 ] Gopal V commented on TEZ-1332: -- This patch was all python - can't quite figure out how this broke anything. tez swimlanes UI tool - Key: TEZ-1332 URL: https://issues.apache.org/jira/browse/TEZ-1332 Project: Apache Tez Issue Type: New Feature Reporter: Gopal V Assignee: Gopal V Priority: Blocker Fix For: 0.5.0 Attachments: TEZ-1332.1.patch Import https://github.com/t3rmin4t0r/tez-swimlanes into trunk. Also move from using the AM INFO logs to using SimpleHistoryLogging/ATS data to draw the diagrams. The goal is to be able to draw diagrams like, for a developer to debug performance issues - http://people.apache.org/~gopalv/query27.svg -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1360) Provide vertex parallelism to each vertex task
[ https://issues.apache.org/jira/browse/TEZ-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1360: - Attachment: TEZ-1360.6.patch Patch with doc changes from me and unit test from [~rajesh.balamohan]. Provide vertex parallelism to each vertex task -- Key: TEZ-1360 URL: https://issues.apache.org/jira/browse/TEZ-1360 Project: Apache Tez Issue Type: Bug Reporter: Johannes Zillmann Assignee: Gopal V Fix For: 0.5.1 Attachments: TEZ-1360.1.patch, TEZ-1360.2.patch, TEZ-1360.4.patch, TEZ-1360.5.patch, TEZ-1360.6.patch It would be good for a task to get a info about the total task count of its vertex. With this there would be an equivalent for map-reduce' {{mapred.map.tasks}} and {{mapred.reduce.tasks}} and mr-applications using this could be ported to Tez more easily. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1489) Broadcast Shuffle should call freeResources() on FetchedInput
Gopal V created TEZ-1489: Summary: Broadcast Shuffle should call freeResources() on FetchedInput Key: TEZ-1489 URL: https://issues.apache.org/jira/browse/TEZ-1489 Project: Apache Tez Issue Type: Bug Reporter: Gopal V BroadcastShuffle does not seem to free up the buffer space allocated by the FetchedInputs during the task runtime. SimpleFetchedInputAllocator::freeResources is never called as per my logging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1491) Tez reducer-side merge's counter update is slow
Gopal V created TEZ-1491: Summary: Tez reducer-side merge's counter update is slow Key: TEZ-1491 URL: https://issues.apache.org/jira/browse/TEZ-1491 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V TezMerger$MergeQueue::next() shows up in profiles due a synchronized block in a tight loop. Part of the slow operation was due to DataInputBuffer issues identified earlier in HADOOP-10694, but along with that approx 11% of my lock prefix calls were originating from the following line. {code} mergeProgress.set(totalBytesProcessed * progPerByte); {code} in two places within the core loop. !perf-top-counters.png! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1491) Tez reducer-side merge's counter update is slow
[ https://issues.apache.org/jira/browse/TEZ-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1491: - Attachment: perf-top-counters.png Tez reducer-side merge's counter update is slow --- Key: TEZ-1491 URL: https://issues.apache.org/jira/browse/TEZ-1491 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Attachments: perf-top-counters.png TezMerger$MergeQueue::next() shows up in profiles due a synchronized block in a tight loop. Part of the slow operation was due to DataInputBuffer issues identified earlier in HADOOP-10694, but along with that approx 11% of my lock prefix calls were originating from the following line. {code} mergeProgress.set(totalBytesProcessed * progPerByte); {code} in two places within the core loop. !perf-top-counters.png! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1488) Implement HashComparatorBytesWritable in TezBytesComparator
[ https://issues.apache.org/jira/browse/TEZ-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1488: - Attachment: TEZ-1488.1.patch Implement HashComparatorBytesWritable in TezBytesComparator - Key: TEZ-1488 URL: https://issues.apache.org/jira/browse/TEZ-1488 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1488.1.patch Speed up TezBytesComparator by ~20% when used in PipelinedSorter. This moves part of the key comparator into the partition comparator, which is a single register operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1083) Enable IFile RLE for DefaultSorter
[ https://issues.apache.org/jira/browse/TEZ-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109453#comment-14109453 ] Gopal V commented on TEZ-1083: -- This looks alright - this just needs a roll-over check for the sameKey long variable. The worst-case value for that is near O(n^2), so it might overflow before totalKeys does. For performance, it can be assumed that if sameKeys is 0, isRLENeeded == true - instead of checking within the loop. Enable IFile RLE for DefaultSorter -- Key: TEZ-1083 URL: https://issues.apache.org/jira/browse/TEZ-1083 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Gopal V Attachments: TEZ-1083.1.patch Generate RLE IFiles for DefaultSorter and use it to fast-forward map-side merge. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1492) IFile RLE not kicking in due to bug in BufferUtils.compare()
[ https://issues.apache.org/jira/browse/TEZ-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109461#comment-14109461 ] Gopal V commented on TEZ-1492: -- The BufferUtils class needs re-namespacing as well, as part of this patch. IFile RLE not kicking in due to bug in BufferUtils.compare() Key: TEZ-1492 URL: https://issues.apache.org/jira/browse/TEZ-1492 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1492.1.patch, TEZ-1492.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1489) Broadcast Shuffle should call freeResources() on FetchedInput
[ https://issues.apache.org/jira/browse/TEZ-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109472#comment-14109472 ] Gopal V commented on TEZ-1489: -- The buffer is being cleared up correctly - but the unreserve() is not getting called, so the internal check switches to Disk even though buffers are unused. Broadcast Shuffle should call freeResources() on FetchedInput - Key: TEZ-1489 URL: https://issues.apache.org/jira/browse/TEZ-1489 Project: Apache Tez Issue Type: Bug Reporter: Gopal V BroadcastShuffle does not seem to free up the buffer space allocated by the FetchedInputs during the task runtime. SimpleFetchedInputAllocator::freeResources is never called as per my logging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1489) Broadcast Shuffle should call freeResources() on FetchedInput
[ https://issues.apache.org/jira/browse/TEZ-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109476#comment-14109476 ] Gopal V commented on TEZ-1489: -- UnorderedKVReader::moveToNextInput() maybe is a good place for this? Broadcast Shuffle should call freeResources() on FetchedInput - Key: TEZ-1489 URL: https://issues.apache.org/jira/browse/TEZ-1489 Project: Apache Tez Issue Type: Bug Reporter: Gopal V BroadcastShuffle does not seem to free up the buffer space allocated by the FetchedInputs during the task runtime. SimpleFetchedInputAllocator::freeResources is never called as per my logging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1497) Add tez-broadcast-example into tez-examples/
Gopal V created TEZ-1497: Summary: Add tez-broadcast-example into tez-examples/ Key: TEZ-1497 URL: https://issues.apache.org/jira/browse/TEZ-1497 Project: Apache Tez Issue Type: Bug Reporter: Gopal V Assignee: Gopal V Modify https://github.com/t3rmin4t0r/tez-broadcast-example into a usable example inside tez-examples. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1492) IFile RLE not kicking in due to bug in BufferUtils.compare()
[ https://issues.apache.org/jira/browse/TEZ-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110264#comment-14110264 ] Gopal V commented on TEZ-1492: -- Thanks [~rajesh.balamohan], can I have the same diff with git mv instead of the big change-sets? IFile RLE not kicking in due to bug in BufferUtils.compare() Key: TEZ-1492 URL: https://issues.apache.org/jira/browse/TEZ-1492 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1492.1.patch, TEZ-1492.2.patch, TEZ-1492.3.patch, TEZ-1492.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TEZ-1497) Add tez-broadcast-example into tez-examples/
[ https://issues.apache.org/jira/browse/TEZ-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V resolved TEZ-1497. -- Resolution: Not a Problem Add tez-broadcast-example into tez-examples/ Key: TEZ-1497 URL: https://issues.apache.org/jira/browse/TEZ-1497 Project: Apache Tez Issue Type: Bug Reporter: Gopal V Assignee: Gopal V Modify https://github.com/t3rmin4t0r/tez-broadcast-example into a usable example inside tez-examples. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1503) UnorderedKVInput.getReader() should return KeyValuesReader
[ https://issues.apache.org/jira/browse/TEZ-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111095#comment-14111095 ] Gopal V commented on TEZ-1503: -- The Unordered input will not satisfy the contract of key values reader at all - it will not return K, ListV UnorderedKVInput.getReader() should return KeyValuesReader -- Key: TEZ-1503 URL: https://issues.apache.org/jira/browse/TEZ-1503 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Currently OrderedGroupedKVInput.getReader() returns KeyValuesReader and UnorderedKVInput.getReader() returns KeyValueReader. It would be useful to return KeyValuesReader for UnorderedKVInput to be consistent with OrderedGroupedKVInput. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1492) IFile RLE not kicking in due to bug in BufferUtils.compare()
[ https://issues.apache.org/jira/browse/TEZ-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111819#comment-14111819 ] Gopal V commented on TEZ-1492: -- Alright, this looks good - +1 IFile RLE not kicking in due to bug in BufferUtils.compare() Key: TEZ-1492 URL: https://issues.apache.org/jira/browse/TEZ-1492 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1492.1.patch, TEZ-1492.2.patch, TEZ-1492.3.patch, TEZ-1492.4.patch, TEZ-1492.5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1503) UnorderedKVInput.getReader() should return KeyValuesReader
[ https://issues.apache.org/jira/browse/TEZ-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111859#comment-14111859 ] Gopal V commented on TEZ-1503: -- Yes, I think this change would make it easier to write wrong code. If someone changes the edge type for a vertex, I'd rather get a class cast exception instead of my vertices working, but generating odd results due to the assumptions around K, ListV for things like aggregations. UnorderedKVInput.getReader() should return KeyValuesReader -- Key: TEZ-1503 URL: https://issues.apache.org/jira/browse/TEZ-1503 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Currently OrderedGroupedKVInput.getReader() returns KeyValuesReader and UnorderedKVInput.getReader() returns KeyValueReader. It would be useful to return KeyValuesReader for UnorderedKVInput to be consistent with OrderedGroupedKVInput. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1509) Set a useful default value for java opts
[ https://issues.apache.org/jira/browse/TEZ-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112988#comment-14112988 ] Gopal V commented on TEZ-1509: -- -1 on the -XX:+UseCompressedStrings. The rest looks good. Set a useful default value for java opts -- Key: TEZ-1509 URL: https://issues.apache.org/jira/browse/TEZ-1509 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah A subset of the following should be considered for the defaults: -server -XX:+UseCompressedStrings -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1501) Add a test dag to generate load on the getTask RPC
[ https://issues.apache.org/jira/browse/TEZ-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114052#comment-14114052 ] Gopal V commented on TEZ-1501: -- Looks good - +1 Just needs an fs.deleteOnExit() for the PAYLOAD file for cleanups. Add a test dag to generate load on the getTask RPC -- Key: TEZ-1501 URL: https://issues.apache.org/jira/browse/TEZ-1501 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-1501.1.txt, TEZ-1501.2.txt -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1488) Implement HashComparatorBytesWritable in TezBytesComparator
[ https://issues.apache.org/jira/browse/TEZ-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114448#comment-14114448 ] Gopal V commented on TEZ-1488: -- I think we should call the interface what it really - a ProxyComparator? I will do the renames write docs. Implement HashComparatorBytesWritable in TezBytesComparator - Key: TEZ-1488 URL: https://issues.apache.org/jira/browse/TEZ-1488 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1488.1.patch Speed up TezBytesComparator by ~20% when used in PipelinedSorter. This moves part of the key comparator into the partition comparator, which is a single register operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1488) Implement HashComparatorBytesWritable in TezBytesComparator
[ https://issues.apache.org/jira/browse/TEZ-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1488: - Attachment: TEZ-1488.3.patch Patch includes renames - will only apply with {{git apply -v -p0 TEZ-1488.3.patch}} Implement HashComparatorBytesWritable in TezBytesComparator - Key: TEZ-1488 URL: https://issues.apache.org/jira/browse/TEZ-1488 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1488.1.patch, TEZ-1488.2.patch, TEZ-1488.3.patch Speed up TezBytesComparator by ~20% when used in PipelinedSorter. This moves part of the key comparator into the partition comparator, which is a single register operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1488) Rename HashComparator to ProxyComparator and implement in TezBytesComparator
[ https://issues.apache.org/jira/browse/TEZ-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1488: - Summary: Rename HashComparator to ProxyComparator and implement in TezBytesComparator (was: Implement HashComparatorBytesWritable in TezBytesComparator) Rename HashComparator to ProxyComparator and implement in TezBytesComparator Key: TEZ-1488 URL: https://issues.apache.org/jira/browse/TEZ-1488 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Fix For: 0.6.0 Attachments: TEZ-1488.1.patch, TEZ-1488.2.patch, TEZ-1488.3.patch Speed up TezBytesComparator by ~20% when used in PipelinedSorter. This moves part of the key comparator into the partition comparator, which is a single register operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1488) Rename HashComparator to ProxyComparator and implement in TezBytesComparator
[ https://issues.apache.org/jira/browse/TEZ-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1488: - Fix Version/s: 0.6.0 Rename HashComparator to ProxyComparator and implement in TezBytesComparator Key: TEZ-1488 URL: https://issues.apache.org/jira/browse/TEZ-1488 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Fix For: 0.6.0 Attachments: TEZ-1488.1.patch, TEZ-1488.2.patch, TEZ-1488.3.patch Speed up TezBytesComparator by ~20% when used in PipelinedSorter. This moves part of the key comparator into the partition comparator, which is a single register operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1488) Rename HashComparator to ProxyComparator and implement in TezBytesComparator
[ https://issues.apache.org/jira/browse/TEZ-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1488: - Attachment: TEZ-1488.4.patch Rename HashComparator to ProxyComparator and implement in TezBytesComparator Key: TEZ-1488 URL: https://issues.apache.org/jira/browse/TEZ-1488 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Fix For: 0.6.0 Attachments: TEZ-1488.1.patch, TEZ-1488.2.patch, TEZ-1488.3.patch, TEZ-1488.4.patch Speed up TezBytesComparator by ~20% when used in PipelinedSorter. This moves part of the key comparator into the partition comparator, which is a single register operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1524) getDAGStatus seems to fork out the entire JVM
[ https://issues.apache.org/jira/browse/TEZ-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115806#comment-14115806 ] Gopal V commented on TEZ-1524: -- The cache does not cache misses. getDAGStatus seems to fork out the entire JVM - Key: TEZ-1524 URL: https://issues.apache.org/jira/browse/TEZ-1524 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Tracked down a consistent fork() call to {code} at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:83) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:50) at org.apache.hadoop.security.Groups.getGroups(Groups.java:139) at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1409) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getRPCUserGroups(DAGClientAMProtocolBlockingPBServerImpl.java:75) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) {code} [~hitesh] - would it make sense to cache this at all? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1524) getDAGStatus seems to fork out the entire JVM
[ https://issues.apache.org/jira/browse/TEZ-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1524: - Attachment: TEZ-1524.1.patch getDAGStatus seems to fork out the entire JVM - Key: TEZ-1524 URL: https://issues.apache.org/jira/browse/TEZ-1524 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Attachments: TEZ-1524.1.patch Tracked down a consistent fork() call to {code} at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:83) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:50) at org.apache.hadoop.security.Groups.getGroups(Groups.java:139) at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1409) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getRPCUserGroups(DAGClientAMProtocolBlockingPBServerImpl.java:75) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) {code} [~hitesh] - would it make sense to cache this at all? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1525) BroadcastLoadGen testcase
Gopal V created TEZ-1525: Summary: BroadcastLoadGen testcase Key: TEZ-1525 URL: https://issues.apache.org/jira/browse/TEZ-1525 Project: Apache Tez Issue Type: Test Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Broadcast load generator test example -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1525) BroadcastLoadGen testcase
[ https://issues.apache.org/jira/browse/TEZ-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1525: - Attachment: TEZ-1525.1.patch BroadcastLoadGen testcase - Key: TEZ-1525 URL: https://issues.apache.org/jira/browse/TEZ-1525 Project: Apache Tez Issue Type: Test Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1525.1.patch Broadcast load generator test example -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1524) getDAGStatus seems to fork out the entire JVM
[ https://issues.apache.org/jira/browse/TEZ-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1524: - Attachment: TEZ-1524.2.patch getDAGStatus seems to fork out the entire JVM - Key: TEZ-1524 URL: https://issues.apache.org/jira/browse/TEZ-1524 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1524.1.patch, TEZ-1524.2.patch Tracked down a consistent fork() call to {code} at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:83) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:50) at org.apache.hadoop.security.Groups.getGroups(Groups.java:139) at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1409) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getRPCUserGroups(DAGClientAMProtocolBlockingPBServerImpl.java:75) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) {code} [~hitesh] - would it make sense to cache this at all? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1524) getDAGStatus seems to fork out the entire JVM on non-secure clusters
[ https://issues.apache.org/jira/browse/TEZ-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1524: - Summary: getDAGStatus seems to fork out the entire JVM on non-secure clusters (was: getDAGStatus seems to fork out the entire JVM) getDAGStatus seems to fork out the entire JVM on non-secure clusters Key: TEZ-1524 URL: https://issues.apache.org/jira/browse/TEZ-1524 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1524.1.patch, TEZ-1524.2.patch Tracked down a consistent fork() call to {code} at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:83) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:50) at org.apache.hadoop.security.Groups.getGroups(Groups.java:139) at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1409) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getRPCUserGroups(DAGClientAMProtocolBlockingPBServerImpl.java:75) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) {code} [~hitesh] - would it make sense to cache this at all? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1524) getDAGStatus seems to fork out the entire JVM on non-secure clusters
[ https://issues.apache.org/jira/browse/TEZ-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1524: - Attachment: TEZ-1524.3.patch Removed the stray println. getDAGStatus seems to fork out the entire JVM on non-secure clusters Key: TEZ-1524 URL: https://issues.apache.org/jira/browse/TEZ-1524 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1524.1.patch, TEZ-1524.2.patch, TEZ-1524.3.patch Tracked down a consistent fork() call to {code} at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:83) at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:50) at org.apache.hadoop.security.Groups.getGroups(Groups.java:139) at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1409) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getRPCUserGroups(DAGClientAMProtocolBlockingPBServerImpl.java:75) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) {code} [~hitesh] - would it make sense to cache this at all? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1157) Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data
[ https://issues.apache.org/jira/browse/TEZ-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1157: - Attachment: TEZ-1157.7.patch Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data Key: TEZ-1157 URL: https://issues.apache.org/jira/browse/TEZ-1157 Project: Apache Tez Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Gopal V Labels: performance Attachments: TEZ-1152.WIP.patch, TEZ-1157.3.WIP.patch, TEZ-1157.4.WIP.patch, TEZ-1157.5.WIP.patch, TEZ-1157.6.patch, TEZ-1157.7.patch, TEZ-broadcast-shuffle+vertex-parallelism.patch Currently tasks (belonging to same job) running in the same machine download its own copy of broadcast data. Optimization could be to download one copy in the machine, and the rest of the tasks can refer to this downloaded copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1157) Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data
[ https://issues.apache.org/jira/browse/TEZ-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1157: - Attachment: TEZ-1157.8.patch Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data Key: TEZ-1157 URL: https://issues.apache.org/jira/browse/TEZ-1157 Project: Apache Tez Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Gopal V Labels: performance Attachments: TEZ-1152.WIP.patch, TEZ-1157.3.WIP.patch, TEZ-1157.4.WIP.patch, TEZ-1157.5.WIP.patch, TEZ-1157.6.patch, TEZ-1157.7.patch, TEZ-1157.8.patch, TEZ-broadcast-shuffle+vertex-parallelism.patch Currently tasks (belonging to same job) running in the same machine download its own copy of broadcast data. Optimization could be to download one copy in the machine, and the rest of the tasks can refer to this downloaded copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1157) Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data
[ https://issues.apache.org/jira/browse/TEZ-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1157: - Attachment: TEZ-1157.9.patch Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data Key: TEZ-1157 URL: https://issues.apache.org/jira/browse/TEZ-1157 Project: Apache Tez Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Gopal V Labels: performance Attachments: TEZ-1152.WIP.patch, TEZ-1157.3.WIP.patch, TEZ-1157.4.WIP.patch, TEZ-1157.5.WIP.patch, TEZ-1157.6.patch, TEZ-1157.7.patch, TEZ-1157.8.patch, TEZ-1157.9.patch, TEZ-broadcast-shuffle+vertex-parallelism.patch Currently tasks (belonging to same job) running in the same machine download its own copy of broadcast data. Optimization could be to download one copy in the machine, and the rest of the tasks can refer to this downloaded copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1535) Minor bug in computing min-time the reducer should run without getting killed
[ https://issues.apache.org/jira/browse/TEZ-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1535: - Labels: performance timeunits (was: performance) Minor bug in computing min-time the reducer should run without getting killed - Key: TEZ-1535 URL: https://issues.apache.org/jira/browse/TEZ-1535 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Rajesh Balamohan Labels: performance, timeunits Fix For: 0.6.0 ShuffleScheduler's shuffleProgressDuration is computed in milliseconds. ShufflePayload's runDuration is computed in microseconds (i.e in OrderedPartitionedKVOutput.generateEventsOnClose()) This would end up in wrong value in computing min-time the reducer should run without getting killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1535) Minor bug in computing min-time the reducer should run without getting killed
[ https://issues.apache.org/jira/browse/TEZ-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1535: - Fix Version/s: 0.6.0 Minor bug in computing min-time the reducer should run without getting killed - Key: TEZ-1535 URL: https://issues.apache.org/jira/browse/TEZ-1535 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Rajesh Balamohan Labels: performance, timeunits Fix For: 0.6.0 ShuffleScheduler's shuffleProgressDuration is computed in milliseconds. ShufflePayload's runDuration is computed in microseconds (i.e in OrderedPartitionedKVOutput.generateEventsOnClose()) This would end up in wrong value in computing min-time the reducer should run without getting killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1157) Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data
[ https://issues.apache.org/jira/browse/TEZ-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1157: - Attachment: TEZ-1157.10.patch Fix TeztTezRuntimeConfiguration failures for missing keys. This is the patch for commit. Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data Key: TEZ-1157 URL: https://issues.apache.org/jira/browse/TEZ-1157 Project: Apache Tez Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Gopal V Labels: performance Attachments: TEZ-1152.WIP.patch, TEZ-1157.10.patch, TEZ-1157.3.WIP.patch, TEZ-1157.4.WIP.patch, TEZ-1157.5.WIP.patch, TEZ-1157.6.patch, TEZ-1157.7.patch, TEZ-1157.8.patch, TEZ-1157.9.patch, TEZ-broadcast-shuffle+vertex-parallelism.patch, connections.png, latency.png Currently tasks (belonging to same job) running in the same machine download its own copy of broadcast data. Optimization could be to download one copy in the machine, and the rest of the tasks can refer to this downloaded copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1157) Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data
[ https://issues.apache.org/jira/browse/TEZ-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1157: - Description: Currently tasks (belonging to same job) running in the same machine download its own copy of broadcast data. Optimization could be to download one copy in the machine, and the rest of the tasks can refer to this downloaded copy. (results after this feature) !connections.png! !latency.png! was:Currently tasks (belonging to same job) running in the same machine download its own copy of broadcast data. Optimization could be to download one copy in the machine, and the rest of the tasks can refer to this downloaded copy. Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data Key: TEZ-1157 URL: https://issues.apache.org/jira/browse/TEZ-1157 Project: Apache Tez Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Gopal V Labels: performance Fix For: 0.6.0 Attachments: TEZ-1152.WIP.patch, TEZ-1157.10.patch, TEZ-1157.3.WIP.patch, TEZ-1157.4.WIP.patch, TEZ-1157.5.WIP.patch, TEZ-1157.6.patch, TEZ-1157.7.patch, TEZ-1157.8.patch, TEZ-1157.9.patch, TEZ-broadcast-shuffle+vertex-parallelism.patch, connections.png, latency.png Currently tasks (belonging to same job) running in the same machine download its own copy of broadcast data. Optimization could be to download one copy in the machine, and the rest of the tasks can refer to this downloaded copy. (results after this feature) !connections.png! !latency.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1593) PipelinedSorter::compare() makes a key-copy to satisfy RawComparator interface
Gopal V created TEZ-1593: Summary: PipelinedSorter::compare() makes a key-copy to satisfy RawComparator interface Key: TEZ-1593 URL: https://issues.apache.org/jira/browse/TEZ-1593 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V The current implementation of PipelinedSorter has a slow section which revolves around key comparisons. {code} kvbuffer.position(istart); kvbuffer.get(ki, 0, ilen); kvbuffer.position(jstart); kvbuffer.get(kj, 0, jlen); // sort by key final int cmp = comparator.compare(ki, 0, ilen, kj, 0, jlen); {code} The kvbuffer.get into the arrays ki and kj are the slowest part of the comparator operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1596) Secure Shuffle utils is extremely expensive for fast queries
[ https://issues.apache.org/jira/browse/TEZ-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1596: - Attachment: shuffle-secure.png Secure Shuffle utils is extremely expensive for fast queries Key: TEZ-1596 URL: https://issues.apache.org/jira/browse/TEZ-1596 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Attachments: shuffle-secure.png Generating the hash for YARN's secure shuffle is more expensive than the actual HTTP call once keep-alive is turned on. !shuffle-secure.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1141) DAGStatus.Progress should include number of failed attempts
[ https://issues.apache.org/jira/browse/TEZ-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1141: - Affects Version/s: 0.5.0 DAGStatus.Progress should include number of failed attempts --- Key: TEZ-1141 URL: https://issues.apache.org/jira/browse/TEZ-1141 Project: Apache Tez Issue Type: Improvement Affects Versions: 0.5.0 Reporter: Bikas Saha Assignee: Gopal V Currently its impossible to know whether a job is seeing a lot of issues and failures because we only report running tasks. Eventually the job fails but before that we have no indication that a bunch of task failures have been happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging
[ https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V reassigned TEZ-1609: Assignee: Gopal V Add hostname to logIdentifiers of fetchers for easy debugging - Key: TEZ-1609 URL: https://issues.apache.org/jira/browse/TEZ-1609 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Rajesh Balamohan Assignee: Gopal V -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-1141) DAGStatus.Progress should include number of failed attempts
[ https://issues.apache.org/jira/browse/TEZ-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V reassigned TEZ-1141: Assignee: Gopal V Hit this issue in a bad way today, need a way to debug this. DAGStatus.Progress should include number of failed attempts --- Key: TEZ-1141 URL: https://issues.apache.org/jira/browse/TEZ-1141 Project: Apache Tez Issue Type: Improvement Affects Versions: 0.5.0 Reporter: Bikas Saha Assignee: Gopal V Currently its impossible to know whether a job is seeing a lot of issues and failures because we only report running tasks. Eventually the job fails but before that we have no indication that a bunch of task failures have been happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1622) Implement a tez jar equivalent script to avoid the complexities of hadoop jar
Gopal V created TEZ-1622: Summary: Implement a tez jar equivalent script to avoid the complexities of hadoop jar Key: TEZ-1622 URL: https://issues.apache.org/jira/browse/TEZ-1622 Project: Apache Tez Issue Type: Bug Reporter: Gopal V Currently, the only way to run a tez job by hand is to setup multiple parameters like HADOOP_CLASSPATH and then do hadoop jar {{main-class}}. This is inconvenient and complex - find an easier way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging
[ https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149779#comment-14149779 ] Gopal V commented on TEZ-1609: -- Yes, they do print the URL and speed. Add hostname to logIdentifiers of fetchers for easy debugging - Key: TEZ-1609 URL: https://issues.apache.org/jira/browse/TEZ-1609 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1609.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1634) BlockCompressorStream.finish() is called twice in IFile.close leading to Shuffle errors
[ https://issues.apache.org/jira/browse/TEZ-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153290#comment-14153290 ] Gopal V commented on TEZ-1634: -- Change looks good, but is hard to read. Can you move the compressor close to the same location as the checksumOut finish? BlockCompressorStream.finish() is called twice in IFile.close leading to Shuffle errors --- Key: TEZ-1634 URL: https://issues.apache.org/jira/browse/TEZ-1634 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: BlockCompressorStream.with.logging.java, TEZ-1634.1.patch, stacktrace-with-comments.txt When IFile.Writer is closed, it explicitly calls compressedOut.finish(); And as a part of FSDataOutputStream.close(), it again internally calls finish(). Please refer o.a.h.i.compress.BlockCompressorStream for more details on finish(). This leads to additional 4 bytes being written to IFile. This causes issues randomly during shuffle. Also, this prevents IFileInputStream to do the proper checksumming. This error happens only when we try to fetch multiple attempt outputs using the same URL. And is easily reproducible with SnappCompressionCodec. First attempt output would be downloaded by fetcher and due to the last 4 bytes in the stream, it wouldn't do the proper checksumming in IFileInputStream. This causes the subsequent attempt download to fail with the following exception. Example error in shuffle phase is attached below. 2014-09-15 09:54:22,950 WARN [fetcher [scope_41] #31] org.apache.tez.runtime.library.common.shuffle.impl.Fetcher: Invalid map id java.lang.IllegalArgumentException: Invalid header received: partition: 0 at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyMapOutput(Fetcher.java:352) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyFromHost(Fetcher.java:294) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.run(Fetcher.java:160) I will attach the debug version of BlockCompressionStream with threaddump (which validates that finish() is called twice in IFile.close()). This bug was present in earlier versions of Tez as well, and was able to consistently reproduce it now on local-vm itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1634) BlockCompressorStream.finish() is called twice in IFile.close leading to Shuffle errors
[ https://issues.apache.org/jira/browse/TEZ-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1634: - Attachment: TEZ-1634.2.patch Small cosmetic change, for easier debugging. Please review - [~rajesh.balamohan]. BlockCompressorStream.finish() is called twice in IFile.close leading to Shuffle errors --- Key: TEZ-1634 URL: https://issues.apache.org/jira/browse/TEZ-1634 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: BlockCompressorStream.with.logging.java, TEZ-1634.1.patch, TEZ-1634.2.patch, stacktrace-with-comments.txt When IFile.Writer is closed, it explicitly calls compressedOut.finish(); And as a part of FSDataOutputStream.close(), it again internally calls finish(). Please refer o.a.h.i.compress.BlockCompressorStream for more details on finish(). This leads to additional 4 bytes being written to IFile. This causes issues randomly during shuffle. Also, this prevents IFileInputStream to do the proper checksumming. This error happens only when we try to fetch multiple attempt outputs using the same URL. And is easily reproducible with SnappCompressionCodec. First attempt output would be downloaded by fetcher and due to the last 4 bytes in the stream, it wouldn't do the proper checksumming in IFileInputStream. This causes the subsequent attempt download to fail with the following exception. Example error in shuffle phase is attached below. 2014-09-15 09:54:22,950 WARN [fetcher [scope_41] #31] org.apache.tez.runtime.library.common.shuffle.impl.Fetcher: Invalid map id java.lang.IllegalArgumentException: Invalid header received: partition: 0 at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyMapOutput(Fetcher.java:352) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyFromHost(Fetcher.java:294) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.run(Fetcher.java:160) I will attach the debug version of BlockCompressionStream with threaddump (which validates that finish() is called twice in IFile.close()). This bug was present in earlier versions of Tez as well, and was able to consistently reproduce it now on local-vm itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1634) BlockCompressorStream.finish() is called twice in IFile.close leading to Shuffle errors
[ https://issues.apache.org/jira/browse/TEZ-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1634: - Fix Version/s: 0.6.0 BlockCompressorStream.finish() is called twice in IFile.close leading to Shuffle errors --- Key: TEZ-1634 URL: https://issues.apache.org/jira/browse/TEZ-1634 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0, 0.6.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Fix For: 0.6.0 Attachments: BlockCompressorStream.with.logging.java, TEZ-1634.1.patch, TEZ-1634.2.patch, stacktrace-with-comments.txt When IFile.Writer is closed, it explicitly calls compressedOut.finish(); And as a part of FSDataOutputStream.close(), it again internally calls finish(). Please refer o.a.h.i.compress.BlockCompressorStream for more details on finish(). This leads to additional 4 bytes being written to IFile. This causes issues randomly during shuffle. Also, this prevents IFileInputStream to do the proper checksumming. This error happens only when we try to fetch multiple attempt outputs using the same URL. And is easily reproducible with SnappCompressionCodec. First attempt output would be downloaded by fetcher and due to the last 4 bytes in the stream, it wouldn't do the proper checksumming in IFileInputStream. This causes the subsequent attempt download to fail with the following exception. Example error in shuffle phase is attached below. 2014-09-15 09:54:22,950 WARN [fetcher [scope_41] #31] org.apache.tez.runtime.library.common.shuffle.impl.Fetcher: Invalid map id java.lang.IllegalArgumentException: Invalid header received: partition: 0 at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyMapOutput(Fetcher.java:352) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyFromHost(Fetcher.java:294) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.run(Fetcher.java:160) I will attach the debug version of BlockCompressionStream with threaddump (which validates that finish() is called twice in IFile.close()). This bug was present in earlier versions of Tez as well, and was able to consistently reproduce it now on local-vm itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1277) Tez Spill handler should truncate files to reserve space on disk
[ https://issues.apache.org/jira/browse/TEZ-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1277: - Description: Occasionally tasks fail due to full disks because the disks had space when the task was allocating via LocalDirAllocator, but the disk space was actually promised to many tasks instead of just one. This race condition shows up when a 1Gb spill can be done in ~10s or so. There is no way to do this via the hadoop-fs abstraction - but an SSD based spill wastes most of the IOPS on journal updates about the file length changing. was: Occasionally tasks fail due to full disks because the disks had space when the task was allocating via LocalDirAllocator, but the disk space was actually promised to many tasks instead of just one. This race condition shows up when a 1Gb spill can be done in ~10s or so. Tez Spill handler should truncate files to reserve space on disk Key: TEZ-1277 URL: https://issues.apache.org/jira/browse/TEZ-1277 Project: Apache Tez Issue Type: Improvement Affects Versions: 0.5.0 Reporter: Gopal V Assignee: Gopal V Occasionally tasks fail due to full disks because the disks had space when the task was allocating via LocalDirAllocator, but the disk space was actually promised to many tasks instead of just one. This race condition shows up when a 1Gb spill can be done in ~10s or so. There is no way to do this via the hadoop-fs abstraction - but an SSD based spill wastes most of the IOPS on journal updates about the file length changing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1083) Enable IFile RLE for DefaultSorter
[ https://issues.apache.org/jira/browse/TEZ-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171963#comment-14171963 ] Gopal V commented on TEZ-1083: -- +1 - this enables RLE on the map-side spill. Reduce-side TezMerger needs the equivalent impl, as a different JIRA. Enable IFile RLE for DefaultSorter -- Key: TEZ-1083 URL: https://issues.apache.org/jira/browse/TEZ-1083 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Gopal V Attachments: TEZ-1083.1.patch, TEZ-1083.2.patch Generate RLE IFiles for DefaultSorter and use it to fast-forward map-side merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1277) Tez Spill handler should truncate files to reserve space on disk
[ https://issues.apache.org/jira/browse/TEZ-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171964#comment-14171964 ] Gopal V commented on TEZ-1277: -- Need to add a NativeIO impl to do {{fallocate}}. Tez Spill handler should truncate files to reserve space on disk Key: TEZ-1277 URL: https://issues.apache.org/jira/browse/TEZ-1277 Project: Apache Tez Issue Type: Improvement Affects Versions: 0.5.0 Reporter: Gopal V Assignee: Gopal V Occasionally tasks fail due to full disks because the disks had space when the task was allocating via LocalDirAllocator, but the disk space was actually promised to many tasks instead of just one. This race condition shows up when a 1Gb spill can be done in ~10s or so. There is no way to do this via the hadoop-fs abstraction - but an SSD based spill wastes most of the IOPS on journal updates about the file length changing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1525) BroadcastLoadGen testcase
[ https://issues.apache.org/jira/browse/TEZ-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1525: - Attachment: TEZ-1525.2.patch Rebase after TEZ-1479 BroadcastLoadGen testcase - Key: TEZ-1525 URL: https://issues.apache.org/jira/browse/TEZ-1525 Project: Apache Tez Issue Type: Test Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1525.1.patch, TEZ-1525.2.patch Broadcast load generator test example -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1141) DAGStatus.Progress should include number of failed attempts
[ https://issues.apache.org/jira/browse/TEZ-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175874#comment-14175874 ] Gopal V commented on TEZ-1141: -- LGTM, I found that this doesn't track NM blacklisting, but that is a completely different problem. I've updated patch on HIVE-7838 to use this and it is useful, to narrow down query failures (particularly reducer OOMs happening). DAGStatus.Progress should include number of failed attempts --- Key: TEZ-1141 URL: https://issues.apache.org/jira/browse/TEZ-1141 Project: Apache Tez Issue Type: Improvement Affects Versions: 0.5.0 Reporter: Bikas Saha Assignee: Hitesh Shah Attachments: TEZ-1141.1.patch Currently its impossible to know whether a job is seeing a lot of issues and failures because we only report running tasks. Eventually the job fails but before that we have no indication that a bunch of task failures have been happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1525) BroadcastLoadGen testcase
[ https://issues.apache.org/jira/browse/TEZ-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1525: - Attachment: TEZ-1525.3.patch BroadcastLoadGen testcase - Key: TEZ-1525 URL: https://issues.apache.org/jira/browse/TEZ-1525 Project: Apache Tez Issue Type: Test Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Attachments: TEZ-1525.1.patch, TEZ-1525.2.patch, TEZ-1525.3.patch Broadcast load generator test example -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1690) TestMultiMRInput tests fail because of user collisions
[ https://issues.apache.org/jira/browse/TEZ-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1690: - Affects Version/s: 0.5.2 TestMultiMRInput tests fail because of user collisions -- Key: TEZ-1690 URL: https://issues.apache.org/jira/browse/TEZ-1690 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Labels: newbie If two users run mvn test on a machine, the paths in TestMultiMRInput collide tests fail. {code} testSingleSplit(org.apache.tez.mapreduce.input.TestMultiMRInput) Time elapsed: 0.037 sec ERROR! java.io.FileNotFoundException: /tmp/TestMultiMRInput/testSingleSplit/file1 (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:212) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:206) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:202) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:265) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:252) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.init(ChecksumFileSystem.java:384) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1071) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.init(SequenceFile.java:1371) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:272) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:294) at org.apache.tez.mapreduce.input.TestMultiMRInput.createInputData(TestMultiMRInput.java:277) at org.apache.tez.mapreduce.input.TestMultiMRInput.testSingleSplit(TestMultiMRInput.java:106) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1693) ARCHIVE local resources are not supported in Tez DAGs
Gopal V created TEZ-1693: Summary: ARCHIVE local resources are not supported in Tez DAGs Key: TEZ-1693 URL: https://issues.apache.org/jira/browse/TEZ-1693 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V {code} 2014-10-21 16:42:17,919 ERROR [main]: exec.Task (TezTask.java:execute(180)) - Failed to execute tez graph. java.lang.IllegalArgumentException: LocalResourceType: ARCHIVE is not supported, only FILE is supported at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:365) at org.apache.tez.client.TezClient.submitDAG(TezClient.java:344) at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:368) at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:159) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:161) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1607) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1593) Refactor PipelinedSorter to remove all MMAP based ByteBuffer references
[ https://issues.apache.org/jira/browse/TEZ-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1593: - Summary: Refactor PipelinedSorter to remove all MMAP based ByteBuffer references (was: PipelinedSorter::compare() makes a key-copy to satisfy RawComparator interface) Refactor PipelinedSorter to remove all MMAP based ByteBuffer references --- Key: TEZ-1593 URL: https://issues.apache.org/jira/browse/TEZ-1593 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Labels: Performance The current implementation of PipelinedSorter has a slow section which revolves around key comparisons. {code} kvbuffer.position(istart); kvbuffer.get(ki, 0, ilen); kvbuffer.position(jstart); kvbuffer.get(kj, 0, jlen); // sort by key final int cmp = comparator.compare(ki, 0, ilen, kj, 0, jlen); {code} The kvbuffer.get into the arrays ki and kj are the slowest part of the comparator operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1593) Refactor PipelinedSorter to remove all MMAP based ByteBuffer references
[ https://issues.apache.org/jira/browse/TEZ-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1593: - Description: The current implementation of PipelinedSorter has a slow section which revolves around key comparisons - this was relevant when the implementation used direct byte buffers to back the kvbuffer. {code} kvbuffer.position(istart); kvbuffer.get(ki, 0, ilen); kvbuffer.position(jstart); kvbuffer.get(kj, 0, jlen); // sort by key final int cmp = comparator.compare(ki, 0, ilen, kj, 0, jlen); {code} The kvbuffer.get into the arrays ki and kj are the slowest part of the comparator operation. was: The current implementation of PipelinedSorter has a slow section which revolves around key comparisons. {code} kvbuffer.position(istart); kvbuffer.get(ki, 0, ilen); kvbuffer.position(jstart); kvbuffer.get(kj, 0, jlen); // sort by key final int cmp = comparator.compare(ki, 0, ilen, kj, 0, jlen); {code} The kvbuffer.get into the arrays ki and kj are the slowest part of the comparator operation. Refactor PipelinedSorter to remove all MMAP based ByteBuffer references --- Key: TEZ-1593 URL: https://issues.apache.org/jira/browse/TEZ-1593 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Gopal V Labels: Performance The current implementation of PipelinedSorter has a slow section which revolves around key comparisons - this was relevant when the implementation used direct byte buffers to back the kvbuffer. {code} kvbuffer.position(istart); kvbuffer.get(ki, 0, ilen); kvbuffer.position(jstart); kvbuffer.get(kj, 0, jlen); // sort by key final int cmp = comparator.compare(ki, 0, ilen, kj, 0, jlen); {code} The kvbuffer.get into the arrays ki and kj are the slowest part of the comparator operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1141) DAGStatus.Progress should include number of failed attempts
[ https://issues.apache.org/jira/browse/TEZ-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180900#comment-14180900 ] Gopal V commented on TEZ-1141: -- The hive progress UI is already crowded with 4 numbers per vertex. In general, we want failed attempts tracking - so that a user can see OOMs or task errors, without waiting for a query to finish and grepping the logs. Adding killed attempts to the mix (as one number) doesn't help and possibly will confuse users (see earlier comment on NM black-listing). DAGStatus.Progress should include number of failed attempts --- Key: TEZ-1141 URL: https://issues.apache.org/jira/browse/TEZ-1141 Project: Apache Tez Issue Type: Improvement Affects Versions: 0.5.0 Reporter: Bikas Saha Assignee: Hitesh Shah Attachments: TEZ-1141.1.patch Currently its impossible to know whether a job is seeing a lot of issues and failures because we only report running tasks. Eventually the job fails but before that we have no indication that a bunch of task failures have been happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1141) DAGStatus.Progress should include number of failed attempts
[ https://issues.apache.org/jira/browse/TEZ-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180910#comment-14180910 ] Gopal V commented on TEZ-1141: -- The UI should be able to extract this information from TEZ_TASK_ATTEMPT_ID::status The progress RPC we're talking about today is from the AM directly for clients like Hive. DAGStatus.Progress should include number of failed attempts --- Key: TEZ-1141 URL: https://issues.apache.org/jira/browse/TEZ-1141 Project: Apache Tez Issue Type: Improvement Affects Versions: 0.5.0 Reporter: Bikas Saha Assignee: Hitesh Shah Attachments: TEZ-1141.1.patch Currently its impossible to know whether a job is seeing a lot of issues and failures because we only report running tasks. Eventually the job fails but before that we have no indication that a bunch of task failures have been happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1688) Add applicationId as a primary filter for all Timeline data for easier export
[ https://issues.apache.org/jira/browse/TEZ-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180934#comment-14180934 ] Gopal V commented on TEZ-1688: -- LGTM - +1. I noticed that this uses the same naming convention as YARN applications (instead of being tied to a TEZ_* name). That makes a lot of sense, when we integrate this with the RM data - but right now, it looks rather odd in the dumps. {code} primaryfilters: { TEZ_DAG_ID: [ dag_1413959022005_0046_1 ], TEZ_VERTEX_ID: [ vertex_1413959022005_0046_1_01 ], applicationId: [ application_1413959022005_0046 ] }, {code} That is not particularly relevant to fix, but I'm commenting for the sake of some documentation about this. Add applicationId as a primary filter for all Timeline data for easier export -- Key: TEZ-1688 URL: https://issues.apache.org/jira/browse/TEZ-1688 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1688.1.patch, TEZ-1688.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1688) Add applicationId as a primary filter for all Timeline data for easier export
[ https://issues.apache.org/jira/browse/TEZ-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180944#comment-14180944 ] Gopal V commented on TEZ-1688: -- Running through my extraction pipelines - {{TEZ_APPLICATION_ATTEMPT}} is missing the filter. The relatedEntities does not allow for a reverse lookup. Add applicationId as a primary filter for all Timeline data for easier export -- Key: TEZ-1688 URL: https://issues.apache.org/jira/browse/TEZ-1688 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1688.1.patch, TEZ-1688.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1596) Secure Shuffle utils is extremely expensive for fast queries
[ https://issues.apache.org/jira/browse/TEZ-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1596: - Description: Generating the hash for YARN's secure shuffle is more expensive than the actual HTTP call once keep-alive is turned on. !Shuffle_generateHash.png! was: Generating the hash for YARN's secure shuffle is more expensive than the actual HTTP call once keep-alive is turned on. !shuffle-secure.png! Secure Shuffle utils is extremely expensive for fast queries Key: TEZ-1596 URL: https://issues.apache.org/jira/browse/TEZ-1596 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Attachments: Shuffle_generateHash.png, TEZ-1596.hack.patch, shuffle-secure-drilldown.png, shuffle-secure.png Generating the hash for YARN's secure shuffle is more expensive than the actual HTTP call once keep-alive is turned on. !Shuffle_generateHash.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1698) Use ResourceCalculatorPlugin instead of ResourceCalculatorProcessTree in Tez
[ https://issues.apache.org/jira/browse/TEZ-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1698: - Attachment: ProcfsBasedProcessTree.png Use ResourceCalculatorPlugin instead of ResourceCalculatorProcessTree in Tez Key: TEZ-1698 URL: https://issues.apache.org/jira/browse/TEZ-1698 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Attachments: ProcfsBasedProcessTree.png ResourceCalculatorProcessTree scraps all of /proc/ for PIDs which are part of the current task's process group. This is mostly wasted in Tez, since unlike YARN which has to do this since it has the PID for the container-executor process (bash) and has to trace the bash - java spawn inheritance. !ProcfsBasedProcessTree.png! The effect of this is less clearly visible with the profiler turned on as this is primarily related to Syscall overhead in the kernel (via the following codepath in YARN). {code} private ListString getProcessList() { String[] processDirs = (new File(procfsDir)).list(); ... for (String dir : processDirs) { try { if ((new File(procfsDir, dir)).isDirectory()) { processList.add(dir); } ... public void updateProcessTree() { if (!pid.equals(deadPid)) { // Get the list of processes ListString processList = getProcessList(); ... for (String proc : processList) { // Get information for each process ProcessInfo pInfo = new ProcessInfo(proc); if (constructProcessInfo(pInfo, procfsDir) != null) { allProcessInfo.put(proc, pInfo); if (proc.equals(this.pid)) { me = pInfo; // cache 'me' processTree.put(proc, pInfo); } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1698) Use ResourceCalculatorPlugin instead of ResourceCalculatorProcessTree in Tez
Gopal V created TEZ-1698: Summary: Use ResourceCalculatorPlugin instead of ResourceCalculatorProcessTree in Tez Key: TEZ-1698 URL: https://issues.apache.org/jira/browse/TEZ-1698 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Attachments: ProcfsBasedProcessTree.png ResourceCalculatorProcessTree scraps all of /proc/ for PIDs which are part of the current task's process group. This is mostly wasted in Tez, since unlike YARN which has to do this since it has the PID for the container-executor process (bash) and has to trace the bash - java spawn inheritance. !ProcfsBasedProcessTree.png! The effect of this is less clearly visible with the profiler turned on as this is primarily related to Syscall overhead in the kernel (via the following codepath in YARN). {code} private ListString getProcessList() { String[] processDirs = (new File(procfsDir)).list(); ... for (String dir : processDirs) { try { if ((new File(procfsDir, dir)).isDirectory()) { processList.add(dir); } ... public void updateProcessTree() { if (!pid.equals(deadPid)) { // Get the list of processes ListString processList = getProcessList(); ... for (String proc : processList) { // Get information for each process ProcessInfo pInfo = new ProcessInfo(proc); if (constructProcessInfo(pInfo, procfsDir) != null) { allProcessInfo.put(proc, pInfo); if (proc.equals(this.pid)) { me = pInfo; // cache 'me' processTree.put(proc, pInfo); } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1698) Use ResourceCalculatorPlugin instead of ResourceCalculatorProcessTree in Tez
[ https://issues.apache.org/jira/browse/TEZ-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1698: - Description: ResourceCalculatorProcessTree scraps all of /proc/ for PIDs which are part of the current task's process group. This is mostly wasted in Tez, since unlike YARN which has to do this since it has the PID for the container-executor process (bash) and has to trace the bash - java spawn inheritance. !ProcfsBasedProcessTree.png! The latency effect of this is less clearly visible with the profiler turned on as this is primarily related to rate of syscalls + overhead in the kernel (via the following codepath in YARN). !ProcfsFiles.png! {code} private ListString getProcessList() { String[] processDirs = (new File(procfsDir)).list(); ... for (String dir : processDirs) { try { if ((new File(procfsDir, dir)).isDirectory()) { processList.add(dir); } ... public void updateProcessTree() { if (!pid.equals(deadPid)) { // Get the list of processes ListString processList = getProcessList(); ... for (String proc : processList) { // Get information for each process ProcessInfo pInfo = new ProcessInfo(proc); if (constructProcessInfo(pInfo, procfsDir) != null) { allProcessInfo.put(proc, pInfo); if (proc.equals(this.pid)) { me = pInfo; // cache 'me' processTree.put(proc, pInfo); } } } {code} was: ResourceCalculatorProcessTree scraps all of /proc/ for PIDs which are part of the current task's process group. This is mostly wasted in Tez, since unlike YARN which has to do this since it has the PID for the container-executor process (bash) and has to trace the bash - java spawn inheritance. !ProcfsBasedProcessTree.png! The effect of this is less clearly visible with the profiler turned on as this is primarily related to Syscall overhead in the kernel (via the following codepath in YARN). {code} private ListString getProcessList() { String[] processDirs = (new File(procfsDir)).list(); ... for (String dir : processDirs) { try { if ((new File(procfsDir, dir)).isDirectory()) { processList.add(dir); } ... public void updateProcessTree() { if (!pid.equals(deadPid)) { // Get the list of processes ListString processList = getProcessList(); ... for (String proc : processList) { // Get information for each process ProcessInfo pInfo = new ProcessInfo(proc); if (constructProcessInfo(pInfo, procfsDir) != null) { allProcessInfo.put(proc, pInfo); if (proc.equals(this.pid)) { me = pInfo; // cache 'me' processTree.put(proc, pInfo); } } } {code} Use ResourceCalculatorPlugin instead of ResourceCalculatorProcessTree in Tez Key: TEZ-1698 URL: https://issues.apache.org/jira/browse/TEZ-1698 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Attachments: ProcfsBasedProcessTree.png, ProcfsFiles.png ResourceCalculatorProcessTree scraps all of /proc/ for PIDs which are part of the current task's process group. This is mostly wasted in Tez, since unlike YARN which has to do this since it has the PID for the container-executor process (bash) and has to trace the bash - java spawn inheritance. !ProcfsBasedProcessTree.png! The latency effect of this is less clearly visible with the profiler turned on as this is primarily related to rate of syscalls + overhead in the kernel (via the following codepath in YARN). !ProcfsFiles.png! {code} private ListString getProcessList() { String[] processDirs = (new File(procfsDir)).list(); ... for (String dir : processDirs) { try { if ((new File(procfsDir, dir)).isDirectory()) { processList.add(dir); } ... public void updateProcessTree() { if (!pid.equals(deadPid)) { // Get the list of processes ListString processList = getProcessList(); ... for (String proc : processList) { // Get information for each process ProcessInfo pInfo = new ProcessInfo(proc); if (constructProcessInfo(pInfo, procfsDir) != null) { allProcessInfo.put(proc, pInfo); if (proc.equals(this.pid)) { me = pInfo; // cache 'me' processTree.put(proc, pInfo); } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1698) Use ResourceCalculatorPlugin instead of ResourceCalculatorProcessTree in Tez
[ https://issues.apache.org/jira/browse/TEZ-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1698: - Attachment: ProcfsFiles.png Use ResourceCalculatorPlugin instead of ResourceCalculatorProcessTree in Tez Key: TEZ-1698 URL: https://issues.apache.org/jira/browse/TEZ-1698 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Attachments: ProcfsBasedProcessTree.png, ProcfsFiles.png ResourceCalculatorProcessTree scraps all of /proc/ for PIDs which are part of the current task's process group. This is mostly wasted in Tez, since unlike YARN which has to do this since it has the PID for the container-executor process (bash) and has to trace the bash - java spawn inheritance. !ProcfsBasedProcessTree.png! The latency effect of this is less clearly visible with the profiler turned on as this is primarily related to rate of syscalls + overhead in the kernel (via the following codepath in YARN). !ProcfsFiles.png! {code} private ListString getProcessList() { String[] processDirs = (new File(procfsDir)).list(); ... for (String dir : processDirs) { try { if ((new File(procfsDir, dir)).isDirectory()) { processList.add(dir); } ... public void updateProcessTree() { if (!pid.equals(deadPid)) { // Get the list of processes ListString processList = getProcessList(); ... for (String proc : processList) { // Get information for each process ProcessInfo pInfo = new ProcessInfo(proc); if (constructProcessInfo(pInfo, procfsDir) != null) { allProcessInfo.put(proc, pInfo); if (proc.equals(this.pid)) { me = pInfo; // cache 'me' processTree.put(proc, pInfo); } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1634) BlockCompressorStream.finish() is called twice in IFile.close leading to Shuffle errors
[ https://issues.apache.org/jira/browse/TEZ-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1634: - Fix Version/s: 0.5.2 BlockCompressorStream.finish() is called twice in IFile.close leading to Shuffle errors --- Key: TEZ-1634 URL: https://issues.apache.org/jira/browse/TEZ-1634 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0, 0.6.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Fix For: 0.6.0, 0.5.2 Attachments: BlockCompressorStream.with.logging.java, TEZ-1634.1.patch, TEZ-1634.2.patch, stacktrace-with-comments.txt When IFile.Writer is closed, it explicitly calls compressedOut.finish(); And as a part of FSDataOutputStream.close(), it again internally calls finish(). Please refer o.a.h.i.compress.BlockCompressorStream for more details on finish(). This leads to additional 4 bytes being written to IFile. This causes issues randomly during shuffle. Also, this prevents IFileInputStream to do the proper checksumming. This error happens only when we try to fetch multiple attempt outputs using the same URL. And is easily reproducible with SnappCompressionCodec. First attempt output would be downloaded by fetcher and due to the last 4 bytes in the stream, it wouldn't do the proper checksumming in IFileInputStream. This causes the subsequent attempt download to fail with the following exception. Example error in shuffle phase is attached below. 2014-09-15 09:54:22,950 WARN [fetcher [scope_41] #31] org.apache.tez.runtime.library.common.shuffle.impl.Fetcher: Invalid map id java.lang.IllegalArgumentException: Invalid header received: partition: 0 at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyMapOutput(Fetcher.java:352) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyFromHost(Fetcher.java:294) at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.run(Fetcher.java:160) I will attach the debug version of BlockCompressionStream with threaddump (which validates that finish() is called twice in IFile.close()). This bug was present in earlier versions of Tez as well, and was able to consistently reproduce it now on local-vm itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1596) Secure Shuffle utils is extremely expensive for fast queries
[ https://issues.apache.org/jira/browse/TEZ-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185973#comment-14185973 ] Gopal V commented on TEZ-1596: -- Works as expected - +1 Secure Shuffle utils is extremely expensive for fast queries Key: TEZ-1596 URL: https://issues.apache.org/jira/browse/TEZ-1596 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: Shuffle_generateHash.png, Shuffle_generateHash_afterFix.png, TEZ-1596.2.patch, TEZ-1596.hack.patch, shuffle-secure-drilldown.png, shuffle-secure.png Generating the hash for YARN's secure shuffle is more expensive than the actual HTTP call once keep-alive is turned on. !Shuffle_generateHash.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1701) ATS fixes to flush all history events and also using batching
[ https://issues.apache.org/jira/browse/TEZ-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186000#comment-14186000 ] Gopal V commented on TEZ-1701: -- Event counts add up with this patch. Within the ATS level DB, significant pauses were noticed during the GC delete passes. ATS fixes to flush all history events and also using batching - Key: TEZ-1701 URL: https://issues.apache.org/jira/browse/TEZ-1701 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Attachments: TEZ-1701.1.patch There are cases when the timeline server can get backlogged. To address this, the AM should wait for a longer period to send events to it. Also, sending events in batches will reduce the load. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1719) Allow IFile reducer merge-sort to disable crc32 checksums
Gopal V created TEZ-1719: Summary: Allow IFile reducer merge-sort to disable crc32 checksums Key: TEZ-1719 URL: https://issues.apache.org/jira/browse/TEZ-1719 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Next-gen filesystems like BTRFS and ZFS provide their own checksumming for disk data. Using PureJavaCrc32 for data written for temporary spills to such filesystems is a complete waste of CPU resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1719) Allow IFile reducer merge-sort to disable crc32 checksums
[ https://issues.apache.org/jira/browse/TEZ-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1719: - Labels: Performance (was: ) Allow IFile reducer merge-sort to disable crc32 checksums - Key: TEZ-1719 URL: https://issues.apache.org/jira/browse/TEZ-1719 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Gopal V Labels: Performance Next-gen filesystems like BTRFS and ZFS provide their own checksumming for disk data. Using PureJavaCrc32 for data written for temporary spills to such filesystems is a complete waste of CPU resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1698) Cut down on ResourceCalculatorProcessTree overheads in Tez
[ https://issues.apache.org/jira/browse/TEZ-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1698: - Summary: Cut down on ResourceCalculatorProcessTree overheads in Tez (was: Use ResourceCalculatorPlugin instead of ResourceCalculatorProcessTree in Tez) Cut down on ResourceCalculatorProcessTree overheads in Tez -- Key: TEZ-1698 URL: https://issues.apache.org/jira/browse/TEZ-1698 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: ProcfsBasedProcessTree.png, ProcfsFiles.png, TEZ-1698.1.patch, TEZ-1698.2.patch ResourceCalculatorProcessTree scraps all of /proc/ for PIDs which are part of the current task's process group. This is mostly wasted in Tez, since unlike YARN which has to do this since it has the PID for the container-executor process (bash) and has to trace the bash - java spawn inheritance. !ProcfsBasedProcessTree.png! The latency effect of this is less clearly visible with the profiler turned on as this is primarily related to rate of syscalls + overhead in the kernel (via the following codepath in YARN). !ProcfsFiles.png! {code} private ListString getProcessList() { String[] processDirs = (new File(procfsDir)).list(); ... for (String dir : processDirs) { try { if ((new File(procfsDir, dir)).isDirectory()) { processList.add(dir); } ... public void updateProcessTree() { if (!pid.equals(deadPid)) { // Get the list of processes ListString processList = getProcessList(); ... for (String proc : processList) { // Get information for each process ProcessInfo pInfo = new ProcessInfo(proc); if (constructProcessInfo(pInfo, procfsDir) != null) { allProcessInfo.put(proc, pInfo); if (proc.equals(this.pid)) { me = pInfo; // cache 'me' processTree.put(proc, pInfo); } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1698) Cut down on ResourceCalculatorProcessTree overheads in Tez
[ https://issues.apache.org/jira/browse/TEZ-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1698: - Attachment: TEZ-1698.3.patch [~rajesh.balamohan]: Can you test this version? Minor changes, the plugin is built only against Sun/Oracle JDKs. And CumulativeRSS now returns total memory held by the JVM via Runtime.getTotalMemory(), since it includes the free heap memory which is held by the JVM (as finalized/collected garbage). Cut down on ResourceCalculatorProcessTree overheads in Tez -- Key: TEZ-1698 URL: https://issues.apache.org/jira/browse/TEZ-1698 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: ProcfsBasedProcessTree.png, ProcfsFiles.png, TEZ-1698.1.patch, TEZ-1698.2.patch, TEZ-1698.3.patch ResourceCalculatorProcessTree scraps all of /proc/ for PIDs which are part of the current task's process group. This is mostly wasted in Tez, since unlike YARN which has to do this since it has the PID for the container-executor process (bash) and has to trace the bash - java spawn inheritance. !ProcfsBasedProcessTree.png! The latency effect of this is less clearly visible with the profiler turned on as this is primarily related to rate of syscalls + overhead in the kernel (via the following codepath in YARN). !ProcfsFiles.png! {code} private ListString getProcessList() { String[] processDirs = (new File(procfsDir)).list(); ... for (String dir : processDirs) { try { if ((new File(procfsDir, dir)).isDirectory()) { processList.add(dir); } ... public void updateProcessTree() { if (!pid.equals(deadPid)) { // Get the list of processes ListString processList = getProcessList(); ... for (String proc : processList) { // Get information for each process ProcessInfo pInfo = new ProcessInfo(proc); if (constructProcessInfo(pInfo, procfsDir) != null) { allProcessInfo.put(proc, pInfo); if (proc.equals(this.pid)) { me = pInfo; // cache 'me' processTree.put(proc, pInfo); } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1698) Cut down on ResourceCalculatorProcessTree overheads in Tez
[ https://issues.apache.org/jira/browse/TEZ-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191204#comment-14191204 ] Gopal V commented on TEZ-1698: -- +1 - Thanks Rajesh, this looks good. This isn't on by default, so it should be good for 0.5.2 as well. Cut down on ResourceCalculatorProcessTree overheads in Tez -- Key: TEZ-1698 URL: https://issues.apache.org/jira/browse/TEZ-1698 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Assignee: Rajesh Balamohan Attachments: ProcfsBasedProcessTree.png, ProcfsFiles.png, TEZ-1698.1.patch, TEZ-1698.2.patch, TEZ-1698.3.patch, TEZ-1698.4.patch ResourceCalculatorProcessTree scraps all of /proc/ for PIDs which are part of the current task's process group. This is mostly wasted in Tez, since unlike YARN which has to do this since it has the PID for the container-executor process (bash) and has to trace the bash - java spawn inheritance. !ProcfsBasedProcessTree.png! The latency effect of this is less clearly visible with the profiler turned on as this is primarily related to rate of syscalls + overhead in the kernel (via the following codepath in YARN). !ProcfsFiles.png! {code} private ListString getProcessList() { String[] processDirs = (new File(procfsDir)).list(); ... for (String dir : processDirs) { try { if ((new File(procfsDir, dir)).isDirectory()) { processList.add(dir); } ... public void updateProcessTree() { if (!pid.equals(deadPid)) { // Get the list of processes ListString processList = getProcessList(); ... for (String proc : processList) { // Get information for each process ProcessInfo pInfo = new ProcessInfo(proc); if (constructProcessInfo(pInfo, procfsDir) != null) { allProcessInfo.put(proc, pInfo); if (proc.equals(this.pid)) { me = pInfo; // cache 'me' processTree.put(proc, pInfo); } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1725) Fix nanosecond to millis conversion in TezMxBeanResourceCalculator
[ https://issues.apache.org/jira/browse/TEZ-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191449#comment-14191449 ] Gopal V commented on TEZ-1725: -- Comparing LinuxResourceCalculatorPlugin vs MXBeanResourceCalculator || Counter || LinuxResourceCalculatorPlugin || TezMxBeanResourceCalculator || | CPU_MILLISECONDS | 48458160 | 48059040 | | PHYSICAL_MEMORY_BYTES | 6679073550336 | 7029569093632 | | VIRTUAL_MEMORY_BYTES | 11706779303936 | 11492467920896 | Fix nanosecond to millis conversion in TezMxBeanResourceCalculator -- Key: TEZ-1725 URL: https://issues.apache.org/jira/browse/TEZ-1725 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1725.1.patch, TEZ-1725.2.patch, TEZ-1725.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1725) Fix nanosecond to millis conversion in TezMxBeanResourceCalculator
[ https://issues.apache.org/jira/browse/TEZ-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191455#comment-14191455 ] Gopal V commented on TEZ-1725: -- The error bar seems to be within a percentage, even on a 10Tb query. +1 - LGTM. Fix nanosecond to millis conversion in TezMxBeanResourceCalculator -- Key: TEZ-1725 URL: https://issues.apache.org/jira/browse/TEZ-1725 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-1725.1.patch, TEZ-1725.2.patch, TEZ-1725.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1733) TezMerger should sort FileChunks on decompressed size
Gopal V created TEZ-1733: Summary: TezMerger should sort FileChunks on decompressed size Key: TEZ-1733 URL: https://issues.apache.org/jira/browse/TEZ-1733 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V MAPREDUCE-3685 fixed the Merger sort order for file chunks to use the decompressed size, to cut-down on CPU and IO costs. TezMerger needs an equivalent sorted TreeSet which sorts by the data with-in sizes rather than actual file sizes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1733) TezMerger should sort FileChunks on decompressed size
[ https://issues.apache.org/jira/browse/TEZ-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1733: - Attachment: TEZ-1733.1.patch TezMerger should sort FileChunks on decompressed size - Key: TEZ-1733 URL: https://issues.apache.org/jira/browse/TEZ-1733 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Attachments: TEZ-1733.1.patch MAPREDUCE-3685 fixed the Merger sort order for file chunks to use the decompressed size, to cut-down on CPU and IO costs. TezMerger needs an equivalent sorted TreeSet which sorts by the data with-in sizes rather than actual file sizes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1733) TezMerger should sort FileChunks on decompressed size
[ https://issues.apache.org/jira/browse/TEZ-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1733: - Priority: Critical (was: Major) TezMerger should sort FileChunks on decompressed size - Key: TEZ-1733 URL: https://issues.apache.org/jira/browse/TEZ-1733 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Priority: Critical Attachments: TEZ-1733.1.patch MAPREDUCE-3685 fixed the Merger sort order for file chunks to use the decompressed size, to cut-down on CPU and IO costs. TezMerger needs an equivalent sorted TreeSet which sorts by the data with-in sizes rather than actual file sizes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1733) TezMerger should sort FileChunks on decompressed size
[ https://issues.apache.org/jira/browse/TEZ-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1733: - Target Version/s: 0.5.2 (was: 0.6.0) TezMerger should sort FileChunks on decompressed size - Key: TEZ-1733 URL: https://issues.apache.org/jira/browse/TEZ-1733 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Reporter: Gopal V Attachments: TEZ-1733.1.patch MAPREDUCE-3685 fixed the Merger sort order for file chunks to use the decompressed size, to cut-down on CPU and IO costs. TezMerger needs an equivalent sorted TreeSet which sorts by the data with-in sizes rather than actual file sizes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)