[jira] [Updated] (MAPREDUCE-6757) Multithreaded mapper corrupts buffer pusher in nativetask
[ https://issues.apache.org/jira/browse/MAPREDUCE-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Tianyi updated MAPREDUCE-6757: - Status: Patch Available (was: Open) > Multithreaded mapper corrupts buffer pusher in nativetask > - > > Key: MAPREDUCE-6757 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6757 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: nativetask >Affects Versions: 3.0.0-alpha1 >Reporter: He Tianyi > Attachments: MAPREDUCE-6757..patch > > > Multiple threads could be calling {{collect}} method of the same > {{NativeMapOutputCollectorDelegator}} instance at the same time. In this > case, buffer can be corrupted. > This may occur when executing Hive queries with custom script. > Adding 'synchronized' keyword to {{collect}} method would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6757) Multithreaded mapper corrupts buffer pusher in nativetask
[ https://issues.apache.org/jira/browse/MAPREDUCE-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Tianyi updated MAPREDUCE-6757: - Attachment: MAPREDUCE-6757..patch > Multithreaded mapper corrupts buffer pusher in nativetask > - > > Key: MAPREDUCE-6757 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6757 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: nativetask >Affects Versions: 3.0.0-alpha1 >Reporter: He Tianyi > Attachments: MAPREDUCE-6757..patch > > > Multiple threads could be calling {{collect}} method of the same > {{NativeMapOutputCollectorDelegator}} instance at the same time. In this > case, buffer can be corrupted. > This may occur when executing Hive queries with custom script. > Adding 'synchronized' keyword to {{collect}} method would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6757) Multithreaded mapper corrupts buffer pusher in nativetask
He Tianyi created MAPREDUCE-6757: Summary: Multithreaded mapper corrupts buffer pusher in nativetask Key: MAPREDUCE-6757 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6757 Project: Hadoop Map/Reduce Issue Type: Bug Components: nativetask Affects Versions: 3.0.0-alpha1 Reporter: He Tianyi Multiple threads could be calling {{collect}} method of the same {{NativeMapOutputCollectorDelegator}} instance at the same time. In this case, buffer can be corrupted. This may occur when executing Hive queries with custom script. Adding 'synchronized' keyword to {{collect}} method would solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6756) native map output collector should be able to handle KV larger than io.sort.mb
He Tianyi created MAPREDUCE-6756: Summary: native map output collector should be able to handle KV larger than io.sort.mb Key: MAPREDUCE-6756 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6756 Project: Hadoop Map/Reduce Issue Type: Improvement Components: nativetask Affects Versions: 2.6.0 Reporter: He Tianyi Currently exception is thrown if kvLength > io.sort.mb. In rare cases this can occur, especially for hive queries. Maybe we could implement a 'spill single record' function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6755) Report correct spill/output size for map in nativetask
He Tianyi created MAPREDUCE-6755: Summary: Report correct spill/output size for map in nativetask Key: MAPREDUCE-6755 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6755 Project: Hadoop Map/Reduce Issue Type: Improvement Components: nativetask Affects Versions: 2.6.0 Reporter: He Tianyi Priority: Minor Currently nativetask always report '-1' as size when calling methods in {{MapOutputFile}}. This is an obstacle for custom {{MapOutputFile}} implementations where size matters (e.g. determine which disk to use). The issue propose to estimate spill/output/index size in nativetask like java implementation does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335164#comment-15335164 ] He Tianyi commented on MAPREDUCE-6712: -- Oh, just another thought: maybe scenarios like this can benefit from nativetask (from Intel). > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326214#comment-15326214 ] He Tianyi commented on MAPREDUCE-6712: -- Hi, [~templedf]. Thanks for these ideas. I think the null key solution meets the need. And this is particularly possible in a customized environment, since application framework can be specialized to do this (whatever language supported internally), but may be not worth it for a general platform. Moving to pyspark is certainly a better solution for these advantages that a higher level abstraction and Spark computing model brings. However, there is still a IPC overhead unless we move to a jvm-based language either. Any suggestions? > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321658#comment-15321658 ] He Tianyi commented on MAPREDUCE-6712: -- Actually in my experiements (in-house workload) turning strings back and forth is not the bottleneck (does not make a difference with typedbytes). But just grouping values make a simple reducer 20% faster (for both text and typedbytes). Also, many users are using C/C++ to implement mapper/reducer which I think is possible to be more efficient than java/scala (smaller memory footprint, less gc, no virtual call overhead, better SIMD support, etc.). > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321658#comment-15321658 ] He Tianyi edited comment on MAPREDUCE-6712 at 6/8/16 11:34 PM: --- Actually in my experiements (in-house workload) turning strings back and forth is not the bottleneck (does not make a difference with typedbytes). But just grouping values make a simple reducer 20% faster (for both text and typedbytes). Also, many users are using C/C++ to implement mapper/reducer which I think is possible to be more efficient than java/scala (smaller memory footprint, less gc, better SIMD support, etc.). was (Author: he tianyi): Actually in my experiements (in-house workload) turning strings back and forth is not the bottleneck (does not make a difference with typedbytes). But just grouping values make a simple reducer 20% faster (for both text and typedbytes). Also, many users are using C/C++ to implement mapper/reducer which I think is possible to be more efficient than java/scala (smaller memory footprint, less gc, no virtual call overhead, better SIMD support, etc.). > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6712) Support grouping values for reducer on java-side
He Tianyi created MAPREDUCE-6712: Summary: Support grouping values for reducer on java-side Key: MAPREDUCE-6712 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 Project: Hadoop Map/Reduce Issue Type: Improvement Components: contrib/streaming Reporter: He Tianyi Priority: Minor In hadoop streaming, with TextInputWriter, reducer program will receive each line representing a (k, v) tuple from {{stdin}}, in which values with identical key is not grouped. This brings some inefficiency, especially for runtimes based on interpreter (e.g. cpython), coming from: A. user program has to compare key with previous one (but on java side, records already come to reducer in groups), B. user program has to perform {{read}}, then {{find}} or {{split}} on each record. even if there are multiple values with identical key, C. if length of key is large, apparently this introduces inefficiency for caching, Suppose we need another InputWriter. But this is not enough, since the interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not {{writeValues}}. Though we can compare key in custom InputWriter and group them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6687) Allow specifing java home via job configuration
He Tianyi created MAPREDUCE-6687: Summary: Allow specifing java home via job configuration Key: MAPREDUCE-6687 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6687 Project: Hadoop Map/Reduce Issue Type: New Feature Components: applicationmaster Reporter: He Tianyi Priority: Minor Suggest allowing user to use a preferred JVM implementation (or version) by specifying java home via JobConf, to launch Map/Reduce tasks. Especially useful for running A/B tests on real workload or benchmark between JVM implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAPREDUCE-6488) Make buffer size in PipeMapRed configurable
[ https://issues.apache.org/jira/browse/MAPREDUCE-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Tianyi resolved MAPREDUCE-6488. -- Resolution: Invalid > Make buffer size in PipeMapRed configurable > --- > > Key: MAPREDUCE-6488 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6488 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: He Tianyi >Assignee: He Tianyi > > Default value of buffer size is 128K in {{PipeMapRed}}. > When mapper input record is large enough that it won't fit in buffer, > {{MapRunner}} blocks until written. If child process and input reader are > both slow (due to calculation and decompress), then process of decoding and > reading will rarely overlap with each other, hurting performance. > I suppose we should make the buffer size configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6488) Make buffer size in PipeMapRed configurable
[ https://issues.apache.org/jira/browse/MAPREDUCE-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901860#comment-14901860 ] He Tianyi commented on MAPREDUCE-6488: -- Went through this again, buffer size has nothing to do with this. Please mark as Invalid. > Make buffer size in PipeMapRed configurable > --- > > Key: MAPREDUCE-6488 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6488 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: He Tianyi >Assignee: He Tianyi > > Default value of buffer size is 128K in {{PipeMapRed}}. > When mapper input record is large enough that it won't fit in buffer, > {{MapRunner}} blocks until written. If child process and input reader are > both slow (due to calculation and decompress), then process of decoding and > reading will rarely overlap with each other, hurting performance. > I suppose we should make the buffer size configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAPREDUCE-6488) Make buffer size in PipeMapRed configurable
He Tianyi created MAPREDUCE-6488: Summary: Make buffer size in PipeMapRed configurable Key: MAPREDUCE-6488 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6488 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: He Tianyi Assignee: He Tianyi Default value of buffer size is 128K in {{PipeMapRed}}. When mapper input record is large enough that it won't fit in buffer, {{MapRunner}} blocks until written. If child process and input reader are both slow (due to calculation and decompress), then process of decoding and reading will rarely overlap with each other, hurting performance. I suppose we should make the buffer size configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)