[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336923#comment-15336923 ] Daniel Templeton commented on MAPREDUCE-6712: - Pyspark actually does some cleverness under covers to replace as much of your python code with Scala as possible. Regardless, the advantages of Spark over MR in performance are huge. Not being forced to serialize to HDFS between chained jobs and being able to cache data in memory are a big deal. > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335164#comment-15335164 ] He Tianyi commented on MAPREDUCE-6712: -- Oh, just another thought: maybe scenarios like this can benefit from nativetask (from Intel). > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326214#comment-15326214 ] He Tianyi commented on MAPREDUCE-6712: -- Hi, [~templedf]. Thanks for these ideas. I think the null key solution meets the need. And this is particularly possible in a customized environment, since application framework can be specialized to do this (whatever language supported internally), but may be not worth it for a general platform. Moving to pyspark is certainly a better solution for these advantages that a higher level abstraction and Spark computing model brings. However, there is still a IPC overhead unless we move to a jvm-based language either. Any suggestions? > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324821#comment-15324821 ] Daniel Templeton commented on MAPREDUCE-6712: - [~He Tianyi], after chewing on this a bit more, I think I see a way that it could be done that wouldn't be too disruptive. What if the first value comes through with the key, and subsequent values come through with a null key, i.e.: {noformat} key1\tvalue1 \tvalue2 \tvalue3 key2\tvalue4 \tvalue5 {noformat} That approach breaks secondary sort and all legacy Streaming reducers, so it would have to be controlled by a config param that is off by default. It's not an unreasonable approach, though. Would that meet your needs? I haven't looked at the Streaming code yet to see whether it's feasible, but I suspect it is. > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323690#comment-15323690 ] Daniel Templeton commented on MAPREDUCE-6712: - For C++ apps, there's Hadoop Pipes, which more closely models Java MapReduce. For python, I strongly recommend taking a look at pyspark. Hadoop Streaming is not intended to be high performance. The general argument for the use of Streaming is that the time spent writing a Java MapReduce job would be more than the time lost by using Streaming. I don't see a way to resolve this issue in any reasonable way. If you include all values for a key in a single line, you have a strong chance of running the reducer out of memory trying to read it. The only way I can see it working is in the case of typedbytes or with regular strings using some unambiguous value separator. You'd have to require that the reducer read the list of values one at a time rather than reading the entire line. That seems like a pretty strict requirement and not something we'd want to enable in the general platform, especially when there is a clear and well tested workaround: Java MapReduce. > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321658#comment-15321658 ] He Tianyi commented on MAPREDUCE-6712: -- Actually in my experiements (in-house workload) turning strings back and forth is not the bottleneck (does not make a difference with typedbytes). But just grouping values make a simple reducer 20% faster (for both text and typedbytes). Also, many users are using C/C++ to implement mapper/reducer which I think is possible to be more efficient than java/scala (smaller memory footprint, less gc, no virtual call overhead, better SIMD support, etc.). > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320692#comment-15320692 ] Daniel Templeton commented on MAPREDUCE-6712: - Hadoop Streaming is limited by the fact that all intermediate data are passed as strings. In most cases the cost of translating those strings back into the intended data types makes Hadoop Streaming so much slower than Java MapReduce that tuning the Hadoop Streaming implementation won't make a significant dent. Turning strings into number is expensive. Using interpreted languages is expensive. If you want better performance you should consider Java MapReduce, or better yet, Spark, e.g. pyspark. > Support grouping values for reducer on java-side > > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming >Reporter: He Tianyi >Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org