[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-17 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336923#comment-15336923
 ] 

Daniel Templeton commented on MAPREDUCE-6712:
-

Pyspark actually does some cleverness under covers to replace as much of your 
python code with Scala as possible.  Regardless, the advantages of Spark over 
MR in performance are huge.  Not being forced to serialize to HDFS between 
chained jobs and being able to cache data in memory are a big deal.

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-16 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335164#comment-15335164
 ] 

He Tianyi commented on MAPREDUCE-6712:
--

Oh, just another thought: maybe scenarios like this can benefit from nativetask 
(from Intel). 

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-12 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326214#comment-15326214
 ] 

He Tianyi commented on MAPREDUCE-6712:
--

Hi, [~templedf]. Thanks for these ideas.
I think the null key solution meets the need. And this is particularly possible 
in a customized environment, since application framework can be specialized to 
do this (whatever language supported internally), but may be not worth it for a 
general platform.

Moving to pyspark is certainly a better solution for these advantages that a 
higher level abstraction and Spark computing model brings. However, there is 
still a IPC overhead unless we move to a jvm-based language either. Any 
suggestions?

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-10 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324821#comment-15324821
 ] 

Daniel Templeton commented on MAPREDUCE-6712:
-

[~He Tianyi], after chewing on this a bit more, I think I see a way that it 
could be done that wouldn't be too disruptive.  What if the first value comes 
through with the key, and subsequent values come through with a null key, i.e.:

{noformat}
key1\tvalue1
\tvalue2
\tvalue3
key2\tvalue4
\tvalue5
{noformat}

That approach breaks secondary sort and all legacy Streaming reducers, so it 
would have to be controlled by a config param that is off by default.  It's not 
an unreasonable approach, though.  Would that meet your needs?  I haven't 
looked at the Streaming code yet to see whether it's feasible, but I suspect it 
is.

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-09 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323690#comment-15323690
 ] 

Daniel Templeton commented on MAPREDUCE-6712:
-

For C++ apps, there's Hadoop Pipes, which more closely models Java MapReduce.  
For python, I strongly recommend taking a look at pyspark.  Hadoop Streaming is 
not intended to be high performance.  The general argument for the use of 
Streaming is that the time spent writing a Java MapReduce job would be more 
than the time lost by using Streaming.

I don't see a way to resolve this issue in any reasonable way.  If you include 
all values for a key in a single line, you have a strong chance of running the 
reducer out of memory trying to read it.  The only way I can see it working is 
in the case of typedbytes or with regular strings using some unambiguous value 
separator. You'd have to require that the reducer read the list of values one 
at a time rather than reading the entire line.  That seems like a pretty strict 
requirement and not something we'd want to enable in the general platform, 
especially when there is a clear and well tested workaround: Java MapReduce.

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-08 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321658#comment-15321658
 ] 

He Tianyi commented on MAPREDUCE-6712:
--

Actually in my experiements (in-house workload) turning strings back and forth 
is not the bottleneck (does not make a difference with typedbytes). But just 
grouping values make a simple reducer 20% faster (for both text and 
typedbytes). 
Also, many users are using C/C++ to implement mapper/reducer which I think is 
possible to be more efficient than java/scala (smaller memory footprint, less 
gc, no virtual call overhead, better SIMD support, etc.). 

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-08 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320692#comment-15320692
 ] 

Daniel Templeton commented on MAPREDUCE-6712:
-

Hadoop Streaming is limited by the fact that all intermediate data are passed 
as strings.  In most cases the cost of translating those strings back into the 
intended data types makes Hadoop Streaming so much slower than Java MapReduce 
that tuning the Hadoop Streaming implementation won't make a significant dent.  
Turning strings into number is expensive.  Using interpreted languages is 
expensive.  If you want better performance you should consider Java MapReduce, 
or better yet, Spark, e.g. pyspark. 

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org