He Tianyi created MAPREDUCE-6712:
------------------------------------

             Summary: Support grouping values for reducer on java-side
                 Key: MAPREDUCE-6712
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: contrib/streaming
            Reporter: He Tianyi
            Priority: Minor


In hadoop streaming, with TextInputWriter, reducer program will receive each 
line representing a (k, v) tuple from {{stdin}}, in which values with identical 
key is not grouped.
This brings some inefficiency, especially for runtimes based on interpreter 
(e.g. cpython), coming from:
A. user program has to compare key with previous one (but on java side, records 
already come to reducer in groups),
B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
record. even if there are multiple values with identical key,
C. if length of key is large, apparently this introduces inefficiency for 
caching,

Suppose we need another InputWriter. But this is not enough, since the 
interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
{{writeValues}}. Though we can compare key in custom InputWriter and group 
them, but this is also inefficient. Some other changes are also needed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to