[jira] [Updated] (MAPREDUCE-6757) Multithreaded mapper corrupts buffer pusher in nativetask

2016-08-16 Thread He Tianyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Tianyi updated MAPREDUCE-6757:
-
Status: Patch Available  (was: Open)

> Multithreaded mapper corrupts buffer pusher in nativetask
> -
>
> Key: MAPREDUCE-6757
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6757
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: nativetask
>Affects Versions: 3.0.0-alpha1
>Reporter: He Tianyi
> Attachments: MAPREDUCE-6757..patch
>
>
> Multiple threads could be calling {{collect}} method of the same 
> {{NativeMapOutputCollectorDelegator}} instance at the same time. In this 
> case, buffer can be corrupted.
> This may occur when executing Hive queries with custom script.
> Adding 'synchronized' keyword to {{collect}} method would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6757) Multithreaded mapper corrupts buffer pusher in nativetask

2016-08-16 Thread He Tianyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Tianyi updated MAPREDUCE-6757:
-
Attachment: MAPREDUCE-6757..patch

> Multithreaded mapper corrupts buffer pusher in nativetask
> -
>
> Key: MAPREDUCE-6757
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6757
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: nativetask
>Affects Versions: 3.0.0-alpha1
>Reporter: He Tianyi
> Attachments: MAPREDUCE-6757..patch
>
>
> Multiple threads could be calling {{collect}} method of the same 
> {{NativeMapOutputCollectorDelegator}} instance at the same time. In this 
> case, buffer can be corrupted.
> This may occur when executing Hive queries with custom script.
> Adding 'synchronized' keyword to {{collect}} method would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6757) Multithreaded mapper corrupts buffer pusher in nativetask

2016-08-16 Thread He Tianyi (JIRA)
He Tianyi created MAPREDUCE-6757:


 Summary: Multithreaded mapper corrupts buffer pusher in nativetask
 Key: MAPREDUCE-6757
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6757
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: nativetask
Affects Versions: 3.0.0-alpha1
Reporter: He Tianyi


Multiple threads could be calling {{collect}} method of the same 
{{NativeMapOutputCollectorDelegator}} instance at the same time. In this case, 
buffer can be corrupted.
This may occur when executing Hive queries with custom script.

Adding 'synchronized' keyword to {{collect}} method would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6756) native map output collector should be able to handle KV larger than io.sort.mb

2016-08-15 Thread He Tianyi (JIRA)
He Tianyi created MAPREDUCE-6756:


 Summary: native map output collector should be able to handle KV 
larger than io.sort.mb
 Key: MAPREDUCE-6756
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6756
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: nativetask
Affects Versions: 2.6.0
Reporter: He Tianyi


Currently exception is thrown if kvLength > io.sort.mb.
In rare cases this can occur, especially for hive queries.

Maybe we could implement a 'spill single record' function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6755) Report correct spill/output size for map in nativetask

2016-08-15 Thread He Tianyi (JIRA)
He Tianyi created MAPREDUCE-6755:


 Summary: Report correct spill/output size for map in nativetask
 Key: MAPREDUCE-6755
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6755
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: nativetask
Affects Versions: 2.6.0
Reporter: He Tianyi
Priority: Minor


Currently nativetask always report '-1' as size when calling methods in 
{{MapOutputFile}}.
This is an obstacle for custom {{MapOutputFile}} implementations where size 
matters (e.g. determine which disk to use).

The issue propose to estimate spill/output/index size in nativetask like java 
implementation does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-16 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335164#comment-15335164
 ] 

He Tianyi commented on MAPREDUCE-6712:
--

Oh, just another thought: maybe scenarios like this can benefit from nativetask 
(from Intel). 

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-12 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326214#comment-15326214
 ] 

He Tianyi commented on MAPREDUCE-6712:
--

Hi, [~templedf]. Thanks for these ideas.
I think the null key solution meets the need. And this is particularly possible 
in a customized environment, since application framework can be specialized to 
do this (whatever language supported internally), but may be not worth it for a 
general platform.

Moving to pyspark is certainly a better solution for these advantages that a 
higher level abstraction and Spark computing model brings. However, there is 
still a IPC overhead unless we move to a jvm-based language either. Any 
suggestions?

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-08 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321658#comment-15321658
 ] 

He Tianyi commented on MAPREDUCE-6712:
--

Actually in my experiements (in-house workload) turning strings back and forth 
is not the bottleneck (does not make a difference with typedbytes). But just 
grouping values make a simple reducer 20% faster (for both text and 
typedbytes). 
Also, many users are using C/C++ to implement mapper/reducer which I think is 
possible to be more efficient than java/scala (smaller memory footprint, less 
gc, no virtual call overhead, better SIMD support, etc.). 

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-08 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321658#comment-15321658
 ] 

He Tianyi edited comment on MAPREDUCE-6712 at 6/8/16 11:34 PM:
---

Actually in my experiements (in-house workload) turning strings back and forth 
is not the bottleneck (does not make a difference with typedbytes). But just 
grouping values make a simple reducer 20% faster (for both text and 
typedbytes). 
Also, many users are using C/C++ to implement mapper/reducer which I think is 
possible to be more efficient than java/scala (smaller memory footprint, less 
gc, better SIMD support, etc.). 


was (Author: he tianyi):
Actually in my experiements (in-house workload) turning strings back and forth 
is not the bottleneck (does not make a difference with typedbytes). But just 
grouping values make a simple reducer 20% faster (for both text and 
typedbytes). 
Also, many users are using C/C++ to implement mapper/reducer which I think is 
possible to be more efficient than java/scala (smaller memory footprint, less 
gc, no virtual call overhead, better SIMD support, etc.). 

> Support grouping values for reducer on java-side
> 
>
> Key: MAPREDUCE-6712
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: He Tianyi
>Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6712) Support grouping values for reducer on java-side

2016-06-08 Thread He Tianyi (JIRA)
He Tianyi created MAPREDUCE-6712:


 Summary: Support grouping values for reducer on java-side
 Key: MAPREDUCE-6712
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/streaming
Reporter: He Tianyi
Priority: Minor


In hadoop streaming, with TextInputWriter, reducer program will receive each 
line representing a (k, v) tuple from {{stdin}}, in which values with identical 
key is not grouped.
This brings some inefficiency, especially for runtimes based on interpreter 
(e.g. cpython), coming from:
A. user program has to compare key with previous one (but on java side, records 
already come to reducer in groups),
B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
record. even if there are multiple values with identical key,
C. if length of key is large, apparently this introduces inefficiency for 
caching,

Suppose we need another InputWriter. But this is not enough, since the 
interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
{{writeValues}}. Though we can compare key in custom InputWriter and group 
them, but this is also inefficient. Some other changes are also needed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6687) Allow specifing java home via job configuration

2016-04-26 Thread He Tianyi (JIRA)
He Tianyi created MAPREDUCE-6687:


 Summary: Allow specifing java home via job configuration
 Key: MAPREDUCE-6687
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6687
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: applicationmaster
Reporter: He Tianyi
Priority: Minor


Suggest allowing user to use a preferred JVM implementation (or version) by 
specifying java home via JobConf, to launch Map/Reduce tasks. 

Especially useful for running A/B tests on real workload or benchmark between 
JVM implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAPREDUCE-6488) Make buffer size in PipeMapRed configurable

2015-10-02 Thread He Tianyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Tianyi resolved MAPREDUCE-6488.
--
Resolution: Invalid

> Make buffer size in PipeMapRed configurable
> ---
>
> Key: MAPREDUCE-6488
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6488
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: He Tianyi
>Assignee: He Tianyi
>
> Default value of buffer size is 128K in {{PipeMapRed}}.
> When mapper input record is large enough that it won't fit in buffer, 
> {{MapRunner}} blocks until written. If child process and input reader are 
> both slow (due to calculation and decompress), then process of decoding and 
> reading will rarely overlap with each other, hurting performance.
> I suppose we should make the buffer size configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6488) Make buffer size in PipeMapRed configurable

2015-09-21 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901860#comment-14901860
 ] 

He Tianyi commented on MAPREDUCE-6488:
--

Went through this again, buffer size has nothing to do with this.
Please mark as Invalid.

> Make buffer size in PipeMapRed configurable
> ---
>
> Key: MAPREDUCE-6488
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6488
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: He Tianyi
>Assignee: He Tianyi
>
> Default value of buffer size is 128K in {{PipeMapRed}}.
> When mapper input record is large enough that it won't fit in buffer, 
> {{MapRunner}} blocks until written. If child process and input reader are 
> both slow (due to calculation and decompress), then process of decoding and 
> reading will rarely overlap with each other, hurting performance.
> I suppose we should make the buffer size configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6488) Make buffer size in PipeMapRed configurable

2015-09-21 Thread He Tianyi (JIRA)
He Tianyi created MAPREDUCE-6488:


 Summary: Make buffer size in PipeMapRed configurable
 Key: MAPREDUCE-6488
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6488
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: He Tianyi
Assignee: He Tianyi


Default value of buffer size is 128K in {{PipeMapRed}}.

When mapper input record is large enough that it won't fit in buffer, 
{{MapRunner}} blocks until written. If child process and input reader are both 
slow (due to calculation and decompress), then process of decoding and reading 
will rarely overlap with each other, hurting performance.

I suppose we should make the buffer size configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)