from:"liyunzhang \(JIRA\)"

[jira] [Comment Edited] (HIVE-17879) Can not find java.sql.date in JDK9 when building hive

2018-02-14 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363616#comment-16363616
 ] 

liyunzhang edited comment on HIVE-17879 at 2/14/18 8:08 AM:


Most work of  HIVE-17879.patch is belongs to 
https://github.com/kgyrtkirk/hive/tree/jdk9-trial. Just to write a report to do 
the patch. based on Trunk:88ca553


was (Author: kellyzly):
Most work of  HIVE-17879.patch is belongs to 
https://github.com/kgyrtkirk/hive/tree/jdk9-trial. Just to write a report to do 
the patch.

> Can not find java.sql.date in JDK9 when building hive
> -
>
> Key: HIVE-17879
> URL: https://issues.apache.org/jira/browse/HIVE-17879
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang
>Priority: Major
> Attachments: HIVE-17879.patch
>
>
> when build hive with jdk9
> got following error
> {code}
> [ERROR] Failed to execute goal 
> org.datanucleus:datanucleus-maven-plugin:3.3.0-release:enhance (default) on 
> project hive-standalone-metastore: Error executing DataNucleus tool 
> org.datanucleus.enhancer.DataNucleusEnhancer: InvocationTargetException: 
> java/sql/Date: java.sql.Date -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
> goal org.datanucleus:datanucleus-maven-plugin:3.3.0-release:enhance (default) 
> on project hive-standalone-metastore: Error executing DataNucleus tool 
> org.datanucleus.enhancer.DataNucleusEnhancer
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
>   at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
>   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
>   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
>   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
>   at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.MojoExecutionException: Error executing 
> DataNucleus tool org.datanucleus.enhancer.DataNucleusEnhancer
>   at 
> org.datanucleus.maven.AbstractDataNucleusMojo.executeInJvm(AbstractDataNucleusMojo.java:350)
>   at 
> org.datanucleus.maven.AbstractEnhancerMojo.enhance(AbstractEnhancerMojo.java:266)
>   at 
> org.datanucleus.maven.AbstractEnhancerMojo.executeDataNucleusTool(AbstractEnhancerMojo.java:72)
>   at 
> org.datanucleus.maven.AbstractDataNucleusMojo.execute(AbstractDataNucleusMojo.java:126)
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
>   ... 20 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.datanucleus.maven.AbstractDataNucleusMojo.executeInJvm(AbstractDataNucleusMojo.java:333)
>   ... 25 more
> Caused by: java.lang.NoClassDefFoundError: java/sql/Date
>

[jira] [Commented] (HIVE-17879) Can not find java.sql.date in JDK9 when building hive

2018-02-14 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363616#comment-16363616
 ] 

liyunzhang commented on HIVE-17879:
---

Most work of  HIVE-17879.patch is belongs to 
https://github.com/kgyrtkirk/hive/tree/jdk9-trial. Just to write a report to do 
the patch.

> Can not find java.sql.date in JDK9 when building hive
> -
>
> Key: HIVE-17879
> URL: https://issues.apache.org/jira/browse/HIVE-17879
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang
>Priority: Major
> Attachments: HIVE-17879.patch
>
>
> when build hive with jdk9
> got following error
> {code}
> [ERROR] Failed to execute goal 
> org.datanucleus:datanucleus-maven-plugin:3.3.0-release:enhance (default) on 
> project hive-standalone-metastore: Error executing DataNucleus tool 
> org.datanucleus.enhancer.DataNucleusEnhancer: InvocationTargetException: 
> java/sql/Date: java.sql.Date -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
> goal org.datanucleus:datanucleus-maven-plugin:3.3.0-release:enhance (default) 
> on project hive-standalone-metastore: Error executing DataNucleus tool 
> org.datanucleus.enhancer.DataNucleusEnhancer
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
>   at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
>   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
>   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
>   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
>   at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.MojoExecutionException: Error executing 
> DataNucleus tool org.datanucleus.enhancer.DataNucleusEnhancer
>   at 
> org.datanucleus.maven.AbstractDataNucleusMojo.executeInJvm(AbstractDataNucleusMojo.java:350)
>   at 
> org.datanucleus.maven.AbstractEnhancerMojo.enhance(AbstractEnhancerMojo.java:266)
>   at 
> org.datanucleus.maven.AbstractEnhancerMojo.executeDataNucleusTool(AbstractEnhancerMojo.java:72)
>   at 
> org.datanucleus.maven.AbstractDataNucleusMojo.execute(AbstractDataNucleusMojo.java:126)
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
>   ... 20 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.datanucleus.maven.AbstractDataNucleusMojo.executeInJvm(AbstractDataNucleusMojo.java:333)
>   ... 25 more
> Caused by: java.lang.NoClassDefFoundError: java/sql/Date
>   at org.datanucleus.ClassConstants.(ClassConstants.java:66)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.registerExtensions(NonManagedPluginRegistry.java:206)
>   at 
>

[jira] [Updated] (HIVE-17879) Can not find java.sql.date in JDK9 when building hive

2018-02-14 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17879:
--
Attachment: HIVE-17879.patch

> Can not find java.sql.date in JDK9 when building hive
> -
>
> Key: HIVE-17879
> URL: https://issues.apache.org/jira/browse/HIVE-17879
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang
>Priority: Major
> Attachments: HIVE-17879.patch
>
>
> when build hive with jdk9
> got following error
> {code}
> [ERROR] Failed to execute goal 
> org.datanucleus:datanucleus-maven-plugin:3.3.0-release:enhance (default) on 
> project hive-standalone-metastore: Error executing DataNucleus tool 
> org.datanucleus.enhancer.DataNucleusEnhancer: InvocationTargetException: 
> java/sql/Date: java.sql.Date -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
> goal org.datanucleus:datanucleus-maven-plugin:3.3.0-release:enhance (default) 
> on project hive-standalone-metastore: Error executing DataNucleus tool 
> org.datanucleus.enhancer.DataNucleusEnhancer
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
>   at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
>   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
>   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
>   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
>   at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.MojoExecutionException: Error executing 
> DataNucleus tool org.datanucleus.enhancer.DataNucleusEnhancer
>   at 
> org.datanucleus.maven.AbstractDataNucleusMojo.executeInJvm(AbstractDataNucleusMojo.java:350)
>   at 
> org.datanucleus.maven.AbstractEnhancerMojo.enhance(AbstractEnhancerMojo.java:266)
>   at 
> org.datanucleus.maven.AbstractEnhancerMojo.executeDataNucleusTool(AbstractEnhancerMojo.java:72)
>   at 
> org.datanucleus.maven.AbstractDataNucleusMojo.execute(AbstractDataNucleusMojo.java:126)
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
>   ... 20 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.datanucleus.maven.AbstractDataNucleusMojo.executeInJvm(AbstractDataNucleusMojo.java:333)
>   ... 25 more
> Caused by: java.lang.NoClassDefFoundError: java/sql/Date
>   at org.datanucleus.ClassConstants.(ClassConstants.java:66)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.registerExtensions(NonManagedPluginRegistry.java:206)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.registerExtensionPoints(NonManagedPluginRegistry.java:155)
>   at org.datanucleus.plugin.PluginManager.(PluginManager.java:63)
>

[jira] [Commented] (HIVE-14285) Explain outputs: map-entry ordering of non-primitive objects.

2018-02-12 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-14285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361692#comment-16361692
 ] 

liyunzhang commented on HIVE-14285:
---

[~kgyrtkirk]: 
{code}
I'm starting to remember...I think prior to this patch the order was the 
same...I just needed to flatten the keys if they were more complex objects 
(like: Path)
{code}
yes, prior to the patch， it also uses TreeMap and TreeMap will order the key.  
Your patch does not modify the order.
Here my *question* is whether the Maps in the same Stage are executed parallel 
in the runtime?
 in following example, {{Ｍap1}} and {{Map4}} are both in the Stage-1, in 
runtime , they are starting together so we don't care which({{Map1}} or 
{{Map4}}) is first in the explain. 
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Tez
 A masked pattern was here 
  Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
 A masked pattern was here 
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: srcpart
  filterExpr: ds is not null (type: boolean)
  Statistics: Num rows: 2000 Data size: 389248 Basic stats: 
COMPLETE Column stats: COMPLETE
  Select Operator
expressions: ds (type: string)
outputColumnNames: _col0
Statistics: Num rows: 2000 Data size: 368000 Basic stats: 
COMPLETE Column stats: COMPLETE
Reduce Output Operator
  key expressions: _col0 (type: string)
  sort order: +
  Map-reduce partition columns: _col0 (type: string)
  Statistics: Num rows: 2000 Data size: 368000 Basic stats: 
COMPLETE Column stats: COMPLETE
Execution mode: llap
LLAP IO: no inputs
Map 4 
Map Operator Tree:
TableScan
  alias: srcpart_date
  filterExpr: ((date = '2008-04-08') and ds is not null) (type: 
boolean)
  Statistics: Num rows: 2 Data size: 736 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  expressions: ds (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  keys: _col0 (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
  Dynamic Partitioning Event Operator
Target column: ds (string)
Target Input: srcpart
Partition key expr: ds
Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
Target Vertex: Map 1
Execution mode: llap
LLAP IO: no inputs
Reducer 2 
Execution mode: llap
Reduce Operator Tree:
  Merge Join Operator
condition map:
 Inner Join 0 to 1
keys:
  0 _col0 (type: string)
  1 _col0 (type: string)
Statistics: Num rows: 2200 Data size: 404800 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  aggregations: count()
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
  Reduce Output Operator
sort order: 
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
value expressions: _col0

[jira] [Commented] (HIVE-14285) Explain outputs: map-entry ordering of non-primitive objects.

2018-02-11 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-14285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360203#comment-16360203
 ] 

liyunzhang commented on HIVE-14285:
---

[~kgyrtkirk]:  I want to ask a question that the works in 1 Stage(like  Map1, 
Map4 in Stage-1) are executed in parallel although Map4 is after Map1 in 
explain.
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Tez
 A masked pattern was here 
  Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
 A masked pattern was here 
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: srcpart
  filterExpr: ds is not null (type: boolean)
  Statistics: Num rows: 2000 Data size: 389248 Basic stats: 
COMPLETE Column stats: COMPLETE
  Select Operator
expressions: ds (type: string)
outputColumnNames: _col0
Statistics: Num rows: 2000 Data size: 368000 Basic stats: 
COMPLETE Column stats: COMPLETE
Reduce Output Operator
  key expressions: _col0 (type: string)
  sort order: +
  Map-reduce partition columns: _col0 (type: string)
  Statistics: Num rows: 2000 Data size: 368000 Basic stats: 
COMPLETE Column stats: COMPLETE
Execution mode: llap
LLAP IO: no inputs
Map 4 
Map Operator Tree:
TableScan
  alias: srcpart_date
  filterExpr: ((date = '2008-04-08') and ds is not null) (type: 
boolean)
  Statistics: Num rows: 2 Data size: 736 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  expressions: ds (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  keys: _col0 (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
  Dynamic Partitioning Event Operator
Target column: ds (string)
Target Input: srcpart
Partition key expr: ds
Statistics: Num rows: 2 Data size: 736 Basic stats: 
COMPLETE Column stats: NONE
Target Vertex: Map 1
Execution mode: llap
LLAP IO: no inputs
Reducer 2 
Execution mode: llap
Reduce Operator Tree:
  Merge Join Operator
condition map:
 Inner Join 0 to 1
keys:
  0 _col0 (type: string)
  1 _col0 (type: string)
Statistics: Num rows: 2200 Data size: 404800 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  aggregations: count()
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
  Reduce Output Operator
sort order: 
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
value expressions: _col0 (type: bigint)
Reducer 3 
Execution mode: llap
Reduce Operator Tree:
  Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: false

[jira] [Commented] (HIVE-14285) Explain outputs: map-entry ordering of non-primitive objects.

2018-02-10 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-14285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359784#comment-16359784
 ] 

liyunzhang commented on HIVE-14285:
---

[~kgyrtkirk]: I found some problem with {{ExplainTask#getBasictypeKeyedMap}}. 
It converts original input(type Map) to a treeMap which will order by the key. 
Like an example, the original map
{code}
"Map 6" -> 
"Reducer 7" -> 
"Reducer 9" -> 
"Map 10" -> 
"Reducer 11" -> 
{code}
After this function, it will be 
{code}
"Map 10" -> 
"Map 6" -> 
"Reducer 11" -> 
"Reducer 7" -> 
"Reducer 9" -> 
{code}

{{Map 10}} is in front of {{Map6}} because "Map 10" is small than "Map 6".  But 
actually {{Map6}}  is executed first then {{Map10}}. It maybe confused in 
explain. Here I want to ask whether we don't care the order in explain is 
different from execution. If we don't care, it is ok now. If the order of 
explain must be same with execution, we can refactor 
{{ExplainTask#getBasictypeKeyedMap}}

>  Explain outputs: map-entry ordering of non-primitive objects. 
> ---
>
> Key: HIVE-14285
> URL: https://issues.apache.org/jira/browse/HIVE-14285
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: HIVE-14285.1.patch
>
>
> In HIVE-12244 I've left behind some ugly backward compatible getters with 
> {{@Explain}} decorations to keep the qtests from breaking.
> There were heavy explain plan changes when I used {{Path}} objects as keys in 
> {{@Explain}} marked methods.
> I've looked into the causes of this:
>  * there is a {{TreeSet}} in there to keep all the keys in order.
>  * but: {{org.apache.hadoop.fs.Path}} uses a different sort order (inherited 
> from {{java.net.URI}} )...it sorts the paths using 
> priorities:[schema,schemeSpecificPart,host,path,query,fragment]
>   considering that the output is an explain result(possibly read by humans): 
> i don't think this sophisticated sort order can be useful.
> {{ExplainTask#outputMap}} always calls toString() on the keys before using 
> them...so the most painless solution would be to change all the keys inside 
> the treeset to simple strings (in case it's not a primitive already); this 
> would restore the original behaviour for me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS

2018-02-07 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356365#comment-16356365
 ] 

liyunzhang commented on HIVE-18340:
---

[~stakiar]: {quote}
Hive-on-Tez's has an implementation of DynamicValueRegistry that uses some 
special Tez APIs such as ProcessorContext#waitForAllInputsReady, how are we 
simulating this in HoS?
{quote}
ProcessorContext#waitForAllInputsReady is called by  
{{org.apache.hadoop.hive.ql.exec.tez.DynamicValueRegistryTez#init}} to read the 
runtime filter info. For HoS, I guess [~Jk_self] will read the info from hdfs 
which is similar as Spark DPP. 

If my understanding is not right, [~stakiar], [~Jk_Self] please tell me.

> Dynamic Min-Max/BloomFilter runtime-filtering in HoS
> 
>
> Key: HIVE-18340
> URL: https://issues.apache.org/jira/browse/HIVE-18340
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Attachments: HIVE-18340.1.patch
>
>
> Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 
> and we should implement the same in HOS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-02-02 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350077#comment-16350077
 ] 

liyunzhang edited comment on HIVE-18301 at 2/2/18 9:59 AM:
---

[~xuefuz]:
{quote}Instead of putting the input patch in each row, like what the patch is 
proposing, could we send a serialized IOContext object as a special row 
whenever the content of the object changes?
{quote}
This is good idea. IOContext is bind to split not to bind to each record. So I 
changed to following idea.
{code:java}
 inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
|   |
   MT_11   MT_12
|   |
   RT_1 RT_2
 \  /
 Union  
{code}
in the CopyRDD1, I only store IOContext as the tuple_1() of result when the 
IOContext#getInputPath is changed and store null in other situation. Thus it 
will reduce the size  of data increment of this solution. In MT_12, it 
initializes the IOContext when IOContext#getInputPath is null, once 
IOContext#getInputPath has value, we need not initialize it again in the same 
thread.  patch is provided in [^HIVE-18301.3.patch].  I guess the file boundary 
of original rdd is same as the file boundary of cached rdd. Only based on this 
theory, I guess the solution can works.  Although I don't read the source code 
of rdd cache, this solution passes in my test env.


was (Author: kellyzly):
[~xuefuz]:
{quote}Instead of putting the input patch in each row, like what the patch is 
proposing, could we send a serialized IOContext object as a special row 
whenever the content of the object changes?
{quote}
This is good idea. IOContext is bind to split not to bind to each record. So I 
changed to following idea.
{code:java}
 inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
|   |
   MT_11   MT_12
|   |
   RT_1 RT_2
 \  /
 Union  
{code}
in the CopyRDD1, I only store IOContext as the tuple_1() of result when the 
IOContext#getInputPath is changed and store null in other situation. Thus it 
will reduce the size  of data increment of this solution. In MT_12, it 
initializes the IOContext when IOContext#getInputPath is null, once 
IOContext#getInputPath has value, we need not initialize it again in the same 
thread.  Any suggestion?

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.2.patch, 
> HIVE-18301.3.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
>

[jira] [Updated] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-02-02 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18301:
--
Attachment: HIVE-18301.3.patch

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.2.patch, 
> HIVE-18301.3.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-02-02 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350077#comment-16350077
 ] 

liyunzhang edited comment on HIVE-18301 at 2/2/18 9:54 AM:
---

[~xuefuz]:
{quote}Instead of putting the input patch in each row, like what the patch is 
proposing, could we send a serialized IOContext object as a special row 
whenever the content of the object changes?
{quote}
This is good idea. IOContext is bind to split not to bind to each record. So I 
changed to following idea.
{code:java}
 inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
|   |
   MT_11   MT_12
|   |
   RT_1 RT_2
 \  /
 Union  
{code}
in the CopyRDD1, I only store IOContext as the tuple_1() of result when the 
IOContext#getInputPath is changed and store null in other situation. Thus it 
will reduce the size  of data increment of this solution. In MT_12, it 
initializes the IOContext when IOContext#getInputPath is null, once 
IOContext#getInputPath has value, we need not initialize it again in the same 
thread.  Any suggestion?


was (Author: kellyzly):
[~xuefuz]:
{quote}
 Instead of putting the input patch in each row, like what the patch is 
proposing, could we send a serialized IOContext object as a special row 
whenever the content of the object changes? 
{quote}
This is good idea.  IOContext is bind to split not to bind to each record.  So 
I changed to following idea.
{code}
 inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
|   |
   MT_11   MT_12
|   |
   RT_1 RT_2
 \  /
 Union  
{code}

in the CopyRDD1, I only store IOContext as the tuple_1() of result when the 
IOContext#getInputPath is changed and store null in other situation. Thus it 
will reduce the size of data size increment of this solution. In MT_12, it 
initialize the IOContext when IOContext#getInputPath is null, once  
IOContext#getInputPath has value, we need not initialize it again in the same 
thread.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.2.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
>

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-02-02 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350077#comment-16350077
 ] 

liyunzhang commented on HIVE-18301:
---

[~xuefuz]:
{quote}
 Instead of putting the input patch in each row, like what the patch is 
proposing, could we send a serialized IOContext object as a special row 
whenever the content of the object changes? 
{quote}
This is good idea.  IOContext is bind to split not to bind to each record.  So 
I changed to following idea.
{code}
 inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
|   |
   MT_11   MT_12
|   |
   RT_1 RT_2
 \  /
 Union  
{code}

in the CopyRDD1, I only store IOContext as the tuple_1() of result when the 
IOContext#getInputPath is changed and store null in other situation. Thus it 
will reduce the size of data size increment of this solution. In MT_12, it 
initialize the IOContext when IOContext#getInputPath is null, once  
IOContext#getInputPath has value, we need not initialize it again in the same 
thread.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.2.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-31 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18301:
--
Attachment: HIVE-18301.2.patch

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.2.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-31 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346418#comment-16346418
 ] 

liyunzhang commented on HIVE-18301:
---

[~lirui]: thanks for question.  
{quote}

The IOContext has quite a few other fields. I wonder whether they are available 
if the RDD is cached.q

{quote}
Yes, IOContext#other fields are called in HiveContextAwareRecordReader and 
other places. We can skip fields which is only used in 
HiveContextAwareRecordReader, like IOContext#nextBlockStart, 
IOContext#isBlockPointer.  Others like IOContext#isBinarySearching is also used 
in FilterOperator#process.  Can not skip to cache it. 

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-26 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340717#comment-16340717
 ] 

liyunzhang commented on HIVE-18301:
---

for the problem it may make the RDD bigger:

original tuple._1() is  
[CombineKey|https://github.com/apache/hive/blob/master/shims/common/src/main/java/org/apache/hadoop/hive/shims/CombineHiveKey.java]
 which stores offset(LongWriteObject), and in currentSolution , tuple._1() is 
inputPath 
([org.apache.hadoop.io|https://github.com/kellyzly/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java]
 
[.Text|https://github.com/kellyzly/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java])
 whose size is decided by actual input file path length.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-26 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18301:
--
Attachment: (was: HIVE-18301.1.patch)

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-26 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18301:
--
Status: Patch Available  (was: Open)

HIVE-18301.1.patch to trigger Hive QA

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-26 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18301:
--
Attachment: HIVE-18301.1.patch

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-18378) Explain plan should show if a Map/Reduce Work is being cached

2018-01-25 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340647#comment-16340647
 ] 

liyunzhang edited comment on HIVE-18378 at 1/26/18 7:21 AM:


[~stakiar]: can you assign this jira to me ? I guess this feature is need by 
HIVE-17486 and  once HIVE-17486 is resolved, it is better to show whether the 
table is cached  or not in explain.


was (Author: kellyzly):
[~stakiar]: can you assign this jira to me ? I guess this feature is need by 
HIVE-17486 and  once HIVE-17486 is resolved, it is better to show whether the 
work is cached in explain.

> Explain plan should show if a Map/Reduce Work is being cached
> -
>
> Key: HIVE-18378
> URL: https://issues.apache.org/jira/browse/HIVE-18378
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
>Priority: Major
>
> It would be nice if the explain plan showed what {{MapWork}} / {{ReduceWork}} 
> objects are being cached by Spark.
> The {{CombineEquivalentWorkResolver}} is the only code that triggers Spark 
> cache-ing, so we should be able to modify it so that it displays if a work 
> object will be cached or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-18378) Explain plan should show if a Map/Reduce Work is being cached

2018-01-25 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang reassigned HIVE-18378:
-

Assignee: liyunzhang

> Explain plan should show if a Map/Reduce Work is being cached
> -
>
> Key: HIVE-18378
> URL: https://issues.apache.org/jira/browse/HIVE-18378
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
>Priority: Major
>
> It would be nice if the explain plan showed what {{MapWork}} / {{ReduceWork}} 
> objects are being cached by Spark.
> The {{CombineEquivalentWorkResolver}} is the only code that triggers Spark 
> cache-ing, so we should be able to modify it so that it displays if a work 
> object will be cached or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18378) Explain plan should show if a Map/Reduce Work is being cached

2018-01-25 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340647#comment-16340647
 ] 

liyunzhang commented on HIVE-18378:
---

[~stakiar]: can you assign this jira to me ? I guess this feature is need by 
HIVE-17486 and  once HIVE-17486 is resolved, it is better to show whether the 
work is cached in explain.

> Explain plan should show if a Map/Reduce Work is being cached
> -
>
> Key: HIVE-18378
> URL: https://issues.apache.org/jira/browse/HIVE-18378
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
>Priority: Major
>
> It would be nice if the explain plan showed what {{MapWork}} / {{ReduceWork}} 
> objects are being cached by Spark.
> The {{CombineEquivalentWorkResolver}} is the only code that triggers Spark 
> cache-ing, so we should be able to modify it so that it displays if a work 
> object will be cached or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-24 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338586#comment-16338586
 ] 

liyunzhang edited comment on HIVE-18301 at 1/25/18 4:05 AM:


[~lirui]:
{quote}does the proposed solution mean we need to cache the input path for each 
record of the table?
{quote}

 yes.

 
{quote}I wonder whether we can reuse the Text for same input paths. Besides, 
it's not efficient to check the job conf each time we process a row. You can 
just check it once and remember the value.
{quote}
OK, I will reuse to reduce some computation.
 [~lirui], [~xuefuz],[~csun],[~Ferd]:

The concern to this solution is 
 1. it may make the RDD bigger( I don't compare the original tuple._1() and 
inputPath, will do it later)
 2. I modify the {{HadoopRDD-CopyRDD-MT-11}}, here CopyRDD does not store the 
original data anymore, CopyRDD store the inputPath and orignal data. It maybe 
confusing.

Do you think idea is ok to commit, if not ok, maybe I need to try other 
solution or can you provide other better solution? Thanks!


was (Author: kellyzly):
[~lirui]:  {quote}
does the proposed solution mean we need to cache the input path for each record 
of the table?

{quote}
yes.

{quote}
I wonder whether we can reuse the Text for same input paths. Besides, it's not 
efficient to check the job conf each time we process a row. You can just check 
it once and remember the value.
{quote}
OK, I will reuse to reduce some computation.
[~lirui], [~xuefuz],[~csun]:

The concern to this solution is 
1. it may make the RDD bigger( I don't compare the original tuple._1() and 
inputPath, will do it later)
2. I modify the {{HadoopRDD-CopyRDD-MT-11}}, here CopyRDD does not store the 
original data anymore, CopyRDD store the inputPath and orignal data.  It maybe 
confusing.

Do you think idea is ok to commit, if not ok, maybe I need to try other 
solution or can you provide other better solution? Thanks!

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-24 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338586#comment-16338586
 ] 

liyunzhang commented on HIVE-18301:
---

[~lirui]:  {quote}
does the proposed solution mean we need to cache the input path for each record 
of the table?

{quote}
yes.

{quote}
I wonder whether we can reuse the Text for same input paths. Besides, it's not 
efficient to check the job conf each time we process a row. You can just check 
it once and remember the value.
{quote}
OK, I will reuse to reduce some computation.
[~lirui], [~xuefuz],[~csun]:

The concern to this solution is 
1. it may make the RDD bigger( I don't compare the original tuple._1() and 
inputPath, will do it later)
2. I modify the {{HadoopRDD-CopyRDD-MT-11}}, here CopyRDD does not store the 
original data anymore, CopyRDD store the inputPath and orignal data.  It maybe 
confusing.

Do you think idea is ok to commit, if not ok, maybe I need to try other 
solution or can you provide other better solution? Thanks!

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-23 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336946#comment-16336946
 ] 

liyunzhang edited comment on HIVE-18301 at 1/24/18 5:47 AM:


In HIVE-18301.patch, it provides one solution to transfer the 
{{IOContext::inputPath}}
{code:java}
  inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
|   |
   MT_11   MT_12
|   |
   RT_1 RT_2
 \  /
 Union  
{code}
MT_11 will call following stack to initialize IOContext::inputPath
{code:java}
 CombineHiveRecordReader#init
->HiveContextAwareRecordReader.initIOContext
->IOContext.setInputPath
{code}
inputRDD1 and inputRDD2 are same table's rdd, so CopyRDD1 and CopyRDD2 are same 
rdd if rdd cache is enabled. When MT_12 will not call 
CombineHiveRecordReader#init to initialize {{IOContext::inputPath}} but 
{{MapOperator#process(Writable value)}} still need this value. IOContext is 
bound to single thread, so the value is different in different thread. 
{{inputRDD1-CopyRDD1-MT_11-RT_1}} and {{inputRDD2-CopyRDD2-MT_12-RT_2}} is 
called in different thread. So IOContext can not be shared between these two 
threads.

For this issue, I gave following solution:
 We save the inputPath in CopyRDD1 when {{inputRDD1-CopyRDD1-MT_11-RT_1}} is 
executed. CopyRDD2 get the cached value and inputPath from CopyRDD1 which is 
stored in spark cache manager. We reinitialized the {{IOContext::inputPath}} in 
{{MapOperator#process(Writable value)}} in MT_12.
 *where to setInputPath?*
 MapInput#CopyFunction#call, save inputPath in the first element of returned 
tuple
{code:java}
 public MapInput(SparkPlan sparkPlan, JavaPairRDD 
hadoopRDD) {
 this(sparkPlan, hadoopRDD, false);
@@ -79,10 +83,19 @@ public void setToCache(boolean toCache) {
 call(Tuple2 tuple) throws Exception {
   if (conf == null) {
 conf = new Configuration();
+conf.set("hive.execution.engine","spark");
   }
-
-  return new Tuple2(tuple._1(),
-  WritableUtils.clone(tuple._2(), conf));
+  //CopyFunction   MapFunction
+  //  HADOOPRDD-> RDD1-> RDD2.
+  // these transformation are in one stage and will be executed by 1 spark 
task(thread),
+  // IOContext.get(conf).getInputPath will not be null.
+  String inputPath = IOContextMap.get(conf).getInputPath().toString();
+  Text inputPathText = new Text(inputPath);
+  // save inputPath in the first element of returned tuple
+  // before we need not use tuple._1() in SparkMapRecordHandler#processRow
+  // so replace inputPathText with tuple._1().
+  return new Tuple2(inputPathText,
+WritableUtils.clone(tuple._2(), conf));
 }

   }
{code}
*where to getInputPath?*
{code:java}
SparkMapRecordHandler#getInputPath
public void processRow(Object key, Object value) throws IOException {

+if (HiveConf.getBoolVar(jc, 
HiveConf.ConfVars.HIVE_SPARK_SHARED_WORK_OPTIMIZATION)) {
+  Path inputPath = IOContextMap.get(jc).getInputPath();
+  // when inputPath is null, it means the record is cached 
+  if (inputPath == null) {
+Text pathText = (Text) key;
+IOContextMap.get(jc).setInputPath(new Path(pathText.toString()));
+  }
+}

{code}
[~lirui], [~xuefuz], [~stakiar],[~csun], please give me your suggesions, thanks!


was (Author: kellyzly):
In HIVE-18301.patch, it provides one solution to transfer the 
{{IOContext::inputPath}}
{code}
  inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
|   |
   MT_11   MT_12
|   |
   RT_1 RT_2
 \  /
 Union  
{code}
MT_11 will call following stack to initialize IOContext::inputPath
{code}
 CombineHiveRecordReader#init
->HiveContextAwareRecordReader.initIOContext
->IOContext.setInputPath
{code}
inputRDD1 and inputRDD2 are same table's rdd, so CopyRDD1 and CopyRDD2 are same 
rdd if rdd cache is enabled. When MT_12 will not call 
CombineHiveRecordReader#init to initialize {{IOContext::inputPath}} but 
{{MapOperator#process(Writable value)}} still need this value. IOContext is 
bound to single thread, so the value is different in different thread. 
{{inputRDD1-CopyRDD1-MT_11-RT_1}} and {{inputRDD2-CopyRDD2-MT_12-RT_2}} is 
called in different thread. So IOContext can not be shared between these two 
threads.

For this issue, I gave following solution:
We save the inputPath in CopyRDD1

[jira] [Comment Edited] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-23 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336946#comment-16336946
 ] 

liyunzhang edited comment on HIVE-18301 at 1/24/18 5:39 AM:


In HIVE-18301.patch, it provides one solution to transfer the 
{{IOContext::inputPath}}
{code}
  inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
|   |
   MT_11   MT_12
|   |
   RT_1 RT_2
 \  /
 Union  
{code}
MT_11 will call following stack to initialize IOContext::inputPath
{code}
 CombineHiveRecordReader#init
->HiveContextAwareRecordReader.initIOContext
->IOContext.setInputPath
{code}
inputRDD1 and inputRDD2 are same table's rdd, so CopyRDD1 and CopyRDD2 are same 
rdd if rdd cache is enabled. When MT_12 will not call 
CombineHiveRecordReader#init to initialize {{IOContext::inputPath}} but 
{{MapOperator#process(Writable value)}} still need this value. IOContext is 
bound to single thread, so the value is different in different thread. 
{{inputRDD1-CopyRDD1-MT_11-RT_1}} and {{inputRDD2-CopyRDD2-MT_12-RT_2}} is 
called in different thread. So IOContext can not be shared between these two 
threads.

For this issue, I gave following solution:
We save the inputPath in CopyRDD1 when {{inputRDD1-CopyRDD1-MT_11-RT_1}} is 
executed.  CopyRDD2 get the cached value and inputPath from CopyRDD1 which is 
stored in spark cache  manager. We reinitialized the {{IOContext::inputPath}} 
in {{MapOperator#process(Writable value)}} in MT_12.
*where to setInputPath?*
MapInput#CopyFunction#call, save inputPath in the first element of returned 
tuple
{code}
 public MapInput(SparkPlan sparkPlan, JavaPairRDD 
hadoopRDD) {
 this(sparkPlan, hadoopRDD, false);
@@ -79,10 +83,19 @@ public void setToCache(boolean toCache) {
 call(Tuple2 tuple) throws Exception {
   if (conf == null) {
 conf = new Configuration();
+conf.set("hive.execution.engine","spark");
   }
-
-  return new Tuple2(tuple._1(),
-  WritableUtils.clone(tuple._2(), conf));
+  //CopyFunction   MapFunction
+  //  HADOOPRDD-> RDD1-> RDD2.
+  // these transformation are in one stage and will be executed by 1 spark 
task(thread),
+  // IOContext.get(conf).getInputPath will not be null.
+  String inputPath = IOContextMap.get(conf).getInputPath().toString();
+  Text inputPathText = new Text(inputPath);
+  // save inputPath in the first element of returned tuple
+  return new Tuple2(inputPathText,
+WritableUtils.clone(tuple._2(), conf));
 }

   }
{code}
*where to getInputPath?*
{code}
SparkMapRecordHandler#getInputPath
public void processRow(Object key, Object value) throws IOException {

+if (HiveConf.getBoolVar(jc, 
HiveConf.ConfVars.HIVE_SPARK_SHARED_WORK_OPTIMIZATION)) {
+  Path inputPath = IOContextMap.get(jc).getInputPath();
+  // when inputPath is null, it means the record is cached 
+  if (inputPath == null) {
+Text pathText = (Text) key;
+IOContextMap.get(jc).setInputPath(new Path(pathText.toString()));
+  }
+}

{code}
[~lirui], [~xuefuz], [~stakiar],[~csun], please give me your suggesions, thanks!


was (Author: kellyzly):
In HIVE-18301.patch, it provides one solution to transfer the 
{{IOContext::inputPath}}
{code}
  inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
||
   MT_11   MT_12
|  |
   RT_1 RT_2
 \   /
 Union  
{code}
MT_11 will call following stack to initialize IOContext::inputPath
{code}
 CombineHiveRecordReader#init
->HiveContextAwareRecordReader.initIOContext
->IOContext.setInputPath
{code}
inputRDD1 and inputRDD2 are same table's rdd, so CopyRDD1 and CopyRDD2 are same 
rdd if rdd cache is enabled. When MT_12 will not call 
CombineHiveRecordReader#init to initialize {{IOContext::inputPath}} but 
{{MapOperator#process(Writable value)}} still need this value. IOContext is 
bound to single thread, so the value is different in different thread. 
{{inputRDD1-CopyRDD1-MT_11-RT_1}} and {{inputRDD2-CopyRDD2-MT_12-RT_2}} is 
called in different thread. So IOContext can not be shared between these two 
threads.

For this issue, I gave following solution:
We save the inputPath in CopyRDD1 when {{inputRDD1-CopyRDD1-MT_11-RT_1}} is 
executed.  CopyRDD2 get the cached value and inputPath from CopyRDD1 which is

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-23 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336946#comment-16336946
 ] 

liyunzhang commented on HIVE-18301:
---

In HIVE-18301.patch, it provides one solution to transfer the 
{{IOContext::inputPath}}
{code}
  inputRDD1inputRDD2
|CopyFunction| CopyFunction
CopyRDD1CopyRDD2
||
   MT_11   MT_12
|  |
   RT_1 RT_2
 \   /
 Union  
{code}
MT_11 will call following stack to initialize IOContext::inputPath
{code}
 CombineHiveRecordReader#init
->HiveContextAwareRecordReader.initIOContext
->IOContext.setInputPath
{code}
inputRDD1 and inputRDD2 are same table's rdd, so CopyRDD1 and CopyRDD2 are same 
rdd if rdd cache is enabled. When MT_12 will not call 
CombineHiveRecordReader#init to initialize {{IOContext::inputPath}} but 
{{MapOperator#process(Writable value)}} still need this value. IOContext is 
bound to single thread, so the value is different in different thread. 
{{inputRDD1-CopyRDD1-MT_11-RT_1}} and {{inputRDD2-CopyRDD2-MT_12-RT_2}} is 
called in different thread. So IOContext can not be shared between these two 
threads.

For this issue, I gave following solution:
We save the inputPath in CopyRDD1 when {{inputRDD1-CopyRDD1-MT_11-RT_1}} is 
executed.  CopyRDD2 get the cached value and inputPath from CopyRDD1 which is 
stored in spark cache  manager. We reinitialized the {{IOContext::inputPath}} 
in {{MapOperator#process(Writable value)}} in MT_12.
*where to setInputPath?*
MapInput#CopyFunction#call, save inputPath in the first element of returned 
tuple
{code}
 public MapInput(SparkPlan sparkPlan, JavaPairRDD 
hadoopRDD) {
 this(sparkPlan, hadoopRDD, false);
@@ -79,10 +83,19 @@ public void setToCache(boolean toCache) {
 call(Tuple2 tuple) throws Exception {
   if (conf == null) {
 conf = new Configuration();
+conf.set("hive.execution.engine","spark");
   }
-
-  return new Tuple2(tuple._1(),
-  WritableUtils.clone(tuple._2(), conf));
+  //CopyFunction   MapFunction
+  //  HADOOPRDD-> RDD1-> RDD2.
+  // these transformation are in one stage and will be executed by 1 spark 
task(thread),
+  // IOContext.get(conf).getInputPath will not be null.
+  String inputPath = IOContextMap.get(conf).getInputPath().toString();
+  Text inputPathText = new Text(inputPath);
+  // save inputPath in the first element of returned tuple
+  return new Tuple2(inputPathText,
+WritableUtils.clone(tuple._2(), conf));
 }

   }
{code}
*where to getInputPath?*
{code}
SparkMapRecordHandler#getInputPath
public void processRow(Object key, Object value) throws IOException {

+if (HiveConf.getBoolVar(jc, 
HiveConf.ConfVars.HIVE_SPARK_SHARED_WORK_OPTIMIZATION)) {
+  Path inputPath = IOContextMap.get(jc).getInputPath();
+  // when inputPath is null, it means the record is cached 
+  if (inputPath == null) {
+Text pathText = (Text) key;
+IOContextMap.get(jc).setInputPath(new Path(pathText.toString()));
+  }
+}

{code}
[~lirui], [~xuefuz], [~stakiar],[~csun], please give me your suggesions, thanks!

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
>

[jira] [Updated] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-23 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18301:
--
Attachment: HIVE-18301.patch

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-23 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336941#comment-16336941
 ] 

liyunzhang commented on HIVE-18301:
---

[~lirui]: {quote}
And we won't be able to tell the file boundaries because they're cached.
{quote}
yes.  when IOContext.getInputPath() = null  only means the record is cached.  
Can not know file boundaries.
So IOContext.getInputPath is necessary.  

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-8436) Modify SparkWork to split works with multiple child works [Spark Branch]

2018-01-23 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336431#comment-16336431
 ] 

liyunzhang commented on HIVE-8436:
--

[~csun]:  thanks for reply,

{quote}

without the copying function, the RDD cache will cache *references*

{quote}

I have not found this in spark 
[document|https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence].
  Spark RDD cache  reference not value.  If have time, can you provide  a link 
which explain it?
{code:java}

private static class CopyFunction implements 
PairFunction,
WritableComparable, Writable> {

private transient Configuration conf;

@Override
public Tuple2
call(Tuple2 tuple) throws Exception {
if (conf == null) {
conf = new Configuration();
}

return new Tuple2(tuple._1(),
WritableUtils.clone(tuple._2(), conf));
}

}{code}

{{WritableUtils.clone(tuple._2(), conf))}} used to clone tuple._2() to  a new 
variable.  This means tuple._2() is instance of Class which can be cloned.  For 
Text type, it is ok. For orc/parquet format, it is not ok because 
[HIVE-18289|https://issues.apache.org/jira/browse/HIVE-18289].  The reason is 
OrcStruct doesn't have an empty constructor for orc when 
ReflectionUtils.newInstance is called and a similar reason for parquet format. 
Is there anyway to solve it?

> Modify SparkWork to split works with multiple child works [Spark Branch]
> 
>
> Key: HIVE-8436
> URL: https://issues.apache.org/jira/browse/HIVE-8436
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
>Priority: Major
> Fix For: 1.1.0
>
> Attachments: HIVE-8436.1-spark.patch, HIVE-8436.10-spark.patch, 
> HIVE-8436.11-spark.patch, HIVE-8436.2-spark.patch, HIVE-8436.3-spark.patch, 
> HIVE-8436.4-spark.patch, HIVE-8436.5-spark.patch, HIVE-8436.6-spark.patch, 
> HIVE-8436.7-spark.patch, HIVE-8436.8-spark.patch, HIVE-8436.9-spark.patch
>
>
> Based on the design doc, we need to split the operator tree of a work in 
> SparkWork if the work is connected to multiple child works. The way splitting 
> the operator tree is performed by cloning the original work and removing 
> unwanted branches in the operator tree. Please refer to the design doc for 
> details.
> This process should be done right before we generate SparkPlan. We should 
> have a utility method that takes the orignal SparkWork and return a modified 
> SparkWork.
> This process should also keep the information about the original work and its 
> clones. Such information will be needed during SparkPlan generation 
> (HIVE-8437).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-18289) Support Parquet/Orc format when enable rdd cache in Hive on Spark

2018-01-23 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18289:
--
Summary: Support Parquet/Orc format when enable rdd cache in Hive on Spark  
(was: Fix jar dependency when enable rdd cache in Hive on Spark)

> Support Parquet/Orc format when enable rdd cache in Hive on Spark
> -
>
> Key: HIVE-18289
> URL: https://issues.apache.org/jira/browse/HIVE-18289
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
>
> running DS/query28 when enabling HIVE-17486's 4th patch
> on tpcds_bin_partitioned_orc_10 whether on spark local or yarn mode
> command
> {code}
> set spark.local=yarn-client;
> echo 'use tpcds_bin_partitioned_orc_10;source query28.sql;'|hive --hiveconf 
> spark.app.name=query28.sql  --hiveconf hive.spark.optimize.shared.work=true 
> -i testbench.settings -i query28.sql.setting
> {code}
> the exception 
> {code}
> ava.lang.RuntimeException: java.lang.NoSuchMethodException: 
> org.apache.hadoop.hive.ql.io.orc.OrcStruct.()
> 748678 at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:134) 
> ~[hadoop-common-2.7.3.jar:?]
> 748679 at 
> org.apache.hadoop.io.WritableUtils.clone(WritableUtils.java:217) 
> ~[hadoop-common-2.7.3.jar:?]
> 748680 at 
> org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.   0-SNAPSHOT]
> 748681 at 
> org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:72)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.   0-SNAPSHOT]
> 748682 at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1031)
>  ~[spark-core_2.11-2.   0.0.jar:2.0.0]
> 748683 at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1031)
>  ~[spark-core_2.11-2.   0.0.jar:2.0.0]
> 748684 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) 
> ~[scala-library-2.11.8.jar:?]
> 748685 at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>  ~[spark-core_2.11-2.0.0.jar:2.   0.0]
> 748686 at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>  ~[spark-core_2.11-2.0.0.   jar:2.0.0]
> 748687 at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
>  ~[spark-core_2.11-2.0.0.   jar:2.0.0]
> 748688 at 
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748689 at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748690 at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748691 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748692 at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748693 at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748694 at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748695 at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748696 at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) 
> ~[spark-core_2.11-2
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-8436) Modify SparkWork to split works with multiple child works [Spark Branch]

2018-01-21 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333870#comment-16333870
 ] 

liyunzhang commented on HIVE-8436:
--

[~csun]:

 can you spend some time to explain why need add 
[MapInput::CopyFunction|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java#L72]?
  

the input is Tuple2, the output is 
Tuple2, why need to add HadoopRDD->CopyFunction?
{code:java}

  private static class CopyFunction implements 
PairFunction,
WritableComparable, Writable> {

private transient Configuration conf;

@Override
public Tuple2
call(Tuple2 tuple) throws Exception {
  if (conf == null) {
conf = new Configuration();
  }

  return new Tuple2(tuple._1(),
  WritableUtils.clone(tuple._2(), conf));
}

  }

 {code}
 

 

> Modify SparkWork to split works with multiple child works [Spark Branch]
> 
>
> Key: HIVE-8436
> URL: https://issues.apache.org/jira/browse/HIVE-8436
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
>Priority: Major
> Fix For: 1.1.0
>
> Attachments: HIVE-8436.1-spark.patch, HIVE-8436.10-spark.patch, 
> HIVE-8436.11-spark.patch, HIVE-8436.2-spark.patch, HIVE-8436.3-spark.patch, 
> HIVE-8436.4-spark.patch, HIVE-8436.5-spark.patch, HIVE-8436.6-spark.patch, 
> HIVE-8436.7-spark.patch, HIVE-8436.8-spark.patch, HIVE-8436.9-spark.patch
>
>
> Based on the design doc, we need to split the operator tree of a work in 
> SparkWork if the work is connected to multiple child works. The way splitting 
> the operator tree is performed by cloning the original work and removing 
> unwanted branches in the operator tree. Please refer to the design doc for 
> details.
> This process should be done right before we generate SparkPlan. We should 
> have a utility method that takes the orignal SparkWork and return a modified 
> SparkWork.
> This process should also keep the information about the original work and its 
> clones. Such information will be needed during SparkPlan generation 
> (HIVE-8437).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-17 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16330190#comment-16330190
 ] 

liyunzhang commented on HIVE-18301:
---

[~lirui]:
{quote}My understanding is if the HadoopRDD is cached, the records are not 
produced by record reader and IOContext is not populated. Therefore the 
information in IOContext will be unavailable, e.g. the input path. This may 
cause problem because some operators need to take certain actions when input 
file changes – {{Operator::cleanUpInputFileChanged}}.
 So basically my point is we have to figure out the scenarios where IOContext 
is necessary. Then decide whether we should disable caching in such cases.
{quote}
Yes, if HadoopRDD is cached, it will not call
{code:java}
CombineHiveRecordReader#init
->HiveContextAwareRecordReader.initIOContext
->IOContext.setInputPath
{code}
. 
 It will use the cached result to call MapOperator#process(Writable value), so 
NPE is thrown because at that time IOContext.getInputPath return null. Now I 
just modify the code of MapOperator#process(Writable value) like 
[link|https://github.com/kellyzly/hive/commit/e81b7df572e2c543095f55dd160b428c355da2fb]

Here my question is 
 1. when {{context.getIoCxt().getInputPath() == null}}, I think in this 
situation, this record is from cache not from CombineHiveRecordReader. We need 
not to call MapOperator#cleanUpInputFileChanged because 
MapOperator#cleanUpInputFileChanged is only designed for one Mapper scanning 
multiple files(like CombineFileInputFormat) and multiple partitions and 
inputPath will change in these situations and need to call 
{{cleanUpInputFileChanged}} to reinitialize [some 
variables|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java#L532]
 but we need not consider reinitialization for a cached record. Is my 
understanding right? If right, is there any other way to judge this is cached 
record or not except by {{context.getIoCxt().getInputPath() == null}}
 2. how to initiliaze IOContext#getInputPath in cache situation? we need this 
variable to reinitialize MapOperator::currentCtxs in 
MapOperator#initializeContexts
{code:java}
  public void initializeContexts() {
Path fpath = getExecContext().getCurrentInputPath();
String nominalPath = getNominalPath(fpath);
Map contexts = opCtxMap.get(nominalPath);
currentCtxs = contexts.values().toArray(new MapOpCtx[contexts.size()]);
  }
{code}
in the code, we store MapOpCtx for every MapOperator in opCtxMap(Map>). In table with partitions, there will be multiple 
elements in opCtxMap( opCtxMap.keySet() is a set containing partition names). 
Currently I test on a table without partitions and can directly use 
opCtxMap.values().iterator().next() to initialize 
[context|https://github.com/kellyzly/hive/blob/HIVE-17486.4/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java#L713]
 and runs successfully in yarn mode. But I guess this is not right with 
partitioned table.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
>

[jira] [Commented] (HIVE-17573) LLAP: JDK9 support fixes

2018-01-04 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312596#comment-16312596
 ] 

liyunzhang commented on HIVE-17573:
---

[~gopalv]: thanks for your reply and tool.  
bq.JDK9 seems to wake up the producer-consumer pair on the same NUMA zone (the 
IO elevator allocates, passes the array to the executor thread and executor 
passes it back instead of throwing it to GC deref).
 If I don't add {{-XX:+UseNUMA}}, I guess the optimization about NUMA handling 
will not benefit the query, is it right? UseNUMA is disabled by default.
bq.the IO elevator allocates, passes the array to the executor thread and 
executor passes it back instead of throwing it to GC deref
I guess this will reduce less GC.

>From my test result, what i found is GC is less in JDK9 comparing JDK8 on Hive 
>on Spark  in long 
>queries([link|https://docs.google.com/presentation/d/1cK9ZfUliAggH3NJzSvexTPwkXpbsM7Dm0o0kdmuFQUU/edit#slide=id.p]).
> Maybe this is because G1GC is the default garbage collector and the purpose 
>of 
>[G1GC|https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-collector.htm#JSGCT-GUID-0394E76A-1A8F-425E-A0D0-B48A3DC82B42]
> is less GC time.

> LLAP: JDK9 support fixes
> 
>
> Key: HIVE-17573
> URL: https://issues.apache.org/jira/browse/HIVE-17573
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Gopal V
>
> The perf diff between JDK8 -> JDK9 seems to be significant.  
> TPC-H Q6 on JDK8 takes 32s on a single node + 1 Tb scale warehouse. 
> TPC-H Q6 on JDK9 takes 19s on the same host + same data.
> The performance difference seems to come from better JIT and better NUMA 
> handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17573) LLAP: JDK9 support fixes

2018-01-03 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310797#comment-16310797
 ] 

liyunzhang commented on HIVE-17573:
---

[~gopalv]:
It seems that on tpch_text_10(10g), there is no big difference between JDK9 and 
JDK8 for [TPC-H Q6| 
https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpch/tpch_query6.sql]
 on my machine( 1 node with CPU E5-2690 v2, 40 cores,62g )

jdk8: 54.189s

jdk9: 67.86s

{quote}
The performance difference seems to come from better JIT and better NUMA 
handling.
{quote}

>From which tool, you can get above conclusion? Have you test it on machine 
>with NUMA architecture?

> LLAP: JDK9 support fixes
> 
>
> Key: HIVE-17573
> URL: https://issues.apache.org/jira/browse/HIVE-17573
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Gopal V
>
> The perf diff between JDK8 -> JDK9 seems to be significant.  
> TPC-H Q6 on JDK8 takes 32s on a single node + 1 Tb scale warehouse. 
> TPC-H Q6 on JDK9 takes 19s on the same host + same data.
> The performance difference seems to come from better JIT and better NUMA 
> handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-03 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309361#comment-16309361
 ] 

liyunzhang commented on HIVE-18301:
---

Here some update about NPE:
 the normal case when enable rdd cache
the stacktrace of initIOContext which will intialize 
ExecMapperContext#setCurrentInputPath is
{code}
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.initIOContext(HiveContextAwareRecordReader.java:175)
 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.initIOContext(HiveContextAwareRecordReader.java:211)
 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:101)
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 java.lang.reflect.Constructor.newInstance(Constructor.java:423)
 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:257)
 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:217)
 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:346)
 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:712)
 org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:246)
 org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
 org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
 org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
 org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
 org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
 org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
 org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
 org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
 org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
 org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
 org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
 org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
 org.apache.spark.scheduler.Task.run(Task.scala:85)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 java.lang.Thread.run(Thread.java:745)
{code}

the stacktrace of ExecMapperContext#getCurrentInputPath
{code}
org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext.getCurrentInputPath(ExecMapperContext.java:113)
 
org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:512)
 
org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:543)
 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:136)
 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
 scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)

[jira] [Issue Comment Deleted] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2018-01-02 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Comment: was deleted

(was: [~stakiar]:
the original purpose to change M->R to M->M->R is to let 
CombineEquivalentWorkResolver combine same Maps. Like
logical plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[1] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}  
physical plan
{code}  
Map1:TS[0]
Map2:TS[1]
Map3:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map4:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}
For {{CombineEquivalentWorkResolver}}, it will combine same Maps. In above 
case, Map2 will be removed because TS\[0\] is same as TS\[1\].  

But when I finish the code, I found that there is no necessary to use this way 
to combine TS\[0\] and TS\[1\]. {{MapInput}} is responsible for TS and I only 
need generate same MapInput for TS\[0\] and TS\[1\]. More detail see 
HIVE-17486.5.patch.
)

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> HIVE-17486.3.patch, HIVE-17486.4.patch, HIVE-17486.5.patch, 
> explain.28.share.false, explain.28.share.true, scanshare.after.svg, 
> scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2018-01-02 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Attachment: HIVE-17486.5.patch

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> HIVE-17486.3.patch, HIVE-17486.4.patch, HIVE-17486.5.patch, 
> explain.28.share.false, explain.28.share.true, scanshare.after.svg, 
> scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2018-01-02 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308995#comment-16308995
 ] 

liyunzhang edited comment on HIVE-17486 at 1/3/18 1:40 AM:
---

[~stakiar]:
the original purpose to change M->R to M->M->R is to let 
CombineEquivalentWorkResolver combine same Maps. Like
logical plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[1] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}  
physical plan
{code}  
Map1:TS[0]
Map2:TS[1]
Map3:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map4:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}
For {{CombineEquivalentWorkResolver}}, it will combine same Maps. In above 
case, Map2 will be removed because TS\[0\] is same as TS\[1\].  

But when I finished the code, I found that there is no necessary to use this 
way to combine TS\[0\] and TS\[1\]. {{MapInput}} is responsible for TS and I 
only need generate same MapInput for TS\[0\] and TS\[1\]. More detail see 
HIVE-17486.5.patch.



was (Author: kellyzly):
[~stakiar]:
the original purpose to change M->R to M->M->R is to let 
CombineEquivalentWorkResolver combine same Maps. Like
logical plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[1] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}  
physical plan
{code}  
Map1:TS[0]
Map2:TS[1]
Map3:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map4:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}
For {{CombineEquivalentWorkResolver}}, it will combine same Maps. In above 
case, Map2 will be removed because TS\[0\] is same as TS\[1\].  

But when I finish the code, I found that there is no necessary to use this way 
to combine TS\[0\] and TS\[1\]. {{MapInput}} is responsible for TS and I only 
need generate same MapInput for TS\[0\] and TS\[1\]. More detail see 
HIVE-17486.5.patch.


> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> HIVE-17486.3.patch, HIVE-17486.4.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2018-01-02 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308994#comment-16308994
 ] 

liyunzhang commented on HIVE-17486:
---

[~stakiar]:
the original purpose to change M->R to M->M->R is to let 
CombineEquivalentWorkResolver combine same Maps. Like
logical plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[1] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}  
physical plan
{code}  
Map1:TS[0]
Map2:TS[1]
Map3:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map4:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}
For {{CombineEquivalentWorkResolver}}, it will combine same Maps. In above 
case, Map2 will be removed because TS\[0\] is same as TS\[1\].  

But when I finish the code, I found that there is no necessary to use this way 
to combine TS\[0\] and TS\[1\]. {{MapInput}} is responsible for TS and I only 
need generate same MapInput for TS\[0\] and TS\[1\]. More detail see 
HIVE-17486.5.patch.


> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> HIVE-17486.3.patch, HIVE-17486.4.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2018-01-02 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308995#comment-16308995
 ] 

liyunzhang commented on HIVE-17486:
---

[~stakiar]:
the original purpose to change M->R to M->M->R is to let 
CombineEquivalentWorkResolver combine same Maps. Like
logical plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[1] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}  
physical plan
{code}  
Map1:TS[0]
Map2:TS[1]
Map3:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map4:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}
For {{CombineEquivalentWorkResolver}}, it will combine same Maps. In above 
case, Map2 will be removed because TS\[0\] is same as TS\[1\].  

But when I finish the code, I found that there is no necessary to use this way 
to combine TS\[0\] and TS\[1\]. {{MapInput}} is responsible for TS and I only 
need generate same MapInput for TS\[0\] and TS\[1\]. More detail see 
HIVE-17486.5.patch.


> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> HIVE-17486.3.patch, HIVE-17486.4.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-01 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307619#comment-16307619
 ] 

liyunzhang commented on HIVE-18301:
---

thanks for Rui's comment, will spend some time to investigate the NPE problem 
.Here I want to say in 
[HIVE-8920|https://issues.apache.org/jira/browse/HIVE-8920], the exception is 
{code}
Caused by: java.lang.IllegalStateException: Invalid input path 
hdfs://localhost:8020/user/hive/warehouse/dec2/dec.txt
at 
org.apache.hadoop.hive.ql.exec.MapOperator.getNominalPath(MapOperator.java:406)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:442)

{code}

Not NPE, so there are two different problems. 

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-27 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16304359#comment-16304359
 ] 

liyunzhang commented on HIVE-18301:
---

[~lirui]:
{quote}
My understanding is these information will be lost if the HadoopRDD is cached.
{quote}

You mean that HadoopRDD will not store the spark plan? If yes, actually in 
hive, it stores the spark plan on a file on  hdfs and deserialize and serialize 
from the file. See more code in 
hive/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#getBaseWork.  If 
not, please spend more time to explain detail.

Here my question is that is there any other reason to disable MapInput#cache 
besides  avoiding  multi-insert cases which there is union operator after 
{{from}}
{code}

from (select * from dec union all select * from dec2) s
insert overwrite table dec3 select s.name, sum(s.value) group by s.name
insert overwrite table dec4 select s.name, s.value order by s.value;

{code}

If there is no other reason to disable MapInput# cache, I guess  for 
HIVE-17486, we can enable MapInput cache because HIVE-17486 is merge same 
single table.  There is few case like above ( from (select A union B) ).

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-22 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301713#comment-16301713
 ] 

liyunzhang commented on HIVE-18301:
---

[~xuefuz],[~csun]: I read jiras about MapInput IOContext problem and enable 
MapInput rdd cache. And found the problem only happens OContext problem with 
multiple MapWorks cloned for multi-insert \[Spark Branch\] like HIVE-8920 
mentioned.
In HIVE-8920, I found the failure case is like
{code}
from (select * from dec union all select * from dec2) s
insert overwrite table dec3 select s.name, sum(s.value) group by s.name
insert overwrite table dec4 select s.name, s.value order by s.value;
{code}
I indeed saw the exception in my hive.log like
{code}
Caused by: java.lang.IllegalStateException: Invalid input path 
hdfs://localhost:8020/user/hive/warehouse/dec2/dec.txt
at 
org.apache.hadoop.hive.ql.exec.MapOperator.getNominalPath(MapOperator.java:406)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:442)
{code}

here the problem  happens on the MapInput is the union result of dec and dec2 
case. But when I modify case
{code}
from (select * from dec ) s
insert overwrite table dec3 select s.name, sum(s.value) group by s.name
insert overwrite table dec4 select s.name, s.value order by s.value;
{code}
No such exception whether in local or yarn mode.

Whether the problem only happens  in such complicated case( the rdd cache is 
the  union result of two tables)?  If only happen in such complicated case, why 
not only disable MapInput rdd cache in such case? Is there any other reason to 
disable MapInput#rdd cache? Please spend some time to view it as both of you 
have experience on it, thanks!

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-21 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301027#comment-16301027
 ] 

liyunzhang commented on HIVE-18301:
---

[~xuefuz],[~lirui]: Is there any way to store IOcontext in some place and then 
reinitialize it in current code when rdd cache is enabled?

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-21 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300999#comment-16300999
 ] 

liyunzhang commented on HIVE-18301:
---

The reason why have this NPE is because 
[org.apache.hadoop.hive.ql.io.IOContextMap#sparkThreadLocal|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/IOContextMap.java#L54
 ] is ThreadLocal variable.The value of the same ThreadLocal variable maybe 
different in different threads.
we set the InputPath by
{code}
 CombineHiveRecordReader#init
->HiveContextAwareRecordReader.initIOContext
->IOContext.setInputPath
{code}

we get the InputPath by
{code}
SparkMapRecordHandler.processRow
->MapOperator.process
->MapOperator.cleanUpInputFileChangedOp
->ExecMapperContext.getCurrentInputPath
->IOContext#getInputPath
{code}

when cache is enabled, there is no setInputPath , so return null when call 
IOContext#getInputPath

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (HIVE-17755) NPE exception when running TestAcidOnTez#testGetSplitsLocks with "

2017-12-20 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang resolved HIVE-17755.
---
Resolution: Not A Problem

> NPE exception when running TestAcidOnTez#testGetSplitsLocks with "
> --
>
> Key: HIVE-17755
> URL: https://issues.apache.org/jira/browse/HIVE-17755
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-20 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299384#comment-16299384
 ] 

liyunzhang commented on HIVE-18148:
---

+1

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18289) Fix jar dependency when enable rdd cache in Hive on Spark

2017-12-19 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297977#comment-16297977
 ] 

liyunzhang commented on HIVE-18289:
---

if running in parquet, the exception is 
{code}
Job failed with java.lang.NoSuchMethodException: 
org.apache.hadoop.io.ArrayWritable.()
FAILED: Execution Error, return code 3 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask. 
java.util.concurrent.ExecutionException: Exception thrown by job
at 
org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 2 in stage 4.0 failed 4 times, most recent failure: Lost task 2.3 in stage 
4.0 (TID 59, bdpe38): java.lang.RuntimeException: 
java.lang.NoSuchMethodException: org.apache.hadoop.io.ArrayWritable.()
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:134)
at org.apache.hadoop.io.WritableUtils.clone(WritableUtils.java:217)
at 
org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:85)
at 
org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:72)
at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1031)
at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1031)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodException: 
org.apache.hadoop.io.ArrayWritable.()
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getDeclaredConstructor(Class.java:2178)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:128)
... 24 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-19 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297921#comment-16297921
 ] 

liyunzhang commented on HIVE-18148:
---

it's better to remove the DPP which the statistics of TS in the target Map is 
smaller in above case.  Only suggestion.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-19 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297899#comment-16297899
 ] 

liyunzhang commented on HIVE-18148:
---

 I can understand the reason why you need to remove DPP1 is to fix the NPE. But 
in above example. If the table in the most left is a big table, it is suitable 
to remove the DPP1 in common join case?  

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-19 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297897#comment-16297897
 ] 

liyunzhang edited comment on HIVE-18301 at 12/20/17 5:23 AM:
-

yes.  I can achieve to merge the same tables into 1 mapInput and run 
successfully such case(DS/query28) in HIVE-17486 in spark local mode.  This 
exception only happens in yarn mode. 


was (Author: kellyzly):
yes.  I can achieve the merge the same tables into 1 mapInput and run 
successfully such case(DS/query28) in HIVE-17486 in spark local mode.  This 
exception only happens in yarn mode.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-19 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297897#comment-16297897
 ] 

liyunzhang commented on HIVE-18301:
---

yes.  I can achieve the merge the same tables into 1 mapInput and run 
successfully such case(DS/query28) in HIVE-17486 in spark local mode.  This 
exception only happens in yarn mode.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-19 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297875#comment-16297875
 ] 

liyunzhang commented on HIVE-18148:
---

sorry for reply late. still have 1 question about the code

{code}

621   /** For DPP sinks w/ common join, we'll split the tree and what's 
above the branching
622* operator is computed multiple times. Therefore it may not be good 
for performance to support
623* nested DPP sinks, i.e. one DPP sink depends on other DPP sinks.
624* The following is an example:
625*
626* TS  TS
627* |   |
628*... FIL
629*|   |  \
630*RS RS  SEL
631*  \/|
632* TS  JOIN  GBY
633* | / \  |
634*RSRSSEL   DPP2
635* \   /   |
636*   JOIN GBY
637*|
638*  DPP1
639*
640* where DPP1 depends on DPP2.
641*
642* To avoid such case, we'll visit all the branching operators. If a 
branching operator has any
643* further away DPP branches w/ common join in its sub-tree, such 
branches will be removed.
644* In the above example, the branch of DPP1 will be removed.
645*/
{code}

this function will  first collect the branching operators(FIL,JOIN in above 
example). then remove the nested DPP in the  branches.  If first traverses FIL, 
then remove DPP1 , If first tranverses JOIN, then remove DPP2.   This function 
will randomly remove one of nested DPPs.  Here I am confused how to judge which 
dpp need to be removed?  If my understanding is not right, tell me. 

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-18 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18301:
--
Description: 
Before IOContext problem is found in MapTran when spark rdd cache is enabled in 
HIVE-8920.
so we disabled rdd cache in MapTran at 
[SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
  The problem is IOContext seems not initialized correctly in the spark yarn 
client/cluster mode and caused the exception like 
{code}
Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
java.lang.RuntimeException: Error processing row: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
at 
org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
... 12 more

Driver stacktrace:
{code}
in yarn client/cluster mode, sometimes 
[currenthttps://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109

  was:
Before IOContext problem is found in MapTran when spark rdd cache is enabled in 
HIVE-8920.
so we disabled rdd cache in MapTran at 
[SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
  The problem is IOContext seems not initialized well in the spark yarn 
client/cluster mode and caused the exception like
{code}
Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
java.lang.RuntimeException: Error processing row: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
at

[jira] [Updated] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-18 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18301:
--
Description: 
Before IOContext problem is found in MapTran when spark rdd cache is enabled in 
HIVE-8920.
so we disabled rdd cache in MapTran at 
[SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
  The problem is IOContext seems not initialized correctly in the spark yarn 
client/cluster mode and caused the exception like 
{code}
Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
java.lang.RuntimeException: Error processing row: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
at 
org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
... 12 more

Driver stacktrace:
{code}
in yarn client/cluster mode, sometimes 
[ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
 is null when rdd cach is enabled.

  was:
Before IOContext problem is found in MapTran when spark rdd cache is enabled in 
HIVE-8920.
so we disabled rdd cache in MapTran at 
[SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
  The problem is IOContext seems not initialized correctly in the spark yarn 
client/cluster mode and caused the exception like 
{code}
Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
java.lang.RuntimeException: Error processing row: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
at

[jira] [Assigned] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-18 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang reassigned HIVE-18301:
-


> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized well in the spark yarn 
> client/cluster mode and caused the exception like
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (HIVE-18289) Fix jar dependency when enable rdd cache in Hive on Spark

2017-12-17 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang reassigned HIVE-18289:
-


> Fix jar dependency when enable rdd cache in Hive on Spark
> -
>
> Key: HIVE-18289
> URL: https://issues.apache.org/jira/browse/HIVE-18289
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> running DS/query28 when enabling HIVE-17486's 4th patch
> on tpcds_bin_partitioned_orc_10 whether on spark local or yarn mode
> command
> {code}
> set spark.local=yarn-client;
> echo 'use tpcds_bin_partitioned_orc_10;source query28.sql;'|hive --hiveconf 
> spark.app.name=query28.sql  --hiveconf hive.spark.optimize.shared.work=true 
> -i testbench.settings -i query28.sql.setting
> {code}
> the exception 
> {code}
> ava.lang.RuntimeException: java.lang.NoSuchMethodException: 
> org.apache.hadoop.hive.ql.io.orc.OrcStruct.()
> 748678 at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:134) 
> ~[hadoop-common-2.7.3.jar:?]
> 748679 at 
> org.apache.hadoop.io.WritableUtils.clone(WritableUtils.java:217) 
> ~[hadoop-common-2.7.3.jar:?]
> 748680 at 
> org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.   0-SNAPSHOT]
> 748681 at 
> org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:72)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.   0-SNAPSHOT]
> 748682 at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1031)
>  ~[spark-core_2.11-2.   0.0.jar:2.0.0]
> 748683 at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1031)
>  ~[spark-core_2.11-2.   0.0.jar:2.0.0]
> 748684 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) 
> ~[scala-library-2.11.8.jar:?]
> 748685 at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>  ~[spark-core_2.11-2.0.0.jar:2.   0.0]
> 748686 at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>  ~[spark-core_2.11-2.0.0.   jar:2.0.0]
> 748687 at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
>  ~[spark-core_2.11-2.0.0.   jar:2.0.0]
> 748688 at 
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748689 at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748690 at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748691 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748692 at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748693 at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748694 at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748695 at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748696 at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) 
> ~[spark-core_2.11-2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-16 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Attachment: HIVE-17486.4.patch

update HIVE-17486.4.patch to fix the compilation error.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> HIVE-17486.3.patch, HIVE-17486.4.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-15 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Attachment: HIVE-17486.3.patch

add spark_optimize_shared_work.q.out and update HIVE-17486.3.patch. Trigger QA 
tests to see whether patch influences current code or not.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> HIVE-17486.3.patch, explain.28.share.false, explain.28.share.true, 
> scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-15 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Status: Patch Available  (was: Open)

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> HIVE-17486.3.patch, explain.28.share.false, explain.28.share.true, 
> scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-15 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292286#comment-16292286
 ] 

liyunzhang commented on HIVE-17486:
---

[~xuefuz]: before you mentioned that the reason to disable caching for MapInput 
because of [IOContext initialization 
problem|https://issues.apache.org/jira/browse/HIVE-8920?focusedCommentId=14260846=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14260846].
 And reading HIVE-9041, it shows an example to show IOContext initialization 
problem
{code}
I just found another bug regarding IOContext, when caching is turned on.
Taking the sample query above as example, right now I have this result plan:

   MW 1 (table0)   MW 2 (table1)   MW 3 (table0)   MW 4 (table1)
  \/ \ /
   \  /   \   /
\/ \ /
 \  /   \   /
   RW 1   RW 2
Suppose MapWorks are executed from left to right, also suppose we are just 
running with a single thread.
Then, the following will happen:
1. executing MW 1: since this is the first time we access table0, initialize 
IOContext and make input path point to table0;
2. executing MW 2: since this is the first time we access table1, initialize 
IOContext and make input path point to table1;
3. executing MW 3: since this is the second time access table0, do not 
initialize IOContext, and use the copy saved in step 2), which is table1.

Step 3 will then fail.
 how to make MW 3 know that it needs to get the saved IOContext from MW 1, but 
not MW 2
{code}
If the problem exists in the MapInput RDD cache because of IOContext is a 
static variable which will be stored in cache and the IOContext will be updated 
in different Maps. Why only disable in [MapInput rdd 
cache|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202]?
 
This should be disabled in all MapTrans.  Please explain more if have time.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> explain.28.share.false, explain.28.share.true, scanshare.after.svg, 
> scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-15 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292203#comment-16292203
 ] 

liyunzhang edited comment on HIVE-17486 at 12/15/17 9:15 AM:
-

[~lirui] and [~xuefuz]:  I have updated  HIVE-17486.2.patch and designed again 
. A similar case like DS/query28.sql is run successfully.
This proved that current design(M-M-R) can use RDD cache to reduce the table 
scan.
In the latest design doc, there is a simple case.  Currently, although i have 
added the simple case(spark_optimize_shared_work.q), there is some exception 
when running qtest in my local env. Once fix the problem, will trigger Hive QA 
to test.


was (Author: kellyzly):
[~lirui] and [~xuefuz]:  I have updated  HIVE-17486.2.patch and designed again 
because a similar case like DS/query28.sql is run successfully.
This proved that current design(M-M-R) can use RDD cache to reduce the table 
scan.
In the latest design doc, there is a simple case.  Currently, although i have 
added the simple case(spark_optimize_shared_work.q), there is some exception 
when running qtest in my local env. Once fix the problem, will trigger Hive QA 
to test.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> explain.28.share.false, explain.28.share.true, scanshare.after.svg, 
> scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-15 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Attachment: HIVE-17486.2.patch

[~lirui] and [~xuefuz]:  I have updated  HIVE-17486.2.patch and designed again 
because a similar case like DS/query28.sql is run successfully.
This proved that current design(M-M-R) can use RDD cache to reduce the table 
scan.
In the latest design doc, there is a simple case.  Currently, although i have 
added the simple case(spark_optimize_shared_work.q), there is some exception 
when running qtest in my local env. Once fix the problem, will trigger Hive QA 
to test.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> explain.28.share.false, explain.28.share.true, scanshare.after.svg, 
> scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18282) Spark tar is downloaded every time for itest

2017-12-14 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292052#comment-16292052
 ] 

liyunzhang commented on HIVE-18282:
---

[~stakiar]: 
http://d3jw87u4immizc.cloudfront.net/spark-tarball/spark-2.0.0-bin-hadoop2-without-hive.tgz.md5sum
 exists. But 
http://d3jw87u4immizc.cloudfront.net/spark-tarball/spark-2.2.0-bin-hadoop2-without-hive.tgz.md5sum
 does not exist. Can you help upload the md5sum for spark2.2 package otherwise 
spark tar will be downloaded even the package has been downloaded?
$HIVE_SOURCE/itests/pom.xml#Line 82
{code}
download() {
url=$1;
finalName=$2
tarName=$(basename $url)
rm -rf $BASE_DIR/$finalName
if [[ ! -f $DOWNLOAD_DIR/$tarName ]]
then
 curl -Sso $DOWNLOAD_DIR/$tarName $url
else
  local md5File="$tarName".md5sum
  curl -Sso $DOWNLOAD_DIR/$md5File "$url".md5sum
  cd $DOWNLOAD_DIR
  if type md5sum >/dev/null  ! md5sum -c 
$md5File; then  // use md5sum to check if need to update package.
curl -Sso $DOWNLOAD_DIR/$tarName $url || return 1
  fi  

  cd -
fi  
tar -zxf $DOWNLOAD_DIR/$tarName -C $BASE_DIR
mv 
$BASE_DIR/spark-${spark.version}-bin-hadoop2-without-hive $BASE_DIR/$finalName
  }   


{code}

> Spark tar is downloaded every time for itest
> 
>
> Key: HIVE-18282
> URL: https://issues.apache.org/jira/browse/HIVE-18282
> Project: Hive
>  Issue Type: Test
>Reporter: Rui Li
>
> Seems we missed the md5 file for spark-2.2.0?
> cc [~kellyzly], [~stakiar]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-13 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290256#comment-16290256
 ] 

liyunzhang commented on HIVE-18148:
---

{code} 
grep -C2 "hive.auto.convert.join" 
$HIVE_SOURCE/itests/qtest/target/testconf/spark/yarn-client/hive-site.xml
160-
161-
162:  hive.auto.convert.join
163-  false
164-  Whether Hive enable the optimization about converting common 
join into mapjoin based on the input file size
{code}
 when running spark_dynamic_partition_pruning_5.q, it will use above 
hive-site.xml and {{hive.auto.convert.join}} is false. 

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-13 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290243#comment-16290243
 ] 

liyunzhang commented on HIVE-18148:
---

[~lirui]:  If this is found in common join please add comments to describe and 
modify some comments, I see following in the patch 
{code}
// If a branch is of pattern "RS - MAPJOIN", it means we're on the "small 
table" side of a
 // map join. Since there will be a job boundary, we shouldn't look for DPPs 
beyond this.
private boolean stopAtMJ(Operator op) {
}
{code}
I guess there is no possiblity to see MAPJOIN in this situation.



> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-13 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288899#comment-16288899
 ] 

liyunzhang commented on HIVE-18148:
---

sorry i have not run the qtest. just copy the query from 
{{spark_dynamic_partition_pruning_5.q}} and there is no 
{{hive.auto.convert.join = false}} in {{spark_dynamic_partition_pruning_5.q}}

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-13 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288864#comment-16288864
 ] 

liyunzhang commented on HIVE-18148:
---

{quote}
you can just run the new test with master branch and it should fail 
{quote}
the new unit test is executed with map join because 
{{hive.auto.convert.join.noconditionaltask}} is true by default. If my 
understanding is wrong, tell me.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-12 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288814#comment-16288814
 ] 

liyunzhang commented on HIVE-18148:
---

[~lirui]:  tried to use the example you provided.
{code}

set hive.spark.dynamic.partition.pruning=true;
explain select * from src join part1 on src.key=part1.p join part2 on 
src.value=part2.q;

{code}

But  in my env( latest build: 095e6bf8988a03875bc9340b2ab82d5d13c4cb3c), the 
physical plan  before SparkCompiler#removeNestedDPP is 
{code}
TS[0]-FIL[22]-SEL[2]-RS[9]-MAPJOIN[32]-MAPJOIN[31]-SEL[15]-FS[16]
TS[3]-FIL[23]-SEL[5]-MAPJOIN[32]
TS[6]-FIL[24]-SEL[8]-RS[13]-MAPJOIN[31]
{code}
there is no dpp operator for removeNestedDPP to traverse because HIVE-17087. 
SparkMapJoinOptimizer#convertJoinMapJoin removes dpp operator when there is 
mapjoin operator.

So did you reproduce the NPE when setting  
{{hive.auto.convert.join.noconditionaltask}} as false?  BTW,  how does Hive on 
Tez deal with this kind of nested dpp case?


> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-10 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285558#comment-16285558
 ] 

liyunzhang edited comment on HIVE-17486 at 12/11/17 6:41 AM:
-

[~xuefuz]
{quote}  It seems that we can do so whenever an TS is connected to multiple 
RSs. The split point should happen at the fork. {quote}  not very understand 
about this. Currently the split is on the TS
for example
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
-FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}

->
{code}  
Map1: TS[0]
Map2:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map3:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}


was (Author: kellyzly):
[~xuefuz]
{quote}  It seems that we can do so whenever an TS is connected to multiple 
RSs. The split point should happen at the fork. {quote}  not very understand 
about this. Please explain more, thanks!

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-10 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285558#comment-16285558
 ] 

liyunzhang commented on HIVE-17486:
---

[~xuefuz]
{quote}  It seems that we can do so whenever an TS is connected to multiple 
RSs. The split point should happen at the fork. {quote}  not very understand 
about this. Please explain more, thanks!

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-10 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285551#comment-16285551
 ] 

liyunzhang commented on HIVE-17486:
---

[~lirui]:
{quote}
could you provide more design and implementation details in your doc?
{quote}
ok, will add more in the design doc.
{quote}
For the first problem, can the new feature be done in a separate pass, so that 
you don't need to deal with existing rules or graph walker?
{quote}
thanks for your suggestion, I will try to avoid current problem.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-10 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285549#comment-16285549
 ] 

liyunzhang commented on HIVE-18150:
---

[~Ferd]:  [~stakiar] has uploaded the spark-2.2.0 package. So we need not bump 
up the version of Spark for QTest.

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-10 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285473#comment-16285473
 ] 

liyunzhang commented on HIVE-18150:
---

[~ferd] : as we have finished the review of the patch, please help commit, 
thanks!

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-10 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285470#comment-16285470
 ] 

liyunzhang commented on HIVE-17486:
---

[~lirui] and [~xuefuz]: Please help view the [First 
Problem|https://docs.google.com/document/d/1f4f0oMhN2vKSTCtXbnd3FBYOV02H4QflX1BbkglnC30/edit#heading=h.d0ptagvbv8k3]
 i mentioned in the design doc, thanks!

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-10 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285468#comment-16285468
 ] 

liyunzhang commented on HIVE-18148:
---

[~lirui]: I can not reproduce because you did not provide all the script to 
reproduce it.
But I guess following script matches you mentioned
[spark_dynamic_partition_pruning.q#https://github.com/kellyzly/hive/blob/master/ql/src/test/queries/clientpositive/spark_dynamic_partition_pruning.q#L29]
{code}
EXPLAIN select count(*) from srcpart join srcpart_date on (srcpart.ds = 
srcpart_date.ds) join srcpart_hour on (srcpart.hr = srcpart_hour.hr)
where srcpart_date.`date` = '2008-04-08' and srcpart_hour.hour = 11;
{code}
srcpart is similar as src , srcpart_data is similar as part1 and srcpart_hour 
is similiar as part2 in your example. But the operator tree is like
{code}
TS[0]-SEL[2]-MAPJOIN[36]-MAPJOIN[35]-GBY[16]-RS[17]-GBY[18]-FS[20]
TS[3]-FIL[27]-SEL[5]-RS[10]-MAPJOIN[36]
-SEL[29]-GBY[30]-SPARKPRUNINGSINK[31]
TS[6]-FIL[28]-SEL[8]-RS[13]-MAPJOIN[35]
-SEL[32]-GBY[33]-SPARKPRUNINGSINK[34]
{code}
Here TS\[3\] is srcpart_date ,TS\[6\] is srcpart_hour, TS\[0\] is src. But 
there is no nested DPP problem here. So where is wrong?

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-08 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283241#comment-16283241
 ] 

liyunzhang commented on HIVE-17486:
---

[~xuefuz] and [~lirui]: the second problem I found  is the modification of 
MapOperator after changing from M->R to M->M->R.  
MapOperator  is reponsible for deserializing and  [initObjectInspector| 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java#L170].
 For the second M in M->M->R,  deserializing is not necessary and 
initObjectInspector is necessary. Currently I am investigating how to make this 
work. If there is some wrong in my understanding, please tell me!

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-07 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283012#comment-16283012
 ] 

liyunzhang commented on HIVE-17486:
---

[~xuefuz]: have uploaded the [design 
doc|https://docs.google.com/document/d/1f4f0oMhN2vKSTCtXbnd3FBYOV02H4QflX1BbkglnC30/edit?usp=sharing].
 I described the problems i met in the [Problem 
Section|https://docs.google.com/document/d/1f4f0oMhN2vKSTCtXbnd3FBYOV02H4QflX1BbkglnC30/edit#heading=h.d0ptagvbv8k3],
 please help view the problem if have time, thanks!

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-07 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282966#comment-16282966
 ] 

liyunzhang commented on HIVE-17486:
---

[~xuefuz]: ok, will upload doc soon.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-06 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281238#comment-16281238
 ] 

liyunzhang commented on HIVE-17486:
---

[~lirui], [~xuefuz]: can you spend some time to view the problem i met about  
DefaultRuleDispatcher? thanks!

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-06 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281237#comment-16281237
 ] 

liyunzhang commented on HIVE-18150:
---

[~ferd], [~lirui],[~xuefuz] and [~stakiar]:  4 unit tests failed but it seems 
not related to this patch. 
[Jenkins|https://builds.apache.org/job/PreCommit-HIVE-Build/8137/#showFailuresLink]
 still failed on these 4 unit tests when applying other patch. Can you help 
review?

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-05 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18150:
--
Attachment: (was: HIVE-18150.1.patch)

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-05 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18150:
--
Attachment: HIVE-18150.1.patch

reattach patch to trigger Hive QA.

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-04 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278167#comment-16278167
 ] 

liyunzhang commented on HIVE-17486:
---

Here record the problems currently I met
1. I want to change the M->R to M->M->R and split the operator tree when 
encountering TS. I create [SparkRuleDispatcher| 
https://github.com/kellyzly/hive/blob/HIVE-17486.3/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkRuleDispatcher.java]
 to apply rules to the operator tree, the reason why i don't use 
DefaultRuleDispatcher is because there already a rule called [Handle Analyze 
Command|https://github.com/kellyzly/hive/blob/jdk9-trial/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L432]
 to split operator trees once encountering TS. Original 
[SparkCompiler#opRules|https://github.com/kellyzly/hive/blob/jdk9-trial/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L417]
 is a linkedHashMap which stores one key with one value. It can not deal with 
the case where one key with two values. So current solution is to modify 
SparkCompile#opRules to a Multimap and create SparkRuleDispatcher . But I am 
afraid once encountering TS, only 1 rule will be applied  for 
{{SparkRuleDispatcher#dispatch}}

SparkRuleDispatcher#dispatch
{code}

@Override
  public Object dispatch(Node nd, Stack ndStack, Object... nodeOutputs)
  throws SemanticException {

// find the firing rule
// find the rule from the stack specified
Rule rule = null;
int minCost = Integer.MAX_VALUE;
for (Rule r : procRules.keySet()) {
  int cost = r.cost(ndStack);
  if ((cost >= 0) && (cost < minCost)) {
minCost = cost;
// Here I am afraid there is only 1 rule will be applied even there are 
two rules for TS
rule = r;
  }
}

Collection procSet;

if (rule == null) {
  procSet = defaultProcSet;
} else {
  procSet = procRules.get(rule);
}

// Do nothing in case proc is null
Object ret = null;
for (NodeProcessor proc : procSet) {
  if (proc != null) {
// Call the process function
ret = proc.process(nd, ndStack, procCtx, nodeOutputs);
  }
}
return ret;
  }
{code}

I can change above code like following but don't know return the result of 
which rule if there are more than 1 rule for TS.
{code}
  @Override
  public Object dispatch(Node nd, Stack ndStack, Object... nodeOutputs)
  throws SemanticException {

// find the firing rule
// find the rule from the stack specified
ArrayList ruleList =new ArrayList();
int minCost = Integer.MAX_VALUE;
for (Rule r : procRules.keySet()) {
  int cost = r.cost(ndStack);
  if ((cost >= 0) && (cost < minCost)) {
minCost = cost;
ruleList.add(r);
  }
}

Collection procSet;

if (ruleList.size() == 0) {
  procSet = defaultProcSet;
} else {
  for(Rule r: ruleList) {
// Question: Here I don't know which rule I should use if there is more 
than 1 rule in the ruleList
procSet = procRules.get(r);
  }
}

// Do nothing in case proc is null
Object ret = null;
for (NodeProcessor proc : procSet) {
  if (proc != null) {
// Call the process function
ret = proc.process(nd, ndStack, procCtx, nodeOutputs);
  }
}
return ret;

  }
}

{code}
[~lirui], [~xuefuz] can you give your suggestions about the problem?


> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-04 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Attachment: HIVE-17486.1.patch

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-04 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278065#comment-16278065
 ] 

liyunzhang commented on HIVE-18150:
---

test two qfiles by TestMiniSparkOnYarnCliDriver, seems no failures.

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-04 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276547#comment-16276547
 ] 

liyunzhang commented on HIVE-17486:
---

will upload patch later and describe the problems i encountered later.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: explain.28.share.false, explain.28.share.true, 
> scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-04 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Attachment: explain.28.share.false
explain.28.share.true

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: explain.28.share.false, explain.28.share.true, 
> scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-04 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276526#comment-16276526
 ] 

liyunzhang commented on HIVE-17486:
---

explain.28.scan.share.true

we can see that there is only operator (TS) in Map1, and the child of TS to the 
RS are belongs to another Map(Map12,Map15,Map18,Map2,Map6,Map9). So  change 
current {{M-R}} in 1 SparkTask to {{M-M-R}}
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Map 12 <- Map 1 (NONE, 1000)
Map 15 <- Map 1 (NONE, 1000)
Map 18 <- Map 1 (NONE, 1000)
Map 2 <- Map 1 (NONE, 1000)
Map 6 <- Map 1 (NONE, 1000)
Map 9 <- Map 1 (NONE, 1000)
Reducer 10 <- Map 9 (GROUP PARTITION-LEVEL SORT, 1)
Reducer 13 <- Map 12 (GROUP PARTITION-LEVEL SORT, 1)
Reducer 16 <- Map 15 (GROUP PARTITION-LEVEL SORT, 1)
Reducer 19 <- Map 18 (GROUP PARTITION-LEVEL SORT, 1)
Reducer 3 <- Map 2 (GROUP PARTITION-LEVEL SORT, 1)
Reducer 4 <- Reducer 10 (PARTITION-LEVEL SORT, 1), Reducer 13 
(PARTITION-LEVEL SORT, 1), Reducer 16 (PARTITION-LEVEL SORT, 1), Reducer 19 
(PARTITION-LEVEL SORT, 1), Reducer 3 (PARTITION-LEVEL SORT, 1), Reducer 7 
(PARTITION-LEVEL SORT, 1)
Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1)
  DagName: root_20171204042631_0435ff7e-3f10-4c84-a5fc-dc5b607497ba:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: ((ss_quantity BETWEEN 0 AND 5 and (ss_list_price 
BETWEEN 11 AND 21 or ss_coupon_amt BETWEEN 460 AND 1460 or ss_wholesale_cost 
BETWEEN 14 AND 34)) or (ss_quantity BETWEEN 6 AND 10 and (ss_list_price BETWEEN 
91 AND 101 or ss_coupon_amt BETWEEN 1430 AND 2430 or ss_wholesale_cost BETWEEN 
32 AND 52)) or (ss_quantity BETWEEN 11 AND 15 and (ss_list_price BETWEEN 66 AND 
76 or ss_coupon_amt BETWEEN 920 AND 1920 or ss_wholesale_cost BETWEEN 4 AND 
24)) or (ss_quantity BETWEEN 16 AND 20 and (ss_list_price BETWEEN 142 AND 152 
or ss_coupon_amt BETWEEN 3054 AND 4054 or ss_wholesale_cost BETWEEN 80 AND 
100)) or (ss_quantity BETWEEN 21 AND 25 and (ss_list_price BETWEEN 135 AND 145 
or ss_coupon_amt BETWEEN 14180 AND 15180 or ss_wholesale_cost BETWEEN 38 AND 
58)) or (ss_quantity BETWEEN 26 AND 30 and (ss_list_price BETWEEN 28 AND 38 or 
ss_coupon_amt BETWEEN 2513 AND 3513 or ss_wholesale_cost BETWEEN 42 AND 62))) 
(type: boolean)
  Statistics: Num rows: 28800991 Data size: 4751513940 Basic 
stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Map 12 
Map Operator Tree:
Filter Operator
  predicate: (ss_quantity BETWEEN 16 AND 20 and (ss_list_price 
BETWEEN 142 AND 152 or ss_coupon_amt BETWEEN 3054 AND 4054 or ss_wholesale_cost 
BETWEEN 80 AND 100)) (type: boolean)
  Statistics: Num rows: 1066701 Data size: 175981606 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: ss_list_price (type: double)
outputColumnNames: ss_list_price
Statistics: Num rows: 1066701 Data size: 175981606 Basic 
stats: COMPLETE Column stats: NONE
Group By Operator
  aggregations: avg(ss_list_price), count(ss_list_price), 
count(DISTINCT ss_list_price)
  keys: ss_list_price (type: double)
  mode: hash
  outputColumnNames: _col0, _col1, _col2, _col3
  Statistics: Num rows: 1066701 Data size: 175981606 Basic 
stats: COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: double)
sort order: +
Statistics: Num rows: 1066701 Data size: 175981606 
Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: 
struct), _col2 (type: bigint)
Map 15 
Map Operator Tree:
Filter Operator
  predicate: (ss_quantity BETWEEN 21 AND 25 and (ss_list_price 
BETWEEN 135 AND 145 or ss_coupon_amt BETWEEN 14180 AND 15180 or 
ss_wholesale_cost BETWEEN 38 AND 58)) (type: boolean)
  Statistics: Num rows: 1066701 Data size: 175981606 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: ss_list_price (type: double)
outputColumnNames: ss_list_price
Statistics: Num rows: 1066701 Data size: 175981606 Basic 
stats: COMPLETE Column stats: NONE
Group By Operator
  aggregations: avg(ss_list_price), count(ss_list_price), 
count(DISTINCT

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-04 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Attachment: (was: explain.28.scan.share.true)

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-04 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Attachment: (was: explain.28.scan.share.false)

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-04 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276515#comment-16276515
 ] 

liyunzhang commented on HIVE-18150:
---

trigger HIVE-QA to find the failed test cases

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-04 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18150:
--
Attachment: HIVE-18150.1.patch

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-04 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang reassigned HIVE-18150:
-

Assignee: liyunzhang  (was: Sahil Takiar)

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-04 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18150:
--
Status: Patch Available  (was: Open)

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang
> Attachments: HIVE-18150.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-04 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-17486:
--
Attachment: explain.28.scan.share.false
explain.28.scan.share.true

I set the flag {{hive.spark.optimize.shared.work}} to enable the 
SharedWorkOptimizer in Hive on Spark.  The attach explain.28.scan.share.true is 
the explain when enabling the flag and explain.28.scan.share.false is the 
explain when disabling the flag for 
[DS/query28.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query28.sql]

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: explain.28.scan.share.false, explain.28.scan.share.true, 
> scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18150) Upgrade Spark Version to 2.2.0

2017-12-03 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276289#comment-16276289
 ] 

liyunzhang commented on HIVE-18150:
---

[~stakiar]:  any progress on this jira? If you have not time on it, can I take 
this? Now I am working on compiling and running spark with scala-2.12 and 
jdk9(SPARK-22660), so I need to upgrade hive on spark to 2.2.0 to enable Hive 
on Spark on JDK9.

> Upgrade Spark Version to 2.2.0
> --
>
> Key: HIVE-18150
> URL: https://issues.apache.org/jira/browse/HIVE-18150
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Issue Comment Deleted] (HIVE-18080) Performance degradation on VectorizedLogicBench#IfExprLongColumnLongColumnBench when AVX512 is enabled

2017-11-23 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18080:
--
Comment: was deleted

(was: 
It seems after If expression is vectorized by AVX2 instructions. It only 
consumes 2.8% of all total instructions and consumes very little cpu time(23ms) 
when the warm iter number is 5000 and iter number is 5. So the variation of 
this test is too big. Later will try to enlarge iterations of jmh test to see 
the degradation exists or not.
{code}
 To specify different parameters, use:
 * - This command will use 10 warm-up iterations, 5 test iterations, and 2 
forks. And it will
 * display the Average Time (avgt) in Microseconds (us)
 * - Benchmark mode. Available modes are:
 * [Throughput/thrpt, AverageTime/avgt, SampleTime/sample, SingleShotTime/ss, 
All/all]
 * - Output time unit. Available time units are: [m, s, ms, us, ns].
 * 
 * $ java -jar target/benchmarks.jar 
org.apache.hive.benchmark.vectorization.VectorizedLogicBench
 * -wi 10 -i 5 -f 2 -bm avgt -tu us
 */
{code})

> Performance degradation on 
> VectorizedLogicBench#IfExprLongColumnLongColumnBench when AVX512 is enabled
> --
>
> Key: HIVE-18080
> URL: https://issues.apache.org/jira/browse/HIVE-18080
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
> Attachments: IFExpression_AVX2_Instruction.png, 
> log.logic.avx1.single.0, log_logic.avx1.part
>
>
> Use  Xeon(R) Platinum 8180 CPU to test the performance of 
> [AVX512|https://en.wikipedia.org/wiki/AVX-512].
> {code}
> #cat /proc/cpuinfo |grep "model name"|head -n 1
> model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
> {code}
> Before that I have compiled hive with JDK9 as JDK9 enables AVX512 
> Use hive microbenchmark(HIVE-10189) to evaluate the performance improvement. 
> It seems performance(20%+) in cases in 
> {{VectorizedArithmeticBench}},{{VectorizedComparisonBench}},{{VectorizedLikeBench}},{{VectorizedLogicBench}}
>  execpt 
> {{VectorizedLogicBench#IfExprLongColumnLongColumnBench}},{{VectorizedLogicBench#IfExprRepeatingLongColumnLongColumnBench}}
>  and
> {{VectorizedLogicBench#IfExprLongColumnRepeatingLongColumnBench}}.The data is 
> like following
> When i use Skylake CPU to evaluate the performance improvement of AVX512.
> I found the performance in VectorizedLogicBench is like following
> || ||AVX2 us/op||AVX512 us/op ||  (AVX2-AVX512)/AVX2||
> |ColAndColBench|122510| 87014| 28.9%|
> |IfExprLongColumnLongColumnBench | 1325759| 1436073| -8.3% |
> |IfExprLongColumnRepeatingLongColumnBench|1397447|1480450|  -5.9%|
> |IfExprRepeatingLongColumnLongColumnBench|1401164|1483062|  -5.9% |
> |NotColBench|77042.83|51513.28|  33%|
> There are degradation in 
> IfExprLongColumnLongColumnBench,IfExprLongColumnRepeatingLongColumnBench, 
> IfExprRepeatingLongColumnLongColumnBench, very confused why there is 
> degradation on IfExprLongColumnLongColumnBench cases.
> Here we use {{taskset -cp 1 $pid}} to run the benchmark on single core to 
> avoid the impact of dynamic CPU frequency scaling.
> my script
> {code}
> export JAVA_HOME=/home/zly/jdk-9.0.1/
> export PATH=$JAVA_HOME/bin:$PATH
> export LD_LIBRARY_PATH=/home/zly/jdk-9.0.1/mylib
> for i in 0 1 2; do
> java -server -XX:UseAVX=3 -jar benchmarks.jar 
> org.apache.hive.benchmark.vectorization.VectorizedLogicBench * -wi 10 -i 20 
> -f 1 -bm avgt -tu us >log.logic.avx3.single.$i & export pid=$!
> taskset -cp 1 $pid
> wait $pid
> done
> for i in 0 1 2; do
> java -server -XX:UseAVX=2 -jar benchmarks.jar 
> org.apache.hive.benchmark.vectorization.VectorizedLogicBench * -wi 10 -i 20 
> -f 1 -bm avgt -tu us >log.logic.avx2.single.$i & export pid=$!
> taskset -cp 1 $pid
> wait $pid
> done
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18080) Performance degradation on VectorizedLogicBench#IfExprLongColumnLongColumnBench when AVX512 is enabled

2017-11-22 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263873#comment-16263873
 ] 

liyunzhang commented on HIVE-18080:
---


It seems after If expression is vectorized by AVX2 instructions. It only 
consumes 2.8% of all total instructions and consumes very little cpu time(23ms) 
when the warm iter number is 5000 and iter number is 5. So the variation of 
this test is too big. Later will try to enlarge iterations of jmh test to see 
the degradation exists or not.
{code}
 To specify different parameters, use:
 * - This command will use 10 warm-up iterations, 5 test iterations, and 2 
forks. And it will
 * display the Average Time (avgt) in Microseconds (us)
 * - Benchmark mode. Available modes are:
 * [Throughput/thrpt, AverageTime/avgt, SampleTime/sample, SingleShotTime/ss, 
All/all]
 * - Output time unit. Available time units are: [m, s, ms, us, ns].
 * 
 * $ java -jar target/benchmarks.jar 
org.apache.hive.benchmark.vectorization.VectorizedLogicBench
 * -wi 10 -i 5 -f 2 -bm avgt -tu us
 */
{code}

> Performance degradation on 
> VectorizedLogicBench#IfExprLongColumnLongColumnBench when AVX512 is enabled
> --
>
> Key: HIVE-18080
> URL: https://issues.apache.org/jira/browse/HIVE-18080
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
> Attachments: IFExpression_AVX2_Instruction.png, 
> log.logic.avx1.single.0, log_logic.avx1.part
>
>
> Use  Xeon(R) Platinum 8180 CPU to test the performance of 
> [AVX512|https://en.wikipedia.org/wiki/AVX-512].
> {code}
> #cat /proc/cpuinfo |grep "model name"|head -n 1
> model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
> {code}
> Before that I have compiled hive with JDK9 as JDK9 enables AVX512 
> Use hive microbenchmark(HIVE-10189) to evaluate the performance improvement. 
> It seems performance(20%+) in cases in 
> {{VectorizedArithmeticBench}},{{VectorizedComparisonBench}},{{VectorizedLikeBench}},{{VectorizedLogicBench}}
>  execpt 
> {{VectorizedLogicBench#IfExprLongColumnLongColumnBench}},{{VectorizedLogicBench#IfExprRepeatingLongColumnLongColumnBench}}
>  and
> {{VectorizedLogicBench#IfExprLongColumnRepeatingLongColumnBench}}.The data is 
> like following
> When i use Skylake CPU to evaluate the performance improvement of AVX512.
> I found the performance in VectorizedLogicBench is like following
> || ||AVX2 us/op||AVX512 us/op ||  (AVX2-AVX512)/AVX2||
> |ColAndColBench|122510| 87014| 28.9%|
> |IfExprLongColumnLongColumnBench | 1325759| 1436073| -8.3% |
> |IfExprLongColumnRepeatingLongColumnBench|1397447|1480450|  -5.9%|
> |IfExprRepeatingLongColumnLongColumnBench|1401164|1483062|  -5.9% |
> |NotColBench|77042.83|51513.28|  33%|
> There are degradation in 
> IfExprLongColumnLongColumnBench,IfExprLongColumnRepeatingLongColumnBench, 
> IfExprRepeatingLongColumnLongColumnBench, very confused why there is 
> degradation on IfExprLongColumnLongColumnBench cases.
> Here we use {{taskset -cp 1 $pid}} to run the benchmark on single core to 
> avoid the impact of dynamic CPU frequency scaling.
> my script
> {code}
> export JAVA_HOME=/home/zly/jdk-9.0.1/
> export PATH=$JAVA_HOME/bin:$PATH
> export LD_LIBRARY_PATH=/home/zly/jdk-9.0.1/mylib
> for i in 0 1 2; do
> java -server -XX:UseAVX=3 -jar benchmarks.jar 
> org.apache.hive.benchmark.vectorization.VectorizedLogicBench * -wi 10 -i 20 
> -f 1 -bm avgt -tu us >log.logic.avx3.single.$i & export pid=$!
> taskset -cp 1 $pid
> wait $pid
> done
> for i in 0 1 2; do
> java -server -XX:UseAVX=2 -jar benchmarks.jar 
> org.apache.hive.benchmark.vectorization.VectorizedLogicBench * -wi 10 -i 20 
> -f 1 -bm avgt -tu us >log.logic.avx2.single.$i & export pid=$!
> taskset -cp 1 $pid
> wait $pid
> done
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-18080) Performance degradation on VectorizedLogicBench#IfExprLongColumnLongColumnBench when AVX512 is enabled

2017-11-22 Thread liyunzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated HIVE-18080:
--
Attachment: IFExpression_AVX2_Instruction.png

> Performance degradation on 
> VectorizedLogicBench#IfExprLongColumnLongColumnBench when AVX512 is enabled
> --
>
> Key: HIVE-18080
> URL: https://issues.apache.org/jira/browse/HIVE-18080
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
> Attachments: IFExpression_AVX2_Instruction.png, 
> log.logic.avx1.single.0, log_logic.avx1.part
>
>
> Use  Xeon(R) Platinum 8180 CPU to test the performance of 
> [AVX512|https://en.wikipedia.org/wiki/AVX-512].
> {code}
> #cat /proc/cpuinfo |grep "model name"|head -n 1
> model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
> {code}
> Before that I have compiled hive with JDK9 as JDK9 enables AVX512 
> Use hive microbenchmark(HIVE-10189) to evaluate the performance improvement. 
> It seems performance(20%+) in cases in 
> {{VectorizedArithmeticBench}},{{VectorizedComparisonBench}},{{VectorizedLikeBench}},{{VectorizedLogicBench}}
>  execpt 
> {{VectorizedLogicBench#IfExprLongColumnLongColumnBench}},{{VectorizedLogicBench#IfExprRepeatingLongColumnLongColumnBench}}
>  and
> {{VectorizedLogicBench#IfExprLongColumnRepeatingLongColumnBench}}.The data is 
> like following
> When i use Skylake CPU to evaluate the performance improvement of AVX512.
> I found the performance in VectorizedLogicBench is like following
> || ||AVX2 us/op||AVX512 us/op ||  (AVX2-AVX512)/AVX2||
> |ColAndColBench|122510| 87014| 28.9%|
> |IfExprLongColumnLongColumnBench | 1325759| 1436073| -8.3% |
> |IfExprLongColumnRepeatingLongColumnBench|1397447|1480450|  -5.9%|
> |IfExprRepeatingLongColumnLongColumnBench|1401164|1483062|  -5.9% |
> |NotColBench|77042.83|51513.28|  33%|
> There are degradation in 
> IfExprLongColumnLongColumnBench,IfExprLongColumnRepeatingLongColumnBench, 
> IfExprRepeatingLongColumnLongColumnBench, very confused why there is 
> degradation on IfExprLongColumnLongColumnBench cases.
> Here we use {{taskset -cp 1 $pid}} to run the benchmark on single core to 
> avoid the impact of dynamic CPU frequency scaling.
> my script
> {code}
> export JAVA_HOME=/home/zly/jdk-9.0.1/
> export PATH=$JAVA_HOME/bin:$PATH
> export LD_LIBRARY_PATH=/home/zly/jdk-9.0.1/mylib
> for i in 0 1 2; do
> java -server -XX:UseAVX=3 -jar benchmarks.jar 
> org.apache.hive.benchmark.vectorization.VectorizedLogicBench * -wi 10 -i 20 
> -f 1 -bm avgt -tu us >log.logic.avx3.single.$i & export pid=$!
> taskset -cp 1 $pid
> wait $pid
> done
> for i in 0 1 2; do
> java -server -XX:UseAVX=2 -jar benchmarks.jar 
> org.apache.hive.benchmark.vectorization.VectorizedLogicBench * -wi 10 -i 20 
> -f 1 -bm avgt -tu us >log.logic.avx2.single.$i & export pid=$!
> taskset -cp 1 $pid
> wait $pid
> done
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18080) Performance degradation on VectorizedLogicBench#IfExprLongColumnLongColumnBench when AVX512 is enabled

2017-11-22 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263861#comment-16263861
 ] 

liyunzhang commented on HIVE-18080:
---

I use vtune to see the assembly code of following code(If expression)
{code}
import java.util.Random;

/**
 *  * Created by lzhang66 on 11/13/2017.
 *   */
public class If {
  static long[] in1, in2,in3,out;
  public static void main(String[] args){
if( args.length == 3){
  long warmIter=Long.parseLong(args[0]);
  System.out.println("warmIter num:"+warmIter);
  long iter=Long.parseLong(args[1]);
  System.out.println("iter num:"+iter);
  boolean enableHive10238= Boolean.parseBoolean(args[2]);
  long startTime = System.currentTimeMillis();
  calc(warmIter, iter,enableHive10238);
  long endTime   = System.currentTimeMillis();
  long totalTime = endTime - startTime;
  System.out.println("Total time:"+totalTime);
}else{
  System.out.println("2 parameter need. Like java ReductionInt [warmIter] 
[iter] [enable");
  System.exit(0);
}

  }

  public static void calc(long warmIter, long iter, boolean enableHive10238)
  {
in1 = new long [1026];
in2 = new long [1026];
in3 = new long [1026];

Random rand = new Random(435437646);
for(int i=0; i0? in2[i1]: in3[i1];
  }
}
  }
}
{code}

run {{java If 5000 5 true}} to enable the patch of HIVE-10238  and run 
{{java  If 5000 5 false}} to disable the patch of HIVE-10238. Here there is 
three parameters, para1 means warmIter number, para2 means  Iter number, para3 
means with/wot HIVE-10238's patch.  In the attached picture, I saw AVX2 
instructions of if expression( {code}(in1[i1] - 1L)>0? in2[i1]: in3[i1]{code}).

> Performance degradation on 
> VectorizedLogicBench#IfExprLongColumnLongColumnBench when AVX512 is enabled
> --
>
> Key: HIVE-18080
> URL: https://issues.apache.org/jira/browse/HIVE-18080
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
> Attachments: log.logic.avx1.single.0, log_logic.avx1.part
>
>
> Use  Xeon(R) Platinum 8180 CPU to test the performance of 
> [AVX512|https://en.wikipedia.org/wiki/AVX-512].
> {code}
> #cat /proc/cpuinfo |grep "model name"|head -n 1
> model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
> {code}
> Before that I have compiled hive with JDK9 as JDK9 enables AVX512 
> Use hive microbenchmark(HIVE-10189) to evaluate the performance improvement. 
> It seems performance(20%+) in cases in 
> {{VectorizedArithmeticBench}},{{VectorizedComparisonBench}},{{VectorizedLikeBench}},{{VectorizedLogicBench}}
>  execpt 
> {{VectorizedLogicBench#IfExprLongColumnLongColumnBench}},{{VectorizedLogicBench#IfExprRepeatingLongColumnLongColumnBench}}
>  and
> {{VectorizedLogicBench#IfExprLongColumnRepeatingLongColumnBench}}.The data is 
> like following
> When i use Skylake CPU to evaluate the performance improvement of AVX512.
> I found the performance in VectorizedLogicBench is like following
> || ||AVX2 us/op||AVX512 us/op ||  (AVX2-AVX512)/AVX2||
> |ColAndColBench|122510| 87014| 28.9%|
> |IfExprLongColumnLongColumnBench | 1325759| 1436073| -8.3% |
> |IfExprLongColumnRepeatingLongColumnBench|1397447|1480450|  -5.9%|
> |IfExprRepeatingLongColumnLongColumnBench|1401164|1483062|  -5.9% |
> |NotColBench|77042.83|51513.28|  33%|
> There are degradation in 
> IfExprLongColumnLongColumnBench,IfExprLongColumnRepeatingLongColumnBench, 
> IfExprRepeatingLongColumnLongColumnBench, very confused why there is 
> degradation on IfExprLongColumnLongColumnBench cases.
> Here we use {{taskset -cp 1 $pid}} to run the benchmark on single core to 
> avoid the impact of dynamic CPU frequency scaling.
> my script
> {code}
> export JAVA_HOME=/home/zly/jdk-9.0.1/
> export PATH=$JAVA_HOME/bin:$PATH
> export

1 2 >

1 - 100 of 151 matches

Mail list logo