[jira] [Comment Edited] (HIVE-16958) Setting hive.merge.sparkfiles=true will retrun an error when generating parquet databases

2017-06-27 Thread Liu Chunxiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064419#comment-16064419
 ] 

Liu Chunxiao edited comment on HIVE-16958 at 6/27/17 7:55 AM:
--

Hi [~lirui], I have added the DataGen mvn project file(Main.java, 
RowGenerator.java, pom.xml) and hive-site.xml in attach files. It is easy to 
generate data on hdfs by the code(Default textfile is 500M. Parquet file is 
about 200M, and it generates 10 partitions dynamically. One file in partition 
is about 10M, so you need at least 2 files in one partition to merge them).


was (Author: liu765940375):
Hi @Rui Li, I have added the DataGen mvn project file(Main.java, 
RowGenerator.java, pom.xml) and hive-site.xml in attach files. It is easy to 
generate data on hdfs by the code(Default textfile is 500M. Parquet file is 
about 200M, and it generates 10 partitions dynamically. One file in partition 
is about 10M, so you need at least 2 files in one partition to merge them).

> Setting hive.merge.sparkfiles=true will retrun an error when generating 
> parquet databases 
> --
>
> Key: HIVE-16958
> URL: https://issues.apache.org/jira/browse/HIVE-16958
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.2.0, 2.3.0
> Environment: centos7 hadoop2.7.3 spark2.0.0
>Reporter: Liu Chunxiao
>Priority: Minor
> Attachments: hive-site.xml, Main.java, 
> parquet-hivemergesparkfiles.txt, pom.xml, RowGenerator.java, sale.sql
>
>
> The process will return 
> Job failed with java.lang.NullPointerException
> FAILED: Execution Error, return code 3 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask. 
> java.util.concurrent.ExecutionException: Exception thrown by job
>   at 
> org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
>   at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 1.0 (TID 31, bdpe822n1): java.io.IOException: 
> java.lang.reflect.InvocationTargetException
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:271)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:217)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:345)
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:695)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:246)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

[jira] [Comment Edited] (HIVE-16958) Setting hive.merge.sparkfiles=true will retrun an error when generating parquet databases

2017-06-27 Thread Liu Chunxiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064419#comment-16064419
 ] 

Liu Chunxiao edited comment on HIVE-16958 at 6/27/17 7:53 AM:
--

Hi @Rui Li, I have added the DataGen mvn project file(Main.java, 
RowGenerator.java, pom.xml) and hive-site.xml in attach files. It is easy to 
generate data on hdfs by the code(Default textfile is 500M. Parquet file is 
about 200M, and it generates 10 partitions dynamically. One file in partition 
is about 10M, so you need at least 2 files in one partition to merge them).


was (Author: liu765940375):
Hi Rui Li, I have added the DataGen mvn project file(Main.java, 
RowGenerator.java, pom.xml) and hive-site.xml in attach files. It is easy to 
generate data on hdfs by the code(Default textfile is 500M. Parquet file is 
about 200M, and it generates 10 partitions dynamically. One file in partition 
is about 10M, so you need at least 2 files in one partition to merge them).

> Setting hive.merge.sparkfiles=true will retrun an error when generating 
> parquet databases 
> --
>
> Key: HIVE-16958
> URL: https://issues.apache.org/jira/browse/HIVE-16958
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.2.0, 2.3.0
> Environment: centos7 hadoop2.7.3 spark2.0.0
>Reporter: Liu Chunxiao
>Priority: Minor
> Attachments: hive-site.xml, Main.java, 
> parquet-hivemergesparkfiles.txt, pom.xml, RowGenerator.java, sale.sql
>
>
> The process will return 
> Job failed with java.lang.NullPointerException
> FAILED: Execution Error, return code 3 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask. 
> java.util.concurrent.ExecutionException: Exception thrown by job
>   at 
> org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
>   at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 1.0 (TID 31, bdpe822n1): java.io.IOException: 
> java.lang.reflect.InvocationTargetException
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:271)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:217)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:345)
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:695)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:246)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>