Hello Terry, How many columns does pqt_rdt_snappy have?
Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu <terry....@smartfocus.com> wrote: > Hi Michael, > > That worked for me. At least I’m now further than I was. Thanks for the > tip! > > -Terry > > From: Michael Armbrust <mich...@databricks.com> > Date: Monday, October 13, 2014 at 5:05 PM > To: Terry Siu <terry....@smartfocus.com> > Cc: "user@spark.apache.org" <user@spark.apache.org> > Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet > > There are some known bug with the parquet serde and spark 1.1. > > You can try setting spark.sql.hive.convertMetastoreParquet=true to cause > spark sql to use built in parquet support when the serde looks like parquet. > > On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu <terry....@smartfocus.com> > wrote: > >> I am currently using Spark 1.1.0 that has been compiled against Hadoop >> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external >> Hive tables that point to Parquet (compressed with Snappy), which were >> converted over from Avro if that matters. >> >> I am trying to perform a join with these two Hive tables, but am >> encountering an exception. In a nutshell, I launch a spark shell, create my >> HiveContext (pointing to the correct metastore on our cluster), and then >> proceed to do the following: >> >> scala> val hc = new HiveContext(sc) >> >> scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >> >= 1325376000000 and translate <= 1340063999999”) >> >> scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where >> coll_def_id=‘abcd’”) >> >> scala> txn.registerAsTable(“segTxns”) >> >> scala> segcust.registerAsTable(“segCusts”) >> >> scala> val joined = hc.sql(“select t.transid, c.customer_id from >> segTxns t join segCusts c on t.customerid=c.customer_id”) >> >> Straight forward enough, but I get the following exception: >> >> 14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 >> (TID 51) >> >> java.lang.IndexOutOfBoundsException: Index: 21, Size: 21 >> >> at java.util.ArrayList.rangeCheck(ArrayList.java:635) >> >> at java.util.ArrayList.get(ArrayList.java:411) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:67) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) >> >> at >> org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197) >> >> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188) >> >> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at >> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) >> >> at org.apache.spark.scheduler.Task.run(Task.scala:54) >> >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> >> >> The number of columns in my table, pqt_segcust_snappy, has 21 columns >> and two partitions defined. Does this error look familiar to anyone? Could >> my usage of SparkSQL with Hive be incorrect or is support with >> Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? >> >> >> Thanks, >> >> -Terry >> > >