BTW, I knew this because the top line was "<console>:21". Anytime you see "<console>" that means that the code is something that you typed into the REPL.
On Wed, Jul 23, 2014 at 11:55 AM, Michael Armbrust <mich...@databricks.com> wrote: > Looks like a bug in your lambda function. Some of the lines you are > processing must have less than 6 elements, so doing p(5) is failing. > > > On Wed, Jul 23, 2014 at 11:44 AM, buntu <buntu...@gmail.com> wrote: > >> Thanks Michael. >> >> If I read in multiple files and attempt to saveAsParquetFile() I get the >> ArrayIndexOutOfBoundsException. I don't see this if I try the same with a >> single file: >> >> > case class Point(dt: String, uid: String, kw: String, tz: Int, success: >> > Int, code: String ) >> >> > val point = sc.textFile("data/raw_data_*").map(_.split("\t")).map(p => >> > Point(df.format(new Date( p(0).trim.toLong*1000L )), p(1), p(2), >> > p(3).trim.toInt, p(4).trim.toInt ,p(5))) >> >> > point.saveAsParquetFile("point.parquet") >> >> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >> SLF4J: Defaulting to no-operation (NOP) logger implementation >> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further >> details. >> 14/07/23 11:30:54 ERROR Executor: Exception in task ID 18 >> java.lang.ArrayIndexOutOfBoundsException: 1 >> at >> $line17.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:21) >> at >> $line17.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:21) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> at >> org.apache.spark.sql.parquet.InsertIntoParquetTable.org >> $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:248) >> at >> >> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:264) >> at >> >> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:264) >> at >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) >> at org.apache.spark.scheduler.Task.run(Task.scala:51) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) >> at >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> >> Is this due to the amount of data (about 5M rows) being processed? I've >> set >> the SPARK_DRIVER_MEMORY to 8g. >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data-files-to-Parquet-format-tp10526p10536.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >