[ 
https://issues.apache.org/jira/browse/SPARK-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342094#comment-14342094
 ] 

Apache Spark commented on SPARK-6082:
-------------------------------------

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4842

> SparkSQL should fail gracefully when input data format doesn't match 
> expectations
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-6082
>                 URL: https://issues.apache.org/jira/browse/SPARK-6082
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.1
>            Reporter: Kay Ousterhout
>
> I have a udf that creates a tab-delimited table. If any of the column values 
> contain a tab, SQL fails with an ArrayIndexOutOfBounds exception (pasted 
> below).  It would be great if SQL failed gracefully here, with a helpful 
> exception (something like "One row contained too many values").
> It looks like this can be done quite easily, by checking here if i > 
> columnBuilders.size and if so, throwing a nicer exception: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala#L124.
> One thing that makes this problem especially annoying to debug is because if 
> you do "CREATE table foo as select transform(..." and then "CACHE table foo", 
> it works fine.  It only fails if you do "CACHE table foo as select 
> transform(...".  Because of this, it would be great if the problem were more 
> transparent to users.
> Stack trace:
> java.lang.ArrayIndexOutOfBoundsException: 3
>   at 
> org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:125)
>   at 
> org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:112)
>   at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
>   at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:220)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to