Re: Spark Sql with python udf fail

lonely Feb Mon, 23 Mar 2015 02:55:20 -0700

sql("SELECT * FROM <your-table>").foreach(println)

can be executed successfully. So the problem may still be in UDF code. How
can i print the the line with ArrayIndexOutOfBoundsException in catalyst?


2015-03-23 17:04 GMT+08:00 lonely Feb <lonely8...@gmail.com>:

> ok i'll try asap
>
> 2015-03-23 17:00 GMT+08:00 Cheng Lian <lian.cs....@gmail.com>:
>
>>  I suspect there is a malformed row in your input dataset. Could you try
>> something like this to confirm:
>>
>> sql("SELECT * FROM <your-table>").foreach(println)
>>
>> If there does exist a malformed line, you should see similar exception.
>> And you can catch it with the help of the output. Notice that the messages
>> are printed to stdout on executor side.
>>
>> On 3/23/15 4:36 PM, lonely Feb wrote:
>>
>>   I caught exceptions in the python UDF code, flush exceptions into a
>> single file, and made sure the the column number of the output lines as
>> same as sql schema.
>>
>>  Sth. interesting is that my output line of the UDF code is just 10
>> columns, and the exception above is java.lang.
>> ArrayIndexOutOfBoundsException: 9, is there any inspirations?
>>
>> 2015-03-23 16:24 GMT+08:00 Cheng Lian <lian.cs....@gmail.com>:
>>
>>> Could you elaborate on the UDF code?
>>>
>>>
>>> On 3/23/15 3:43 PM, lonely Feb wrote:
>>>
>>>> Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a
>>>> sql job with python udf i got a exception:
>>>>
>>>> java.lang.ArrayIndexOutOfBoundsException: 9
>>>>         at
>>>> org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
>>>>         at
>>>> org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
>>>>         at
>>>> org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:166)
>>>>         at
>>>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
>>>>         at
>>>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
>>>>         at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:390)
>>>>         at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>>>>         at
>>>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:156)
>>>>         at
>>>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:151)
>>>>         at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
>>>>         at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
>>>>         at
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>         at
>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>>>>         at
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>         at
>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>>>>         at
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>>         at
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>>>>         at
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>         at java.lang.Thread.run(Thread.java:744)
>>>>
>>>> I suspected there was an odd line in the input file. But the input file
>>>> is so large and i could not found any abnormal lines with several jobs to
>>>> check. How can i get the abnormal line here ?
>>>>
>>>
>>>
>>    
>>
>
>

Re: Spark Sql with python udf fail

Reply via email to