Re: Spark Sql with python udf fail
sql("SELECT * FROM ").foreach(println) can be executed successfully. So the problem may still be in UDF code. How can i print the the line with ArrayIndexOutOfBoundsException in catalyst? 2015-03-23 17:04 GMT+08:00 lonely Feb : > ok i'll try asap > > 2015-03-23 17:00 GMT+08:00 Cheng Lian : > >> I suspect there is a malformed row in your input dataset. Could you try >> something like this to confirm: >> >> sql("SELECT * FROM ").foreach(println) >> >> If there does exist a malformed line, you should see similar exception. >> And you can catch it with the help of the output. Notice that the messages >> are printed to stdout on executor side. >> >> On 3/23/15 4:36 PM, lonely Feb wrote: >> >> I caught exceptions in the python UDF code, flush exceptions into a >> single file, and made sure the the column number of the output lines as >> same as sql schema. >> >> Sth. interesting is that my output line of the UDF code is just 10 >> columns, and the exception above is java.lang. >> ArrayIndexOutOfBoundsException: 9, is there any inspirations? >> >> 2015-03-23 16:24 GMT+08:00 Cheng Lian : >> >>> Could you elaborate on the UDF code? >>> >>> >>> On 3/23/15 3:43 PM, lonely Feb wrote: >>> >>>> Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a >>>> sql job with python udf i got a exception: >>>> >>>> java.lang.ArrayIndexOutOfBoundsException: 9 >>>> at >>>> org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142) >>>> at >>>> org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37) >>>> at >>>> org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:166) >>>> at >>>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30) >>>> at >>>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30) >>>> at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:390) >>>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327) >>>> at >>>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:156) >>>> at >>>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:151) >>>> at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601) >>>> at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601) >>>> at >>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >>>> at >>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) >>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) >>>> at >>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >>>> at >>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) >>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) >>>> at >>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >>>> at >>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >>>> at org.apache.spark.scheduler.Task.run(Task.scala:56) >>>> at >>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>> at java.lang.Thread.run(Thread.java:744) >>>> >>>> I suspected there was an odd line in the input file. But the input file >>>> is so large and i could not found any abnormal lines with several jobs to >>>> check. How can i get the abnormal line here ? >>>> >>> >>> >> >> > >
Re: Spark Sql with python udf fail
ok i'll try asap 2015-03-23 17:00 GMT+08:00 Cheng Lian : > I suspect there is a malformed row in your input dataset. Could you try > something like this to confirm: > > sql("SELECT * FROM ").foreach(println) > > If there does exist a malformed line, you should see similar exception. > And you can catch it with the help of the output. Notice that the messages > are printed to stdout on executor side. > > On 3/23/15 4:36 PM, lonely Feb wrote: > > I caught exceptions in the python UDF code, flush exceptions into a > single file, and made sure the the column number of the output lines as > same as sql schema. > > Sth. interesting is that my output line of the UDF code is just 10 > columns, and the exception above is java.lang. > ArrayIndexOutOfBoundsException: 9, is there any inspirations? > > 2015-03-23 16:24 GMT+08:00 Cheng Lian : > >> Could you elaborate on the UDF code? >> >> >> On 3/23/15 3:43 PM, lonely Feb wrote: >> >>> Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a >>> sql job with python udf i got a exception: >>> >>> java.lang.ArrayIndexOutOfBoundsException: 9 >>> at >>> org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142) >>> at >>> org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37) >>> at >>> org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:166) >>> at >>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30) >>> at >>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30) >>> at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:390) >>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327) >>> at >>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:156) >>> at >>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:151) >>> at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601) >>> at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601) >>> at >>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >>> at >>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) >>> at >>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >>> at >>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) >>> at >>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >>> at >>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >>> at org.apache.spark.scheduler.Task.run(Task.scala:56) >>> at >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:744) >>> >>> I suspected there was an odd line in the input file. But the input file >>> is so large and i could not found any abnormal lines with several jobs to >>> check. How can i get the abnormal line here ? >>> >> >> > >
Re: Spark Sql with python udf fail
I caught exceptions in the python UDF code, flush exceptions into a single file, and made sure the the column number of the output lines as same as sql schema. Sth. interesting is that my output line of the UDF code is just 10 columns, and the exception above is java.lang.ArrayIndexOutOfBoundsException: 9, is there any inspirations? 2015-03-23 16:24 GMT+08:00 Cheng Lian : > Could you elaborate on the UDF code? > > > On 3/23/15 3:43 PM, lonely Feb wrote: > >> Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a >> sql job with python udf i got a exception: >> >> java.lang.ArrayIndexOutOfBoundsException: 9 >> at org.apache.spark.sql.catalyst.expressions.GenericRow.apply( >> Row.scala:142) >> at org.apache.spark.sql.catalyst.expressions.BoundReference. >> eval(BoundAttribute.scala:37) >> at org.apache.spark.sql.catalyst.expressions.EqualTo.eval( >> predicates.scala:166) >> at org.apache.spark.sql.catalyst.expressions. >> InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30) >> at org.apache.spark.sql.catalyst.expressions. >> InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30) >> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) >> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> at org.apache.spark.sql.execution.Aggregate$$anonfun$ >> execute$1$$anonfun$7.apply(Aggregate.scala:156) >> at org.apache.spark.sql.execution.Aggregate$$anonfun$ >> execute$1$$anonfun$7.apply(Aggregate.scala:151) >> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) >> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) >> at org.apache.spark.rdd.MapPartitionsRDD.compute( >> MapPartitionsRDD.scala:35) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD. >> scala:263) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) >> at org.apache.spark.rdd.MapPartitionsRDD.compute( >> MapPartitionsRDD.scala:35) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD. >> scala:263) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) >> at org.apache.spark.scheduler.ShuffleMapTask.runTask( >> ShuffleMapTask.scala:68) >> at org.apache.spark.scheduler.ShuffleMapTask.runTask( >> ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:56) >> at org.apache.spark.executor.Executor$TaskRunner.run( >> Executor.scala:197) >> at java.util.concurrent.ThreadPoolExecutor.runWorker( >> ThreadPoolExecutor.java:1145) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >> ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:744) >> >> I suspected there was an odd line in the input file. But the input file >> is so large and i could not found any abnormal lines with several jobs to >> check. How can i get the abnormal line here ? >> > >
Spark Sql with python udf fail
Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a sql job with python udf i got a exception: java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142) at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37) at org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:166) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:156) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) I suspected there was an odd line in the input file. But the input file is so large and i could not found any abnormal lines with several jobs to check. How can i get the abnormal line here ?
Fwd: Problems with TeraValidate
+spark-user -- Forwarded message -- From: lonely Feb Date: 2015-01-16 19:09 GMT+08:00 Subject: Re: Problems with TeraValidate To: Ewan Higgs thx a lot. btw, here is my output: 1. when dataset is 1000g: num records: 100 checksum: 12aa5028310ea763e part 0 lastMaxArrayBuffer(0, 0, 0, 0, 0, 0, 0, 0, 0, 0) min ArrayBuffer(0, 4, 25, 150, 6, 136, 39, 39, 214, 164) max ArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113) part 1 lastMaxArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113) min ArrayBuffer(0, 4, 25, 150, 6, 136, 39, 39, 214, 164) max ArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113) Exception in thread "main" java.lang.AssertionError: assertion failed: current partition min < last partition max at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.examples.terasort.TeraValidate$$anonfun$validate$3.apply(TeraValidate.scala:117) at org.apache.spark.examples.terasort.TeraValidate$$anonfun$validate$3.apply(TeraValidate.scala:111) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.examples.terasort.TeraValidate$.validate(TeraValidate.scala:111) at org.apache.spark.examples.terasort.TeraValidate$.main(TeraValidate.scala:59) at org.apache.spark.examples.terasort.TeraValidate.main(TeraValidate.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2. when dataset is 200m: um records: 200 checksum: ca93e5d2fad40 part 0 lastMaxArrayBuffer(0, 0, 0, 0, 0, 0, 0, 0, 0, 0) min ArrayBuffer(82, 24, 27, 218, 62, 68, 174, 208, 69, 78) max ArrayBuffer(146, 177, 217, 195, 175, 144, 239, 81, 29, 252) part 1 lastMaxArrayBuffer(146, 177, 217, 195, 175, 144, 239, 81, 29, 252) min ArrayBuffer(82, 24, 27, 218, 62, 68, 174, 208, 69, 78) max ArrayBuffer(146, 177, 217, 195, 175, 144, 239, 81, 29, 252) Exception in thread "main" java.lang.AssertionError: assertion failed: current partition min < last partition max at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.examples.terasort.TeraValidate$$anonfun$validate$3.apply(TeraValidate.scala:117) at org.apache.spark.examples.terasort.TeraValidate$$anonfun$validate$3.apply(TeraValidate.scala:111) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.examples.terasort.TeraValidate$.validate(TeraValidate.scala:111) at org.apache.spark.examples.terasort.TeraValidate$.main(TeraValidate.scala:59) at org.apache.spark.examples.terasort.TeraValidate.main(TeraValidate.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) I suspect sth. is wrong with the function "clone". 2015-01-16 19:02 GMT+08:00 Ewan Higgs : > Hi Ionely, > I am looking at this now. If you need to validate a terasort benchmark as > soon as possible, I would use Hadoop's TeraValidate. > > I'll let you know when I have a fix. > > Yours, > Ewan Higgs > > > On 16/01/15 09:47, lonely Feb wrote: > >> Hi i run your terasort program on my spark cluster, when the dataset is >> small (below 1000g) everything goes fine, but when the dataset is over >> 1000g, the TeraValidate always assert error with: >> current partition min < last partition max >> >> eg. output is : >> num records: 100 >> checksum: 12aa5028310ea763e >> part 0 >> lastMaxArrayBuffer(0, 0, 0, 0, 0, 0, 0, 0, 0, 0) >> min ArrayBuffer(0, 4, 25, 150, 6, 136, 39, 39, 214, 164) >> max ArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113) >> part 1 >> lastMaxArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113) >> min ArrayBuffer(0, 4, 25, 150, 6, 136, 39, 39, 214, 164) >>
terasort on spark
Hi all , i tried to run a terasort benchmark on my spark cluster, but i found it is hard to find a standard spark terasort program except a PR from rxin and ewan higgs: https://github.com/apache/spark/pull/1242 https://github.com/ehiggs/spark/tree/terasort The example which rxin provided without a validate test so i tried higgs's example, but i sadly found a always get an error when validate: assertion failed: current partition min < last partition max It seems that it requires the min array in partition 2 must bigger than max array in partion 1, but the code here is confusing: println(s"lastMax" + lastMax.toSeq.map(x => if (x < 0) 256 + x else x)) println(s"min " + min.toSeq.map(x => if (x < 0) 256 + x else x)) println(s"max " + max.toSeq.map(x => if (x < 0) 256 + x else x)) Anyone ever run the terasort example successfully? Or where can i get a standard terasort application?
View executor logs on YARN mode
Hi all, i sadly found on YARN mode i cannot view executor logs on YARN web UI nor on SPARK history web UI. On YARN web UI i can only view AppMaster logs and on SPARK history web UI i can only view Application metric information. If i want to see whether a executor is being full GC i can only use "yarn logs" command. It's very unfriendly. BTW, "yarn logs" command requires a option of containerID which i could not found on YARN web UI either. I need to grep it in AppMaster log. I just wonder how do you handle this situation?