Re: Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
sql("SELECT * FROM ").foreach(println)

can be executed successfully. So the problem may still be in UDF code. How
can i print the the line with ArrayIndexOutOfBoundsException in catalyst?

2015-03-23 17:04 GMT+08:00 lonely Feb :

> ok i'll try asap
>
> 2015-03-23 17:00 GMT+08:00 Cheng Lian :
>
>>  I suspect there is a malformed row in your input dataset. Could you try
>> something like this to confirm:
>>
>> sql("SELECT * FROM ").foreach(println)
>>
>> If there does exist a malformed line, you should see similar exception.
>> And you can catch it with the help of the output. Notice that the messages
>> are printed to stdout on executor side.
>>
>> On 3/23/15 4:36 PM, lonely Feb wrote:
>>
>>   I caught exceptions in the python UDF code, flush exceptions into a
>> single file, and made sure the the column number of the output lines as
>> same as sql schema.
>>
>>  Sth. interesting is that my output line of the UDF code is just 10
>> columns, and the exception above is java.lang.
>> ArrayIndexOutOfBoundsException: 9, is there any inspirations?
>>
>> 2015-03-23 16:24 GMT+08:00 Cheng Lian :
>>
>>> Could you elaborate on the UDF code?
>>>
>>>
>>> On 3/23/15 3:43 PM, lonely Feb wrote:
>>>
>>>> Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a
>>>> sql job with python udf i got a exception:
>>>>
>>>> java.lang.ArrayIndexOutOfBoundsException: 9
>>>> at
>>>> org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
>>>> at
>>>> org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
>>>> at
>>>> org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:166)
>>>> at
>>>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
>>>> at
>>>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
>>>> at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:390)
>>>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>>>> at
>>>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:156)
>>>> at
>>>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:151)
>>>> at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
>>>> at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
>>>> at
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>> at
>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>>>> at
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>> at
>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>>>> at
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>> at
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>>>> at
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:744)
>>>>
>>>> I suspected there was an odd line in the input file. But the input file
>>>> is so large and i could not found any abnormal lines with several jobs to
>>>> check. How can i get the abnormal line here ?
>>>>
>>>
>>>
>>​
>>
>
>


Re: Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
ok i'll try asap

2015-03-23 17:00 GMT+08:00 Cheng Lian :

>  I suspect there is a malformed row in your input dataset. Could you try
> something like this to confirm:
>
> sql("SELECT * FROM ").foreach(println)
>
> If there does exist a malformed line, you should see similar exception.
> And you can catch it with the help of the output. Notice that the messages
> are printed to stdout on executor side.
>
> On 3/23/15 4:36 PM, lonely Feb wrote:
>
>   I caught exceptions in the python UDF code, flush exceptions into a
> single file, and made sure the the column number of the output lines as
> same as sql schema.
>
>  Sth. interesting is that my output line of the UDF code is just 10
> columns, and the exception above is java.lang.
> ArrayIndexOutOfBoundsException: 9, is there any inspirations?
>
> 2015-03-23 16:24 GMT+08:00 Cheng Lian :
>
>> Could you elaborate on the UDF code?
>>
>>
>> On 3/23/15 3:43 PM, lonely Feb wrote:
>>
>>> Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a
>>> sql job with python udf i got a exception:
>>>
>>> java.lang.ArrayIndexOutOfBoundsException: 9
>>> at
>>> org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:166)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
>>> at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:390)
>>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>>> at
>>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:156)
>>> at
>>> org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:151)
>>> at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
>>> at org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>>
>>> I suspected there was an odd line in the input file. But the input file
>>> is so large and i could not found any abnormal lines with several jobs to
>>> check. How can i get the abnormal line here ?
>>>
>>
>>
>​
>


Re: Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
I caught exceptions in the python UDF code, flush exceptions into a single
file, and made sure the the column number of the output lines as same as
sql schema.

Sth. interesting is that my output line of the UDF code is just 10 columns,
and the exception above is java.lang.ArrayIndexOutOfBoundsException: 9, is
there any inspirations?

2015-03-23 16:24 GMT+08:00 Cheng Lian :

> Could you elaborate on the UDF code?
>
>
> On 3/23/15 3:43 PM, lonely Feb wrote:
>
>> Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a
>> sql job with python udf i got a exception:
>>
>> java.lang.ArrayIndexOutOfBoundsException: 9
>> at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(
>> Row.scala:142)
>> at org.apache.spark.sql.catalyst.expressions.BoundReference.
>> eval(BoundAttribute.scala:37)
>> at org.apache.spark.sql.catalyst.expressions.EqualTo.eval(
>> predicates.scala:166)
>> at org.apache.spark.sql.catalyst.expressions.
>> InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
>> at org.apache.spark.sql.catalyst.expressions.
>> InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
>> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at org.apache.spark.sql.execution.Aggregate$$anonfun$
>> execute$1$$anonfun$7.apply(Aggregate.scala:156)
>> at org.apache.spark.sql.execution.Aggregate$$anonfun$
>> execute$1$$anonfun$7.apply(Aggregate.scala:151)
>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.
>> scala:263)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.
>> scala:263)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:68)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>> at org.apache.spark.executor.Executor$TaskRunner.run(
>> Executor.scala:197)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>>
>> I suspected there was an odd line in the input file. But the input file
>> is so large and i could not found any abnormal lines with several jobs to
>> check. How can i get the abnormal line here ?
>>
>
>


Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a sql
job with python udf i got a exception:

java.lang.ArrayIndexOutOfBoundsException: 9
at
org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
at
org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
at
org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:166)
at
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
at
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:156)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

I suspected there was an odd line in the input file. But the input file is
so large and i could not found any abnormal lines with several jobs to
check. How can i get the abnormal line here ?


Fwd: Problems with TeraValidate

2015-01-16 Thread lonely Feb
+spark-user

-- Forwarded message --
From: lonely Feb 
Date: 2015-01-16 19:09 GMT+08:00
Subject: Re: Problems with TeraValidate
To: Ewan Higgs 


thx a lot.
btw, here is my output:

1. when dataset is 1000g:
num records: 100
checksum: 12aa5028310ea763e
part 0
lastMaxArrayBuffer(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
min ArrayBuffer(0, 4, 25, 150, 6, 136, 39, 39, 214, 164)
max ArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113)
part 1
lastMaxArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113)
min ArrayBuffer(0, 4, 25, 150, 6, 136, 39, 39, 214, 164)
max ArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113)
Exception in thread "main" java.lang.AssertionError: assertion failed:
current partition min < last partition max
at scala.Predef$.assert(Predef.scala:179)
at
org.apache.spark.examples.terasort.TeraValidate$$anonfun$validate$3.apply(TeraValidate.scala:117)
at
org.apache.spark.examples.terasort.TeraValidate$$anonfun$validate$3.apply(TeraValidate.scala:111)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.examples.terasort.TeraValidate$.validate(TeraValidate.scala:111)
at
org.apache.spark.examples.terasort.TeraValidate$.main(TeraValidate.scala:59)
at
org.apache.spark.examples.terasort.TeraValidate.main(TeraValidate.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

2. when dataset is 200m:
um records: 200
checksum: ca93e5d2fad40
part 0
lastMaxArrayBuffer(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
min ArrayBuffer(82, 24, 27, 218, 62, 68, 174, 208, 69, 78)
max ArrayBuffer(146, 177, 217, 195, 175, 144, 239, 81, 29, 252)
part 1
lastMaxArrayBuffer(146, 177, 217, 195, 175, 144, 239, 81, 29, 252)
min ArrayBuffer(82, 24, 27, 218, 62, 68, 174, 208, 69, 78)
max ArrayBuffer(146, 177, 217, 195, 175, 144, 239, 81, 29, 252)
Exception in thread "main" java.lang.AssertionError: assertion failed:
current partition min < last partition max
at scala.Predef$.assert(Predef.scala:179)
at
org.apache.spark.examples.terasort.TeraValidate$$anonfun$validate$3.apply(TeraValidate.scala:117)
at
org.apache.spark.examples.terasort.TeraValidate$$anonfun$validate$3.apply(TeraValidate.scala:111)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.examples.terasort.TeraValidate$.validate(TeraValidate.scala:111)
at
org.apache.spark.examples.terasort.TeraValidate$.main(TeraValidate.scala:59)
at
org.apache.spark.examples.terasort.TeraValidate.main(TeraValidate.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I suspect sth. is wrong with the function "clone".

2015-01-16 19:02 GMT+08:00 Ewan Higgs :

> Hi Ionely,
> I am looking at this now. If you need to validate a terasort benchmark as
> soon as possible, I would use Hadoop's TeraValidate.
>
> I'll let you know when I have a fix.
>
> Yours,
> Ewan Higgs
>
>
> On 16/01/15 09:47, lonely Feb wrote:
>
>> Hi i run your terasort program on my spark cluster, when the dataset is
>> small (below 1000g) everything goes fine, but when the dataset is over
>> 1000g, the TeraValidate always assert error with:
>> current partition min < last partition max
>>
>> eg. output is :
>> num records: 100
>> checksum: 12aa5028310ea763e
>> part 0
>> lastMaxArrayBuffer(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
>> min ArrayBuffer(0, 4, 25, 150, 6, 136, 39, 39, 214, 164)
>> max ArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113)
>> part 1
>> lastMaxArrayBuffer(255, 255, 96, 244, 80, 50, 31, 158, 43, 113)
>> min ArrayBuffer(0, 4, 25, 150, 6, 136, 39, 39, 214, 164)
>>

terasort on spark

2015-01-16 Thread lonely Feb
Hi all , i tried to run a terasort benchmark on my spark cluster, but i
found it is hard to find a standard spark terasort program except a PR from
rxin and ewan higgs:

https://github.com/apache/spark/pull/1242
https://github.com/ehiggs/spark/tree/terasort

The example which rxin provided without a validate test so i tried higgs's
example, but i sadly found a always get an error when validate:

assertion failed: current partition min < last partition max

It seems that it requires the min array in partition 2 must bigger than max
array in partion 1, but the code here is confusing:

println(s"lastMax" + lastMax.toSeq.map(x => if (x < 0) 256 + x else
x))
println(s"min " + min.toSeq.map(x => if (x < 0) 256 + x else x))
println(s"max " + max.toSeq.map(x => if (x < 0) 256 + x else x))

Anyone ever run the terasort example successfully? Or where can i get a
standard terasort application?


View executor logs on YARN mode

2015-01-14 Thread lonely Feb
Hi all, i sadly found on YARN mode i cannot view executor logs on YARN web
UI nor on SPARK history web UI. On YARN web UI i can only view AppMaster
logs and on SPARK history web UI i can only view Application metric
information. If i want to see whether a executor is being full GC i can
only use "yarn logs" command. It's very unfriendly. BTW, "yarn logs"
command requires a option of containerID which i could not found on YARN
web UI either. I need to grep it in AppMaster log.

I just wonder how do you handle this situation?