unsubscribe

2017-04-13 Thread tian zhang



Re: Spark streaming checkpoint against s3

2015-10-15 Thread Tian Zhang
So as long as jar is kept on s3 and available across different runs, then the
s3 checkpoint is working.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068p25081.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




Spark streaming checkpoint against s3

2015-10-14 Thread Tian Zhang
Hi, I am trying to set spark streaming checkpoint to s3, here is what I did
basically

val checkpoint = "s3://myBucket/checkpoint"
val ssc = StreamingContext.getOrCreate(checkpointDir,
   () =>
getStreamingContext(sparkJobName,

   
batchDurationSec),

  
classOf[MyClassKryoRegistrator],

  
checkpointDir),

  
getHadoopConfiguration) 
  
  def getHadoopConfiguration: Configuration = {
val hadoopConf = new Configuration()
hadoopConf.set("fs.defaultFS", "s3://"+myBucket+"/")
hadoopConf.set("fs.s3.awsAccessKeyId", "myAccessKey")
hadoopConf.set("fs.s3.awsSecretAccessKey", "mySecretKey")
hadoopConf.set("fs.s3n.awsAccessKeyId", "myAccessKey")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "mySecretKey
hadoopConf
   }

It is working as I can see that it tries to retrieve checkpoint from s3. 

However it did more than what I intended.  I saw in the log of the following
15/10/14 19:58:47 ERROR spark.SparkContext: Jar not found at
file:/media/ephemeral0/oncue/mesos-slave/slaves/20151007-172900-436893194-5050-2984-S9/frameworks/20150825-180042-604730890-5050-4268-0003/executors/tian-act-reg.47368a1a-71f9-11e5-ad61-de5fb3a867da/runs/dfc28a6c-48a0-464b-bdb1-d6dd057acd51/artifacts/rna-spark-streaming.jar

Now SparkContext is trying to look the following path instead of local

file:/media/ephemeral0/oncue/mesos-slave/slaves/20151007-172900-436893194-5050-2984-S9/frameworks/20150825-180042-604730890-5050-4268-0003/executors/tian-act-reg.47368a1a-71f9-11e5-ad61-de5fb3a867da/runs/dfc28a6c-48a0-464b-bdb1-d6dd057acd51/artifacts/rna-spark-streaming.jar

How do I let SparkContext to look just
/media/ephemeral0/oncue/mesos-slave/slaves/20151007-172900-436893194-5050-2984-S9/frameworks/20150825-180042-604730890-5050-4268-0003/executors/tian-act-reg.47368a1a-71f9-11e5-ad61-de5fb3a867da/runs/dfc28a6c-48a0-464b-bdb1-d6dd057acd51/artifacts/rna-spark-streaming.jar?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark streaming checkpoint against s3

2015-10-14 Thread Tian Zhang
It looks like that reconstruction of SparkContext from checkpoint data is
trying to look for 
the jar file of previous failed runs.  It can not find the jar files as our
jar files are on local 
machines and were cleaned up after each failed run.








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068p25070.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: updateStateByKey and stack overflow

2015-10-13 Thread Tian Zhang
It turns out that our hdfs checkpoint failed, but spark streaming
is running and building up a long lineage ... 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-and-stack-overflow-tp25015p25054.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: "Too many open files" exception on reduceByKey

2015-10-11 Thread Tian Zhang
It turns out the mesos can overwrite the OS ulimit -n setting. So we have
increased the mesos slave ulimit -n setting.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



updateStateByKey and stack overflow

2015-10-10 Thread Tian Zhang
Hi, I am following the spark streaming stateful application example and write 
a simple counting application with updateStateByKey.
val keyStateStream = actRegBatchCountStream.updateStateByKey(update, new
HashPartitioner(ssc.sparkContext.defaultParallelism), true, initKeyStateRDD)
   
This runs for a few hours and hit the following stack overflow issue.
Any idea?

15/10/10 18:30:08 INFO BlockManagerInfo: Added broadcast_5249_piece0 in
memory on ip-10-112-11-64.ec2.internal:60489 (size: 16.5 KB, free: 4.1 GB)
15/10/10 18:30:08 WARN TaskSetManager: Lost task 4.0 in stage 129045.0 (TID
175432, ip-10-112-11-64.ec2.internal): java.lang.StackOverflowError
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1982)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-and-stack-overflow-tp25015.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: "Too many open files" exception on reduceByKey

2015-10-09 Thread tian zhang
You are right, I did find that mesos overwrite this to a smaller number.So we 
will modify that and try to run again. Thanks!
Tian 


 On Thursday, October 8, 2015 4:18 PM, DB Tsai <dbt...@dbtsai.com> wrote:
   

 Try to run to see actual ulimit. We found that mesos overrides the ulimit 
which causes the issue.
import sys.process._
val p = 1 to 100
val rdd = sc.parallelize(p, 100)
val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect



Sincerely,

DB Tsai
--Blog: 
https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Oct 8, 2015 at 3:22 PM, Tian Zhang <tzhang...@yahoo.com> wrote:

I hit this issue with spark 1.3.0 stateful application (with
updateStateByKey) function on mesos.  It will
fail after running fine for about 24 hours.
The error stack trace as below, I checked ulimit -n and we have very large
numbers set on the machines.
What else can be wrong?
15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage
113727.0 (TID 833758, ip-10-112-10-221.ec2.internal):
java.io.FileNotFoundException:
/media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index
(Too many open files)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.(FileOutputStream.java:221)
        at java.io.FileOutputStream.(FileOutputStream.java:171)
        at
org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85)
        at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





  

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread Tian Zhang
I hit this issue with spark 1.3.0 stateful application (with
updateStateByKey) function on mesos.  It will 
fail after running fine for about 24 hours.
The error stack trace as below, I checked ulimit -n and we have very large
numbers set on the machines.
What else can be wrong?
15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage
113727.0 (TID 833758, ip-10-112-10-221.ec2.internal):
java.io.FileNotFoundException:
/media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index
(Too many open files)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at java.io.FileOutputStream.(FileOutputStream.java:171)
at
org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to pass configuration properties from driver to executor?

2015-04-30 Thread Tian Zhang
Hi, 

We have a scenario as below and would like your suggestion.
We have app.conf file with propX=A as default built into the fat jar file
that is provided to spark-submit
WE have env.conf file with propX=B that would like spark-submit to take as
input to overwrite the default and populate to both driver and executors.
Note in the executor, we are using some package that is using typesafe
config to read configuration properties.

How do we do that?

Thanks.

Tian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-pass-configuration-properties-from-driver-to-executor-tp22728.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread tian zhang
I have found this paper seems to answer most of questions about life 
duration.https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

Tian 

 On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha 
me.mukesh@gmail.com wrote:
   

 Hey Experts,
I wanted to understand in detail about the lifecycle of rdd(s) in a streaming 
app.
From my current understanding- rdd gets created out of the realtime input 
stream.
- Transform(s) functions are applied in a lazy fashion on the RDD to transform 
into another rdd(s).- Actions are taken on the final transformed rdds to get 
the data out of the system.
Also rdd(s) are stored in the clusters RAM (disc if configured so) and are 
cleaned in LRU fashion.
So I have the following questions on the same.
- How spark (streaming) guarantees that all the actions are taken on each input 
rdd/batch. - How does spark determines that the life-cycle of a rdd is 
complete. Is there any chance that a RDD will be cleaned out of ram before all 
actions are taken on them?
Thanks in advance for all your help. Also, I'm relatively new to scala  spark 
so pardon me in case these are naive questions/assumptions.

-- 
Thanks  Regards,
Mukesh Jha

   

2 spark streaming questions

2014-11-23 Thread tian zhang

Hi, Dear Spark Streaming Developers and Users,
We are prototyping using spark streaming and hit the following 2 issues thatI 
would like to seek your expertise.
1) We have a spark streaming application in scala, that reads  data from Kafka 
intoa DStream, does some processing and output a transformed DStream. If for 
some reasonthe Kafka connection is not available or timed out, the spark 
streaming job will startto send empty RDD afterwards. The log is clean w/o any 
ERROR indicator. I googled  around and this seems to be a known issue.We 
believe that spark streaming infrastructure should either retry or return 
error/exception.Can you share how you handle this case?
2) We would like implement a spark streaming job that join an 1 minute  
duration DStream of real time eventswith a metadata RDD that was read from a 
database. The metadata only changes slightly each day in the database.So what 
is the best practice of refresh the RDD daily keep the streaming join job 
running? Is this do-able as of spark 1.1.0?
Thanks.
Tian



Re: spark streaming and the spark shell

2014-11-19 Thread Tian Zhang
I am hitting the same issue, i.e., after running for some time, if spark
streaming job lost or timeout 
kafka connection, it will just start to return empty RDD's ..
Is there a timeline for when this issue will be fixed so that I can plan
accordingly?

Thanks.

Tian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-and-the-spark-shell-tp3347p19296.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark 1.1.0/yarn hang

2014-10-22 Thread Tian Zhang
We have narrowed this hanging issue down to the calliope package 
that we used to create RDD from reading cassandra table.
The calliope native RDD interface seems hanging and I have decided to switch
to the calliope cql3 RDD interface.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-0-yarn-hang-tp16396p17087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark 1.1.0 RDD and Calliope 1.1.0-CTP-U2-H2

2014-10-21 Thread Tian Zhang
Hi, I am using the latest calliope library from tuplejump.com to create RDD
for cassandra table.
I am on a 3 nodes spark 1.1.0 with yarn.

My cassandra table is defined as below and I have about 2000 rows of data
inserted.
CREATE TABLE top_shows (
  program_id varchar,
  view_minute timestamp,
  view_count counter,
  PRIMARY KEY (view_minute, program_id)   //note that view_minute is the
partition key
);

Here are the simple steps I ran from spark-shell on master node

spark-shell --master yarn-client --jars
rna/rna-spark-streaming-assembly-1.0-SNAPSHOT.jar --driver-memory 512m 
--executor-memory 512m --num-executors 3 --executor-cores 1

// Import the necessary 
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import com.tuplejump.calliope.utils.RichByteBuffer._
import com.tuplejump.calliope.Implicits._
import com.tuplejump.calliope.CasBuilder
import com.tuplejump.calliope.Types.{CQLRowKeyMap, CQLRowMap}

// Define my class and the implicit cast 
case class ProgramViewCount(viewMinute:Long, program:String, viewCount:Long)
implicit def keyValtoProgramViewCount(key:CQLRowKeyMap,
values:CQLRowMap):ProgramViewCount =
   ProgramViewCount(key.get(view_minute).get.getLong,
key.get(program_id).toString, values.get(view_count).get.getLong)

// Use the cql3 interface to read from table with WHERE predicate.
val cas = CasBuilder.cql3.withColumnFamily(streaming_qa,
top_shows).onHost(23.22.120.96)
.where(view_minute = 141386178)
val allPrograms = sc.cql3Cassandra[ProgramViewCount](cas)

// Lazy  evaluation till this point
val rowCount = allPrograms.count

I hit the following exception. It seems that it does not like my where
clause. If I do not have the 
WHERE CLAUSE, it works fine. But with the WHERE CLAUSE, no matter the
predicate is on 
partition key or not, it will fail with the following exception.

Anyone else using calliope package can share some lights? Thanks a lot.

Tian

scala val rowCount = allPrograms.count

14/10/21 23:26:07 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 0.0
(TID 2, ip-10-187-51-136.ec2.internal): java.lang.RuntimeException: 
   
com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:665)
   
com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.init(CqlPagingRecordReader.java:301)
   
com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:167)
   
com.tuplejump.calliope.cql3.Cql3CassandraRDD$$anon$1.init(Cql3CassandraRDD.scala:75)
   
com.tuplejump.calliope.cql3.Cql3CassandraRDD.compute(Cql3CassandraRDD.scala:64)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-0-RDD-and-Calliope-1-1-0-CTP-U2-H2-tp16975.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark 1.1.0/yarn hang

2014-10-14 Thread tian zhang
Hi, I have spark 1.1.0 yarn installation. I am using spark-submit to run a 
simple application.
From the console output, I have 769 partitions and after task 768 in stage 0 
(count) finished,
it hangs. I used jstack to dump the stacktop and it shows it is waiting ...

Any suggestion what might go wrong and how to debug this kind of hanging?

Thanks.

Tian


main prio=10 tid=0x7f6058009000 nid=0x7ecd in Object.wait() 
[0x7f605e4d9000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0xfdd30500 (a 
org.apache.spark.scheduler.JobWaiter)
at java.lang.Object.wait(Object.java:503)
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
- locked 0xfdd30500 (a org.apache.spark.scheduler.JobWaiter)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:511)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1088)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1107)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1121)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
at org.apache.spark.rdd.RDD.count(RDD.scala:904)
at 
com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob.getTopShows(TopShowsToKafkaJob.scala:29)
at 
com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob.getTopShows(TopShowsToKafkaJob.scala:45)
at 
com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob$$anonfun$5.apply(TopShowsToKafkaJob.scala:79)
at 
com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob$$anonfun$5.apply(TopShowsToKafkaJob.scala:76)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob.processRecentWindow(TopShowsToKafkaJob.scala:76)
at 
com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob$.processRecentWindow(TopShowsToKafkaJob.scala:98)
at 
com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob$.main(TopShowsToKafkaJob.scala:112)
at 
com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob.main(TopShowsToKafkaJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

   Locked ownable synchronizers:
- None


Re: Spark Streaming : Could not compute split, block not found

2014-10-09 Thread Tian Zhang
I have figured out why I am getting this error:
We have a lot of data in kafka and the DStream from Kafka used
MEMROY_ONLY_SER,
so once the memory is low, spark started to discard data that is needed
later ...
So once I change to MEMORY_AND_DISK_SER, the error is gone.

Tian





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p16084.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [ANN] SparkSQL support for Cassandra with Calliope

2014-10-06 Thread tian zhang
Rohit,

Thank you very much for release the H2 version and now my app compiles
file and there is no more runtime error wrt. hadoop 1.x class or interface.

Tian


On Saturday, October 4, 2014 9:47 AM, Rohit Rai ro...@tuplejump.com wrote:
 


Hi Tian,

We have published a build against Hadoop 2.0 with version 1.1.0-CTP-U2-H2

Let us know how your testing goes.

Regards,
Rohit



Founder  CEO, Tuplejump, Inc.

www.tuplejump.comThe Data Engineering Platform

On Sat, Oct 4, 2014 at 3:49 AM, tian zhang tzhang...@yahoo.com wrote:

Hi, Rohit,


Thank you for sharing this good news.


I have some relevant issue that I would like to ask your help.
I am using spark 1.1.0 and I have a spark application using 
com.tuplejump% calliope-core_2.10% 1.1.0-CTP-U2,


At runtime there are following errors that seem indicate that 
calliope package is compiled with hadoop 1.x and spark is running on hadoop 
2.x.
Can you release a new version of calliope so that it will be compatible with 
spark 1.1.0?


Thanks. here is the error details.
java.lang.IncompatibleClassChangeError: 
Found interface (hadoop 2.x)
org.apache.hadoop.mapreduce.TaskAttemptContext, but class (hadoop 1.x) was
expected

 com.tuplejump.calliope.hadoop.cql3.CqlRecordReader.initialize(CqlRecordReader.java:82)



Tian



On Friday, October 3, 2014 11:15 AM, Rohit Rai ro...@tuplejump.com wrote:
 


Hi All,


An year ago we started this journey and laid the path for Spark + Cassandra 
stack. We established the ground work and direction for Spark Cassandra 
connectors and we have been happy seeing the results.


With Spark 1.1.0 and SparkSQL release, we its time to take Calliope to the 
logical next level also paving the way for much more advanced functionality to 
come. 


Yesterday we released Calliope 1.1.0 Community Tech Preview, which brings 
Native SparkSQL support for Cassandra. The further details are available here.


This release showcases in core spark-sql, hiveql and HiveThriftServer support. 


I differentiate it as native spark-sql integration as it doesn't rely on 
Cassandra's hive connectors (like Cash or DSE) and saves a level of 
indirection through Hive. 


It also allows us to harness Spark's analyzer and optimizer in future to work 
out the best execution plan targeting a balance between Cassandra's querying 
restrictions and Sparks in memory processing.


As far as we know this it the first and only third party datastore connector 
for SparkSQL. This is a CTP release as it relies on Spark internals that still 
don't have/stabilized a developer API and we will work with the Spark 
Community in documenting the requirements and working towards a standard and 
stable API for third party data store integration.


On another note, we no longer require you to signup to access the early access 
code repository.


Inviting all of you try it and give us your valuable feedback.


Regards,


Rohit
Founder  CEO, Tuplejump, Inc.

www.tuplejump.comThe Data Engineering Platform



Spark 1.1.0 (w/ hadoop 2.4) versus aws-java-sdk-1.7.2.jar

2014-09-19 Thread tian zhang


Hi, Spark experts,

I have the following issue when using aws java sdk in my spark application. 
Here I narrowed down the following steps to reproduce the problem

1) I have Spark 1.1.0 with hadoop 2.4 installed on 3 nodes cluster
2) from the master node, I did the following steps.
spark-shell --jars  ws-java-sdk-1.7.2.jar 
import com.amazonaws.{Protocol, ClientConfiguration}
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.services.s3.AmazonS3Client
val clientConfiguration = new ClientConfiguration()
val s3accessKey=X
val s3secretKey=Y
val credentials = new BasicAWSCredentials(s3accessKey,s3secretKey)
println(CLASSPATH=+System.getenv(CLASSPATH))
CLASSPATH=::/home/hadoop/spark/conf:/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar:/home/hadoop/conf:/home/hadoop/conf
println(java.class.path=+System.getProperty(java.class.path))
java.class.path=::/home/hadoop/spark/conf:/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar:/home/hadoop/conf:/home/hadoop/conf

So far all look good and normal. But then the following step will fail and it 
looks like the class loader can't resolve to the right class. Any suggestion
for Spark application that requires aws sdk?

scala val s3Client = new AmazonS3Client(credentials, clientConfiguration)
java.lang.NoClassDefFoundError: 
org/apache/http/impl/conn/PoolingClientConnectionManager
at 
com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
at 
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96)
at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:155)
at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:334)
at $iwC$$iwC$$iwC$$iwC.init(console:21)
at $iwC$$iwC$$iwC.init(console:26)
at $iwC$$iwC.init(console:28)
at $iwC.init(console:30)
at init(console:32)
at .init(console:36)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: 
org.apache.http.impl.conn.PoolingClientConnectionManager
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 46 more

Thanks.

Tian