Re: Spark caching questions

2014-09-10 Thread Mayur Rustagi
Cached RDD do not survive SparkContext deletion (they are scoped on a per sparkcontext basis). I am not sure what you mean by disk based cache eviction, if you cache more RDD than disk space the result will not be very pretty :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-10 Thread Mayur Rustagi
I think she is checking for blanks? But if the RDD is blank then nothing will happen, no db connections etc. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Sep 8, 2014 at 1:32 PM, Tobias Pfeiffer t...@preferred.jp

Re: RDD memory questions

2014-09-10 Thread Boxian Dong
Thank you very much for your kindly help. I rise some another questions: - If the RDD is stored in serialized format, is that means that whenever the RDD is processed, it will be unpacked and packed again from and back to the JVM even they are located on the same machine?

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Luis Ángel Vicente Sánchez
I somehow missed that parameter when I was reviewing the documentation, that should do the trick! Thank you! 2014-09-10 2:10 GMT+01:00 Shao, Saisai saisai.s...@intel.com: Hi Luis, The parameter “spark.cleaner.ttl” and “spark.streaming.unpersist” can be used to remove useless timeout

Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Hi, I am using spark 1.0.0. The bug is fixed by 1.0.1. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark SQL -- more than two tables for join

2014-09-10 Thread boyingk...@163.com
Hi: I Hava a question about Spark SQL。 First ,i use left join on two tables,like this: sql(SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey)).collect().foreach(println) the result is my except。 But,when i use left join on three tables or more ,like this:

Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Great. And you should ask question in user@spark.apache.org mail list. I believe many people don't subscribe the incubator mail list now. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote: Hi, I am using

Dependency Problem with Spark / ScalaTest / SBT

2014-09-10 Thread Thorsten Bergler
Hello, I am writing a Spark App which is already working so far. Now I started to build also some UnitTests, but I am running into some dependecy problems and I cannot find a solution right now. Perhaps someone could help me. I build my Spark Project with SBT and it seems to be configured

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi, What is your SBT command and the parameters? Arthur On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler sp...@tbonline.de wrote: Hello, I am writing a Spark App which is already working so far. Now I started to build also some UnitTests, but I am running into some dependecy problems and

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-10 Thread Thorsten Bergler
Hi, I just called: test or run Thorsten Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com: Hi, What is your SBT command and the parameters? Arthur On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler sp...@tbonline.de wrote: Hello, I am writing a Spark App which is already working

Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Ah, thank you. I did not notice that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

nested rdd operation

2014-09-10 Thread Pavlos Katsogridakis
Hi , I have a question on spark this programm on spark-shell val filerdd = sc.textFile(NOTICE,2) val maprdd = filerdd.map( word = filerdd.map( word2 = (word2+word) ) ) maprdd.collect() throws NULL pointer exception , can somebody explain why i cannot have a nested rdd operation ? --pavlos

JavaPairRDDString, Integer to JavaPairRDDString, String based on key

2014-09-10 Thread Tom
Is it possible to generate a JavaPairRDDString, Integer from a JavaPairRDDString, String, where I can also use the key values? I have looked at for instance mapToPair, but this generates a new K/V pair based on the original value, and does not give me information about the key. I need this in the

Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arunshell87
Hi, I too had tried SQL queries with joins, MINUS , subqueries etc but they did not work in Spark Sql. I did not find any documentation on what queries work and what do not work in Spark SQL, may be we have to wait for the Spark book to be released in Feb-2015. I believe you can try HiveQL in

Re: nested rdd operation

2014-09-10 Thread Sean Owen
You can't use an RDD inside an operation on an RDD. Here you have filerdd in your map function. It sort of looks like you want a cartesian product of the RDD with itself, so look at the cartesian() method. It may not be a good idea to compute such a thing. On Wed, Sep 10, 2014 at 1:57 PM, Pavlos

Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
| Do the two mailing lists share messages ? I don't think so. I didn't receive this message from the user list. I am not in databricks, so I can't answer your other questions. Maybe Davies Liu dav...@databricks.com can answer you? -- Ye Xianjin Sent with Sparrow

Re: JavaPairRDDString, Integer to JavaPairRDDString, String based on key

2014-09-10 Thread Sean Owen
So, each key-value pair gets a new value for the original key? you want mapValues(). On Wed, Sep 10, 2014 at 2:01 PM, Tom thubregt...@gmail.com wrote: Is it possible to generate a JavaPairRDDString, Integer from a JavaPairRDDString, String, where I can also use the key values? I have looked at

Global Variables in Spark Streaming

2014-09-10 Thread Ravi Sharma
Hi Friends, I'm using spark streaming as for kafka consumer. I want to do the CEP by spark. So as for that I need to store my sequence of events. so that I cant detect some pattern. My question is How can I save my events in java collection temporary , So that i can detect pattern by

Some techniques for improving application performance

2014-09-10 Thread Will Benton
Spark friends, I recently wrote up a blog post with examples of some of the standard techniques for improving Spark application performance: http://chapeau.freevariable.com/2014/09/improving-spark-application-performance.html The idea is that we start with readable but poorly-performing

Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi, May be you can take a look about the following. http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html Good luck. Arthur On 10 Sep, 2014, at 9:09 pm, arunshell87 shell.a...@gmail.com wrote: Hi, I too had tried SQL queries with joins, MINUS ,

Re: Task not serializable

2014-09-10 Thread Sean Owen
You mention that you are creating a UserGroupInformation inside your function, but something is still serializing it. You should show your code since it may not be doing what you think. If you instantiate an object, it happens every time your function is called. map() is called once per data

Spark NLP

2014-09-10 Thread Paolo Platter
Hi all, What is your preferred scala NLP lib ? why ? Is there any items on the spark’s road map to integrate NLP features ? I basically need to perform NER line by line, so I don’t need a deep integration with the distributed engine. I only want simple dependencies and the chance to build a

Re: Task not serializable

2014-09-10 Thread Sarath Chandra
Thanks Sean. Please find attached my code. Let me know your suggestions/ideas. Regards, *Sarath* On Wed, Sep 10, 2014 at 8:05 PM, Sean Owen so...@cloudera.com wrote: You mention that you are creating a UserGroupInformation inside your function, but something is still serializing it. You

Re: flattening a list in spark sql

2014-09-10 Thread gtinside
Hi , Thanks it worked, really appreciate your help. I have also been trying to do multiple Lateral Views, but it doesn't seem to be working. Query : hiveContext.sql(Select t2 from fav LATERAL VIEW explode(TABS) tabs1 as t1 LATERAL VIEW explode(t1) tabs2 as t2) Exception

Re: pyspark and cassandra

2014-09-10 Thread Oleg Ruchovets
Hi , I try to evaluate different option of spark + cassandra and I have couple of additional questions. My aim is to use cassandra only without hadoop: 1) Is it possible to use only cassandra as input/output parameter for PySpark? 2) In case I'll use Spark (java,scala) is it possible to

Cassandra connector

2014-09-10 Thread wwilkins
Hi, I am having difficulty getting the Cassandra connector running within the spark shell. My jars looks like: [wwilkins@phel-spark-001 logs]$ ls -altr /opt/connector/ total 14588 drwxr-xr-x. 5 root root4096 Sep 9 22:15 .. -rw-r--r-- 1 root root 242809 Sep 9 22:20

Re: Cassandra connector

2014-09-10 Thread gtinside
Are you using spark 1.1 ? If yes you would have to update the datastax cassandra connector code and remove ref to log methods from CassandraConnector.scala Regards, Gaurav -- View this message in context:

Re: Low Level Kafka Consumer for Spark

2014-09-10 Thread Dibyendu Bhattacharya
Hi , The latest changes with Kafka message re-play by manipulating ZK offset seems to be working fine for us. This gives us some relief till actual issue is fixed in Spark 1.2 . I have some question on how Spark process the Received data. The logic I used is basically to pull messages form

Re: groupBy gives non deterministic results

2014-09-10 Thread Davies Liu
I think the mails to spark.incubator.apache.org will be forwarded to spark.apache.org. Here is the header of the first mail: from: redocpot julien19890...@gmail.com to: u...@spark.incubator.apache.org date: Mon, Sep 8, 2014 at 7:29 AM subject: groupBy gives non deterministic results mailing

Re: Global Variables in Spark Streaming

2014-09-10 Thread Akhil Das
Yes your understanding is correct. In that case one easiest option would be to Serialize the object and dump it somewhere in hdfs so that you will be able to recreate/update the object from the file. We have something similar which you can find over BroadCastServer

Re: flattening a list in spark sql

2014-09-10 Thread gtinside
My bad, please ignore, it works !!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/flattening-a-list-in-spark-sql-tp13300p13901.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark SQL -- more than two tables for join

2014-09-10 Thread Michael Armbrust
What version of Spark SQL are you running here? I think a lot of your concerns have likely been addressed in more recent versions of the code / documentation. (Spark 1.1 should be published in the next few days) In particular, for serious applications you should use a HiveContext and HiveQL as

Re: Spark HiveQL support plan

2014-09-10 Thread Michael Armbrust
HiveQL is the default language for the JDBC server which will be available as part of the 1.1 release (coming very soon!). Adding support for calling MLlib and other spark libraries is on the roadmap, but not possible at this moment. On Tue, Sep 9, 2014 at 1:45 PM, XUE, Xiaohui

RE: Cassandra connector

2014-09-10 Thread Wade Wilkins
Thank you! I am running spark 1.0 but, your suggestion worked for me. I rem'ed out all //logDebug In both CassandraConnector.scala and Schema.scala I am moving again. Regards, Wade -Original Message- From: gtinside [mailto:gtins...@gmail.com] Sent: Wednesday, September 10, 2014 8:49

Re: how to run python examples in spark 1.1?

2014-09-10 Thread freedafeng
Just want to provide more information on how I ran the examples. Environment: Cloudera quick start Vm 5.1.0 (HBase 0.98.1 installed). I created a table called 'data1', and 'put' two records in it. I can see the table and data are fine in hbase shell. I cloned spark repo and checked out to 1.1

Re: Global Variables in Spark Streaming

2014-09-10 Thread Santiago Mola
Hi Ravi, 2014-09-10 15:55 GMT+02:00 Ravi Sharma raviprincesha...@gmail.com: I'm using spark streaming as for kafka consumer. I want to do the CEP by spark. So as for that I need to store my sequence of events. so that I cant detect some pattern. Depending on what you're trying to accomplish,

Se8sIang i Atau zorOi

2014-09-10 Thread Edgar Vega
jarAk ifz ais iviiiaiiiIsirseikejujuran8$-88Ω=-O Ω:-P in iCdsiiisOz) :) (isuii:V) (:V) riiie89θ

Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Well, That's weird. I don't see this thread in my mail box as sending to user list. Maybe because I also subscribe the incubator mail list? I do see mails sending to incubator mail list and no one replies. I thought it was because people don't subscribe the incubator now. -- Ye Xianjin Sent

Re: Task not serializable

2014-09-10 Thread Marcelo Vanzin
You're using hadoopConf, a Configuration object, in your closure. That type is not serializable. You can use -Dsun.io.serialization.extendedDebugInfo=true to debug serialization issues. On Wed, Sep 10, 2014 at 8:23 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Thanks Sean.

Re: How to scale more consumer to Kafka stream

2014-09-10 Thread Tim Smith
How are you creating your kafka streams in Spark? If you have 10 partitions for a topic, you can call createStream ten times to create 10 parallel receivers/executors and then use union to combine all the dStreams. On Wed, Sep 10, 2014 at 7:16 AM, richiesgr richie...@gmail.com wrote: Hi (my

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Tim Smith
I am using Spark 1.0.0 (on CDH 5.1) and have a similar issue. In my case, the receivers die within an hour because Yarn kills the containers for high memory usage. I set ttl.cleaner to 30 seconds but that didn't help. So I don't think stale RDDs are an issue here. I did a jmap -histo on a couple

Re: how to choose right DStream batch interval

2014-09-10 Thread Tim Smith
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617 Slide 39 covers it. On Tue, Sep 9, 2014 at 9:23 PM, qihong qc...@pivotal.io wrote: Hi Mayur, Thanks for your response. I did write a simple test that set up a DStream with 5 batches; The

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Yana Kadiyska
Tim, I asked a similar question twice: here http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Cannot-get-executors-to-stay-alive-tt12940.html and here http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Executor-OOM-tt12383.html and have not yet received any responses. I

Re: spark-streaming Could not compute split exception

2014-09-10 Thread Tim Smith
I had a similar issue and many others - all were basically symptoms for yarn killing the container for high memory usage. Haven't gotten to root cause yet. On Tue, Sep 9, 2014 at 3:18 PM, Marcelo Vanzin van...@cloudera.com wrote: Your executor is exiting or crashing unexpectedly: On Tue, Sep

Hadoop Distributed Cache

2014-09-10 Thread Maximo Gurmendez
Hi, As part of SparkContext.newAPIHadoopRDD(). Would Spark support an InputFormat that uses Hadoop’s distributed cache? Thanks, Máximo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,

[spark upgrade] Error communicating with MapOutputTracker when running test cases in latest spark

2014-09-10 Thread Adrian Mocanu
I use https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala to help me with testing. In spark 9.1 my tests depending on TestSuiteBase worked fine. As soon as I switched to latest (1.0.1) all tests fail. My sbt import is:

Re: pyspark and cassandra

2014-09-10 Thread Kan Zhang
Thanks for the clarification, Yadid. By Hadoop jobs, I meant Spark jobs that use Hadoop inputformats (as shown in the cassandra_inputformat.py example). A future possibility of accessing Cassandra from PySpark is when SparkSQL supports Cassandra as a data source. On Wed, Sep 10, 2014 at 11:37

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-10 Thread alexandria1101
I used the hiveContext to register the tables and the tables are still not being found by the thrift server. Do I have to pass the hiveContext to JDBC somehow? -- View this message in context:

Accumulo and Spark

2014-09-10 Thread Megavolt
I've been doing some Googling and haven't found much info on how to incorporate Spark and Accumulo. Does anyone know of some examples of how to tie Spark to Accumulo (for both fetching data and dumping results)? -- View this message in context:

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-10 Thread Denny Lee
Actually, when registering the table, it is only available within the sc context you are running it in. For Spark 1.1, the method name is changed to RegisterAsTempTable to better reflect that. The Thrift server process runs under a different process meaning that it cannot see any of the

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-10 Thread Du Li
Hi Denny, There is a related question by the way. I have a program that reads in a stream of RDD¹s, each of which is to be loaded into a hive table as one partition. Currently I do this by first writing the RDD¹s to HDFS and then loading them to hive, which requires multiple passes of HDFS I/O

java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2

2014-09-10 Thread Jeffrey Picard
Hey guys, After rebuilding from the master branch this morning, I’ve started to see these errors that I’ve never gotten before while running connected components. Anyone seen this before? 14/09/10 20:38:53 INFO collection.ExternalSorter: Thread 87 spilling in-memory batch of 1020 MB to disk

RE: how to setup steady state stream partitions

2014-09-10 Thread Anton Brazhnyk
Just a guess. updateStateByKey has overloaded variants with partitioner as parameter. Can it help? -Original Message- From: qihong [mailto:qc...@pivotal.io] Sent: Tuesday, September 09, 2014 9:13 PM To: u...@spark.incubator.apache.org Subject: Re: how to setup steady state stream

Re: DELIVERY FAILURE: Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. Message probably in a routing loop.

2014-09-10 Thread Andrew Or
Great, good to know it's not just me... 2014-09-10 11:25 GMT-07:00 Marcelo Vanzin van...@cloudera.com: Yes please pretty please. This is really annoying. On Sun, Sep 7, 2014 at 6:31 AM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: I keep getting below reply every time I send a

Re: Accumulo and Spark

2014-09-10 Thread Russ Weeks
It's very straightforward to set up a Hadoop RDD to use AccumuloInputFormat. Something like this will do the trick: private JavaPairRDDKey,Value newAccumuloRDD(JavaSparkContext sc, AgileConf agileConf, String appName, Authorizations auths) throws IOException, AccumuloSecurityException {

Re: Spark + AccumuloInputFormat

2014-09-10 Thread Russ Weeks
To answer my own question... I didn't realize that I was responsible for telling Spark how much parallelism I wanted for my job. I figured that between Spark and Yarn they'd figure it out for themselves. Adding --executor-memory 3G --num-executors 24 to my spark-submit command took the query time

RE: how to setup steady state stream partitions

2014-09-10 Thread qihong
Thanks for your response! I found that too, and it does the trick! Here's refined code: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use sensorId as key val partitionedDStream = keyedDstream.transform(rdd = rdd.partitionBy(new MyPartitioner(...))) val stateDStream =

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 11:15 PM, Sean Owen so...@cloudera.com wrote: This structure is not specific to Hadoop, but in theory works in any JAR file. You can put JARs in JARs and refer to them with Class-Path entries in META-INF/MANIFEST.MF. Funny that you mention that, since someone internally

Re: RDD memory questions

2014-09-10 Thread Davies Liu
On Wed, Sep 10, 2014 at 1:05 AM, Boxian Dong box...@indoo.rs wrote: Thank you very much for your kindly help. I rise some another questions: - If the RDD is stored in serialized format, is that means that whenever the RDD is processed, it will be unpacked and packed again from and back to

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Sean Owen
Hm, so it is: http://docs.oracle.com/javase/tutorial/deployment/jar/downman.html I'm sure I've done this before though and thought is was this mechanism. It must be something custom. What's the Hadoop jar structure in question then? Is it something special like a WAR file? I confess I had never

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Steve Lewis
In modern projects there are a bazillion dependencies - when I use Hadoop I just put them in a lib directory in the jar - If I have a project that depends on 50 jars I need a way to deliver them to Spark - maybe wordcount can be written without dependencies but real projects need to deliver

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Wed, Sep 10, 2014 at 3:44 PM, Sean Owen so...@cloudera.com wrote: What's the Hadoop jar structure in question then? Is it something special like a WAR file? I confess I had never heard of this so thought this was about generic JAR stuff. What I've been told (and Steve's e-mail alludes to)

Re: PrintWriter error in foreach

2014-09-10 Thread Daniil Osipov
Try providing full path to the file you want to write, and make sure the directory exists and is writable by the Spark process. On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra arun.lut...@gmail.com wrote: I have a spark program that worked in local mode, but throws an error in yarn-client mode on

GraphX : AssertionError

2014-09-10 Thread Vipul Pandey
Hi, I have a small graph with about 3.3M vertices and close to 7.5M edges. It's a pretty innocent graph with the max degree of 8. Unfortunately, graph.traingleCount is failing on me with the exception below. I'm running a spark-shell on CDH5.1 with the following params : SPARK_DRIVER_MEM=10g

Re: PrintWriter error in foreach

2014-09-10 Thread Arun Luthra
Ok, so I don't think the workers on the data nodes will be able to see my output directory on the edge node. I don't think stdout will work either, so I'll write to HDFS via rdd.saveAsTextFile(...) On Wed, Sep 10, 2014 at 3:51 PM, Daniil Osipov daniil.osi...@shazam.com wrote: Try providing full

Re: Re: Spark SQL -- more than two tables for join

2014-09-10 Thread boyingk...@163.com
Hi,michael : I think Arthur.hk.chan isn't here now,I Can Show something: 1)my spark version is 1.0.1 2) when I use multiple join ,like this: sql(SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on

Re: How to scale more consumer to Kafka stream

2014-09-10 Thread Dibyendu Bhattacharya
Hi, You can use this Kafka Spark Consumer. https://github.com/dibbhatt/kafka-spark-consumer This is exactly does that . It creates parallel Receivers for every Kafka topic partitions. You can see the Consumer.java under consumer.kafka.client package to see an example how to use it. There is

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Tim Smith
I switched from Yarn to StandAlone mode and haven't had OOM issue yet. However, now I have Akka issues killing the executor: 2014-09-11 02:43:34,543 INFO akka.actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from

Setting up jvm in pyspark from shell

2014-09-10 Thread Mohit Singh
Hi, I am using pyspark shell and am trying to create an rdd from numpy matrix rdd = sc.parallelize(matrix) I am getting the following error: JVMDUMP039I Processing dump event systhrow, detail java/lang/OutOfMemoryError at 2014/09/10 22:41:44 - please wait. JVMDUMP032I JVM requested Heap dump