Re: Submiting Spark application through code
I tried running it but dint work public static final SparkConf batchConf= new SparkConf(); String master = spark://sivarani:7077; String spark_home =/home/sivarani/spark-1.0.2-bin-hadoop2/; String jar = /home/sivarani/build/Test.jar; public static final JavaSparkContext batchSparkContext = new JavaSparkContext(master,SparkTest,spark_home,new String[] {jar}); public static void main(String args[]){ runSpark(0,TestSubmit);} public static void runSpark(int crit, String dataFile){ JavaRDDString logData = batchSparkContext.textFile(input, 10); flatMap maptoparr reduceByKey ListTuple2lt;String, Integer output1 = counts.collect(); } This works fine with spark-submit but when i tried to submit through code LeadBatchProcessing.runSpark(0, TestSubmit.csv); I get this following error HTTP Status 500 - javax.servlet.ServletException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 29 on host 172.18.152.36 failed for unknown reason Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 29 on host 172.18.152.36 failed for unknown reason Driver stacktrace: Any Advice on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submiting-Spark-application-through-code-tp17452p17797.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Scaladoc
In IntelliJ, Tools Generate Scaladoc. Kamal On Fri, Oct 31, 2014 at 5:35 AM, Alessandro Baretta alexbare...@gmail.com wrote: How do I build the scaladoc html files from the spark source distribution? Alex Bareta
Re: Submiting Spark application through code
What do your worker logs say? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 11:44 AM, sivarani whitefeathers...@gmail.com wrote: I tried running it but dint work public static final SparkConf batchConf= new SparkConf(); String master = spark://sivarani:7077; String spark_home =/home/sivarani/spark-1.0.2-bin-hadoop2/; String jar = /home/sivarani/build/Test.jar; public static final JavaSparkContext batchSparkContext = new JavaSparkContext(master,SparkTest,spark_home,new String[] {jar}); public static void main(String args[]){ runSpark(0,TestSubmit);} public static void runSpark(int crit, String dataFile){ JavaRDDString logData = batchSparkContext.textFile(input, 10); flatMap maptoparr reduceByKey ListTuple2lt;String, Integer output1 = counts.collect(); } This works fine with spark-submit but when i tried to submit through code LeadBatchProcessing.runSpark(0, TestSubmit.csv); I get this following error HTTP Status 500 - javax.servlet.ServletException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 29 on host 172.18.152.36 failed for unknown reason Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 29 on host 172.18.152.36 failed for unknown reason Driver stacktrace: Any Advice on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submiting-Spark-application-through-code-tp17452p17797.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Manipulating RDDs within a DStream
Hi, Since, the cassandra object is not serializable you can't open the connection on driver level and access the object inside foreachRDD (i.e. at worker level). You have to open connection inside foreachRDD only, perform the operation and then close the connection. For example: wordCounts.foreachRDD( rdd = { val arr = rdd.toArray OPEN cassandra connection store arr CLOSE cassandra connection }) Thanks - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NonSerializable Exception in foreachRDD
Are you expecting something like this? val data = ssc.textFileStream(hdfs://akhldz:9000/input/) val rdd = ssc.sparkContext.parallelize(Seq(foo, bar)) val sample = data.foreachRDD(x= { val new_rdd = x.union(rdd) new_rdd.saveAsTextFile(hdfs://akhldz:9000/output/) }) Thanks Best Regards On Fri, Oct 31, 2014 at 10:46 AM, Tobias Pfeiffer t...@preferred.jp wrote: Harold, just mentioning it in case you run into it: If you are in a separate thread, there are apparently stricter limits to what you can and cannot serialize: val someVal future { // be very careful with defining RDD operations using someVal here val myLocalVal = someVal // use myLocalVal instead } On Thu, Oct 30, 2014 at 4:55 PM, Harold Nguyen har...@nexgate.com wrote: In Spark Streaming, when I do foreachRDD on my DStreams, I get a NonSerializable exception when I try to do something like: DStream.foreachRDD( rdd = { var sc.parallelize(Seq((test, blah))) }) Is this the code you are actually using? var sc.parallelize(...) doesn't really look like valid Scala to me. Tobias
Re: Spark Streaming Issue not running 24/7
It says 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException Can you try putting a try { }catch around all those operations that you are doing on the DStream? In that way it will not stop the entire application due to corrupt data/field etc. Thanks Best Regards On Fri, Oct 31, 2014 at 10:09 AM, sivarani whitefeathers...@gmail.com wrote: The problem is simple I want a to stream data 24/7 do some calculations and save the result in a csv/json file so that i could use it for visualization using dc.js/d3.js I opted for spark streaming on yarn cluster with kafka tried running it for 24/7 Using GroupByKey and updateStateByKey to have the computed historical data Initially streaming is working fine.. but after few hours i am getting 14/10/30 23:48:49 ERROR TaskSetManager: Task 2485162.0:3 failed 4 times; aborting job 14/10/30 23:48:50 ERROR JobScheduler: Error running job streaming job 141469227 ms.1 org.apache.spark.SparkException: Job aborted due to stage failure: Task 2485162.0:3 failed 4 times, most recent failure: Exception failure in TID 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) I guess its due to the GroupByKey and updateStateByKey, i tried GroupByKey(100) increased partition Also when data is in state say for eg 10th sec 1000 records are in state, 100th sec 20,000 records are in state out of which 19,000 records are not updated how to remove them from state.. UpdateStateByKey(none) how and when to do that, how we will know when to send none, and save the data before setting none? I also tried not sending any data a few hours but check the web ui i am getting task FINISHED app-20141030203943- NewApp 0 6.0 GB 2014/10/30 20:39:43 hadoop FINISHED 4.2 h This makes me confused.. In the code it says awaitTermination, but did not terminate the task.. will streaming stop if no data is received for a significant amount of time? Is there any doc available on how much time spark will run when no data is streamed? Any Doc available -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Issue-not-running-24-7-tp17791.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
different behaviour of the same code
I am trying to write some sample code under IntelliJ IDEA. I start with a non-sbt scala project. In order that the program compile, I add *spark-assembly-1.1.0-hadoop2.4.0.jar* in the *spark/lib* directory as one external library of the IDEA project. http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/proj.jpg The code are here: LogReg.scala http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/LogReg.scala Then I click the Run button of the IDEA, and I get the following error message errlog.txt http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/errlog.txt . But when I export the jar file, and use *spark-submit --class net.yanl.spark.LogReg log_reg.jar 15*. The program works finely. This is somehow annoying. Can anyone resolve this issue? You may need the following file to reproduce the error. out5_training.log/out5_testing.log http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/small01.log -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/different-behaviour-of-the-same-code-tp17803.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using a Database to persist and load data from
You can also use PairRDDFunctions' saveAsNewAPIHadoopFile that takes an OutputFormat class. So you will have to write a custom OutputFormat class that extends OutputFormat. In this class, you will have to implement a getRecordWriter which returns a custom RecordWriter. So you will also have to write a custom RecordWriter which extends RecordWriter which will have a write method that actually writes to the DB. On Fri, Oct 31, 2014 at 11:25 AM, Yanbo Liang yanboha...@gmail.com wrote: AFAIK, you can read data from DB with JdbcRDD, but there is no interface for writing to DB. JdbcRDD has some restrict such as SQL must with where clause. For writing to DB, you can use mapPartitions or foreachPartition to implement. You can refer this example: http://stackoverflow.com/questions/24916852/how-can-i-connect-to-a-postgresql-database-into-apache-spark-using-scala 2014-10-30 23:01 GMT+08:00 Asaf Lahav asaf.la...@gmail.com: Hi Ladies and Gents, I would like to know what are the options I have if I would like to leverage Spark code I already have written to use a DB (Vertica) as its store/datasource. The data is of tabular nature. So any relational DB can essentially be used. Do I need to develop a context? If yes, how? where can I get a good example? Thank you, Asaf
about aggregateByKey and standard deviation
Hi, everyone I have an RDD filled with data like (k1, v11) (k1, v12) (k1, v13) (k2, v21) (k2, v22) (k2, v23) ... I want to calculate the average and standard deviation of (v11, v12, v13) and (v21, v22, v23) group by there keys for the moment, i have done that by using groupByKey and map, I notice that groupByKey is very expensive, but i can not figure out how to do it by using aggregateByKey, so i wonder is there any better way to do this? Thanks! qinwei
Re: Using a Database to persist and load data from
I think you can try to use the Hadoop DBOutputFormat Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 1:00 PM, Kamal Banga ka...@sigmoidanalytics.com wrote: You can also use PairRDDFunctions' saveAsNewAPIHadoopFile that takes an OutputFormat class. So you will have to write a custom OutputFormat class that extends OutputFormat. In this class, you will have to implement a getRecordWriter which returns a custom RecordWriter. So you will also have to write a custom RecordWriter which extends RecordWriter which will have a write method that actually writes to the DB. On Fri, Oct 31, 2014 at 11:25 AM, Yanbo Liang yanboha...@gmail.com wrote: AFAIK, you can read data from DB with JdbcRDD, but there is no interface for writing to DB. JdbcRDD has some restrict such as SQL must with where clause. For writing to DB, you can use mapPartitions or foreachPartition to implement. You can refer this example: http://stackoverflow.com/questions/24916852/how-can-i-connect-to-a-postgresql-database-into-apache-spark-using-scala 2014-10-30 23:01 GMT+08:00 Asaf Lahav asaf.la...@gmail.com: Hi Ladies and Gents, I would like to know what are the options I have if I would like to leverage Spark code I already have written to use a DB (Vertica) as its store/datasource. The data is of tabular nature. So any relational DB can essentially be used. Do I need to develop a context? If yes, how? where can I get a good example? Thank you, Asaf
Re: SparkContext UI
No, empty parens do no matter when calling no-arg methods in Scala. This invocation should work as-is and should result in the RDD showing in Storage. I see that when I run it right now. Since it really does/should work, I'd look at other possibilities -- is it maybe taking a short time to start caching? looking at a different/old Storage tab? On Fri, Oct 31, 2014 at 1:17 AM, Sameer Farooqui same...@databricks.com wrote: Hi Stuart, You're close! Just add a () after the cache, like: data.cache() ...and then run the .count() action on it and you should be good to see it in the Storage UI! - Sameer On Thu, Oct 30, 2014 at 4:50 PM, Stuart Horsman stuart.hors...@gmail.com wrote: Sorry too quick to pull the trigger on my original email. I should have added that I'm tried using persist() and cache() but no joy. I'm doing this: data = sc.textFile(somedata) data.cache data.count() but I still can't see anything in the storage? On 31 October 2014 10:42, Sameer Farooqui same...@databricks.com wrote: Hey Stuart, The RDD won't show up under the Storage tab in the UI until it's been cached. Basically Spark doesn't know what the RDD will look like until it's cached, b/c up until then the RDD is just on disk (external to Spark). If you launch some transformations + an action on an RDD that is purely on disk, then Spark will read it from disk, compute against it and then write the results back to disk or show you the results at the scala/python shells. But when you run Spark workloads against purely on disk files, the RDD won't show up in Spark's Storage UI. Hope that makes sense... - Sameer On Thu, Oct 30, 2014 at 4:30 PM, Stuart Horsman stuart.hors...@gmail.com wrote: Hi All, When I load an RDD with: data = sc.textFile(somefile) I don't see the resulting RDD in the SparkContext gui on localhost:4040 in /storage. Is there something special I need to do to allow me to view this? I tried but scala and python shells but same result. Thanks Stuart - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?
cache() won't speed up a single operation on an RDD, since it is computed the same way before it is persisted. On Thu, Oct 30, 2014 at 7:15 PM, Sameer Farooqui same...@databricks.com wrote: By the way, in case you haven't done so, do try to .cache() the RDD before running a .count() on it as that could make a big speed improvement. On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui same...@databricks.com wrote: Hi Shahab, Are you running Spark in Local, Standalone, YARN or Mesos mode? If you're running in Standalone/YARN/Mesos, then the .count() action is indeed automatically parallelized across multiple Executors. When you run a .count() on an RDD, it is actually distributing tasks to different executors to each do a local count on a local partition and then all the tasks send their sub-counts back to the driver for final aggregation. This sounds like the kind of behavior you're looking for. However, in Local mode, everything runs in a single JVM (the driver + executor), so there's no parallelization across Executors. On Thu, Oct 30, 2014 at 10:25 AM, shahab shahab.mok...@gmail.com wrote: Hi, I noticed that the count (of RDD) in many of my queries is the most time consuming one as it runs in the driver process rather then done by parallel worker nodes, Is there any way to perform count in parallel , at at least parallelize it as much as possible? best, /Shahab - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId not found
Hi, all My job failed and there are a lot of ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId not found information in the log. Can anyone tell me what's wrong and how to fix it? Best Regards, Kevin.
Spark SQL on Cassandra
I am dealing with a Lambda Architecture. This means I have Hadoop on the batch layer, Storm on the speed layer and I'm storing the precomputed views from both layers in Cassandra. I understand that Spark is a substitute for Hadoop but at the moment I would like not to change the batch layer. I would like to execute SQL queries on Cassandra using Spark SQL. Is it possible to get just Spark SQL to run on top of Cassandra, without Spark? My goal is to access Cassandra data with BI tools. Spark SQL looks like the perfect tool for this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-on-Cassandra-tp17812.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Repartitioning by partition size, not by number of partitions.
Hi, I have inpot data that are many of very small files containing one .json. For performance reasons (I use PySpark) I have to do repartioning, currently I do: sc.textFile(files).coalesce(100)) Problem is that I have to guess the number of partitions in a such way that it's as fast as possible and I am still on the sefe side with the RAM memory. So this is quiet difficult. For this reason I would like to ask if there is some way, how to replace coalesce(100) by something that creates N partitions of the given size? I went through the documentation, but I was not able to find some way, how to do that. thank you in advance for any help or advice. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: how idf is calculated
I found my problem. I assumed based on TF-IDF in Wikipedia , that log base 10 is used, but as I found in this discussion https://groups.google.com/forum/#!topic/scala-language/K5tbYSYqQc8, in scala it is actually ln (natural logarithm). Regards, Andrejs On Thu, Oct 30, 2014 at 10:49 PM, Ashic Mahtab as...@live.com wrote: Hi Andrejs, The calculations are a bit different to what I've come across in Mining Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available here: http://www.mmds.org/ Their calculation of IDF is as follows: IDFi = log2(N / ni) where N is the number of documents and ni is the number of documents in which the word appears. This looks different to your IDF function. For TF, they use TFij = fij / maxk fkj That is: For document j, the term frequency of the term i in j is the number of times i appears in j divided by the maximum number of times any term appears in j. Stop words are usually excluded when considering the maximum). So, in your case, the TFa1 = 2 / 2 = 1 TFb1 = 1 / 2 = 0.5 TFc1 = 1/2 = 0.5 TFm1 = 2/2 = 1 ... IDFa = log2(3 / 2) = 0.585 So, TFa1 * IDFa = 0.585 Wikipedia mentions an adjustment to overcome biases for long documents, by calculating TFij = 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change anything for TFa1, as the value remains 1. In other words, my calculations don't agree with yours, and neither seem to agree with Spark :) Regards, Ashic. -- Date: Thu, 30 Oct 2014 22:13:49 + Subject: how idf is calculated From: andr...@sindicetech.com To: u...@spark.incubator.apache.org Hi, I'm writing a paper and I need to calculate tf-idf. Whit your help I managed to get results, I needed, but the problem is that I need to be able to explain how each number was gotten. So I tried to understand how idf was calculated and the numbers i get don't correspond to those I should get . I have 3 documents (each line a document) a a b c m m e a c d e e d j k l m m c When I calculate tf, I get this (1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0]) (1048576,[97,98,99,109],[2.0,1.0,1.0,2.0]) (1048576,[97,99,100,101],[1.0,1.0,1.0,3.0] idf is supposedly calculated idf = log((m + 1) / (d(t) + 1)) m -number of documents (3 in my case). d(t) - in how many documents is term present a: log(4/3) =0.1249387366 b: log(4/2) =0.3010299957 c: log(4/4) =0 d: log(4/3) =0.1249387366 e: log(4/2) =0.3010299957 l: log(4/2) =0.3010299957 m: log(4/3) =0.1249387366 When I output idf vector ` idf.idf.toArray.filter(_.(0)).distinct.foreach(println(_)) ` I get : 1.3862943611198906 0.28768207245178085 0.6931471805599453 I understand why there are only 3 numbers, because only 3 are unique : log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf where calculated Best regards, Andrejs
Re: how idf is calculated
Yes, here the base doesn't matter as it just multiplies all results by a constant factor. Math libraries tend to have ln, not log10 or log2. ln is often the more, er, natural base for several computations. So I would assume that log = ln in the context of ML. On Fri, Oct 31, 2014 at 11:31 AM, Andrejs Abele andr...@sindicetech.com wrote: I found my problem. I assumed based on TF-IDF in Wikipedia , that log base 10 is used, but as I found in this discussion, in scala it is actually ln (natural logarithm). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Executor and BlockManager memory size
Hi, I meet the same problem in the context of spark and yarn. When I open pyspark with the following command: spark/bin/pyspark --master yarn-client --num-executors 1 --executor-memory 2500m It turns out *INFO storage.BlockManagerMasterActor: Registering block manager ip-10-0-6-171.us-west-2.compute.internal:38770 with 1294.1 MB RAM* So, according to the documentation, just 2156.83m is allocated to executor. Moreover, according to yarn 3072m memory is used for this container. Do you have any ideas about this? Thanks a lot Cheers Gen Boromir Widas wrote Hey Larry, I have been trying to figure this out for standalone clusters as well. http://apache-spark-user-list.1001560.n3.nabble.com/What-is-a-Block-Manager-td12833.html has an answer as to what block manager is for. From the documentation, what I understood was if you assign X GB to each executor, spark.storage.memoryFraction(default 0.6) * X is assigned to the BlockManager and the rest for the JVM itself(?). However, as you see, 26.8G is assigned to the BM, and assuming 0.6 memoryFraction, this means the executor sees ~44.7G of memory, I am not sure what happens to the difference(5.3G). On Thu, Oct 9, 2014 at 9:40 PM, Larry Xiao lt; xiaodi@.edu gt; wrote: Hi all, I'm confused about Executor and BlockManager, why they have different memory. 14/10/10 08:50:02 INFO AppClient$ClientActor: Executor added: app-20141010085001-/2 on worker-20141010004933-brick6-35657 (brick6:35657) with 6 cores 14/10/10 08:50:02 INFO SparkDeploySchedulerBackend: Granted executor ID app-20141010085001-/2 on hostPort brick6:35657 with 6 cores, 50.0 GB RAM 14/10/10 08:50:07 INFO BlockManagerMasterActor: Registering block manager brick6:53296 with 26.8 GB RAM and on the WebUI, Executor IDAddressRDD Blocks Memory UsedDisk UsedActive TasksFailed Tasks Complete TasksTotal TasksTask Time Input Shuffle ReadShuffle Write 0brick3:3760700.0 B / 26.8 GB0.0 B60 06 0 ms0.0 B0.0 B0.0 B 1brick1:5949300.0 B / 26.8 GB0.0 B60 06 0 ms0.0 B0.0 B0.0 B 2brick6:5329600.0 B / 26.8 GB0.0 B60 06 0 ms0.0 B0.0 B0.0 B 3brick5:3854300.0 B / 26.8 GB0.0 B60 06 0 ms0.0 B0.0 B0.0 B 4brick2:4493700.0 B / 26.8 GB0.0 B60 06 0 ms0.0 B0.0 B0.0 B 5brick4:4679800.0 B / 26.8 GB0.0 B60 06 0 ms0.0 B0.0 B0.0 B driver brick0:5769200.0 B / 274.6 MB0.0 B 000 00 ms0.0 B0.0 B0.0 B As I understand it, a worker consist of a daemon and an executor, and executor takes charge both execution and storage. So does it mean that 26.8 GB is saved for storage and the rest is for execution? Another question is that, throughout execution, it seems that the blockmanager is always almost free. 14/10/05 14:33:44 INFO BlockManagerInfo: Added broadcast_21_piece0 in memory on brick2:57501 (size: 1669.0 B, free: 21.2 GB) I don't know what I'm missing here. Best regards, Larry - To unsubscribe, e-mail: user-unsubscribe@.apache For additional commands, e-mail: user-help@.apache -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executor-and-BlockManager-memory-size-tp16091p17816.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sbt/sbt compile error [FATAL]
Hi Thanks Branch 1.1. did not work but 1.0 worked. Why could that be? Regards Hans-Peter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-compile-error-FATAL-tp17629p17817.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SQL COUNT DISTINCT
While i testing Spark SQL i noticed that COUNT DISTINCT works really slow. Map partitions phase finished fast, but collect phase is slow. It's only runs on single executor. Should this run this way? And here is the simple code which i use for testing: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetFile = sqlContext.parquetFile(/bojan/test/2014-10-20/) parquetFile.registerTempTable(parquetFile) val count = sqlContext.sql(SELECT COUNT(DISTINCT f2) FROM parquetFile) count.map(t = t(0)).collect().foreach(println) I guess because of the distinct process must be on single node. But i wonder can i add some parallelism to the collect process. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Repartitioning by partition size, not by number of partitions.
Hi Jan. I've actually written a function recently to do precisely that using the RDD.randomSplit function. You just need to calculate how big each element of your data is, then how many of each data can fit in each RDD to populate the input to rqndomSplit. Unfortunately, in my case I wind up with GC errors on large data doing this and am still debugging :) -Original Message- From: jan.zi...@centrum.cz [jan.zi...@centrum.czmailto:jan.zi...@centrum.cz] Sent: Friday, October 31, 2014 06:27 AM Eastern Standard Time To: user@spark.apache.org Subject: Repartitioning by partition size, not by number of partitions. Hi, I have inpot data that are many of very small files containing one .json. For performance reasons (I use PySpark) I have to do repartioning, currently I do: sc.textFile(files).coalesce(100)) Problem is that I have to guess the number of partitions in a such way that it's as fast as possible and I am still on the sefe side with the RAM memory. So this is quiet difficult. For this reason I would like to ask if there is some way, how to replace coalesce(100) by something that creates N partitions of the given size? I went through the documentation, but I was not able to find some way, how to do that. thank you in advance for any help or advice. The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
unsubscribe
Apology for having to send to all. I am highly interested in spark, would like to stay in this mailing list. But the email I signed up is not right one. The link below to unsubscribe seems not working. https://spark.apache.org/community.html Can anyone help? This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.
Re: unsubscribe
Hongbin, Please send an email to user-unsubscr...@spark.apache.org in order to unsubscribe from the user list. On Fri, Oct 31, 2014 at 9:05 AM, Hongbin Liu hongbin@theice.com wrote: Apology for having to send to all. I am highly interested in spark, would like to stay in this mailing list. But the email I signed up is not right one. The link below to unsubscribe seems not working. https://spark.apache.org/community.html Can anyone help? -- This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.
Too many files open with Spark 1.1 and CDH 5.1
Hi, I am trying to make Spark SQL 1.1 to work to replace part of our ETL processes that are currently done by Hive 0.12. A common problem that I have encountered is the Too many files open error. Once that happened, the query just failed. I started the spark-shell by using ulimit -n 4096 spark-shell. And it still pops the same error. Any solutions? Many thanks. Bill -- Many thanks. Bill
Re: CANNOT FIND ADDRESS
Thanks for the pointers! I did tried but didn't seem to help... In my latest try, I am doing spark-submit local But see the same message in spark App ui (4040) localhost CANNOT FIND ADDRESS In the logs, I see a lot of in-memory map to disk. I don't understand why that is the case. There should be over 35 gb ram available for input that is not significantly large. It may be link to the performance issues that I am seeing. I have another post for seeing advice on that. It seems, I am not able to tune spark sufficiently to execute my process successfully. 14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory map of 1777 MB to disk (2 times so far) 14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory map of 393 MB to disk (1 time so f ar) 14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory map of 392 MB to disk (2 times so far) 14/10/31 13:45:14 INctorsBySecId();^ZFO ExternalAppendOnlyMap: Thread 230 spilling in-memory map of 554 MB to disk (2 times so far) 14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 235 spilling in-memory map of 3990 MB to disk (1 time so far) 14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 236 spilling in-memory map of 2667 MB to disk (1 time so far) 14/10/31 13:45:17 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory map of 825 MB to disk (3 times so far) 14/10/31 13:45:24 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory map of 4618 MB to disk (2 times so far) 14/10/31 13:45:26 INFO ExternalAppendOnlyMap: Thread 233 spilling in-memory map of 15869 MB to disk (1 time so far) 14/10/31 13:45:37 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory map of 3026 MB to disk (3 times so far) 14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory map of 401 MB to disk (3 times so far) 14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory map of 392 MB to disk (4 times so My spark properties are: NameValue spark.akka.frameSize50 spark.akka.timeout 900 spark.app.name Simple File Merge Application spark.core.connection.ack.wait.timeout 900 spark.default.parallelism 10 spark.driver.host spark-single.c.fi-mdd-poc.internal spark.driver.memory 35g spark.driver.port 40255 spark.eventLog.enabled true spark.fileserver.urihttp://10.240.106.135:59255 spark.jars file:/home/ami_khandeshi_gmail_com/SparkExample-1.0.jar spark.masterlocal[16] spark.scheduler.modeFIFO spark.shuffle.consolidateFiles true spark.storage.memoryFraction0.3 spark.tachyonStore.baseDir /tmp spark.tachyonStore.folderName spark-21ad0fd2-2177-48ce-9242-8dbb33f2a1f1 spark.tachyonStore.url tachyon://mdd:19998 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17824.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Too many files open with Spark 1.1 and CDH 5.1
It's almost surely the workers, not the driver (shell) that have too many files open. You can change their ulimit. But it's probably better to see why it happened -- a very big shuffle? -- and repartition or design differently to avoid it. The new sort-based shuffle might help in this regard. On Fri, Oct 31, 2014 at 3:25 PM, Bill Q bill.q@gmail.com wrote: Hi, I am trying to make Spark SQL 1.1 to work to replace part of our ETL processes that are currently done by Hive 0.12. A common problem that I have encountered is the Too many files open error. Once that happened, the query just failed. I started the spark-shell by using ulimit -n 4096 spark-shell. And it still pops the same error. Any solutions? Many thanks. Bill -- Many thanks. Bill - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkContext.stop() ?
what is it for? when do we call it? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Too many files open with Spark 1.1 and CDH 5.1
Hi Sean, Thanks for the reply. I think both driver and worker have the problem. You are right that the ulimit fixed the driver side too many files open error. And there is a very big shuffle. My maybe naive thought is to migrate the HQL scripts directly from Hive to Spark SQL and make them work. It seems that it won't be that easy. Is that correct? And it seems that I had done that with Shark and it worked pretty well in the old days. Any suggestions if we are planning to migrate a large code base from Hive to Spark SQL with minimum code rewriting? Many thanks. Cao On Friday, October 31, 2014, Sean Owen so...@cloudera.com wrote: It's almost surely the workers, not the driver (shell) that have too many files open. You can change their ulimit. But it's probably better to see why it happened -- a very big shuffle? -- and repartition or design differently to avoid it. The new sort-based shuffle might help in this regard. On Fri, Oct 31, 2014 at 3:25 PM, Bill Q bill.q@gmail.com javascript:; wrote: Hi, I am trying to make Spark SQL 1.1 to work to replace part of our ETL processes that are currently done by Hive 0.12. A common problem that I have encountered is the Too many files open error. Once that happened, the query just failed. I started the spark-shell by using ulimit -n 4096 spark-shell. And it still pops the same error. Any solutions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill
Re: does updateStateByKey accept a state that is a tuple?
Based on execution on small test cases, it appears that the construction below does what I intend. (Yes, all those Tuple1()s were superfluous.) var lines = ssc.textFileStream(dirArg) var linesArray = lines.map( line = (line.split(\t))) var newState = linesArray.map( lineArray = ((lineArray(4), (1, Time((lineArray(0).toDouble*1000).toLong), Time((lineArray(0).toDouble*1000).toLong))) )) val updateDhcpState = (newValues: Seq[(Int, Time, Time)], state: Option[(Int, Time, Time)]) = Option[(Int, Time, Time)] { val newCount = newValues.map( x = x._1).sum val newMinTime = newValues.map( x = x._2).min val newMaxTime = newValues.map( x = x._3).max val (count, minTime, maxTime) = state.getOrElse((0, Time(Int.MaxValue), Time(Int.MinValue))) (count+newCount, Seq(minTime, newMinTime).min, Seq(maxTime, newMaxTime).max) } var DhcpSvrCum = newState.updateStateByKey[(Int, Time, Time)](updateDhcpState) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/does-updateStateByKey-accept-a-state-that-is-a-tuple-tp17756p17828.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Out of memory with Spark Streaming
Thanks Chris for looking at this. I was putting data at roughly the same 50 records per batch max. This issue was purely because of a bug in my persistence logic that was leaking memory. Overall, I haven't seen a lot of lag with kinesis + spark setup and I am able to process records at roughly the same rate as data as fed into kinesis with acceptable latency. Thanks, Aniket On Oct 31, 2014 1:15 AM, Chris Fregly ch...@fregly.com wrote: curious about why you're only seeing 50 records max per batch. how many receivers are you running? what is the rate that you're putting data onto the stream? per the default AWS kinesis configuration, the producer can do 1000 PUTs per second with max 50k bytes per PUT and max 1mb per second per shard. on the consumer side, you can only do 5 GETs per second and 2mb per second per shard. my hunch is that the 5 GETs per second is what's limiting your consumption rate. can you verify that these numbers match what you're seeing? if so, you may want to increase your shards and therefore the number of kinesis receivers. otherwise, this may require some further investigation on my part. i wanna stay on top of this if it's an issue. thanks for posting this, aniket! -chris On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi all Sorry but this was totally my mistake. In my persistence logic, I was creating async http client instance in RDD foreach but was never closing it leading to memory leaks. Apologies for wasting everyone's time. Thanks, Aniket On 12 September 2014 02:20, Tathagata Das tathagata.das1...@gmail.com wrote: Which version of spark are you running? If you are running the latest one, then could try running not a window but a simple event count on every 2 second batch, and see if you are still running out of memory? TD On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I did change it to be 1 gb. It still ran out of memory but a little later. The streaming job isnt handling a lot of data. In every 2 seconds, it doesn't get more than 50 records. Each record size is not more than 500 bytes. On Sep 11, 2014 10:54 PM, Bharat Venkat bvenkat.sp...@gmail.com wrote: You could set spark.executor.memory to something bigger than the default (512mb) On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am running a simple Spark Streaming program that pulls in data from Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps data and persists to a store. The program is running in local mode right now and runs out of memory after a while. I am yet to investigate heap dumps but I think Spark isn't releasing memory after processing is complete. I have even tried changing storage level to disk only. Help! Thanks, Aniket
RE: Repartitioning by partition size, not by number of partitions.
Hi Ilya, This seems to me as quiet complicated solution, I'm thinking that easier (though not optimal) solution might be for example to use heuristicaly something like RDD.coalesce(RDD.getNumPartitions() / N), but it keeps me wonder that Spark does not have something like RDD.coalesce(partition_size). __ Hi Jan. I've actually written a function recently to do precisely that using the RDD.randomSplit function. You just need to calculate how big each element of your data is, then how many of each data can fit in each RDD to populate the input to rqndomSplit. Unfortunately, in my case I wind up with GC errors on large data doing this and am still debugging :) -Original Message- From: jan.zi...@centrum.cz [jan.zi...@centrum.cz jan.zi...@centrum.cz] Sent: Friday, October 31, 2014 06:27 AM Eastern Standard Time To: user@spark.apache.org Subject: Repartitioning by partition size, not by number of partitions. Hi, I have inpot data that are many of very small files containing one .json. For performance reasons (I use PySpark) I have to do repartioning, currently I do: sc.textFile(files).coalesce(100)) Problem is that I have to guess the number of partitions in a such way that it's as fast as possible and I am still on the sefe side with the RAM memory. So this is quiet difficult. For this reason I would like to ask if there is some way, how to replace coalesce(100) by something that creates N partitions of the given size? I went through the documentation, but I was not able to find some way, how to do that. thank you in advance for any help or advice. The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to set Spark to perform only one map at once at each cluster node
Yes I would expect it as you say, setting executor-cores as 1 would work, but it seems to me that when I do use executor-cores=1 than it does actually perform more than one job on each of the machines at one time moment (at least based on what top says). __ CC: user@spark.apache.org It's not very difficult to implement by properly set parameter of application.Some basic knowledge you should know: An application can have only one executor at each machine or container (YARN).So you just set executor-cores as 1, then each executor will make only one task at once. 2014-10-28 19:00 GMT+08:00 jan.zi...@centrum.cz jan.zi...@centrum.cz: But I guess that this makes only one task over all the clusters nodes. I would like to run several tasks, but I would like Spark to not run more than one map at each of my nodes at one time. That means I would like to let's say have 4 different tasks and 2 nodes where each node has 2 cores. Currently hadoop runs 2 maps in parallel at each node (all the 4 tasks in parallel), but I would like to somehow force it to run only 1 task at each node and to give it another task after the first task will finish. __ The number of tasks is decided by the input partition numbers.If you want only one map or flatMap at once, just call coalesce() or repartition() to associate data into one partition.However, this is not recommend because it was not executed parallel efficiently. 2014-10-28 17:27 GMT+08:00 jan.zi...@centrum.cz jan.zi...@centrum.cz: Hi, I am currently struggling with how to properly set Spark to perform only one map, flatMap, etc at once. In other words my map uses multi core algorithm so I would like to have only one map running to be able to use all the machine cores. Thank you in advance for advices and replies. Jan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark streaming - saving kafka DStream into hadoop throws exception
Hm, now I am also seeing this problem. The essence of my code is: final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); JavaStreamingContextFactory streamingContextFactory = new JavaStreamingContextFactory() { @Override public JavaStreamingContext create() { return new JavaStreamingContext(sparkContext, new Duration(batchDurationMS)); } }; streamingContext = JavaStreamingContext.getOrCreate( checkpointDirString, sparkContext.hadoopConfiguration(), streamingContextFactory, false); streamingContext.checkpoint(checkpointDirString); yields: 2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66 org.apache.hadoop.conf.Configuration - field (class org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9, name: conf$2, type: class org.apache.hadoop.conf.Configuration) - object (class org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9, function2) - field (class org.apache.spark.streaming.dstream.ForEachDStream, name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc, type: interface scala.Function2) - object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@cb8016a) ... This looks like it's due to PairRDDFunctions, as this saveFunc seems to be org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9 : def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration ) { val saveFunc = (rdd: RDD[(K, V)], time: Time) = { val file = rddToFileName(prefix, suffix, time) rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass, outputFormatClass, conf) } self.foreachRDD(saveFunc) } conf indeed is serialized to make it part of saveFunc, no? but it can't be serialized. But surely this doesn't fail all the time or someone would have noticed by now... It could be a particular closure problem again. Any ideas on whether this is a problem, or if there's a workaround? checkpointing does not work at all for me as a result. On Fri, Aug 15, 2014 at 10:37 PM, salemi alireza.sal...@udo.edu wrote: Hi All, I am just trying to save the kafka dstream to hadoop as followed val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap) dStream.saveAsHadoopFiles(hdfsDataUrl, data) It throws the following exception. What am I doing wrong? 14/08/15 14:30:09 ERROR OneForOneStrategy: org.apache.hadoop.mapred.JobConf java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:168) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
A Spark Design Problem
The original problem is in biology but the following captures the CS issues, Assume I have a large number of locks and a large number of keys. There is a scoring function between keys and locks and a key that fits a lock will have a high score. There may be many keys fitting one lock and a key may fit no locks well. The object is to find the best fitting lock for each key. Assume that the number of keys and locks is high enough that taking the cartesian product of the two is computationally impractical. Also assume that keys and locks have an attached location which is accurate within an error (say 1 Km). Only keys and locks within 1 Km need be compared. Now assume I can create a JavaRDDKeys and a JavaRDDLocks . I could divide the locations into 1 Km squared bins and look only within a few bins. Assume that it is practical to take a cartesian product for all elements in a bin but not to keep all elements in memory. I could map my RDDs into PairRDDs where the key is the bin assigned by location I know how to take the cartesian product of two JavaRDDs but not how to take a cartesian product of sets of elements sharing a common key (bin), Any suggestions. Assume that in the worst cases the number of elements in a bin are too large to keep in memory although if a bin were subdivided into, say 100 subbins elements would fit in memory. Any thoughts as to how to attack the problem
Re: Too many files open with Spark 1.1 and CDH 5.1
As Sean suggested, try out the new sort-based shuffle in 1.1 if you know you're triggering large shuffles. That should help a lot. 2014년 10월 31일 금요일, Bill Qbill.q@gmail.com님이 작성한 메시지: Hi Sean, Thanks for the reply. I think both driver and worker have the problem. You are right that the ulimit fixed the driver side too many files open error. And there is a very big shuffle. My maybe naive thought is to migrate the HQL scripts directly from Hive to Spark SQL and make them work. It seems that it won't be that easy. Is that correct? And it seems that I had done that with Shark and it worked pretty well in the old days. Any suggestions if we are planning to migrate a large code base from Hive to Spark SQL with minimum code rewriting? Many thanks. Cao On Friday, October 31, 2014, Sean Owen so...@cloudera.com javascript:_e(%7B%7D,'cvml','so...@cloudera.com'); wrote: It's almost surely the workers, not the driver (shell) that have too many files open. You can change their ulimit. But it's probably better to see why it happened -- a very big shuffle? -- and repartition or design differently to avoid it. The new sort-based shuffle might help in this regard. On Fri, Oct 31, 2014 at 3:25 PM, Bill Q bill.q@gmail.com wrote: Hi, I am trying to make Spark SQL 1.1 to work to replace part of our ETL processes that are currently done by Hive 0.12. A common problem that I have encountered is the Too many files open error. Once that happened, the query just failed. I started the spark-shell by using ulimit -n 4096 spark-shell. And it still pops the same error. Any solutions? Many thanks. Bill -- Many thanks. Bill -- Many thanks. Bill
Re: Measuring Performance in Spark
Is there any tools like Ganglia that I can use to get performance on Spark or I need to do it myself? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17836.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Measuring Performance in Spark
Hi Mahsa, Use SPM http://sematext.com/spm/. See http://blog.sematext.com/2014/10/07/apache-spark-monitoring/ . Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Fri, Oct 31, 2014 at 1:00 PM, mahsa mahsa.han...@gmail.com wrote: Is there any tools like Ganglia that I can use to get performance on Spark or I need to do it myself? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17836.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
properties file on a spark cluster
Hi Guys, I'm trying to execute a spark job using python, running on a cluster of Yarn (managed by cloudera manager). The python script is using a set of python programs installed in each member of cluster. This set of programs need an property file found by a local system path. My problem is: When this script is sent, using spark-submit, the programs can't find this properties file. Running locally as stand-alone job, is no problem, the properties file is found. My questions is: 1 - What is the problem here ? 2 - In this scenario (an script running on a spark yarn cluster that use python programs that share same properties file) what is the best approach ? Thank's taka
Re: Measuring Performance in Spark
Oh this is Awesome! exactly what I needed! Thank you Otis! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17839.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkContext.stop() ?
It is used to shut down the context when you're done with it, but if you're using a context for the lifetime of your application I don't think it matters. I use this in my unit tests, because they start up local contexts and you can't have multiple local contexts open so each test must stop its context when it's done. On Fri, Oct 31, 2014 at 11:12 AM, ll duy.huynh@gmail.com wrote: what is it for? when do we call it? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: SQL COUNT DISTINCT
The only thing in your code that cannot be parallelized is the collect() because -- by definition -- it collects all the results to the driver node. This has nothing to do with the DISTINCT in your query. What do you want to do with the results after you collect them? How many results do you have in the output of collect? Perhaps it makes more sense to continue operating on the RDDs you have or saving them using one of the RDD methods, because that preserves the cluster's ability to parallelize work. Nick 2014년 10월 31일 금요일, Bojan Kosticblood9ra...@gmail.com님이 작성한 메시지: While i testing Spark SQL i noticed that COUNT DISTINCT works really slow. Map partitions phase finished fast, but collect phase is slow. It's only runs on single executor. Should this run this way? And here is the simple code which i use for testing: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetFile = sqlContext.parquetFile(/bojan/test/2014-10-20/) parquetFile.registerTempTable(parquetFile) val count = sqlContext.sql(SELECT COUNT(DISTINCT f2) FROM parquetFile) count.map(t = t(0)).collect().foreach(println) I guess because of the distinct process must be on single node. But i wonder can i add some parallelism to the collect process. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: user-h...@spark.apache.org javascript:;
Accessing Cassandra with SparkSQL, Does not work?
Hi, I am using the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._; import org.apache.spark.SparkContext._ import org.apache.spark.sql.catalyst.expressions._ import com.datastax.spark.connector.cql.CassandraConnector import org.apache.spark.sql.cassandra.CassandraSQLContext val conf = new SparkConf().setAppName(SomethingElse) .setMaster(local) .set(spark.cassandra.connection.host, localhost) val sc: SparkContext = new SparkContext(conf) val rdd = sc.cassandraTable(mydb, mytable) // this works But: val cc = new CassandraSQLContext(sc) cc.setKeyspace(mydb) val srdd: SchemaRDD = cc.sql(select * from mydb.mytable ) println (count : + srdd.count) // does not work Exception is thrown: Exception in thread main com.google.common.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: key not found: mydb3.inverseeventtype at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201) at com.google.common.cache.LocalCache.get(LocalCache.java:3934) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938) in fact mydb3 is anothery keyspace which I did not tried even to connect to it ! Any idea? best, /Shahab Here is how my SBT looks like: libraryDependencies ++= Seq( com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 withSources() withJavadoc(), org.apache.cassandra % cassandra-all % 2.0.9 intransitive(), org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j- api) exclude(javax.servlet, servlet-api), com.datastax.cassandra % cassandra-driver-core % 2.0.4 intransitive(), org.apache.spark %% spark-core % 1.1.0 % provided exclude(org.apache.hadoop, hadoop-core), org.apache.spark %% spark-streaming % 1.1.0 % provided, org.apache.hadoop % hadoop-client % 1.0.4 % provided, com.github.nscala-time %% nscala-time % 1.0.0, org.scalatest %% scalatest % 1.9.1 % test, org.apache.spark %% spark-sql % 1.1.0 % provided, org.apache.spark %% spark-hive % 1.1.0 % provided, org.json4s %% json4s-jackson % 3.2.5, junit % junit % 4.8.1 % test, org.slf4j % slf4j-api % 1.7.7, org.slf4j % slf4j-simple % 1.7.7, org.clapper %% grizzled-slf4j % 1.0.2, log4j % log4j % 1.2.17)
Re: Manipulating RDDs within a DStream
Hi Harold, Can you include the versions of spark and spark-cassandra-connector you are using? Thanks! Helena @helenaedelson On Oct 30, 2014, at 12:58 PM, Harold Nguyen har...@nexgate.com wrote: Hi all, I'd like to be able to modify values in a DStream, and then send it off to an external source like Cassandra, but I keep getting Serialization errors and am not sure how to use the correct design pattern. I was wondering if you could help me. I'd like to be able to do the following: wordCounts.foreachRDD( rdd = { val arr = record.toArray ... }) I would like to use the arr to send back to cassandra, for instance: Use it like this: val collection = sc.parallelize(Seq(a.head._1, a.head_.2)) collection.saveToCassandra() Or something like that, but as you know, I can't do this within the foreacRDD but only at the driver level. How do I use the arr variable to do something like that ? Thanks for any help, Harold - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Manipulating RDDs within a DStream
Thanks Lalit, and Helena, What I'd like to do is manipulate the values within a DStream like this: DStream.foreachRDD( rdd = { val arr = record.toArray } I'd then like to be able to insert results from the arr back into Cassadnra, after I've manipulated the arr array. However, for all the examples I've seen, inserting into Cassandra is something like: val collection = sc.parralellize(Seq(foo, bar))) Where foo and bar could be elements in the arr array. So I would like to know how to insert into Cassandra at the worker level. Best wishes, Harold On Thu, Oct 30, 2014 at 11:48 PM, lalit1303 la...@sigmoidanalytics.com wrote: Hi, Since, the cassandra object is not serializable you can't open the connection on driver level and access the object inside foreachRDD (i.e. at worker level). You have to open connection inside foreachRDD only, perform the operation and then close the connection. For example: wordCounts.foreachRDD( rdd = { val arr = rdd.toArray OPEN cassandra connection store arr CLOSE cassandra connection }) Thanks - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accessing Cassandra with SparkSQL, Does not work?
Hi Shahab, I’m just curious, are you explicitly needing to use thrift? Just using the connector with spark does not require any thrift dependencies. Simply: com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1” But to your question, you declare the keyspace but also unnecessarily repeat the keyspace.table in your select. Try this instead: val cc = new CassandraSQLContext(sc) cc.setKeyspace(“keyspaceName) val result = cc.sql(SELECT * FROM tableName”) etc - Helena @helenaedelson On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am using the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._; import org.apache.spark.SparkContext._ import org.apache.spark.sql.catalyst.expressions._ import com.datastax.spark.connector.cql.CassandraConnector import org.apache.spark.sql.cassandra.CassandraSQLContext val conf = new SparkConf().setAppName(SomethingElse) .setMaster(local) .set(spark.cassandra.connection.host, localhost) val sc: SparkContext = new SparkContext(conf) val rdd = sc.cassandraTable(mydb, mytable) // this works But: val cc = new CassandraSQLContext(sc) cc.setKeyspace(mydb) val srdd: SchemaRDD = cc.sql(select * from mydb.mytable ) println (count : + srdd.count) // does not work Exception is thrown: Exception in thread main com.google.common.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: key not found: mydb3.inverseeventtype at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201) at com.google.common.cache.LocalCache.get(LocalCache.java:3934) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938) in fact mydb3 is anothery keyspace which I did not tried even to connect to it ! Any idea? best, /Shahab Here is how my SBT looks like: libraryDependencies ++= Seq( com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 withSources() withJavadoc(), org.apache.cassandra % cassandra-all % 2.0.9 intransitive(), org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j-api) exclude(javax.servlet, servlet-api), com.datastax.cassandra % cassandra-driver-core % 2.0.4 intransitive(), org.apache.spark %% spark-core % 1.1.0 % provided exclude(org.apache.hadoop, hadoop-core), org.apache.spark %% spark-streaming % 1.1.0 % provided, org.apache.hadoop % hadoop-client % 1.0.4 % provided, com.github.nscala-time %% nscala-time % 1.0.0, org.scalatest %% scalatest % 1.9.1 % test, org.apache.spark %% spark-sql % 1.1.0 % provided, org.apache.spark %% spark-hive % 1.1.0 % provided, org.json4s %% json4s-jackson % 3.2.5, junit % junit % 4.8.1 % test, org.slf4j % slf4j-api % 1.7.7, org.slf4j % slf4j-simple % 1.7.7, org.clapper %% grizzled-slf4j % 1.0.2, log4j % log4j % 1.2.17)
Re: A Spark Design Problem
Hi Steve, Are you talking about sequence alignment ? — FG On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2...@gmail.com wrote: The original problem is in biology but the following captures the CS issues, Assume I have a large number of locks and a large number of keys. There is a scoring function between keys and locks and a key that fits a lock will have a high score. There may be many keys fitting one lock and a key may fit no locks well. The object is to find the best fitting lock for each key. Assume that the number of keys and locks is high enough that taking the cartesian product of the two is computationally impractical. Also assume that keys and locks have an attached location which is accurate within an error (say 1 Km). Only keys and locks within 1 Km need be compared. Now assume I can create a JavaRDDKeys and a JavaRDDLocks . I could divide the locations into 1 Km squared bins and look only within a few bins. Assume that it is practical to take a cartesian product for all elements in a bin but not to keep all elements in memory. I could map my RDDs into PairRDDs where the key is the bin assigned by location I know how to take the cartesian product of two JavaRDDs but not how to take a cartesian product of sets of elements sharing a common key (bin), Any suggestions. Assume that in the worst cases the number of elements in a bin are too large to keep in memory although if a bin were subdivided into, say 100 subbins elements would fit in memory. Any thoughts as to how to attack the problem
Re: Manipulating RDDs within a DStream
Hi Harold, Yes, that is the problem :) Sorry for the confusion, I will make this clear in the docs ;) since master is work for the next version. All you need to do is use spark 1.1.0 as you have it already com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1” and assembly - not from master, checkout branch b1.1, and sbt ;clean ;reload ;assembly Cheers, - Helena @helenaedelson On Oct 31, 2014, at 1:35 PM, Harold Nguyen har...@nexgate.com wrote: Hi Helena, Thanks very much ! I'm using Spark 1.1.0, and spark-cassandra-connector-assembly-1.2.0-SNAPSHOT Best wishes, Harold On Fri, Oct 31, 2014 at 10:31 AM, Helena Edelson helena.edel...@datastax.com wrote: Hi Harold, Can you include the versions of spark and spark-cassandra-connector you are using? Thanks! Helena @helenaedelson On Oct 30, 2014, at 12:58 PM, Harold Nguyen har...@nexgate.com wrote: Hi all, I'd like to be able to modify values in a DStream, and then send it off to an external source like Cassandra, but I keep getting Serialization errors and am not sure how to use the correct design pattern. I was wondering if you could help me. I'd like to be able to do the following: wordCounts.foreachRDD( rdd = { val arr = record.toArray ... }) I would like to use the arr to send back to cassandra, for instance: Use it like this: val collection = sc.parallelize(Seq(a.head._1, a.head_.2)) collection.saveToCassandra() Or something like that, but as you know, I can't do this within the foreacRDD but only at the driver level. How do I use the arr variable to do something like that ? Thanks for any help, Harold
Re: Manipulating RDDs within a DStream
Hi Harold, This is a great use case, and here is how you could do it, for example, with Spark Streaming: Using a Kafka stream: https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L50 Save raw data to Cassandra from that stream https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L56 Do n-computations on that streaming data: reading from Kafka, computing in Spark, and writing to Cassandra https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L69-L71 I hope that helps, and if not I’ll dig up another. - Helena @helenaedelson On Oct 31, 2014, at 1:37 PM, Harold Nguyen har...@nexgate.com wrote: Thanks Lalit, and Helena, What I'd like to do is manipulate the values within a DStream like this: DStream.foreachRDD( rdd = { val arr = record.toArray } I'd then like to be able to insert results from the arr back into Cassadnra, after I've manipulated the arr array. However, for all the examples I've seen, inserting into Cassandra is something like: val collection = sc.parralellize(Seq(foo, bar))) Where foo and bar could be elements in the arr array. So I would like to know how to insert into Cassandra at the worker level. Best wishes, Harold On Thu, Oct 30, 2014 at 11:48 PM, lalit1303 la...@sigmoidanalytics.com wrote: Hi, Since, the cassandra object is not serializable you can't open the connection on driver level and access the object inside foreachRDD (i.e. at worker level). You have to open connection inside foreachRDD only, perform the operation and then close the connection. For example: wordCounts.foreachRDD( rdd = { val arr = rdd.toArray OPEN cassandra connection store arr CLOSE cassandra connection }) Thanks - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accessing Cassandra with SparkSQL, Does not work?
Thanks Helena. I tried setting the KeySpace, but I got same result. I also removed other Cassandra dependencies, but still same exception! I also tried to see if this setting appears in the CassandraSQLContext or not, so I printed out the output of configustion val cc = new CassandraSQLContext(sc) cc.setKeyspace(mydb) cc.conf.getAll.foreach(f = println (f._1 + : + f._2)) printout: spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2 spark.driver.host :192.168.1.111 spark.cassandra.connection.host : localhost spark.cassandra.input.split.size : 1 spark.app.name : SomethingElse spark.fileserver.uri : http://192.168.1.111:51463 spark.driver.port : 51461 spark.master : local Does it have anything to do with the version of Apache Cassandra that I use?? I use apache-cassandra-2.1.0 best, /Shahab The shortened SBT : com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 withSources() withJavadoc(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.spark %% spark-core % 1.1.0 % provided exclude(org.apache.hadoop, hadoop-core), org.apache.spark %% spark-streaming % 1.1.0 % provided, org.apache.hadoop % hadoop-client % 1.0.4 % provided, com.github.nscala-time %% nscala-time % 1.0.0, org.scalatest %% scalatest % 1.9.1 % test, org.apache.spark %% spark-sql % 1.1.0 % provided, org.apache.spark %% spark-hive % 1.1.0 % provided, org.json4s %% json4s-jackson % 3.2.5, junit % junit % 4.8.1 % test, org.slf4j % slf4j-api % 1.7.7, org.slf4j % slf4j-simple % 1.7.7, org.clapper %% grizzled-slf4j % 1.0.2, log4j % log4j % 1.2.17 On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson helena.edel...@datastax.com wrote: Hi Shahab, I’m just curious, are you explicitly needing to use thrift? Just using the connector with spark does not require any thrift dependencies. Simply: com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1” But to your question, you declare the keyspace but also unnecessarily repeat the keyspace.table in your select. Try this instead: val cc = new CassandraSQLContext(sc) cc.setKeyspace(“keyspaceName) val result = cc.sql(SELECT * FROM tableName”) etc - Helena @helenaedelson On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am using the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._; import org.apache.spark.SparkContext._ import org.apache.spark.sql.catalyst.expressions._ import com.datastax.spark.connector.cql.CassandraConnector import org.apache.spark.sql.cassandra.CassandraSQLContext val conf = new SparkConf().setAppName(SomethingElse) .setMaster(local) .set(spark.cassandra.connection.host, localhost) val sc: SparkContext = new SparkContext(conf) val rdd = sc.cassandraTable(mydb, mytable) // this works But: val cc = new CassandraSQLContext(sc) cc.setKeyspace(mydb) val srdd: SchemaRDD = cc.sql(select * from mydb.mytable ) println (count : + srdd.count) // does not work Exception is thrown: Exception in thread main com.google.common.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: key not found: mydb3.inverseeventtype at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201) at com.google.common.cache.LocalCache.get(LocalCache.java:3934) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938) in fact mydb3 is anothery keyspace which I did not tried even to connect to it ! Any idea? best, /Shahab Here is how my SBT looks like: libraryDependencies ++= Seq( com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 withSources() withJavadoc(), org.apache.cassandra % cassandra-all % 2.0.9 intransitive(), org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j-api) exclude(javax.servlet, servlet-api), com.datastax.cassandra % cassandra-driver-core % 2.0.4 intransitive(), org.apache.spark %% spark-core % 1.1.0 % provided exclude(org.apache.hadoop, hadoop-core), org.apache.spark %% spark-streaming % 1.1.0 % provided, org.apache.hadoop % hadoop-client % 1.0.4 % provided, com.github.nscala-time %% nscala-time % 1.0.0, org.scalatest %% scalatest % 1.9.1 % test, org.apache.spark %% spark-sql % 1.1.0 % provided, org.apache.spark %% spark-hive % 1.1.0 % provided, org.json4s %% json4s-jackson % 3.2.5, junit % junit % 4.8.1 % test, org.slf4j % slf4j-api % 1.7.7, org.slf4j % slf4j-simple %
Re: SparkContext.stop() ?
You don't have to call it if you just exit your application, but it's useful for example in unit tests if you want to create and shut down a separate SparkContext for each test. Matei On Oct 31, 2014, at 10:39 AM, Evan R. Sparks evan.spa...@gmail.com wrote: In cluster settings if you don't explicitly call sc.stop() your application may hang. Like closing files, network connections, etc, when you're done with them, it's a good idea to call sc.stop(), which lets the spark master know that your application is finished consuming resources. On Fri, Oct 31, 2014 at 10:13 AM, Daniel Siegmann daniel.siegm...@velos.io mailto:daniel.siegm...@velos.io wrote: It is used to shut down the context when you're done with it, but if you're using a context for the lifetime of your application I don't think it matters. I use this in my unit tests, because they start up local contexts and you can't have multiple local contexts open so each test must stop its context when it's done. On Fri, Oct 31, 2014 at 11:12 AM, ll duy.huynh@gmail.com mailto:duy.huynh@gmail.com wrote: what is it for? when do we call it? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io mailto:daniel.siegm...@velos.io W: www.velos.io http://www.velos.io/
Re: Accessing Cassandra with SparkSQL, Does not work?
Hi Shahab, The apache cassandra version looks great. I think that doing cc.setKeyspace(mydb) cc.sql(SELECT * FROM mytable) versus cc.setKeyspace(mydb) cc.sql(select * from mydb.mytable ) Is the problem? And if not, would you mind creating a ticket off-list for us to help further? You can create one here: https://github.com/datastax/spark-cassandra-connector/issues with tag: help wanted :) Cheers, - Helena @helenaedelson On Oct 31, 2014, at 1:59 PM, shahab shahab.mok...@gmail.com wrote: Thanks Helena. I tried setting the KeySpace, but I got same result. I also removed other Cassandra dependencies, but still same exception! I also tried to see if this setting appears in the CassandraSQLContext or not, so I printed out the output of configustion val cc = new CassandraSQLContext(sc) cc.setKeyspace(mydb) cc.conf.getAll.foreach(f = println (f._1 + : + f._2)) printout: spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2 spark.driver.host :192.168.1.111 spark.cassandra.connection.host : localhost spark.cassandra.input.split.size : 1 spark.app.name : SomethingElse spark.fileserver.uri : http://192.168.1.111:51463 spark.driver.port : 51461 spark.master : local Does it have anything to do with the version of Apache Cassandra that I use?? I use apache-cassandra-2.1.0 best, /Shahab The shortened SBT : com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 withSources() withJavadoc(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.spark %% spark-core % 1.1.0 % provided exclude(org.apache.hadoop, hadoop-core), org.apache.spark %% spark-streaming % 1.1.0 % provided, org.apache.hadoop % hadoop-client % 1.0.4 % provided, com.github.nscala-time %% nscala-time % 1.0.0, org.scalatest %% scalatest % 1.9.1 % test, org.apache.spark %% spark-sql % 1.1.0 % provided, org.apache.spark %% spark-hive % 1.1.0 % provided, org.json4s %% json4s-jackson % 3.2.5, junit % junit % 4.8.1 % test, org.slf4j % slf4j-api % 1.7.7, org.slf4j % slf4j-simple % 1.7.7, org.clapper %% grizzled-slf4j % 1.0.2, log4j % log4j % 1.2.17 On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson helena.edel...@datastax.com wrote: Hi Shahab, I’m just curious, are you explicitly needing to use thrift? Just using the connector with spark does not require any thrift dependencies. Simply: com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1” But to your question, you declare the keyspace but also unnecessarily repeat the keyspace.table in your select. Try this instead: val cc = new CassandraSQLContext(sc) cc.setKeyspace(“keyspaceName) val result = cc.sql(SELECT * FROM tableName”) etc - Helena @helenaedelson On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am using the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._; import org.apache.spark.SparkContext._ import org.apache.spark.sql.catalyst.expressions._ import com.datastax.spark.connector.cql.CassandraConnector import org.apache.spark.sql.cassandra.CassandraSQLContext val conf = new SparkConf().setAppName(SomethingElse) .setMaster(local) .set(spark.cassandra.connection.host, localhost) val sc: SparkContext = new SparkContext(conf) val rdd = sc.cassandraTable(mydb, mytable) // this works But: val cc = new CassandraSQLContext(sc) cc.setKeyspace(mydb) val srdd: SchemaRDD = cc.sql(select * from mydb.mytable ) println (count : + srdd.count) // does not work Exception is thrown: Exception in thread main com.google.common.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: key not found: mydb3.inverseeventtype at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201) at com.google.common.cache.LocalCache.get(LocalCache.java:3934) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938) in fact mydb3 is anothery keyspace which I did not tried even to connect to it ! Any idea? best, /Shahab Here is how my SBT looks like: libraryDependencies ++= Seq( com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 withSources() withJavadoc(), org.apache.cassandra % cassandra-all % 2.0.9 intransitive(), org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j-api) exclude(javax.servlet,
LinearRegression and model prediction threshold
Hi All, I am using LinearRegression and have a question about the details on model.predict method. Basically it is predicting variable y given an input vector x. However, can someone point me to the documentation about what is the threshold used in the predict method? Can that be changed ? I am assuming that i/p vector essentially gets mapped to a number and is compared against a threshold value and then y is either set to 0 or 1 based on those two numbers. Another question I have is if I want to save the model to hdfs for later reuse is there a recommended way for doing that? // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point = val prediction = model.predict(point.features) (point.label, prediction) }
Re: Accessing Cassandra with SparkSQL, Does not work?
OK, I created an issue. Hopefully it will be resolved soon. Again thanks, best, /Shahab On Fri, Oct 31, 2014 at 7:05 PM, Helena Edelson helena.edel...@datastax.com wrote: Hi Shahab, The apache cassandra version looks great. I think that doing cc.setKeyspace(mydb) cc.sql(SELECT * FROM mytable) versus cc.setKeyspace(mydb) cc.sql(select * from mydb.mytable ) Is the problem? And if not, would you mind creating a ticket off-list for us to help further? You can create one here: https://github.com/datastax/spark-cassandra-connector/issues with tag: help wanted :) Cheers, - Helena @helenaedelson On Oct 31, 2014, at 1:59 PM, shahab shahab.mok...@gmail.com wrote: Thanks Helena. I tried setting the KeySpace, but I got same result. I also removed other Cassandra dependencies, but still same exception! I also tried to see if this setting appears in the CassandraSQLContext or not, so I printed out the output of configustion val cc = new CassandraSQLContext(sc) cc.setKeyspace(mydb) cc.conf.getAll.foreach(f = println (f._1 + : + f._2)) printout: spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2 spark.driver.host :192.168.1.111 spark.cassandra.connection.host : localhost spark.cassandra.input.split.size : 1 spark.app.name : SomethingElse spark.fileserver.uri : http://192.168.1.111:51463 spark.driver.port : 51461 spark.master : local Does it have anything to do with the version of Apache Cassandra that I use?? I use apache-cassandra-2.1.0 best, /Shahab The shortened SBT : com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 withSources() withJavadoc(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.spark %% spark-core % 1.1.0 % provided exclude(org.apache.hadoop, hadoop-core), org.apache.spark %% spark-streaming % 1.1.0 % provided, org.apache.hadoop % hadoop-client % 1.0.4 % provided, com.github.nscala-time %% nscala-time % 1.0.0, org.scalatest %% scalatest % 1.9.1 % test, org.apache.spark %% spark-sql % 1.1.0 % provided, org.apache.spark %% spark-hive % 1.1.0 % provided, org.json4s %% json4s-jackson % 3.2.5, junit % junit % 4.8.1 % test, org.slf4j % slf4j-api % 1.7.7, org.slf4j % slf4j-simple % 1.7.7, org.clapper %% grizzled-slf4j % 1.0.2, log4j % log4j % 1.2.17 On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson helena.edel...@datastax.com wrote: Hi Shahab, I’m just curious, are you explicitly needing to use thrift? Just using the connector with spark does not require any thrift dependencies. Simply: com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1” But to your question, you declare the keyspace but also unnecessarily repeat the keyspace.table in your select. Try this instead: val cc = new CassandraSQLContext(sc) cc.setKeyspace(“keyspaceName) val result = cc.sql(SELECT * FROM tableName”) etc - Helena @helenaedelson On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am using the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._; import org.apache.spark.SparkContext._ import org.apache.spark.sql.catalyst.expressions._ import com.datastax.spark.connector.cql.CassandraConnector import org.apache.spark.sql.cassandra.CassandraSQLContext val conf = new SparkConf().setAppName(SomethingElse) .setMaster(local) .set(spark.cassandra.connection.host, localhost) val sc: SparkContext = new SparkContext(conf) val rdd = sc.cassandraTable(mydb, mytable) // this works But: val cc = new CassandraSQLContext(sc) cc.setKeyspace(mydb) val srdd: SchemaRDD = cc.sql(select * from mydb.mytable ) println (count : + srdd.count) // does not work Exception is thrown: Exception in thread main com.google.common.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: key not found: mydb3.inverseeventtype at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201) at com.google.common.cache.LocalCache.get(LocalCache.java:3934) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938) in fact mydb3 is anothery keyspace which I did not tried even to connect to it ! Any idea? best, /Shahab Here is how my SBT looks like: libraryDependencies ++= Seq( com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 withSources() withJavadoc(), org.apache.cassandra % cassandra-all % 2.0.9 intransitive(), org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.thrift % libthrift % 0.9.1
Re: SparkContext.stop() ?
Actually, if you don't call SparkContext.stop(), the event log information that is used by the history server will be incomplete, and your application will never show up in the history server's UI. If you don't use that functionality, then you're probably ok not calling it as long as your application exits after it's done using the context. On Fri, Oct 31, 2014 at 8:12 AM, ll duy.huynh@gmail.com wrote: what is it for? when do we call it? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: LinearRegression and model prediction threshold
It sounds like you are asking about logistic regression, not linear regression. If so, yes that's just what it does. The default would be 0.5 in logistic regression. If you 'clear' the threshold you get the raw margin out of this and other linear classifiers. On Fri, Oct 31, 2014 at 7:18 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I am using LinearRegression and have a question about the details on model.predict method. Basically it is predicting variable y given an input vector x. However, can someone point me to the documentation about what is the threshold used in the predict method? Can that be changed ? I am assuming that i/p vector essentially gets mapped to a number and is compared against a threshold value and then y is either set to 0 or 1 based on those two numbers. Another question I have is if I want to save the model to hdfs for later reuse is there a recommended way for doing that? // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: A Spark Design Problem
Does the following help? JavaPairRDDbin,key join with JavaPairRDDbin,lock If you partition both RDDs by the bin id, I think you should be able to get what you want. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 11:19 PM, francois.garil...@typesafe.com wrote: Hi Steve, Are you talking about sequence alignment ? — FG On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2...@gmail.com wrote: The original problem is in biology but the following captures the CS issues, Assume I have a large number of locks and a large number of keys. There is a scoring function between keys and locks and a key that fits a lock will have a high score. There may be many keys fitting one lock and a key may fit no locks well. The object is to find the best fitting lock for each key. Assume that the number of keys and locks is high enough that taking the cartesian product of the two is computationally impractical. Also assume that keys and locks have an attached location which is accurate within an error (say 1 Km). Only keys and locks within 1 Km need be compared. Now assume I can create a JavaRDDKeys and a JavaRDDLocks . I could divide the locations into 1 Km squared bins and look only within a few bins. Assume that it is practical to take a cartesian product for all elements in a bin but not to keep all elements in memory. I could map my RDDs into PairRDDs where the key is the bin assigned by location I know how to take the cartesian product of two JavaRDDs but not how to take a cartesian product of sets of elements sharing a common key (bin), Any suggestions. Assume that in the worst cases the number of elements in a bin are too large to keep in memory although if a bin were subdivided into, say 100 subbins elements would fit in memory. Any thoughts as to how to attack the problem
Re: LinearRegression and model prediction threshold
You can serialize the model to a local/hdfs file system and use it later when you want. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Nov 1, 2014 at 12:02 AM, Sean Owen so...@cloudera.com wrote: It sounds like you are asking about logistic regression, not linear regression. If so, yes that's just what it does. The default would be 0.5 in logistic regression. If you 'clear' the threshold you get the raw margin out of this and other linear classifiers. On Fri, Oct 31, 2014 at 7:18 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I am using LinearRegression and have a question about the details on model.predict method. Basically it is predicting variable y given an input vector x. However, can someone point me to the documentation about what is the threshold used in the predict method? Can that be changed ? I am assuming that i/p vector essentially gets mapped to a number and is compared against a threshold value and then y is either set to 0 or 1 based on those two numbers. Another question I have is if I want to save the model to hdfs for later reuse is there a recommended way for doing that? // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Example of Fold
I'm want to fold an RDD into a smaller RDD with max elements. I have simple bean objects with 4 properties. I want to group by 3 of the properties and then select the max of the 4th. So I believe fold is the appropriate method for this. My question is, is there a good fold example out there. Additionally, what it the zero value used for as the first argument? Thanks.
Spark Build
I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve built the master branch successfully previously and am trying to rebuild again to take advantage of the new Hive 0.13.1 profile. I execute the following command: $ mvn -DskipTests -Phive-0.13-1 -Phadoop-2.4 -Pyarn clean package The build fails at the following stage: INFO] Using incremental compilation [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) [INFO] Compiling 5 Scala sources to /home/terrys/Applications/spark/yarn/stable/target/scala-2.10/test-classes... [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:20: object MemLimitLogger is not a member of package org.apache.spark.deploy.yarn [ERROR] import org.apache.spark.deploy.yarn.MemLimitLogger._ [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:29: not found: value memLimitExceededLogMessage [ERROR] val vmemMsg = memLimitExceededLogMessage(diagnostics, VMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:30: not found: value memLimitExceededLogMessage [ERROR] val pmemMsg = memLimitExceededLogMessage(diagnostics, PMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] three errors found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.758s] [INFO] Spark Project Common Network Code . SUCCESS [6.716s] [INFO] Spark Project Core SUCCESS [2:46.610s] [INFO] Spark Project Bagel ... SUCCESS [16.776s] [INFO] Spark Project GraphX .. SUCCESS [52.159s] [INFO] Spark Project Streaming ... SUCCESS [1:09.883s] [INFO] Spark Project ML Library .. SUCCESS [1:18.932s] [INFO] Spark Project Tools ... SUCCESS [10.210s] [INFO] Spark Project Catalyst SUCCESS [1:12.499s] [INFO] Spark Project SQL . SUCCESS [1:10.561s] [INFO] Spark Project Hive SUCCESS [1:08.571s] [INFO] Spark Project REPL SUCCESS [32.377s] [INFO] Spark Project YARN Parent POM . SUCCESS [1.317s] [INFO] Spark Project YARN Stable API . FAILURE [25.918s] [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 11:15.889s [INFO] Finished at: Fri Oct 31 12:08:55 PDT 2014 [INFO] Final Memory: 73M/829M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) on project spark-yarn_2.10: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-yarn_2.10 I could not find MemLimitLogger anywhere in the Spark code. Anybody else seen/encounter this? Thanks, -Terry
Re: Spark Build
Yeah looks like https://github.com/apache/spark/pull/2744 broke the build. We will fix it soon On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote: I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve built the master branch successfully previously and am trying to rebuild again to take advantage of the new Hive 0.13.1 profile. I execute the following command: $ mvn -DskipTests -Phive-0.13-1 -Phadoop-2.4 -Pyarn clean package The build fails at the following stage: INFO] Using incremental compilation [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) [INFO] Compiling 5 Scala sources to /home/terrys/Applications/spark/yarn/stable/target/scala-2.10/test-classes... [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:20: object MemLimitLogger is not a member of package org.apache.spark.deploy.yarn [ERROR] import org.apache.spark.deploy.yarn.MemLimitLogger._ [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:29: not found: value memLimitExceededLogMessage [ERROR] val vmemMsg = memLimitExceededLogMessage(diagnostics, VMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:30: not found: value memLimitExceededLogMessage [ERROR] val pmemMsg = memLimitExceededLogMessage(diagnostics, PMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] three errors found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.758s] [INFO] Spark Project Common Network Code . SUCCESS [6.716s] [INFO] Spark Project Core SUCCESS [2:46.610s] [INFO] Spark Project Bagel ... SUCCESS [16.776s] [INFO] Spark Project GraphX .. SUCCESS [52.159s] [INFO] Spark Project Streaming ... SUCCESS [1:09.883s] [INFO] Spark Project ML Library .. SUCCESS [1:18.932s] [INFO] Spark Project Tools ... SUCCESS [10.210s] [INFO] Spark Project Catalyst SUCCESS [1:12.499s] [INFO] Spark Project SQL . SUCCESS [1:10.561s] [INFO] Spark Project Hive SUCCESS [1:08.571s] [INFO] Spark Project REPL SUCCESS [32.377s] [INFO] Spark Project YARN Parent POM . SUCCESS [1.317s] [INFO] Spark Project YARN Stable API . FAILURE [25.918s] [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 11:15.889s [INFO] Finished at: Fri Oct 31 12:08:55 PDT 2014 [INFO] Final Memory: 73M/829M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) on project spark-yarn_2.10: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-yarn_2.10 I could not find MemLimitLogger anywhere in the Spark code. Anybody else seen/encounter this? Thanks, -Terry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
Re: Spark Build
Thanks for the update, Shivaram. -Terry On 10/31/14, 12:37 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah looks like https://github.com/apache/spark/pull/2744 broke the build. We will fix it soon On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote: I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I¹ve built the master branch successfully previously and am trying to rebuild again to take advantage of the new Hive 0.13.1 profile. I execute the following command: $ mvn -DskipTests -Phive-0.13-1 -Phadoop-2.4 -Pyarn clean package The build fails at the following stage: INFO] Using incremental compilation [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) [INFO] Compiling 5 Scala sources to /home/terrys/Applications/spark/yarn/stable/target/scala-2.10/test-classe s... [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spa rk/deploy/yarn/YarnAllocatorSuite.scala:20: object MemLimitLogger is not a member of package org.apache.spark.deploy.yarn [ERROR] import org.apache.spark.deploy.yarn.MemLimitLogger._ [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spa rk/deploy/yarn/YarnAllocatorSuite.scala:29: not found: value memLimitExceededLogMessage [ERROR] val vmemMsg = memLimitExceededLogMessage(diagnostics, VMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spa rk/deploy/yarn/YarnAllocatorSuite.scala:30: not found: value memLimitExceededLogMessage [ERROR] val pmemMsg = memLimitExceededLogMessage(diagnostics, PMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] three errors found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.758s] [INFO] Spark Project Common Network Code . SUCCESS [6.716s] [INFO] Spark Project Core SUCCESS [2:46.610s] [INFO] Spark Project Bagel ... SUCCESS [16.776s] [INFO] Spark Project GraphX .. SUCCESS [52.159s] [INFO] Spark Project Streaming ... SUCCESS [1:09.883s] [INFO] Spark Project ML Library .. SUCCESS [1:18.932s] [INFO] Spark Project Tools ... SUCCESS [10.210s] [INFO] Spark Project Catalyst SUCCESS [1:12.499s] [INFO] Spark Project SQL . SUCCESS [1:10.561s] [INFO] Spark Project Hive SUCCESS [1:08.571s] [INFO] Spark Project REPL SUCCESS [32.377s] [INFO] Spark Project YARN Parent POM . SUCCESS [1.317s] [INFO] Spark Project YARN Stable API . FAILURE [25.918s] [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 11:15.889s [INFO] Finished at: Fri Oct 31 12:08:55 PDT 2014 [INFO] Final Memory: 73M/829M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) on project spark-yarn_2.10: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-yarn_2.10 I could not find MemLimitLogger anywhere in the Spark code. Anybody else seen/encounter this? Thanks, -Terry
Spark Standalone on cluster stops
Hi, I have an issue with running Spark in standalone mode on a cluster. Everything seems to run fine for a couple of minutes until Spark stops executing the tasks. Any idea? Would appreciate some help. Thanks in advance, Tassilo I get errors like that at the end: 14/10/31 16:16:59 INFO client.AppClient$ClientActor: Executor updated: app-20141031161538-0003/7 is now EXITED (Command exited with code 1) 14/10/31 16:16:59 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141031161538-0003/7 removed: Command exited with code 1 14/10/31 16:16:59 INFO client.AppClient$ClientActor: Executor added: app-20141031161538-0003/11 on worker-20141031142207-localhost-36911 (localhost:36911) with 12 cores 14/10/31 16:16:59 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141031161538-0003/11 on hostPort localhost:36911 with 12 cores, 16.0 GB RAM 14/10/31 16:16:59 INFO client.AppClient$ClientActor: Executor updated: app-20141031161538-0003/11 is now RUNNING 14/10/31 16:16:59 INFO client.AppClient$ClientActor: Executor updated: app-20141031161538-0003/8 is now EXITED (Command exited with code 1) 14/10/31 16:16:59 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141031161538-0003/8 removed: Command exited with code 1 14/10/31 16:16:59 INFO client.AppClient$ClientActor: Executor added: app-20141031161538-0003/12 on worker-20141031142207-localhost-41750 (localhost:41750) with 12 cores -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Standalone-on-cluster-stops-tp17869.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Example of Fold
You should look at how fold is used in scala in general to help. Here is a blog post that may also give some guidance: http://blog.madhukaraphatak.com/spark-rdd-fold The zero value should be your bean, with the 4th parameter set to the minimum value. Your fold function should compare the 4th param in the incoming records and chose the larger one. If you want to group by the 3 parameters prior to folding you're probably better off using a reduce function. On Fri, Oct 31, 2014 at 12:01 PM, Ron Ayoub ronalday...@live.com wrote: I'm want to fold an RDD into a smaller RDD with max elements. I have simple bean objects with 4 properties. I want to group by 3 of the properties and then select the max of the 4th. So I believe fold is the appropriate method for this. My question is, is there a good fold example out there. Additionally, what it the zero value used for as the first argument? Thanks.
Re: Help with error initializing SparkR.
try run this code: sudo -E R CMD javareconf and then start spark, basically, it syncs R's java configuration with your Java configuration Good luck! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-error-initializing-SparkR-tp4495p17871.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
hadoop_conf_dir when running spark on yarn
How do i setup hadoop_conf_dir correctly when I'm running my spark job on Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the configuration that I pull from sc.hadoopConfiguration() is incorrect. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-conf-dir-when-running-spark-on-yarn-tp17872.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL performance
I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. I would be very interested to hear what others in the group have to say about this. Thanks -Soumya
Re: SparkSQL performance
We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are not necessarily biased but as a consequence the results are. Ideally it’s critical for the user community to be informed of all the in-depth tuning tricks of all products. However, realistically, there is a big gap in terms of documentation. Hope the Spark folks will make a difference. :-) Du From: Soumya Simanta soumya.sima...@gmail.commailto:soumya.sima...@gmail.com Date: Friday, October 31, 2014 at 4:04 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL performance I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. I would be very interested to hear what others in the group have to say about this. Thanks -Soumya
Spark Meetup in Singapore
Dear Sir/Madam, We want to become an organiser of Singapore Meetup to promote the regional SPARK and big data community in ASEAN area. My name is Songtao, I am a big data consultant in Singapore and have great passion for Spark technologies. Thanks, Songtao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkContext UI
Hi Sean/Sameer, It seems you're both right. In the python shell I need to explicitly call the empty parens data.cache(), then run an action and it appears in the storage tab. Using the scala shell I can just call data.cache without the parens, run an action tthat works. Thanks for your help. Stu On 31 October 2014 19:19, Sean Owen so...@cloudera.com wrote: No, empty parens do no matter when calling no-arg methods in Scala. This invocation should work as-is and should result in the RDD showing in Storage. I see that when I run it right now. Since it really does/should work, I'd look at other possibilities -- is it maybe taking a short time to start caching? looking at a different/old Storage tab? On Fri, Oct 31, 2014 at 1:17 AM, Sameer Farooqui same...@databricks.com wrote: Hi Stuart, You're close! Just add a () after the cache, like: data.cache() ...and then run the .count() action on it and you should be good to see it in the Storage UI! - Sameer On Thu, Oct 30, 2014 at 4:50 PM, Stuart Horsman stuart.hors...@gmail.com wrote: Sorry too quick to pull the trigger on my original email. I should have added that I'm tried using persist() and cache() but no joy. I'm doing this: data = sc.textFile(somedata) data.cache data.count() but I still can't see anything in the storage? On 31 October 2014 10:42, Sameer Farooqui same...@databricks.com wrote: Hey Stuart, The RDD won't show up under the Storage tab in the UI until it's been cached. Basically Spark doesn't know what the RDD will look like until it's cached, b/c up until then the RDD is just on disk (external to Spark). If you launch some transformations + an action on an RDD that is purely on disk, then Spark will read it from disk, compute against it and then write the results back to disk or show you the results at the scala/python shells. But when you run Spark workloads against purely on disk files, the RDD won't show up in Spark's Storage UI. Hope that makes sense... - Sameer On Thu, Oct 30, 2014 at 4:30 PM, Stuart Horsman stuart.hors...@gmail.com wrote: Hi All, When I load an RDD with: data = sc.textFile(somefile) I don't see the resulting RDD in the SparkContext gui on localhost:4040 in /storage. Is there something special I need to do to allow me to view this? I tried but scala and python shells but same result. Thanks Stuart
Re: SparkSQL performance
I agree. My personal experience with Spark core is that it performs really well once you tune it properly. As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? Has anyone done any comparison of SparkSQL with Impala ? The fact that many of the queries don't even finish in the benchmark is quite surprising and hard to believe. A few months ago there were a few emails about Spark not being able to handle large volumes (TBs) of data. That myth was busted recently when the folks at Databricks published their sorting record results. Thanks -Soumya On Fri, Oct 31, 2014 at 7:35 PM, Du Li l...@yahoo-inc.com wrote: We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are not necessarily biased but as a consequence the results are. Ideally it’s critical for the user community to be informed of all the in-depth tuning tricks of all products. However, realistically, there is a big gap in terms of documentation. Hope the Spark folks will make a difference. :-) Du From: Soumya Simanta soumya.sima...@gmail.com Date: Friday, October 31, 2014 at 4:04 PM To: user@spark.apache.org user@spark.apache.org Subject: SparkSQL performance I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. I would be very interested to hear what others in the group have to say about this. Thanks -Soumya