Re: How SparkStreaming output messages to Kafka?

2015-03-29 Thread Akhil Das
Are you having enough messages in kafka to consume? Can you make sure you kafka setup is working with your console consumer? Also try this example Thanks

Re: How SparkStreaming output messages to Kafka?

2015-03-29 Thread Saisai Shao
Hi Hui, Did you try the direct Kafka stream example under Spark Streaming's examples? Does it still fail to receive the message? Also would you please check all the setups including Kafka, test with Kafka console consumer to see if Kafka is OK. Besides seeing from your code, there's some problems

Re: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-29 Thread Akhil Das
What happens when you do: sc.textFile("hdfs://path/to/the_file.txt") Thanks Best Regards On Mon, Mar 30, 2015 at 11:04 AM, Nick Travers wrote: > Hi List, > > I'm following this example here > < > https://github.com/databricks/learning-spark/tree/master/mini-complete-example > > > with the fol

java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-29 Thread Nick Travers
Hi List, I'm following this example here with the following: $SPARK_HOME/bin/spark-submit \ --deploy-mode cluster \ --master spark://host.domain.ex:7077 \ --class com.oreilly.learningsparkexamples.mini.scal

转发:How SparkStreaming output messages to Kafka?

2015-03-29 Thread luohui20001
Hi guys, I am using SparkStreaming to receive message from kafka,process it and then send back to kafka. however ,kafka consumer can not receive any messages. Any one share ideas? here is my code: object SparkStreamingSampleDirectApproach { def main(args: Array[String]): Unit = {

RE: Port configuration for BlockManagerId

2015-03-29 Thread Manish Gupta 8
Has anyone else faced this issue of running spark-shell (yarn client mode) in an environment with strict firewall rules (on fixed allowed incoming ports)? How can this be rectified? Thanks, Manish From: Manish Gupta 8 Sent: Thursday, March 26, 2015 4:09 PM To: user@spark.apache.org Subject: Por

Re: What is the meaning to of 'STATE' in a worker/ an executor?

2015-03-29 Thread Mark Hamstra
A LOADING Executor is on the way to RUNNING, but hasn't yet been registered with the Master, so it isn't quite ready to do useful work. > On Mar 29, 2015, at 9:09 PM, Niranda Perera wrote: > > Hi, > > I have noticed in the Spark UI, workers and executors run on several states, > ALIVE, LOAD

What is the meaning to of 'STATE' in a worker/ an executor?

2015-03-29 Thread Niranda Perera
Hi, I have noticed in the Spark UI, workers and executors run on several states, ALIVE, LOADING, RUNNING, DEAD etc? What exactly are these states mean and what is the effect it has on working with those executor? ex: whether an executor can not be used in the loading state, etc cheers -- Niran

Pregel API Abstraction for GraphX

2015-03-29 Thread Kenny Bastani
Hi all, I have been working hard to make it easier for developers to make community contributions to the Spark GraphX algorithm library. At the core of this I found that the Pregel API is a difficult concept to understand and I think I can help make it better. Can you please review https://gist

Re: How to specify the port for AM Actor ...

2015-03-29 Thread Shixiong Zhu
LGTM. Could you open a JIRA and send a PR? Thanks. Best Regards, Shixiong Zhu 2015-03-28 7:14 GMT+08:00 Manoj Samel : > I looked @ the 1.3.0 code and figured where this can be added > > In org.apache.spark.deploy.yarn ApplicationMaster.scala:282 is > > actorSystem = AkkaUtils.createActorSyst

Re: Running Spark in Local Mode

2015-03-29 Thread Saisai Shao
Hi, I think for local mode, the number N (N number of thread) basically equals to N number of available cores in ONE executor(worker), not N workers. You could image local[N] as have one worker with N cores. I'm not sure you could set the memory usage for each thread, for Spark the memory is share

Task result in Spark Worker Node

2015-03-29 Thread raggy
I am a PhD student working on a research project related to Apache Spark. I am trying to modify some of the spark source code such that instead of sending the final result RDD from the worker nodes to a master node, I want to send the final result RDDs to some different node. In order to do this, I

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread seglo
Thanks for the response. I'll admit I'm rather new to Mesos. Due to the nature of my setup I can't use the Mesos web portal effectively because I'm not connected by VPN, so the local network links from the mesos-master dashboard I SSH tunnelled aren't working. Anyway, I was able to dig up some l

Re: Untangling dependency issues in spark streaming

2015-03-29 Thread Ted Yu
For Gradle, there are: https://github.com/musketyr/gradle-fatjar-plugin https://github.com/johnrengelman/shadow FYI On Sun, Mar 29, 2015 at 4:29 PM, jay vyas wrote: > thanks for posting this! Ive ran into similar issues before, and generally > its a bad idea to swap the libraries out and "pray

Re: Spark-submit not working when application jar is in hdfs

2015-03-29 Thread dilm
Made it work by using yarn-cluster as master instead of local. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22281.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Untangling dependency issues in spark streaming

2015-03-29 Thread jay vyas
thanks for posting this! Ive ran into similar issues before, and generally its a bad idea to swap the libraries out and "pray fot the best", so the shade functionality is probably the best feature. Unfortunately, im not sure how well SBT and Gradle support shading... how do folks using next gen bu

Re: [spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-29 Thread Eran Medan
Thanks Michael! Can you please point me to the docs / source location for that automatic casting? I'm just using it to extract the data and put it in a Map[String, Any] (long story on the reason...) so I think the casting rules won't "know" what to cast it to... right? I guess I can have the JSON /

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-29 Thread Eran Medan
Hi Sean, I think your point about the ETL costs are the wining argument here. but I would like to see more research on the topic. What I would like to see researched - is ability to run a specialized set of common algorithms in "fast-local-mode" just like a compiler optimizer can decide to inline

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread hbogert
works/20150329-232522-84118794-5050-18181-/executors/5/runs/latest/stderr -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-run-spark-submit-with-an-application-jar-on-a-Mesos-cluster-tp22277p22280.html Sent from the Apache Spark User List mailing list ar

Running Spark in Local Mode

2015-03-29 Thread FreePeter
Hi, I am trying to use Spark for my own applications, and I am currently profiling the performance with local mode, and I have a couple of questions: 1. When I set spark.master local[N], it means the will use up to N worker *threads* on the single machine. Is this equivalent to say there are N wo

Re: Understanding Spark Memory distribution

2015-03-29 Thread Ankur Srivastava
Hi Wisely, I am running on Amazon EC2 instances so I can not doubt the hardware. Moreover my other pipelines run successfully except for this which involves Broadcasting large object. My spark-en.sh setting are: SPARK_MASTER_IP= SPARK_LOCAL_IP= SPARK_DRIVER_MEMORY=24g SPARK_WORKER_MEMORY=28g

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread Timothy Chen
I left a comment on your stackoverflow earlier. Can you share what's the output in your stderr log from your Mesos task? It Can be found in your Mesos UI and going to its sandbox. Tim Sent from my iPhone > On Mar 29, 2015, at 12:14 PM, seglo wrote: > > The latter part of this question where I

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread seglo
The latter part of this question where I try to submit the application by referring to it on HDFS is very similar to the recent question Spark-submit not working when application jar is in hdfs http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-

Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread seglo
Mesosphere did a great job on simplifying the process of running Spark on Mesos. I am using this guide to setup a development Mesos cluster on Google Cloud Compute. https://mesosphere.com/docs/tutorials/run-spark-on-mesos/ I can run the example that's in the guide by using spark-shell (finding nu

Re: Build fails on 1.3 Branch

2015-03-29 Thread Reynold Xin
I pushed a hotfix to the branch. Should work now. On Sun, Mar 29, 2015 at 9:23 AM, Marty Bower wrote: > Yes, that worked - thank you very much. > > > > On Sun, Mar 29, 2015 at 9:05 AM Ted Yu wrote: > >> Jenkins build failed too: >> >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Sp

Arguments/parameters in Spark shell scripts?

2015-03-29 Thread Minnow Noir
How does one consume parameters passed to a Scala script via spark-shell -i? 1. If I use an object with a main() method, the println outputs nothing as if not called: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object Test { de

Re: Build fails on 1.3 Branch

2015-03-29 Thread Marty Bower
Yes, that worked - thank you very much. On Sun, Mar 29, 2015 at 9:05 AM Ted Yu wrote: > Jenkins build failed too: > > > https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/326/consoleFull > > For the moment, you can apply the f

Re: Build fails on 1.3 Branch

2015-03-29 Thread Ted Yu
Jenkins build failed too: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/326/consoleFull For the moment, you can apply the following change: diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.sca

Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-29 Thread Ted Yu
Nathan: Please look in log files for any of the following: doCleanupRDD(): case e: Exception => logError("Error cleaning RDD " + rddId, e) doCleanupShuffle(): case e: Exception => logError("Error cleaning shuffle " + shuffleId, e) doCleanupBroadcast(): case e: Exception => logErro

Build fails on 1.3 Branch

2015-03-29 Thread mjhb
I tried pulling the source and building for the first time, but cannot get past the "object NoRelation is not a member of package org.apache.spark.sql.catalyst.plans.logical" error below on the 1.3 branch. I can build the 1.2 branch. I have tried with both -Dscala-2.11 and 2.10 (after running the

Re: RDD collect hangs on large input data

2015-03-29 Thread Akhil Das
Don't call .collect if your data size huge, you can simply do a count() to trigger the execution. Can you paste your exception stack trace so that we'll know whats happening? Thanks Best Regards On Fri, Mar 27, 2015 at 9:18 PM, Zsolt Tóth wrote: > Hi, > > I have a simple Spark application: it

Re: Shuffle Read and Write

2015-03-29 Thread Akhil Das
Depends on the task that you are doing. I think What is displayed on the webui (attached pic) is for your complete application. Thanks Best Regards On Fri, Mar 27, 2015 at 8:28 PM, Laeeq Ahmed wrote: > Hi, > > Is shuffle read and write is for the complete streaming application or > just the cur

Re: input size too large | Performance issues with Spark

2015-03-29 Thread Akhil Das
Go through this once, if you haven't read it already. https://spark.apache.org/docs/latest/tuning.html Thanks Best Regards On Sat, Mar 28, 2015 at 7:33 PM, nsareen wrote: > Hi All, > > I'm facing performance issues with spark implementation, and was briefly > investigating on WebUI logs, i noti

Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-29 Thread Akhil Das
Try these: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation: sparkConf.set("spark.executor.logs.rolling.strategy", "size") .set("spark.executor.logs.rolling.size.maxBytes", "1024") .set("spark.executor.logs.rolling.maxRetainedFiles", "3") Also see, wh

Re: Unable to run NetworkWordCount.java

2015-03-29 Thread Akhil Das
You need set your master as local[2], 2 is the minimum number of threads (in localmode) needed to consume and process a Streaming application. Thanks Best Regards On Sun, Mar 29, 2015 at 12:56 PM, mehak.soni wrote: > I am trying to run the NetworkWordCount.java in Spark streaming examples. I >

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Cheng Lian
I mean JavaSparkContext has a field name "sc", whose type is SparkContext. You may pass this "sc" to HiveContext. On 3/29/15 9:59 PM, Vincent He wrote: thanks . It does not work, and can not pass compile as HiveContext constructor does not accept JaveSparkContext and JaveSparkContext is not su

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Vincent He
thanks . It does not work, and can not pass compile as HiveContext constructor does not accept JaveSparkContext and JaveSparkContext is not subclass of SparkContext. Anyone else have any idea? I suspect this is supported now. On Sun, Mar 29, 2015 at 8:54 AM, Cheng Lian wrote: > You may simply pa

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-29 Thread Vincent He
No luck, it does not work, anyone know whether there some special setting for spark-sql cli so we do not need to write code to use spark sql? Anyone have some simple example on this? appreciate any help. thanks in advance. On Sat, Mar 28, 2015 at 9:05 AM, Ted Yu wrote: > See > https://databricks

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Cheng Lian
You may simply pass in JavaSparkContext.sc On 3/29/15 9:25 PM, Vincent He wrote: All, I try Spark SQL with Java, I find HiveContext does not accept JavaSparkContext, is this true? Or any special build of Spark I need to do (I build with Hive and thrift server)? Can we use HiveContext in Java

Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Vincent He
All, I try Spark SQL with Java, I find HiveContext does not accept JavaSparkContext, is this true? Or any special build of Spark I need to do (I build with Hive and thrift server)? Can we use HiveContext in Java? thanks in advance.

Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-29 Thread Alexey Zinoviev
I figured out that Logging is a DeveloperApi and it should not be used outside Spark code, so everything is fine now. Thanks again, Marcelo. > On 24 Mar 2015, at 20:06, Marcelo Vanzin wrote: > > From the exception it seems like your app is also repackaging Scala > classes somehow. Can you doubl

Re: RDD Persistance synchronization

2015-03-29 Thread Sean Owen
I don't think you can guarantee that there is no recomputation. Even if you persist(), you might lose the block and have to recompute it. You can persist your UUIDs to storage like HDFS. They won't change then of course. I suppose you still face a much narrower problem, that the act of computing t

Re: RDD Persistance synchronization

2015-03-29 Thread Harut Martirosyan
Thanks to you again, Sean. The thing is that, we persist and count that RDD in hope that all later actions with it won't trigger previous recalculations, it's not really about performance here, it's because recalculations contain UUID generation which should be the same for further actions. I und

Re: RDD Persistance synchronization

2015-03-29 Thread Sean Owen
persist() completes immediately since it only marks the RDD for persistence. count() triggers computation of rdd, and as rdd is computed it will be persisted. The following transform should therefore only start after count() and therefore after the persistence completes. I think there might be corn

Re: FW: converting DStream[String] into RDD[String] in spark streaming [I]

2015-03-29 Thread Sean Owen
You're just describing the operation of Spark Streaming at its simplest, without windowing. You get non-overlapping RDDs of the most recent data each time. On Sun, Mar 29, 2015 at 8:44 AM, Deenar Toraskar wrote: > Sean > > Thank you very much for your response. I have a requirement run a function

kmeans|| in Spark is not real paralleled?

2015-03-29 Thread Xi Shen
Hi, I have opened a couple of threads asking about k-means performance problem in Spark. I think I made a little progress. Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It uses the "kmeans||" initialization algorithm which supposedly to be a faster version of kmeans++ an

Re: converting DStream[String] into RDD[String] in spark streaming [I]

2015-03-29 Thread Deenar Toraskar
Sean Thank you very much for your response. I have a requirement run a function only over the new inputs in a Spark Streaming sliding window, i.e. the latest batch of events only, do I just get a new Dstream using the slide duration equal to the window duration ? such as val sparkConf = new

RDD Persistance synchronization

2015-03-29 Thread Harut Martirosyan
Hi. rdd.persist() rdd.count() rdd.transform()... is there a chance transform() runs before persist() is complete? -- RGRDZ Harut

Re: Why KMeans with mllib is so slow ?

2015-03-29 Thread Xi Shen
Hi Burak, Unfortunately, I am expected to do my work in HDInsight environment which only supports Spark 1.2.0 with Microsoft's flavor. I cannot simple replace it with Spark 1.3. I think the problem I am observing is caused by kmeans|| initialization step. I will open another thread to discuss it.

Unable to run NetworkWordCount.java

2015-03-29 Thread mehak.soni
I am trying to run the NetworkWordCount.java in Spark streaming examples. I was able to run it using run-example. I was now trying to run the same code from an app I created. This is the code- it looks pretty much similar to the existing code: import scala.Tuple2; import com.google.common.collect.

Untangling dependency issues in spark streaming

2015-03-29 Thread Neelesh
Hi, My streaming app uses org.apache.httpcomponent:httpclient:4.3.6, but spark uses 4.2.6 , and I believe thats what's causing the following error. I've tried setting spark.executor.userClassPathFirst & spark.driver.userClassPathFirst to true in the config, but that does not solve it either. Fina