RE: [SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.

2015-03-16 Thread Shuai Zheng
And it is an NoSuchMethodError, not a classnofound error And default I think the spark is only compile against Hadoop 2.2? For this issue itself, I just check the latest spark (1.3.0), its version can work (because it package with a newer version of httpclient, I can see the method is

Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Kelly, Jonathan
See https://issues.apache.org/jira/browse/SPARK-6351 ~ Jonathan From: Shuai Zheng szheng.c...@gmail.commailto:szheng.c...@gmail.com Date: Monday, March 16, 2015 at 11:46 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject:

problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Kelly, Jonathan
I'm attempting to use the Spark Kinesis Connector, so I've added the following dependency in my build.sbt: libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 My app works fine with sbt run, but I can't seem to get sbt assembly to work without failing with different

Re: Any way to find out feature importance in Spark SVM?

2015-03-16 Thread Xiangrui Meng
You can compute the standard deviations of the training data using Statistics.colStats and then compare them with model coefficients to compute feature importance. -Xiangrui On Fri, Mar 13, 2015 at 11:35 AM, Natalia Connolly natalia.v.conno...@gmail.com wrote: Hello, While running an

partitionBy not working w HashPartitioner

2015-03-16 Thread Adrian Mocanu
Here's my use case: I read an array into an RDD and I use a hash partitioner to partition the RDD. This is the array type: Array[(String, Iterable[(Long, Int)])] topK:Array[(String, Iterable[(Long, Int)])] = ... import org.apache.spark.HashPartitioner val hashPartitioner=new HashPartitioner(10)

Re: Logistic Regression displays ERRORs

2015-03-16 Thread Xiangrui Meng
Actually, they should be INFO or DEBUG. Line search steps are expected. You can configure log4j.properties to ignore those. A better solution would be reporting this at https://github.com/scalanlp/breeze/issues -Xiangrui On Thu, Mar 12, 2015 at 5:46 PM, cjwang c...@cjwang.us wrote: I am running

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-16 Thread Xiangrui Meng
Try this: val ratings = purchase.map { line = line.split(',') match { case Array(user, item, rate) = (user.toInt, item.toInt, rate.toFloat) }.toDF(user, item, rate) Doc for DataFrames: http://spark.apache.org/docs/latest/sql-programming-guide.html -Xiangrui On Mon, Mar 16, 2015 at 9:08 AM,

sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Shuai Zheng
Hi All, I just upgrade the system to use version 1.3.0, but then the sqlContext.parquetFile doesn't work with s3n. I have test the same code with 1.2.1 and it works. A simple test running in spark-shell: val parquetFile = sqlContext.parquetFile(s3n:///test/2.parq )

RE: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Shuai Zheng
I see, but this is really a. big issue. anyway for me to work around? I try to set the fs.default.name = s3n, but looks like it doesn't work. I must upgrade to 1.3.0 because I face the package incompatible issue in 1.2.1, and if I must patch something, I rather go with latest version.

Re: Basic GraphX deployment and usage question

2015-03-16 Thread Takeshi Yamamuro
Hi, Your're right, that is, graphx has already be included in a spark default package. As a first step, 'Analytics' seems to be suitable for your objective. # ./bin/run-example graphx.Analytics pagerank graph-file On Tue, Mar 17, 2015 at 2:21 AM, Khaled Ammar khaled.am...@gmail.com wrote:

Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Michael Armbrust
We will be including this fix in Spark 1.3.1 which we hope to make in the next week or so. On Mon, Mar 16, 2015 at 12:01 PM, Shuai Zheng szheng.c...@gmail.com wrote: I see, but this is really a… big issue. anyway for me to work around? I try to set the fs.default.name = s3n, but looks like it

Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Tathagata Das
If you are creating an assembly, make sure spark-streaming is marked as provided. spark-streaming is already part of the spark installation so will be present at run time. That might solve some of these, may be!? TD On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan jonat...@amazon.com wrote:

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-16 Thread Chang-Jia Wang
I just used random numbers.(My ML lib was spark-mllib_2.10-1.2.1)Please see the attached log. In the middle of the log, I dumped the data set before feeding into LogisticRegressionWithLBFGS. The first column false/true was the label (attribute “a”), and columns 2-5 (attributes “x”, “y”, “z”, and

Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Tathagata Das
Can you give use your SBT project? Minus the source codes if you don't wish to expose them. TD On Mon, Mar 16, 2015 at 12:54 PM, Kelly, Jonathan jonat...@amazon.com wrote: Yes, I do have the following dependencies marked as provided: libraryDependencies += org.apache.spark %% spark-core %

Re: What is best way to run spark job in yarn-cluster mode from java program(servlet container) and NOT using spark-submit command.

2015-03-16 Thread rrussell25
Hi, were you ever able to determine a satisfactory approach for this problem? I have a similar situation and would prefer to execute the job directly from java code within my jms listener and/or servlet container. -- View this message in context:

Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Kelly, Jonathan
Yes, I do have the following dependencies marked as provided: libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-hive % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-sql % 1.3.0 % provided

Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Kelly, Jonathan
Here's build.sbt, minus blank lines for brevity, and without any of the exclude/excludeAll options that I've attempted: name := spark-sandbox version := 1.0 scalaVersion := 2.10.4 resolvers += Akka Repository at http://repo.akka.io/releases/; run in Compile = Defaults.runTask(fullClasspath in

unable to access spark @ spark://debian:7077

2015-03-16 Thread Ralph Bergmann
Hi, I try my first steps with Spark but I have problems to access Spark running on my Linux server from my Mac. I start Spark with sbin/start-all.sh When I now open the website at port 8080 I see that all is running and I can access Spark at port 7077 but this doesn't work. I scanned the

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Dibyendu Bhattacharya
Yes.. Auto restart is enabled in my low level consumer ..when there is some unhandled exception comes... Even if you see KafkaConsumer.java, for some cases ( like broker failure, kafka leader changes etc ) it can even refresh the Consumer (The Coordinator which talks to a Leader) which will

Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen
I used local[*]. The CPU hits about 80% when there are active jobs, then it drops to about 13% and hand for a very long time. Thanks, David On Mon, 16 Mar 2015 17:46 Akhil Das ak...@sigmoidanalytics.com wrote: How many threads are you allocating while creating the sparkContext? like local[4]

MappedStream vs Transform API

2015-03-16 Thread madhu phatak
Hi, Current implementation of map function in spark streaming looks as below. def map[U: ClassTag](mapFunc: T = U): DStream[U] = { new MappedDStream(this, context.sparkContext.clean(mapFunc)) } It creates an instance of MappedDStream which is a subclass of DStream. The same function can

Re: unable to access spark @ spark://debian:7077

2015-03-16 Thread Akhil Das
Try setting SPARK_MASTER_IP and you need to use the Spark URI (spark://yourlinuxhost:7077) as displayed in the top left corner of Spark UI (running on port 8080). Also when you are connecting from your mac, make sure your network/firewall isn't blocking any port between the two machines. Thanks

Re: start-slave.sh failed with ssh port other than 22

2015-03-16 Thread Akhil Das
Open sbin/slaves.sh and sbin/spark-daemon.sh and then look for ssh command, pass the port argument to that command in your case *-p 58518* and save those files, do a start-all.sh :) Thanks Best Regards On Mon, Mar 16, 2015 at 1:37 PM, ZhuGe t...@outlook.com wrote: Hi all: I am new to spark

Processing of text file in large gzip archive

2015-03-16 Thread sergunok
I have a 30GB gzip file (originally that is text file where each line represents text document) in HDFS and Spark 1.2.0 under YARN cluster with 3 worker nodes with 64GB RAM and 4 cores on each node. Replictaion factor for my file is 3. I tried to implement simple pyspark script to parse this file

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
I have checked Dibyendu's code, it looks that his implementation has auto-restart mechanism:

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
I set it in code, not by configuration. I submit my jar file to local. I am working in my developer environment. On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote: How are you setting it? and how are you submitting the job? Thanks Best Regards On Mon, Mar 16, 2015 at

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
By default spark.executor.memory is set to 512m, I'm assuming since you are submiting the job using spark-submit and it is not able to override the value since you are running in local mode. Can you try it without using spark-submit as a standalone project? Thanks Best Regards On Mon, Mar 16,

RE: MappedStream vs Transform API

2015-03-16 Thread Shao, Saisai
I think these two ways are both OK for you to write streaming job, `transform` is a more general way for you to transform from one DStream to another if there’s no related DStream API (but have related RDD API). But using map maybe more straightforward and easy to understand. Thanks Jerry

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
Hi Akhil, Yes, you are right. If I ran the program from IDE as a normal java program, the executor's memory is increased...but not to 2048m, it is set to 6.7GB...Looks like there's some formula to calculate this value. Thanks, David On Mon, Mar 16, 2015 at 7:36 PM Akhil Das

Re: Processing of text file in large gzip archive

2015-03-16 Thread Akhil Das
1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. 2. Instead of map, do a mapPartitions 3. You need to open the driver UI and see what's really taking time. If that is running on a remote machine and you are not able to access

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
How much memory are you having on your machine? I think default value is 0.6 of the spark.executor.memory as you can see from here http://spark.apache.org/docs/1.2.1/configuration.html#execution-behavior. Thanks Best Regards On Mon, Mar 16, 2015 at 2:26 PM, Xi Shen davidshe...@gmail.com wrote:

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
I set spark.executor.memory to 2048m. If the executor storage memory is 0.6 of executor memory, it should be 2g * 0.6 = 1.2g. My machine has 56GB memory, and 0.6 of that should be 33.6G...I hate math xD On Mon, Mar 16, 2015 at 7:59 PM Akhil Das ak...@sigmoidanalytics.com wrote: How much

[no subject]

2015-03-16 Thread Hector
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
Strange, even i'm having it while running in local mode. [image: Inline image 1] I set it as .set(spark.executor.memory, 1g) Thanks Best Regards On Mon, Mar 16, 2015 at 2:43 PM, Xi Shen davidshe...@gmail.com wrote: I set spark.executor.memory to 2048m. If the executor storage memory is 0.6

Re: Which OutputCommitter to use for S3?

2015-03-16 Thread Pei-Lun Lee
Hi, I created a JIRA and PR for supporting a s3 friendly output committer for saveAsParquetFile: https://issues.apache.org/jira/browse/SPARK-6352 https://github.com/apache/spark/pull/5042 My approach is add a DirectParquetOutputCommitter class in spark-sql package and use a boolean config

<    1   2