回复:Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread luohui20001
To Saisai: it works after I correct some of them with your advices like below: Furthermore, I am not quite clear about which code running on driver and which code running on executor, so i wrote my understanding in comment. would you help check? Thank you. To akhil:

Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread Saisai Shao
Also you could use Producer singletion to improve the performance, since now you have to create a Producer for each partition in each batch duration, you could create a singleton object and reuse it (Producer is tread safe as I know). -Jerry 2015-03-30 15:13 GMT+08:00 Saisai Shao

Re: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread Akhil Das
What happens when you do: sc.textFile(hdfs://path/to/the_file.txt) Thanks Best Regards On Mon, Mar 30, 2015 at 11:04 AM, Nick Travers n.e.trav...@gmail.com wrote: Hi List, I'm following this example here https://github.com/databricks/learning-spark/tree/master/mini-complete-example

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread ๏̯͡๏
I am able to connect to MySQL Hive metastore from the client cluster machine. -sh-4.1$ mysql --user=hiveuser --password=pass --host= hostname.vip.company.com Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 9417286 Server version: 5.5.12-eb-5.5.12-log MySQL-eb

Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread Akhil Das
Are you having enough messages in kafka to consume? Can you make sure you kafka setup is working with your console consumer? Also try this example https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala Thanks

Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread Saisai Shao
Yeah, after review again about your code, the reason why you cannot receive any data is that your previous code lacks ACTION function of DStream, so the code actually doesn't execute, after you changing to the style as I mentioned, `foreachRDD` will trigger and run the jobs as you wrote. Yes,

Spark 1.3 build with hive support fails on JLine

2015-03-30 Thread Night Wolf
Hey, Trying to build Spark 1.3 with Scala 2.11 supporting yarn hive (with thrift server). Running; *mvn -e -DskipTests -Pscala-2.11 -Dscala-2.11 -Pyarn -Pmapr4 -Phive -Phive-thriftserver clean install* The build fails with; INFO] Compiling 9 Scala sources to

Re: 回复:Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread Saisai Shao
This warning is not related to --from-beginning. It means there's no new data for current partition in current batch duration, it is acceptable. If you pushing the data into Kafka again, this warning log will be disappeared. Thanks Saisai 2015-03-30 16:58 GMT+08:00 luohui20...@sina.com: BTW,

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-30 Thread Saisai Shao
Shuffle write will finally spill the data into file system as a bunch of files. If you want to avoid disk write, you can mount a ramdisk and configure spark.local.dir to this ram disk. So shuffle output will write to memory based FS, and will not introduce disk IO. Thanks Jerry 2015-03-30 17:15

Re: Using ORC input for mllib algorithms

2015-03-30 Thread Zsolt Tóth
Thanks for your answer! Unfortunately I can't use Spark SQL for some reason. If anyone has experience in using ORC as hadoopFile, I'd be happy to read some hints/thoughts about my issues. Zsolt 2015-03-27 19:07 GMT+01:00 Xiangrui Meng men...@gmail.com: This is a PR in review to support ORC

why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
Hi, I was looking at SparkUI, Executors, and I noticed that I have 597 MB for Shuffle while I am using cached temp-table and the Spark had 2 GB free memory (the number under Memory Used is 597 MB /2.6 GB) ?!!! Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks be done in

Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-30 Thread Nathan Marin
Hi, thanks for your quick answers. I looked at what was being written on disk and a folder called blockmgr-d0236c76-7f7c-4a60-a6ae-ffc622b2db84 was enlarging every second. This folder contained shuffle data and was not being cleaned (after 30minutes of my application running it contained the

回复:回复:Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread luohui20001
BTW, what's the matter about below warning? Not quite clear about KafkaRDD WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping topic1 0. does this warning occurs relative with my starting consumer without --from-beginning param?

回复:Re: 回复:Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread luohui20001
got it.Thank you Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Saisai Shao sai.sai.s...@gmail.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: 回复:Re: Re: How SparkStreaming output messages to Kafka? 日期:2015年03月30日 17点05分

Re: Problem with groupBy and OOM when just writing the group in a file

2015-03-30 Thread Sean Owen
The behavior is the same. I am not sure it's a problem as much as design decision. It does not require everything to stay in memory, but the values for one key at a time. Have a look at how the preceding shuffle works. Consider repartitionAndSortWithinPartition to *partition* by hour and then

Spark caching

2015-03-30 Thread Renato Marroquín Mogrovejo
Hi all, I am trying to understand Spark lazy evaluation works, and I need some help. I have noticed that creating an RDD once and using it many times won't trigger recomputation of it every time it gets used. Whereas creating a new RDD for every time a new operation is performed will trigger

Re: Spark caching

2015-03-30 Thread Sean Owen
I think that you get a sort of silent caching after shuffles, in some cases, since the shuffle files are not immediately removed and can be reused. (This is the flip side to the frequent question/complaint that the shuffle files aren't removed straight away.) On Mon, Mar 30, 2015 at 9:43 AM,

Receive on driver program (without serializing)

2015-03-30 Thread MartijnD
We are building a wrapper that makes it possible to use reactive streams (i.e. Observable, see reactivex.io) as input to Spark Streaming. We therefore tried to create a custom receiver for Spark. However, the Observable lives at the driver program and is generally not serializable. Is it possible

Problem with groupBy and OOM when just writing the group in a file

2015-03-30 Thread Mario Pastorelli
we are experiencing some problems with the groupBy operations when used to group together data that will be written in the same file. The operation that we want to do is the following: given some data with a timestamp, we want to sort it by timestamp, group it by hour and write one file per

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Steve Loughran
Note that even the Facebook four degrees of separation paper went down to a single machine running WebGraph (http://webgraph.di.unimi.it/) for the final steps, after running jobs in there Hadoop cluster to build the dataset for that final operation. The computations were performed on a

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-30 Thread Cheng Lian
Try this in Spark shell: |import org.apache.spark.api.java.JavaSparkContext import org.apache.spark.sql.hive.HiveContext val jsc = new JavaSparkContext(sc) val hc = new HiveContext(jsc.sc) | (I never mentioned that JavaSparkContext extends SparkContext…) Cheng On 3/30/15 8:28 PM,

Re: Streaming anomaly detection using ARIMA

2015-03-30 Thread Corey Nolet
Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched style of the DStreams API. On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread Cheng Lian
The mysql command line doesn't use JDBC to talk to MySQL server, so this doesn't verify anything. I think this Hive metastore installation guide from Cloudera may be helpful. Although this document is for CDH4, the general steps are the same, and should help you to figure out the

Re: Problem with groupBy and OOM when just writing the group in a file

2015-03-30 Thread Mario Pastorelli
I worked, thank you. On 30.03.2015 11:58, Sean Owen wrote: The behavior is the same. I am not sure it's a problem as much as design decision. It does not require everything to stay in memory, but the values for one key at a time. Have a look at how the preceding shuffle works. Consider

Spark Streaming/Flume display all events

2015-03-30 Thread Chong Zhang
Hi, I am new to Spark/Streaming, and tried to run modified FlumeEventCount.scala example to display all events by adding the call: stream.map(e = Event:header: + e.event.get(0).toString + body: + new String(e.event.getBody.array)).print() The spark-submit runs fine with --master local[4],

Re: Too many open files

2015-03-30 Thread Ted Yu
bq. In /etc/secucity/limits.conf set the next values: Have you done the above modification on all the machines in your Spark cluster ? If you use Ubuntu, be sure that the /etc/pam.d/common-session file contains the following line: session required pam_limits.so On Mon, Mar 30, 2015 at 5:08

Re: python : Out of memory: Kill process

2015-03-30 Thread Eduardo Cusa
Hi, I change my process flow. Now I am processing a file per hour, instead of process at the end of the day. This decreased the memory comsuption . Regards Eduardo On Thu, Mar 26, 2015 at 3:16 PM, Davies Liu dav...@databricks.com wrote: Could you narrow down to a step which cause the

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread ๏̯͡๏
Hello Lian Can you share the URL ? On Mon, Mar 30, 2015 at 6:12 PM, Cheng Lian lian.cs@gmail.com wrote: The mysql command line doesn't use JDBC to talk to MySQL server, so this doesn't verify anything. I think this Hive metastore installation guide from Cloudera may be helpful.

Re: Too many open files

2015-03-30 Thread Masf
I'm executing my application in local mode (with --master local[*]). I'm using ubuntu and I've put session required pam_limits.so into /etc/pam.d/common-session but it doesn't work On Mon, Mar 30, 2015 at 4:08 PM, Ted Yu yuzhih...@gmail.com wrote: bq. In /etc/secucity/limits.conf set the next

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-30 Thread Doug Balog
The “best” solution to spark-shell’s problem is creating a file $SPARK_HOME/conf/java-opts with “-Dhdp.version=2.2.0.0-2014” Cheers, Doug On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote: I've also been having trouble running 1.3.0 on HDP. The

Re: Spark caching

2015-03-30 Thread Renato Marroquín Mogrovejo
Thanks Sean! Do you know if there is a way (even manually) to delete these intermediate shuffle results? I was just want to test the expected behaviour. I know that re-caching might be a positive action most of the times but I want to try it without it. Renato M. 2015-03-30 12:15 GMT+02:00 Sean

Too many open files

2015-03-30 Thread Masf
Hi I have a problem with temp data in Spark. I have fixed spark.shuffle.manager to SORT. In /etc/secucity/limits.conf set the next values: * softnofile 100 * hardnofile 100 In spark-env.sh set ulimit -n 100 I've restarted the spark service and it

Re: SparkSQL Timestamp query failure

2015-03-30 Thread anu
Hi Alessandro Could you specify which query were you able to run successfully? 1. sqlContext.sql(SELECT * FROM Logs as l where l.timestamp = '2012-10-08 16:10:36' ).collect OR 2. sqlContext.sql(SELECT * FROM Logs as l where cast(l.timestamp as string) = '2012-10-08 16:10:36.0').collect I am

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
Thanks Saisai. I will try your solution, but still i don't understand why filesystem should be used where there is a plenty of memory available! On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao sai.sai.s...@gmail.com wrote: Shuffle write will finally spill the data into file system as a bunch of

Re: RDD collect hangs on large input data

2015-03-30 Thread Zsolt Tóth
Thanks for your answer! I don't call .collect because I want to trigger the execution. I call it because I need the rdd on the driver. This is not a huge RDD and it's not larger than the one returned with 50GB input data. The end of the stack trace: The two IP's are the two worker nodes, I think

Re: Too many open files

2015-03-30 Thread Akhil Das
Mostly, you will have to restart the machines to get the ulimit effect (or relogin). What operation are you doing? Are you doing too many repartitions? Thanks Best Regards On Mon, Mar 30, 2015 at 4:52 PM, Masf masfwo...@gmail.com wrote: Hi I have a problem with temp data in Spark. I have

Re: Too many open files

2015-03-30 Thread Masf
Hi. I've done relogin, in fact, I put 'uname -n' and returns 100, but it crashs. I'm doing reduceByKey and SparkSQL mixed over 17 files (250MB-500MB/file) Regards. Miguel Angel. On Mon, Mar 30, 2015 at 1:52 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Mostly, you will have to restart

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread jay vyas
Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that you can't rely on memory in distributed analytics...now maybe we are challenging the assumption that big data analytics need to distributed? I've been asking the same question lately and seen similarly that

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-30 Thread Vincent He
thanks. That is what I have tried. JavaSparkContext does not extend SparkContext, it can not be used here. Anyone else know whether we can use HiveContext with JavaSparkContext, from API documents, seems this is not supported. thanks. On Sun, Mar 29, 2015 at 9:24 AM, Cheng Lian

Online Realtime Recommendation System

2015-03-30 Thread dvpe
Hi, I like to have a online realtime recommendation system. I have a ALS model but I want to add the new data on realtime. Is it possible??? any guidelines??? -- View this message in context:

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-03-30 Thread dvpe
Hi, do you have any updates -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942p22296.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread java8964
I think the jar file has to be local. In HDFS is not supported yet in Spark. See this answer: http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs Date: Sun, 29 Mar 2015 22:34:46 -0700 From: n.e.trav...@gmail.com To: user@spark.apache.org

Re: Spark Streaming/Flume display all events

2015-03-30 Thread Nathan Marin
Hi, DStream.print() only prints the first 10 elements contained in the Stream. You can call DStream.print(x) to print the first x elements but if you don’t know the exact count you can call DStream.foreachRDD and apply a function to display the content of every RDD. For example:

Re: Job Opportunity in London

2015-03-30 Thread Chitturi Padma
Hi, I am interested in this opportunity. I am working as Research Engineer in Impetus Technologies, Bangalore, India. In fact we implemented Distributed Deep Learning on Spark. Will share my CV if you are interested. Please visit the below link:

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Jaonary Rabarisoa
Dear all, I'm still struggling to make a pre-trained caffe model transformer for dataframe works. The main problem is that creating a caffe model inside the UDF is very slow and consumes memories. Some of you suggest to broadcast the model. The problem with broadcasting is that I use a JNI

Re: How to avoid being killed by YARN node manager ?

2015-03-30 Thread Y. Sakamoto
Thank you for your reply. I'm sorry confirmation is slow. I'll try the tuning 'spark.yarn.executor.memoryOverhead'. Thanks, Yuichiro Sakamoto On 2015/03/25 0:56, Sandy Ryza wrote: Hi Yuichiro, The way to avoid this is to boost spark.yarn.executor.memoryOverhead until the executors have

actorStream woes

2015-03-30 Thread Marius Soutier
Hi there, I'm using Spark Streaming 1.2.1 with actorStreams. Initially, all goes well. 15/03/30 15:37:00 INFO spark.storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.2 KB, free 1589.8 MB) 15/03/30 15:37:00 INFO spark.storage.BlockManagerInfo: Added

Re: Job Opportunity in London

2015-03-30 Thread Akhil Das
Maybe you should mail him directly on j.bo...@ucl.ac.uk Thanks Best Regards On Mon, Mar 30, 2015 at 8:47 PM, Chitturi Padma learnings.chitt...@gmail.com wrote: Hi, I am interested in this opportunity. I am working as Research Engineer in Impetus Technologies, Bangalore, India. In fact we

Re: Understanding Spark Memory distribution

2015-03-30 Thread giive chen
Hi Ankur If you using standalone mode, your config is wrong. You should use export SPARK_DAEMON_MEMORY=xxx in config/spark-env.sh. At least it works on my spark 1.3.0 standalone mode machine. BTW, The SPARK_DRIVER_MEMORY is used in Yarn mode and looks like the standalone mode don't use this

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread Cheng Lian
Ah, sorry, my bad... http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-2-0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html On 3/30/15 10:24 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: Hello Lian Can you share the URL ? On Mon, Mar 30, 2015 at 6:12 PM, Cheng Lian lian.cs@gmail.com

RE: How to get rdd count() without double evaluation of the RDD?

2015-03-30 Thread Wang, Ningjun (LNG-NPV)
Sean Yes I know that I can use persist() to persist to disk, but it is still a big extra cost of persist a huge RDD to disk. I hope that I can do one pass to get the count as well as rdd.saveAsObjectFile(file2), but I don’t know how. May be use accumulator to count the total ? Ningjun From:

Re: Cannot run spark-shell command not found.

2015-03-30 Thread roni
I think you must have downloaded the spark source code gz file. It is little confusing. You have to select the hadoop version also and the actual tgz file will have spark version and hadoop version in it. -R On Mon, Mar 30, 2015 at 10:34 AM, vance46 wang2...@purdue.edu wrote: Hi all, I'm a

Re: java.lang.IncompatibleClassChangeError when using PrunedFilteredScan

2015-03-30 Thread Gaspar Muñoz
Hello, Thank you for your contribution. We have tried to reproduce your error but we need more information: - Which Spark version are you using? Stratio Spark-Mongodb connector supports 1.2.x SparkSQL version. - What jars are you adding while launching the Spark-shell? Best regards,

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote: Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-30 Thread Ted Yu
Nicolas: See if there was occurrence of the following exception in the log: errs = throw new SparkException( sCouldn't connect to leader for topic ${part.topic} ${part.partition}: + errs.mkString(\n)), Cheers On Mon, Mar 30, 2015 at 9:40 AM, Cody Koeninger

Cannot run spark-shell command not found.

2015-03-30 Thread vance46
Hi all, I'm a newbee try to setup spark for my research project on a RedHat system. I've downloaded spark-1.3.0.tgz and untared it. and installed python, java and scala. I've set JAVA_HOME and SCALA_HOME and then try to use sudo sbt/sbt assembly according to

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-30 Thread Akhil Das
Did you try this example? https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala I think you need to create a topic set with # partitions to consume. Thanks Best Regards On Mon, Mar 30, 2015 at 9:35 PM, Nicolas

Re: Cannot run spark-shell command not found.

2015-03-30 Thread Manas Kar
If you are only interested in getting a hands on with Spark and not with building it with specific version of Hadoop use one of the bundle provider like cloudera. It will give you a very easy way to install and monitor your services.( I find installing via cloudera manager

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Steve Loughran
On 30 Mar 2015, at 13:27, jay vyas jayunit100.apa...@gmail.commailto:jayunit100.apa...@gmail.com wrote: Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that you can't rely on memory in distributed analytics...now maybe we are challenging the assumption

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Shivaram Venkataraman
One workaround could be to convert a DataFrame into a RDD inside the transform function and then use mapPartitions/broadcast to work with the JNI calls and then convert back to RDD. Thanks Shivaram On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-30 Thread Cody Koeninger
This line at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close( KafkaRDD.scala:158) is the attempt to close the underlying kafka simple consumer. We can add a null pointer check, but the underlying issue of the consumer being null probably indicates a problem earlier. Do you see

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-30 Thread Xiangrui Meng
Okay, I didn't realize that I changed the behavior of lambda in 1.3. to make it scale-invariant, but it is worth discussing whether this is a good change. In 1.2, we multiply lambda by the number ratings in each sub-problem. This makes it scale-invariant for explicit feedback. However, in implicit

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-30 Thread Zhan Zhang
Hi Folks, Just to summarize it to run SPARK on HDP distribution. 1. The spark version has to be 1.3.0 and above if you are using upstream distribution. This configuration is mainly for HDP rolling upgrade purpose, and the patch only went into spark upstream from 1.3.0. 2. In

log4j.properties in jar

2015-03-30 Thread Udit Mehta
Hi, Is it possible to put the log4j.properties in the application jar such that the driver and the executors use this log4j file. Do I need to specify anything while submitting my app so that this file is used? Thanks, Udit

Spark and OpenJDK - jar: No such file or directory

2015-03-30 Thread Kelly, Jonathan
I'm trying to use OpenJDK 7 with Spark 1.3.0 and noticed that the compute-classpath.sh script is not adding the datanucleus jars to the classpath because compute-classpath.sh is assuming to find the jar command in $JAVA_HOME/bin/jar, which does not exist for OpenJDK. Is this an issue anybody

Spark 1.3.0 Build Failure

2015-03-30 Thread ARose
So, I am trying to build Spark 1.3.0 (standalone mode) on Windows 7 using Maven, but I'm getting a build failure. java -version java version 1.8.0_31 Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) Here is the command I am

Re: Actor not found

2015-03-30 Thread sparkdi
I have the same problem, i.e. exception with the same call stack when I start either pyspark or spark-shell. I use spark-1.3.0-bin-hadoop2.4 on ubuntu 14.10. bin/pyspark bunch of INFO messages, then ActorInitializationException exception. Shell starts, I can do this: rd = sc.parallelize([1,2])

When will 1.3.1 release?

2015-03-30 Thread Shuai Zheng
Hi All, I am waiting the spark 1.3.1 to fix the bug to work with S3 file system. Anyone knows the release date for 1.3.1? I can't downgrade to 1.2.1 because there is jar compatible issue to work with AWS SDK. Regards, Shuai

Re: When will 1.3.1 release?

2015-03-30 Thread Kelly, Jonathan
Are you referring to SPARK-6330https://issues.apache.org/jira/browse/SPARK-6330? If you are able to build Spark from source yourself, I believe you should just need to cherry-pick the following commits in order to backport the fix: 67fa6d1f830dee37244b5a30684d797093c7c134 [SPARK-6330] Fix

Re: Spark and OpenJDK - jar: No such file or directory

2015-03-30 Thread Kelly, Jonathan
Ah, never mind, I found the jar command in the java-1.7.0-openjdk-devel package. I only had java-1.7.0-openjdk installed. Looks like I just need to install java-1.7.0-openjdk-devel then set JAVA_HOME to /usr/lib/jvm/java instead of /usr/lib/jvm/jre. ~ Jonathan Kelly From: Kelly, Jonathan

Why is a Spark job faster through Eclipse than Standalone Cluster

2015-03-30 Thread rival95
When I run my code in Eclipse with the following parameters, VM Args: -Xmx4g OS: Windows Time: 4.4 minutes It is faster than submitting to a cluster with these parameters: SPARK_EXECUTOR_MEMORY=4G OS: Ubuntu Time: 5.2 minutes They are equivalent options are they not? Both environments run on

Re: kmeans|| in Spark is not real paralleled?

2015-03-30 Thread Xiangrui Meng
This PR updated the k-means|| initialization: https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d, which was included in 1.3.0. It should fix kmean|| initialization with large k. Please create a JIRA for this issue and send me the code and the dataset to produce this

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xi Shen
For the same amount of data, if I set the k=500, the job finished in about 3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the longest time I waited was 12 hrs... If I use kmeans-random, same amount of data, k=5000, the job finished in less than 2 hrs. I think current kmeans||

Re: Spark-submit not working when application jar is in hdfs

2015-03-30 Thread nsalian
Client mode would not support HDFS jar extraction. I tried this: sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10 And it worked. -- View this message in context:

Re: k-means can only run on one executor with one thread?

2015-03-30 Thread Xiangrui Meng
Hey Xi, Have you tried Spark 1.3.0? The initialization happens on the driver node and we fixed an issue with the initialization in 1.3.0. Again, please start with a smaller k, and increase it gradually, Let us know at what k the problem happens. Best, Xiangrui On Sat, Mar 28, 2015 at 3:11 AM,

Java and Kryo Serialization, Java.io.OptionalDataException

2015-03-30 Thread zia_kayani
I have set Kryo Serializer as default serializer in SparkConf and Spark UI confirms it too, but in the Spark logs I'm getting this exception, java.io.OptionalDataException at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1370) at

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
Updated spark-defaults and spark-env: Log directory /home/hduser/spark/spark-events does not exist. (Also, in the default /tmp/spark-events it also did not work) On 30 March 2015 at 18:03, Marcelo Vanzin van...@cloudera.com wrote: Are those config values in spark-defaults.conf? I don't think

Re: Setting a custom loss function for GradientDescent

2015-03-30 Thread Xiangrui Meng
You can extend Gradient, e.g., https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala#L266, and use it in GradientDescent:

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Marcelo Vanzin
Are those config values in spark-defaults.conf? I don't think you can use ~ there - IIRC it does not do any kind of variable expansion. On Mon, Mar 30, 2015 at 3:50 PM, Tom thubregt...@gmail.com wrote: I have set spark.eventLog.enabled true as I try to preserve log files. When I run, I get

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
We test large feature dimension but not very large k (https://github.com/databricks/spark-perf/blob/master/config/config.py.template#L525). Again, please create a JIRA and post your test code and a link to your test dataset, we can work on it. It is hard to track the issue with multiple threads in

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter
One issue is that 'big' becomes 'not so big' reasonably quickly. A couple of TeraBytes is not that challenging (depending on the algorithm) these days where as 5 years ago it was a big challenge. We have a bit over a PetaByte (not using Spark) and using a distributed system is the only viable way

Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom
I have set spark.eventLog.enabled true as I try to preserve log files. When I run, I get Log directory /tmp/spark-events does not exist. I set spark.local.dir ~/spark spark.eventLog.dir ~/spark/spark-events and SPARK_LOCAL_DIRS=~/spark Now I get: Log directory ~/spark/spark-events does not

Re: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread nsalian
Try running it like this: sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10 Caveats: 1) Make sure the permissions of /user/nick is 775 or 777. 2) No need for

Registering classes with KryoSerializer

2015-03-30 Thread Arun Lists
I am trying to register classes with KryoSerializer. I get the following error message: How do I find out what class is being referred to by: *OpenHashMap$mcI$sp ?* *com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered:

Re: Why is a Spark job faster through Eclipse than Standalone Cluster

2015-03-30 Thread rival95
I re-ran my application through Eclipse on Ubuntu and received slower than expected results of 6.1 minutes. So the question is now, why would there be such a difference of run times between Windows 7 and Ubuntu 14.04? -- View this message in context:

Re: Understanding Spark Memory distribution

2015-03-30 Thread Ankur Srivastava
Hi Wisely, I am running spark 1.2.1 and I have checked the process heap and it is running with all the heap that I am assigning and as I mentioned earlier I get OOM on workers not the driver or master. Thanks Ankur On Mon, Mar 30, 2015 at 9:24 AM, giive chen thegi...@gmail.com wrote: Hi Ankur

Re: Spark 1.3.0 Build Failure

2015-03-30 Thread Marcelo Vanzin
This sounds like SPARK-6532. On Mon, Mar 30, 2015 at 1:34 PM, ARose ashley.r...@telarix.com wrote: So, I am trying to build Spark 1.3.0 (standalone mode) on Windows 7 using Maven, but I'm getting a build failure. java -version java version 1.8.0_31 Java(TM) SE Runtime Environment (build

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Marcelo Vanzin
Are you running Spark in cluster mode by any chance? (It always helps to show the command line you're actually running, and if there's an exception, the first few frames of the stack trace.) On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen thubregt...@gmail.com wrote: Updated spark-defaults and

Spark Streaming on YARN with loss of application master

2015-03-30 Thread Matt Narrell
I’m looking at various HA scenarios with Spark streaming. We’re currently running a Spark streaming job that is intended to be long-lived, 24/7. We see that if we kill node managers that are hosting Spark workers, new node managers assume execution of the jobs that were running on the

Re: WordCount example

2015-03-30 Thread Mohit Anchlia
I tried to file a bug in git repo however I don't see a link to open issues On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I checked the ports using netstat and don't see any connections established on that port. Logs show only this: 15/03/27 13:50:48 INFO

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-30 Thread Denny Lee
Hi Vincent, This may be a case that you're missing a semi-colon after your CREATE TEMPORARY TABLE statement. I ran your original statement (missing the semi-colon) and got the same error as you did. As soon as I added it in, I was good to go again: CREATE TEMPORARY TABLE jsonTable USING

Re: MLlib Spam example gets stuck in Stage X

2015-03-30 Thread Su She
Thank you for updating the files Holden! I actually was using that same text in my files located on HDFS. Could the files being located on HDFS be the reason why the example gets stuck? I c/p the code provided on github, the only things I changed were: a) file paths to: val spam =

data frame API, change groupBy result column name

2015-03-30 Thread Neal Yin
I ran a line like following: tb2.groupBy(city, state).avg(price).show I got result: city state AVG(price) Charlestown New South Wales 1200.0 Newton ... MA 1200.0 Coral Gables ... FL 1200.0 Castricum

Re: When will 1.3.1 release?

2015-03-30 Thread Michael Armbrust
I'm hoping to cut an RC this week. We are just waiting for a few other critical fixes. On Mon, Mar 30, 2015 at 12:54 PM, Kelly, Jonathan jonat...@amazon.com wrote: Are you referring to SPARK-6330 https://issues.apache.org/jira/browse/SPARK-6330? If you are able to build Spark from source

Re: data frame API, change groupBy result column name

2015-03-30 Thread Michael Armbrust
You'll need to use the longer form for aggregation: tb2.groupBy(city, state).agg(avg(price).as(newName)).show depending on the language you'll need to import: scala: import org.apache.spark.sql.functions._ python: from pyspark.sql.functions import * On Mon, Mar 30, 2015 at 5:49 PM, Neal Yin

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
I run Spark in local mode. Command line (added some debug info): hduser@hadoop7:~/spark-terasort$ ./bin/run-example SparkPi 10 Jar: /home/hduser/spark-terasort/examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop2.4.0.jar /home/hduser/spark-terasort/bin/spark-submit --master local[*]

Task size is large when CombineTextInputFormat is used

2015-03-30 Thread Taeyun Kim
Hi, I used CombineTextInputFormat to read many small files. The Java code is as follows (I've written it as an utility function): public static JavaRDDString combineTextFile(JavaSparkContext sc, String path, long maxSplitSize, boolean recursive) { Configuration conf =

Re: Spark 1.3 build with hive support fails

2015-03-30 Thread nightwolf
I am having the same problems. Did you find a fix? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-build-with-hive-support-fails-tp22215p22309.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
The stack trace for the first scenario and your suggested improvement is similar, with as only difference the first line (Sorry for not including this): Log directory /home/hduser/spark/spark-events does not exist. To verify your premises, I cd'ed into the directory by copy pasting the path

Re: Spark Streaming - Subroutine not being executed more than once

2015-03-30 Thread jhakku
hey all, I am trying to figure out if I can use for building loosely coupled distributed data pipelines. This is part of a pitch that I am trying to come up. I'd really appreciate if someone can comment if this is possible or no. Many Thanks -- View this message in context:

  1   2   >