Re: Is Branch 1.0 build broken ?

2014-04-10 Thread Sean Owen
The error is not about the build but an external repo. This almost always means you have some trouble accessing all the repos from your environment. Do you need proxy settings? Any other errors in the log about why you can't access it? On Apr 11, 2014 12:32 AM, "Chester Chen" wrote: > I just upda

Re: programmatic way to tell Spark version

2014-04-10 Thread Shixiong Zhu
Hi Patrick, You should use classOf[SparkContext].getPackage.getImplementationVersion classOf[SparkContext].getClass.getPackage.getImplementationVersion is used to get the version of java.lang.Class. That's the JVM version. Best Regards, Shixiong Zhu 2014-04-11 9:06 GMT+08:00 Nicholas Chammas

Re: programmatic way to tell Spark version

2014-04-10 Thread Nicholas Chammas
Looks like it. I'm guessing this didn't make the cut for 0.9.1, and will instead be included with 1.0.0. So would you access it just by calling sc.version from the shell? And will this automatically make it into the Python API? I'll mark the JIRA issue as resolved. On Thu, Apr 10, 2014 at 5:05

Re: Spark 0.9.1 PySpark ImportError

2014-04-10 Thread Matei Zaharia
Kind of strange because we haven’t updated CloudPickle AFAIK. Is this a package you added on the PYTHONPATH? How did you set the path, was it in conf/spark-env.sh? Matei On Apr 10, 2014, at 7:39 AM, aazout wrote: > I am getting a python ImportError on Spark standalone cluster. I have set the

Is Branch 1.0 build broken ?

2014-04-10 Thread Chester Chen
I just updated and got the following:  [error] (external-mqtt/*:update) sbt.ResolveException: unresolved dependency: org.eclipse.paho#mqtt-client;0.4.0: not found [error] Total time: 7 s, completed Apr 10, 2014 4:27:09 PM Chesters-MacBook-Pro:spark chester$ git branch * branch-1.0   master Look

Re: SparkR with Sequence Files

2014-04-10 Thread Shivaram Venkataraman
SparkR doesn't support reading in SequenceFiles yet. It is a often requested feature though and we are working on it. Note that it is tricky to support sequence files and though there were discussions[1], this isn't supported in PySpark as well. Thanks Shivaram [1] http://mail-archives.apache.org

SparkR with Sequence Files

2014-04-10 Thread Gary Malouf
Has anyone been using SparkR to work with data from sequence files? We use protobuf throughout our system and are considering whether to try out SparkR.

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread Aaron Davidson
This is likely because hdfs's core-site.xml (or something similar) provides an "fs.default.name" which changes the default FileSystem and Spark uses the Hadoop FileSystem API to resolve paths. Anyway, your solution is definitely a good one -- another would be to remote hdfs from Spark's classpath i

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
Pierre - I'm not sure that would work. I just opened a Spark shell and did this: scala> classOf[SparkContext].getClass.getPackage.getImplementationVersion res4: String = 1.7.0_25 It looks like this is the JVM version. - Patrick On Thu, Apr 10, 2014 at 2:08 PM, Pierre Borckmans < pierre.borckm.

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread didata.us
Hi: I believe I figured out how the behavior here: A file specified to SparkContext like this '/PATH/TO/SOME/FILE': * Will be interpreted as 'HDFS://path/to/some/file', when settings for HDFS are present in '/ETC/HADOOP/CONF/*-SITE.XML'. * Will be interpreted as 'FILE:///pa

Error specifying Kafka params from Java

2014-04-10 Thread Paul Mogren
Hi all, I get the following exception when trying to build a Kafka input DStream with custom properties from Java. I am wondering if it's a problem with the Java to Scala binding - I am at a loss for what I could be doing wrong. 14/04/10 16:46:28 ERROR NetworkInputTracker: De-registered rec

Re: programmatic way to tell Spark version

2014-04-10 Thread Pierre Borckmans
I see that this was fixed using a fixed string in SparkContext.scala. Wouldn’t it be better to use something like: getClass.getPackage.getImplementationVersion to get the version from the jar manifest (and thus from the sbt definition)? The same holds for SparkILoopInit.scala in the welcome mess

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
I think this was solved in a recent merge: https://github.com/apache/spark/pull/204/files#diff-364713d7776956cb8b0a771e9b62f82dR779 Is that what you are looking for? If so, mind marking the JIRA as resolved? On Wed, Apr 9, 2014 at 3:30 PM, Nicholas Chammas wrote: > Hey Patrick, > > I've creat

Re: hbase scan performance

2014-04-10 Thread Patrick Wendell
This job might still be faster... in MapReduce there will be other overheads in addition to the fact that doing sequential reads from HBase is slow. But it's possible the bottleneck is the HBase scan performance. - Patrick On Wed, Apr 9, 2014 at 10:10 AM, Jerry Lam wrote: > Hi Dave, > > This i

Error specifying Kafka params from Java

2014-04-10 Thread Paul Mogren
Hi all, I get the following exception when trying to build a Kafka input DStream with custom properties from Java. I am wondering if it's a problem with the Java to Scala binding - I am at a loss for what I could be doing wrong. 14/04/10 16:46:28 ERROR NetworkInputTracker: De-registered receive

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread DiData
Hi Alton: Thanks for the reply. I just wanted to build/use it from scratch to get a better intuition of what's a happening. Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue as my compiled version (i.e. it, too, tried to access the HDFS / Name Node. Same exact error).

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-10 Thread Patrick Wendell
Okay so I think the issue here is just a conflict between your application code and the Hadoop code. Hadoop 2.0.0 depends on protobuf 2.4.0a: https://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.0-alpha/hadoop-project/pom.xml Your code is depending on protobuf 2.5.X The protobuf libra

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread Alton Alexander
I am doing the exact same thing for the purpose of learning. I also don't have a hadoop cluster and plan to scale on ec2 as soon as I get it working locally. I am having good success just using the binaries on and not compiling from source... Is there a reason why you aren't just using the binarie

Re: Spark - ready for prime time?

2014-04-10 Thread Matei Zaharia
To add onto the discussion about memory working space, 0.9 introduced the ability to spill data within a task to disk, and in 1.0 we’re also changing the interface to allow spilling data within the same *group* to disk (e.g. when you do groupBy and get a key with lots of values). The main reason

Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread DiData
Hello friends: I recently compiled and installed Spark v0.9 from the Apache distribution. Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually, the entire big-data suite from CDH is installed), but for the moment I'm using my manually built Apache Spark for 'ground-up' lea

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
> 4. Shuffle on disk > Is it true - I couldn't find it in official docs, but did see this mentioned > in various threads - that shuffle _always_ hits disk? (Disregarding OS > caches.) Why is this the case? Are you planning to add a function to do > shuffle in memory or are there some intrinsic reas

Re: Spark - ready for prime time?

2014-04-10 Thread Andrew Or
Here are answers to a subset of your questions: > 1. Memory management > The general direction of these questions is whether it's possible to take RDD caching related memory management more into our own hands as LRU eviction is nice most of the time but can be very suboptimal in some of our use ca

Re: /bin/java not found: JAVA_HOME ignored launching shark executor

2014-04-10 Thread Ken Ellinwood
Sorry, I forgot to mention this is spark-0.9.1 and shark-0.9.1. Ken On Thursday, April 10, 2014 9:02 AM, Ken Ellinwood wrote: 14/04/10 08:00:42 INFO AppClient$ClientActor: Executor added: app-20140410080041-0017/9 on worker-20140409145028-ken- VirtualBox-39159 (ken-VirtualBox:39159) with 4

Re: Spark - ready for prime time?

2014-04-10 Thread Roger Hoover
Can anyone comment on their experience running Spark Streaming in production? On Thu, Apr 10, 2014 at 10:33 AM, Dmitriy Lyubimov wrote: > > > > On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash wrote: > >> The biggest issue I've come across is that the cluster is somewhat >> unstable when under memor

/bin/java not found: JAVA_HOME ignored launching shark executor

2014-04-10 Thread Ken Ellinwood
14/04/10 08:00:42 INFO AppClient$ClientActor: Executor added: app-20140410080041-0017/9 on worker-20140409145028-ken- VirtualBox-39159 (ken-VirtualBox:39159) with 4 cores 14/04/10 08:00:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140410080041-0017/9 on hostPort ken-VirtualBo

Re: Spark - ready for prime time?

2014-04-10 Thread Dmitriy Lyubimov
On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash wrote: > The biggest issue I've come across is that the cluster is somewhat > unstable when under memory pressure. Meaning that if you attempt to > persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll > often still get OOMs. I h

Re: Spark - ready for prime time?

2014-04-10 Thread Debasish Das
I agree with AndrewEvery time I underestimate the RAM requirementmy hand calculations are always ways less than what JVM actually allocates... But I guess I will understand the Scala JVM optimizations as I get more pain On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash wrote: > The bigge

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
I would echo much of what Andrew has said. I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk space dedicated to spark, data storage in separate HDFS shares). I've been using spark since 0.7, and as with Andrew I've observed significant and consistent improvements in stability (

Re: Spark - ready for prime time?

2014-04-10 Thread Andrew Ash
The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often still get OOMs. I had to carefully modify some of the space tuning parameters an

Re: NPE using saveAsTextFile

2014-04-10 Thread Nick Pentreath
There was a closure over the config object lurking around - but in any case upgrading to 1.2.0 for config did the trick as it seems to have been a bug in Typesafe config, Thanks Matei! On Thu, Apr 10, 2014 at 8:46 AM, Nick Pentreath wrote: > Ok I thought it may be closing over the config option

Re: Spark operators on Objects

2014-04-10 Thread Flavio Pompermaier
Probably for the XML case the best resource I found iare http://stevenskelton.ca/real-time-data-mining-spark/ and http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ . And about JSON? If I have to work with JSON and I want to use fasterxml implementation?

Re: Spark on YARN performance

2014-04-10 Thread Flavio Pompermaier
Thank you for the reply Mayur, it would be nice to have a comparison about that. I hope one day it will be available, or to have the time to test it myself :) So you're using Mesos for the moment, right? Which are the main differences in you experience? YARN seems to be more flexible and interopera

Behaviour of caching when dataset does not fit into memory

2014-04-10 Thread Pierre Borckmans
Hi there, Just playing around in the Spark shell, I am now a bit confused by the performance I observe when the dataset does not fit into memory : - i load a dataset with roughly 500 million rows - i do a count, it takes about 20 seconds - now if I cache the RDD and do a count again (which will

Re: Spark on YARN performance

2014-04-10 Thread Mayur Rustagi
I've had better luck with standalone in terms of speed & latency. I think thr is impact but not really very high. Bigger impact is towards being able to manage resources & share cluster. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: is it possible to initiate Spark jobs from Oozie?

2014-04-10 Thread Mayur Rustagi
I dont think it'll do failure detection etc of spark job in Oozie as of yet. You should be able to trigger it from Oozie (worst case as a shell script). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Apr 10, 2014 at

Re: Pig on Spark

2014-04-10 Thread Mayur Rustagi
Bam !!! http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Apr 10, 2014 at 3:07 AM, Konstantin Kudryavtsev < kudryavtsev.konstan...@gmail.com

Re: Spark - ready for prime time?

2014-04-10 Thread Alex Boisvert
I'll provide answers from our own experience at Bizo. We've been using Spark for 1+ year now and have found it generally better than previous approaches (Hadoop + Hive mostly). On Thu, Apr 10, 2014 at 7:11 AM, Andras Nemeth < andras.nem...@lynxanalytics.com> wrote: > I. Is it too much magic? Lo

Re: Spark - ready for prime time?

2014-04-10 Thread Sean Owen
Mike Olson's comment: http://vision.cloudera.com/mapreduce-spark/ Here's the partnership announcement: http://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html > On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira > wrote: >> >> Do you have the link to the Clouder

Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
Here are several good ones: https://www.google.com/search?q=cloudera+spark&oq=cloudera+spark&aqs=chrome..69i57j69i65l3j69i60l2.4439j0j7&sourceid=chrome&espv=2&es_sm=119&ie=UTF-8 On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira wrote: > Do you have the link to the Cloudera comment? > > Sent from

Re: Spark - ready for prime time?

2014-04-10 Thread Ian Ferreira
Do you have the link to the Cloudera comment? Sent from Windows Mail From: Dean Wampler Sent: ‎Thursday‎, ‎April‎ ‎10‎, ‎2014 ‎7‎:‎39‎ ‎AM To: Spark Users Cc: Daniel Darabos, Andras Barjak Spark has been endorsed by Cloudera as the successor to MapReduce. That says a lot... On

Spark 0.9.1 PySpark ImportError

2014-04-10 Thread aazout
I am getting a python ImportError on Spark standalone cluster. I have set the PYTHONPATH on both worker and slave and the package imports properly when I run PySpark command line on both machines. This only happens with Master - Slave communication. Here is the error below: 14/04/10 13:40:19 INFO

Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
Spark has been endorsed by Cloudera as the successor to MapReduce. That says a lot... On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth < andras.nem...@lynxanalytics.com> wrote: > Hello Spark Users, > > With the recent graduation of Spark to a top level project (grats, btw!), > maybe a well timed

RE: Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Shao, Saisai
Hi Asaf, The user who run SparkContext is decided by the below code in SparkContext, normally this user.name is the user who started JVM, you can start your application with -Duser.name=xxx to specify a username you want, this specified username will be the user to communicate with HDFS. val

Re: Spark - ready for prime time?

2014-04-10 Thread Debasish Das
When you say "Spark is one of the forerunners for our technology choice", what are the other options you are looking into ? I start cross validation runs on a 40 core, 160 GB spark job using a script...I woke up in the morning, none of the jobs crashed ! and the project just came out of incubation

Fwd: Spark - ready for prime time?

2014-04-10 Thread Andras Nemeth
Hello Spark Users, With the recent graduation of Spark to a top level project (grats, btw!), maybe a well timed question. :) We are at the very beginning of a large scale big data project and after two months of exploration work we'd like to settle on the technologies to use, roll up our sleeves

Re: Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Adnan
Then problem is not on spark side, you have three options, choose any one of them: 1. Change permissions on /tmp/Iris folder from shell on NameNode with "hdfs dfs -chmod" command. 2. Run your hadoop service with hdfs user. 3. Disable dfs.permissions in conf/hdfs-site.xml. Regards, Adnan avito w

Re: Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Adnan
You need to use proper HDFS URI with saveAsTextFile. For Example: rdd.saveAsTextFile("hdfs://NameNode:Port/tmp/Iris/output.tmp") Regards, Adnan Asaf Lahav wrote > Hi, > > We are using Spark with data files on HDFS. The files are stored as files > for predefined hadoop user ("hdfs"). > > The

Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Asaf Lahav
Hi, We are using Spark with data files on HDFS. The files are stored as files for predefined hadoop user ("hdfs"). The folder is permitted with · read write, executable and read permission for the hdfs user · executable and read permission for users in the group · just

Re: How does Spark handle RDD via HDFS ?

2014-04-10 Thread gtanguy
Yes that help to understand better how works spark. But that was also what I was afraid, I think the network communications will take to much time for my job. I will continue to look for a trick in order to not have network communications. I saw on the hadoop website that : "To minimize global ba

Re: Pig on Spark

2014-04-10 Thread Konstantin Kudryavtsev
Hi Mayur, I wondered if you could share your findings in some way (github, blog post, etc). I guess your experience will be very interesting/useful for many people sent from Lenovo YogaTablet On Apr 8, 2014 8:48 PM, "Mayur Rustagi" wrote: > Hi Ankit, > Thanx for all the work on Pig. > Finally g

Re: is it possible to initiate Spark jobs from Oozie?

2014-04-10 Thread Konstantin Kudryavtsev
I believe you need to write custom action or engage java action On Apr 10, 2014 12:11 AM, "Segerlind, Nathan L" < nathan.l.segerl...@intel.com> wrote: > Howdy. > > > > Is it possible to initiate Spark jobs from Oozie (presumably as a java > action)? If so, are there known limitations to this? An

Re: Shark CDH5 Final Release

2014-04-10 Thread chutium
hi, you can take a look here: http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shark-CDH5-Final-Release-tp3826p4055.html Sent from the Apache Spark User List mailing list archive at Nabble.

Re: Where does println output go?

2014-04-10 Thread wxhsdp
rdd.foreach(p => { print(p) }) The above closure gets executed on workers, you need to look at the logs of the workers to see the output. but if i'm in local mode, where's the logs of local driver, there are no /logs and /work dirs in /SPARK_HOME which are set in standalone mode. -- View