Spark SQL JDBC Connectivity

2014-05-29 Thread Venkat Subramanian
We are planning to use the latest Spark SQL on RDDs. If a third party application wants to connect to Spark via JDBC, does Spark SQL have support? (We want to avoid going though Shark/Hive JDBC layer as we need good performance). BTW, we also want to do the same for Spark Streaming - With Spark

Re: Use mvn run Spark program occur problem

2014-05-29 Thread jaranda
That was it, thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Use-mvn-run-Spark-program-occur-problem-tp1751p6512.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
Hi, I used 1g memory for the driver java process and got OOM error on driver side before reduceByKey. After analyzed the heap dump, the biggest object is org.apache.spark.MapStatus, which occupied over 900MB memory. Here's my question: 1. Is there any optimization switches that I can tune

Re: A Standalone App in Scala: Standalone mode issues

2014-05-29 Thread jaranda
I finally got it working. Main points: - I had to add hadoop-client dependency to avoid a strange EOFException. - I had to set SPARK_MASTER_IP in conf/start-master.sh to hostname -f instead of hostname, since akka seems not to work properly with host names / ip, it requires fully qualified domain

How can I dispose an Accumulator?

2014-05-29 Thread innowireless TaeYun Kim
Hi, How can I dispose an Accumulator? It has no method like 'unpersist()' which Broadcast provides. Thanks.

Re: Python, Spark and HBase

2014-05-29 Thread Nick Pentreath
Hi Tommer, I'm working on updating and improving the PR, and will work on getting an HBase example working with it. Will feed back as soon as I have had the chance to work on this a bit more. N On Thu, May 29, 2014 at 3:27 AM, twizansk twiza...@gmail.com wrote: The code which causes the

Selecting first ten values in a RDD/partition

2014-05-29 Thread nilmish
I have a DSTREAM which consists of RDD partitioned every 2 sec. I have sorted each RDD and want to retain only top 10 values and discard further value. How can I retain only top 10 values ? I am trying to get top 10 hashtags. Instead of sorting the entire of 5-minute-counts (thereby, incurring

Is uberjar a recommended way of running Spark/Scala applications?

2014-05-29 Thread Andrei
I'm using Spark 1.0 and sbt assembly plugin to create uberjar of my application. However, when I run assembly command, I get a number of errors like this: java.lang.RuntimeException: deduplicate: different file contents found in the following:

Re: problem about broadcast variable in iteration

2014-05-29 Thread randylu
hi, Andrew Ash, thanks for your reply. In fact, I have already used unpersist(), but it doesn't take effect. One reason that I select 1.0.0 version is just for it providing unpersist() interface. -- View this message in context:

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-05-29 Thread jaranda
Hi Andrei, I think the preferred way to deploy Spark jobs is by using the sbt package task instead of using the sbt assembly plugin. In any case, as you comment, the mergeStrategy in combination with some dependency exlusions should fix your problems. Have a look at this gist

ClassCastExceptions when using Spark shell

2014-05-29 Thread Sebastian Schelter
Hi, I have trouble running some custom code on Spark 0.9.1 in standalone mode on a cluster. I built a fat jar (excluding Spark) that I'm adding to the classpath with ADD_JARS=... When I start the Spark shell, I can instantiate classes, but when I run Spark code, I get strange

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Anwar Rizal
Can you clarify what you're trying to achieve here ? If you want to take only top 10 of each RDD, why don't sort followed by take(10) of every RDD ? Or, you want to take top 10 of five minutes ? Cheers, On Thu, May 29, 2014 at 2:04 PM, nilmish nilmish@gmail.com wrote: I have a DSTREAM

Re: Comprehensive Port Configuration reference?

2014-05-29 Thread Jacob Eisinger
Howdy Andrew, This is a standalone cluster. And, yes, if my understanding of Spark terminology is correct, you are correct about the port ownerships. Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Andrew Ash and...@andrewash.com To:

Re: Spark SQL JDBC Connectivity

2014-05-29 Thread Michael Armbrust
On Wed, May 28, 2014 at 11:39 PM, Venkat Subramanian vsubr...@gmail.comwrote: We are planning to use the latest Spark SQL on RDDs. If a third party application wants to connect to Spark via JDBC, does Spark SQL have support? (We want to avoid going though Shark/Hive JDBC layer as we need good

Re: ClassCastExceptions when using Spark shell

2014-05-29 Thread Marcelo Vanzin
Hi Sebastian, That exception generally means you have the class loaded by two different class loaders, and some code is trying to mix instances created by the two different loaded classes. Do you happen to have that class both in the spark jars and in your app's uber-jar? That might explain the

Spark hook to create external process

2014-05-29 Thread ansriniv
I have a requirement where for every Spark executor threadpool thread, I need to launch an associated external process. My job will consist of some processing in the Spark executor thread and some processing by its associated external process with the 2 communicating via some IPC mechanism. Is

Re: Spark hook to create external process

2014-05-29 Thread Matei Zaharia
Hi Anand, This is probably already handled by the RDD.pipe() operation. It will spawn a process and let you feed data to it through its stdin and read data through stdout. Matei On May 29, 2014, at 9:39 AM, ansriniv ansri...@gmail.com wrote: I have a requirement where for every Spark

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-05-29 Thread Andrei
Thanks, Jordi, your gist looks pretty much like what I have in my project currently (with few exceptions that I'm going to borrow). I like the idea of using sbt package, since it doesn't require third party plugins and, most important, doesn't create a mess of classes and resources. But in this

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-05-29 Thread Stephen Boesch
The MergeStrategy combined with sbt assembly did work for me. This is not painless: some trial and error and the assembly may take multiple minutes. You will likely want to filter out some additional classes from the generated jar file. Here is an SOF answer to explain that and with IMHO the

Re: Driver OOM while using reduceByKey

2014-05-29 Thread Matei Zaharia
That hash map is just a list of where each task ran, it’s not the actual data. How many map and reduce tasks do you have? Maybe you need to give the driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, 100) to use only 100 tasks). Matei On May 29, 2014, at 2:03 AM, haitao

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Brian Gawalt
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects. It will give you direct access to an iterator containing the member objects of each partition for doing the kind of within-partition hashtag counts you're describing. -- View this message in context:

Why Scala?

2014-05-29 Thread Nick Chammas
I recently discovered Hacker News and started reading through older posts about Scala https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It looks like the language is fairly controversial on there, and it got me thinking. Scala appears to be the preferred language to work with in Spark, and

Re: Why Scala?

2014-05-29 Thread Matei Zaharia
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data

Re: Shuffle file consolidation

2014-05-29 Thread Nathan Kronenfeld
Thanks, I missed that. One thing that's still unclear to me, even looking at that, is - does this parameter have to be set when starting up the cluster, on each of the workers, or can it be set by an individual client job? On Fri, May 23, 2014 at 10:13 AM, Han JU ju.han.fe...@gmail.com wrote:

Re: Shuffle file consolidation

2014-05-29 Thread Matei Zaharia
It can be set in an individual application. Consolidation had some issues on ext3 as mentioned there, though we might enable it by default in the future because other optimizations now made it perform on par with the non-consolidation version. It also had some bugs in 0.9.0 so I’d suggest at

Re: Why Scala?

2014-05-29 Thread Dmitriy Lyubimov
There were few known concerns about Scala, and some still are, but having been doing Scala professionally over two years now, i learned to master and appreciate the advanatages. Major concern IMO is Scala in a less-than-scrupulous corporate environment. First, Scala requires significantly more

Re: Spark SQL JDBC Connectivity and more

2014-05-29 Thread Venkat Subramanian
Thanks Michael. OK will try SharkServer2.. But I have some basic questions on a related area: 1) If I have a standalone spark application that has already built a RDD, how can SharkServer2 or for that matter Shark access 'that' RDD and do queries on it. All the examples I have seen for Shark,

access hdfs file name in map()

2014-05-29 Thread Xu (Simon) Chen
Hello, A quick question about using spark to parse text-format CSV files stored on hdfs. I have something very simple: sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p = (XXX, p[0], p[2])) Here, I want to replace XXX with a string, which is the current csv filename for the line.

Re: Why Scala?

2014-05-29 Thread Krishna Sankar
Nicholas, Good question. Couple of thoughts from my practical experience: - Coming from R, Scala feels more natural than other languages. The functional succinctness of Scala is more suited for Data Science than other languages. In short, Scala-Spark makes sense, for Data Science,

getPreferredLocations

2014-05-29 Thread ansriniv
I am building my own custom RDD class. 1) Is there a guarantee that a partition will only be processed on a node which is in the getPreferredLocations set of nodes returned by the RDD ? 2) I am implementing this custom RDD in Java and plan to extend JavaRDD. However, I dont see a

Re: Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
Thanks. it worked. 2014-05-30 1:53 GMT+08:00 Matei Zaharia matei.zaha...@gmail.com: That hash map is just a list of where each task ran, it’s not the actual data. How many map and reduce tasks do you have? Maybe you need to give the driver a bit more memory, or use fewer tasks (e.g. do

Re: access hdfs file name in map()

2014-05-29 Thread Aaron Davidson
Currently there is not a way to do this using textFile(). However, you could pretty straightforwardly define your own subclass of HadoopRDD [1] in order to get access to this information (likely using mapPartitionsWithIndex to look up the InputSplit for a particular partition). Note that