We are planning to use the latest Spark SQL on RDDs. If a third party
application wants to connect to Spark via JDBC, does Spark SQL have support?
(We want to avoid going though Shark/Hive JDBC layer as we need good
performance).
BTW, we also want to do the same for Spark Streaming - With Spark
That was it, thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Use-mvn-run-Spark-program-occur-problem-tp1751p6512.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I used 1g memory for the driver java process and got OOM error on
driver side before reduceByKey. After analyzed the heap dump, the biggest
object is org.apache.spark.MapStatus, which occupied over 900MB memory.
Here's my question:
1. Is there any optimization switches that I can tune
I finally got it working. Main points:
- I had to add hadoop-client dependency to avoid a strange EOFException.
- I had to set SPARK_MASTER_IP in conf/start-master.sh to hostname -f
instead of hostname, since akka seems not to work properly with host names /
ip, it requires fully qualified domain
Hi,
How can I dispose an Accumulator?
It has no method like 'unpersist()' which Broadcast provides.
Thanks.
Hi Tommer,
I'm working on updating and improving the PR, and will work on getting an
HBase example working with it. Will feed back as soon as I have had the
chance to work on this a bit more.
N
On Thu, May 29, 2014 at 3:27 AM, twizansk twiza...@gmail.com wrote:
The code which causes the
I have a DSTREAM which consists of RDD partitioned every 2 sec. I have sorted
each RDD and want to retain only top 10 values and discard further value.
How can I retain only top 10 values ?
I am trying to get top 10 hashtags. Instead of sorting the entire of
5-minute-counts (thereby, incurring
I'm using Spark 1.0 and sbt assembly plugin to create uberjar of my
application. However, when I run assembly command, I get a number of errors
like this:
java.lang.RuntimeException: deduplicate: different file contents found in
the following:
hi, Andrew Ash, thanks for your reply.
In fact, I have already used unpersist(), but it doesn't take effect.
One reason that I select 1.0.0 version is just for it providing unpersist()
interface.
--
View this message in context:
Hi Andrei,
I think the preferred way to deploy Spark jobs is by using the sbt package
task instead of using the sbt assembly plugin. In any case, as you comment,
the mergeStrategy in combination with some dependency exlusions should fix
your problems. Have a look at this gist
Hi,
I have trouble running some custom code on Spark 0.9.1 in standalone
mode on a cluster. I built a fat jar (excluding Spark) that I'm adding
to the classpath with ADD_JARS=... When I start the Spark shell, I can
instantiate classes, but when I run Spark code, I get strange
Can you clarify what you're trying to achieve here ?
If you want to take only top 10 of each RDD, why don't sort followed by
take(10) of every RDD ?
Or, you want to take top 10 of five minutes ?
Cheers,
On Thu, May 29, 2014 at 2:04 PM, nilmish nilmish@gmail.com wrote:
I have a DSTREAM
Howdy Andrew,
This is a standalone cluster. And, yes, if my understanding of Spark
terminology is correct, you are correct about the port ownerships.
Jacob
Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075
From: Andrew Ash and...@andrewash.com
To:
On Wed, May 28, 2014 at 11:39 PM, Venkat Subramanian vsubr...@gmail.comwrote:
We are planning to use the latest Spark SQL on RDDs. If a third party
application wants to connect to Spark via JDBC, does Spark SQL have
support?
(We want to avoid going though Shark/Hive JDBC layer as we need good
Hi Sebastian,
That exception generally means you have the class loaded by two
different class loaders, and some code is trying to mix instances
created by the two different loaded classes.
Do you happen to have that class both in the spark jars and in your
app's uber-jar? That might explain the
I have a requirement where for every Spark executor threadpool thread, I need
to launch an associated external process.
My job will consist of some processing in the Spark executor thread and some
processing by its associated external process with the 2 communicating via
some IPC mechanism.
Is
Hi Anand,
This is probably already handled by the RDD.pipe() operation. It will spawn a
process and let you feed data to it through its stdin and read data through
stdout.
Matei
On May 29, 2014, at 9:39 AM, ansriniv ansri...@gmail.com wrote:
I have a requirement where for every Spark
Thanks, Jordi, your gist looks pretty much like what I have in my project
currently (with few exceptions that I'm going to borrow).
I like the idea of using sbt package, since it doesn't require third
party plugins and, most important, doesn't create a mess of classes and
resources. But in this
The MergeStrategy combined with sbt assembly did work for me. This is not
painless: some trial and error and the assembly may take multiple minutes.
You will likely want to filter out some additional classes from the
generated jar file. Here is an SOF answer to explain that and with IMHO
the
That hash map is just a list of where each task ran, it’s not the actual data.
How many map and reduce tasks do you have? Maybe you need to give the driver a
bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, 100) to use
only 100 tasks).
Matei
On May 29, 2014, at 2:03 AM, haitao
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects.
It will give you direct access to an iterator containing the member objects
of each partition for doing the kind of within-partition hashtag counts
you're describing.
--
View this message in context:
I recently discovered Hacker News and started reading through older posts
about Scala https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It
looks like the language is fairly controversial on there, and it got me
thinking.
Scala appears to be the preferred language to work with in Spark, and
Quite a few people ask this question and the answer is pretty simple. When we
started Spark, we had two goals — we wanted to work with the Hadoop ecosystem,
which is JVM-based, and we wanted a concise programming interface similar to
Microsoft’s DryadLINQ (the first language-integrated big data
Thanks, I missed that.
One thing that's still unclear to me, even looking at that, is - does this
parameter have to be set when starting up the cluster, on each of the
workers, or can it be set by an individual client job?
On Fri, May 23, 2014 at 10:13 AM, Han JU ju.han.fe...@gmail.com wrote:
It can be set in an individual application.
Consolidation had some issues on ext3 as mentioned there, though we might
enable it by default in the future because other optimizations now made it
perform on par with the non-consolidation version. It also had some bugs in
0.9.0 so I’d suggest at
There were few known concerns about Scala, and some still are, but having
been doing Scala professionally over two years now, i learned to master and
appreciate the advanatages.
Major concern IMO is Scala in a less-than-scrupulous corporate environment.
First, Scala requires significantly more
Thanks Michael.
OK will try SharkServer2..
But I have some basic questions on a related area:
1) If I have a standalone spark application that has already built a RDD,
how can SharkServer2 or for that matter Shark access 'that' RDD and do
queries on it. All the examples I have seen for Shark,
Hello,
A quick question about using spark to parse text-format CSV files stored on
hdfs.
I have something very simple:
sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p =
(XXX, p[0], p[2]))
Here, I want to replace XXX with a string, which is the current csv
filename for the line.
Nicholas,
Good question. Couple of thoughts from my practical experience:
- Coming from R, Scala feels more natural than other languages. The
functional succinctness of Scala is more suited for Data Science than
other languages. In short, Scala-Spark makes sense, for Data Science,
I am building my own custom RDD class.
1) Is there a guarantee that a partition will only be processed on a node
which is in the getPreferredLocations set of nodes returned by the RDD ?
2) I am implementing this custom RDD in Java and plan to extend JavaRDD.
However, I dont see a
Thanks. it worked.
2014-05-30 1:53 GMT+08:00 Matei Zaharia matei.zaha...@gmail.com:
That hash map is just a list of where each task ran, it’s not the actual
data. How many map and reduce tasks do you have? Maybe you need to give the
driver a bit more memory, or use fewer tasks (e.g. do
Currently there is not a way to do this using textFile(). However, you
could pretty straightforwardly define your own subclass of HadoopRDD [1] in
order to get access to this information (likely using
mapPartitionsWithIndex to look up the InputSplit for a particular
partition).
Note that
32 matches
Mail list logo