Ganglia does give you a cluster wide and per machine utilization of
resources, but i don't think it gives your per Spark Job. If you want to
build something from scratch then you can follow up like :
1. Login to the machine
2. Get the PIDs
3. For network IO per process, you can have a look at
PipelinedRDD is an RDD generated by Python mapper/reducer, such as
rdd.map(func) will be PipelinedRDD.
PipelinedRDD is an subclass of RDD, so it should have all the APIs which
RDD has.
sc.parallelize(range(10)).map(lambda x: (x, str(x))).sortByKey().count()
10
I'm wondering that how can you
it also appears in streaming hdfs fileStream
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Can you dump out a small piece of data? while doing rdd.collect and
rdd.foreach(println)
Thanks
Best Regards
On Wed, Sep 17, 2014 at 12:26 PM, vasiliy zadonsk...@gmail.com wrote:
it also appears in streaming hdfs fileStream
--
View this message in context:
Yes, that is how it is supposed to work. Apps run as yarn and do not
generally expect to depend on local file state that is created externally.
This directory should be owned by yarn though right? Your error does not
show permission denied. It looks like you are unable to list a yarn dir as
your
full code example:
def main(args: Array[String]) {
val conf = new
SparkConf().setAppName(ErrorExample).setMaster(local[8])
.set(spark.serializer, classOf[KryoSerializer].getName)
val sc = new SparkContext(conf)
val rdd = sc.hadoopFile(
hdfs://./user.avro,
I'm supposing that there's no good solution to having heterogenous hardware
in a cluster. What are the prospects of having something like this in the
future? Am I missing an architectural detail that precludes this
possibility?
Thanks,
Victor
On Fri, Sep 12, 2014 at 12:10 PM, Victor Tso-Guillen
I thought I answered this ... you can easily accomplish this with YARN
by just telling YARN how much memory / CPU each machine has. This can
be configured in groups too rather than per machine. I don't think you
actually want differently-sized executors, and so don't need ratios.
But you can have
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.scheduler.cluster.YarnClientClusterScheduler
It sounds like you perhaps deployed a custom build of Spark that did
not include YARN support? you need -Pyarn in your build.
On Wed, Sep 17, 2014 at 4:47 AM, Barrington
Hmm, interesting. I'm using standalone mode but I could consider YARN. I'll
have to simmer on that one. Thanks as always, Sean!
On Wed, Sep 17, 2014 at 12:40 AM, Sean Owen so...@cloudera.com wrote:
I thought I answered this ... you can easily accomplish this with YARN
by just telling YARN how
Hi Nicolas,
I've had suspicions about speculation causing problems on my cluster but
don't have any hard evidence of it yet.
I'm also interested in why it's turned off by default.
On Tue, Sep 16, 2014 at 3:01 PM, Nicolas Mai nicolas@gmail.com wrote:
Hi, guys
My current project is using
Hi,
I'm using spark streaming 1.0. I create dstream with kafkautils and
apply some operations on it. There's a reduceByWindow operation at
last so I suppose the checkpoint interval should be automatically set
to more than 10 seconds. But what I see is it still checkpoint every 2
seconds (my batch
Hi all:
The Scala API here: http://spark.apache.org/docs/latest/api/scala/#package
I found that it didn't specify the exception of each function or class.
Then how could I know if a function throws a exception or not without
reading the source code?
Hi,
I want to make the following changes in the RDD (create new RDD from the
existing to reflect some transformation):
In an RDD of key-value pair, I want to get the keys for which the values
are 1.
How to do this using map()?
Thank You
You don't. That's what filter or the partial function version of collect
are for:
val transformedRDD = yourRDD.collect { case (k, v) if k == 1 = v }
On Wed, Sep 17, 2014 at 3:24 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I want to make the following changes in the RDD (create new
Hi
I need the same through Java.
Doesn't the SPark API support this?
On Wed, Sep 17, 2014 at 2:48 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Ganglia does give you a cluster wide and per machine utilization of
resources, but i don't think it gives your per Spark Job. If you want to
build
Hi ,
I am execution pyspark on yarn.
I have successfully executed initial dataset but now I growed it 10 times
more.
during execution I got all the time this error:
14/09/17 19:28:50 ERROR cluster.YarnClientClusterScheduler: Lost executor
68 on UCS-NODE1.sms1.local: remote Akka client
Cloudera had a blog post about this in August 2013:
http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/
Has anyone been using this in production - curious as to if it made a
significant difference from a Spark perspective.
Hi everyone,
Is it possible to fix the number of tasks related to a saveAsTextFile in
Pyspark?
I am loading several files from HDFS, fixing the number of partitions to X
(let's say 40 for instance). Then some transformations, like joins and
filters are carried out. The weird thing here is that
How many partitions do you have in your input rdd? Are you specifying
numPartitions in subsequent calls to groupByKey/reduceByKey?
On Sep 17, 2014, at 4:38 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I am execution pyspark on yarn.
I have successfully executed initial dataset
Hello
We are building an adjacency list to represent a graph. Vertexes, Edges and
Weights for the same has been extracted from hdfs files by a Spark job.
Further we expect size of the adjacency list(Hash Map) could grow over
20Gigs.
How can we represent this in RDD, so that it will distributed in
Hello
We are building an adjacency list to represent a graph. Vertexes, Edges and
Weights for the same has been extracted from hdfs files by a Spark job.
Further we expect size of the adjacency list(Hash Map) could grow over
20Gigs.
How can we represent this in RDD, so that it will distributed
Hi,
For last few days I am working on an exercise where I want to understand the
sentiment of a set of articles.
As the input I have XML file with articles and the AFINN-111.txt file defining
sentiment of few hundred words.
What I am able to do without any problem is loading of the data,
Hi Davis,
Thank you for you answer. This is my code. I think it is very similar with
word count example in spark
lines = sc.textFile(sys.argv[2])
sie = lines.map(lambda l: (l.strip().split(',')[4],1)).reduceByKey(lambda
a, b: a + b)
sort_sie = sie.sortByKey(False)
Thanks again.
--
Hi Davis,
When I run your code in pyspark, I still get the same error:
sc.parallelize(range(10)).map(lambda x: (x, str(x))).sortByKey().count()
Traceback (most recent call last):
File stdin, line 1, in module
AttributeError: 'PipelinedRDD' object has no attribute 'sortByKey'
Is it the
Hello everyone.
The problem is that spark write data to the disk very hard, even if
application has a lot of free memory (about 3.8g).
So, I've noticed that folder with name like
spark-local-20140917165839-f58c contains a lot of other folders with
files like shuffle_446_0_1. The total size of
Sure, I'll post to the mail list.
groupByKey(self, numPartitions=None)source code
http://spark.apache.org/docs/1.0.2/api/python/pyspark.rdd-pysrc.html#RDD.groupByKey
Group the values for each key in the RDD into a single sequence.
Hash-partitions the resulting RDD with into numPartitions
Hi,
The files you mentioned are temporary files written by Spark during shuffling.
ALS will write a LOT of those files as it is a shuffle heavy algorithm.
Those files will be deleted after your program completes as Spark looks for
those files in case a fault occurs. Having those files ready
However, in my application there is no logic to access local files.
So I thought that spark is internally using the local file system to cache
RDDs.
As per the log, it looks that error occurred during spark internal logic
rather than my business logic.
It is trying to delete local directories.
Hmm, no response to this thread!
Adding to it, please share experiences of building an enterprise grade product
based on Spark Streaming.
I am exploring Spark Streaming for enterprise software and am cautiously
optimistic about it. I see huge potential to improve debuggability of Spark.
-
Hi Group,
I am quite fresh in the spark world. There is a particular use case that I just
cannot understand how to accomplish in spark. I am using Cloudera
CDH5/YARN/Java 7.
I have a dataset that has the following characteristics -
A JavaPairRDD that represents the following -
Key = {int ID}
Maybe the Python worker use too much memory during groupByKey(),
groupByKey() with larger numPartitions can help.
Also, can you upgrade your cluster to 1.1? It can spilling the data
into disks if the memory can not hold all the data during groupByKey().
Also, If there is hot key with dozens of
On Wed, Sep 17, 2014 at 5:21 AM, Luis Guerra luispelay...@gmail.com wrote:
Hi everyone,
Is it possible to fix the number of tasks related to a saveAsTextFile in
Pyspark?
I am loading several files from HDFS, fixing the number of partitions to X
(let's say 40 for instance). Then some
You just need to call mapValues() to change your Iterable of things
into a sorted Iterable of things for each key-value pair. In that
function you write, it's no different from any other Java program. I
imagine you'll need to copy the input Iterable into an ArrayList
(unfortunately), sort it with
Thanks Sean,
Makes total sense. I guess I was so caught up with RDD's and all the wonderful
transformations it can do, that I did not think about pain old Java
Collections.sort(list, comparator).
Thanks,
__
Abraham
-Original Message-
From: Sean Owen
Hi Harsha,
You could look through the GraphX source to see the approach taken there
for ideas in your own. I'd recommend starting at
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala#L385
to see the storage technique.
Why do you want to avoid
Hi Burak,
Most discussions of checkpointing in the docs is related to Spark
streaming. Are you talking about the sparkContext.setCheckpointDir()?
What effect does that have?
https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
On Wed, Sep 17, 2014 at 7:44 AM,
Hi Andrew,
Yes, I'm referring to sparkContext.setCheckpointDir(). It has the same effect
as in Spark Streaming.
For example, in an algorithm like ALS, the RDDs go through many transformations
and the lineage of the RDD starts to grow drastically just like
the lineage of DStreams do in Spark
Thanks for the info!
Are there performance impacts with writing to HDFS instead of local disk?
I'm assuming that's why ALS checkpoints every third iteration instead of
every iteration.
Also I can imagine that checkpointing should be done every N shuffles
instead of every N operations (counting
I'm pretty sure it does help, though I don't have any numbers for it. In any
case, Spark will automatically benefit from this if you link it to a version of
HDFS that contains this.
Matei
On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com) wrote:
Cloudera had a blog post
I have a library written in Cython and C. wondering if it can be shipped to
the workers which don't have cython installed. maybe create an egg package
from this library? how?
--
View this message in context:
Sorry if this is in the docs someplace and I'm missing it.
I'm trying to implement label propagation in GraphX. The core step of that
algorithm is
- for each vertex, find the most frequent label among its neighbors and set
its label to that.
(I think) I see how to get the input from all the
hi estimated Sparkes,
I have some doubt about Streaming Context batchDuration parameter.
I've already read excellent explication by Tathagata Das about difference
between batch window duration.
But the issue seem a few confusion as for exam.- without any window
SparkStreaming plays as implicit
At 2014-09-17 11:39:19 -0700, spr s...@yarcdata.com wrote:
I'm trying to implement label propagation in GraphX. The core step of that
algorithm is
- for each vertex, find the most frequent label among its neighbors and set
its label to that.
[...]
It seems on the broken line above, I
Yes, writing to HDFS is more expensive, but I feel it is still a small price to
pay when compared to having a Disk Space Full error three hours in
and having to start from scratch.
The main goal of checkpointing is to truncate the lineage. Clearing up shuffle
writes come as a bonus to
Not sure if you resolved this but I had a similar issue and resolved it. In
my case, the problem was the ids of my items were of type Long and could be
very large (even though there are only a small number of distinct ids...
maybe a few hundred of them). KMeans will create a dense vector for the
I would like to reduce the dimensionality of my data before running kmeans.
The problem I'm having is that both RowMatrix.computePrincipalComponents()
and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train()
requires an RDD[Vector]. Does MLlib provide a way to do this conversion?
Patrick,
If I understand this correctly, I won't be able to do this in the closure
provided to mapPartitions() because that's going to be stateless, in the
sense that a hash map that I create within the closure would only be useful
for one call of MapPartitionsRDD.compute(). I guess I would need
I don't have anything in production yet but I now at least have a
stable (running for more than 24 hours) streaming app. Earlier, the
app would crash for all sorts of reasons. Caveats/setup:
- Spark 1.0.0 (I have no input flow control unlike Spark 1.1)
- Yarn for RM
- Input and Output to Kafka
-
Thanks Tim for the detailed email, would help me a lot.
I have:
. 3 nodes CDH5.1 cluster with 16G memory.
. Majority of code in Scala, some part in Java (using Cascading earlier).
. Inputs are bunch of textFileStream directories.
. Every batch output is going to Parquet files, and to HBase
if you want the result as RDD of (key, 1)
new_rdd = rdd.filter(x = x._2 == 1)
if you want result as RDD of keys (since you know the values are 1), then
new_rdd = rdd.filter(x = x._2 == 1).map(x = x._1)
x._1 and x._2 are the way of scala to access the key and value from
key/value pair.
what are you trying to do? generate time series from your data in HDFS, or
doing
some transformation and/or aggregation from your time series data in HDFS?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Dealing-with-Time-Series-Data-tp14275p14482.html
Sent
Hi,
We are running aggregation on a huge data set (few billion rows).
While running the task got the following error (see below). Any ideas?
Running spark 1.1.0 on cdh distribution.
...
14/09/17 13:33:30 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0).
2083 bytes result sent to driver
Nice write-up... very helpful!
-Original Message-
From: Tim Smith [mailto:secs...@gmail.com]
Sent: Wednesday, September 17, 2014 1:11 PM
Cc: spark users
Subject: Re: Stable spark streaming app
I don't have anything in production yet but I now at least have a stable
(running for more
If you'd like to re-use the resulting inverted map, you can persist the result:
x = myRdd.mapPartitions(create inverted map).persist()
Your function would create the reverse map and then return an iterator
over the keys in that map.
On Wed, Sep 17, 2014 at 1:04 PM, Akshat Aranya
Hi,
Could you try repartitioning the data by .repartition(# of cores on machine) or
while reading the data, supply the number of minimum partitions as in
sc.textFile(path, # of cores on machine).
It may be that the whole data is stored in one block? If it is billions of
rows, then the indexing
Here's official spark document about batch size/interval:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-size
spark is batch oriented processing. As you mentioned, the streaming
is continuous flow, and core spark can not handle it.
Spark streaming
I am trying to benchmark spark in a hadoop cluster.
I need to design a sample spark job to test the CPU utilization, RAM usage,
Input throughput, Output throughput and Duration of execution in the
cluster.
I need to test the state of the cluster for :-
A spark job which uses high CPU
A spark
Thanks :)
On Wed, Sep 17, 2014 at 2:10 PM, Paul Wais pw...@yelp.com wrote:
Thanks Tim, this is super helpful!
Question about jars and spark-submit: why do you provide
myawesomeapp.jar as the program jar but then include other jars via
the --jars argument? Have you tried building one uber
Hi I am new to spark.
I am trying to write a simple java program that process tweets that where
collected and stored in a file. I figured the simplest thing to do would be
to convert the JSON string into a java map. When I submit my jar file I keep
getting the following error
Hi,
Wonder anybody had similar experience or any suggestion here.
I have an akka Actor that processes database requests in high-level messages.
Inside this Actor, it creates a HiveContext object that does the actual db
work. The main thread creates the needed SparkContext and passes in to the
More clearly, that yarn cluster is managed by other team.
That means, I do not have any permission to change the system.
If required, I can request to them,
but as of now, I only have permission to manage my spark application.
So if there are any way to solve this problem by changing
It works for me :
export
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native
export
Hi Reza,
In similarColumns, it seems with cosine similarity I also need other
numbers such as intersection, jaccard and other measures...
Right now I modified the code to generate jaccard but I had to run it twice
due to the design of RowMatrix / CoordinateMatrix...I feel we should modify
Hi, Du
I am not sure what you mean triggers the HiveContext to create a database, do
you create the sub class of HiveContext? Just be sure you call the
HiveContext.sessionState eagerly, since it will set the proper hiveconf
into the SessionState, otherwise the HiveDriver will always get the
When I run `sbt test-only SparkTest` or `sbt test-only SparkTest1`, it
was pass. But run `set test` to tests SparkTest and SparkTest1, it was
failed.
If merge all cases into one file, the test was pass.
--
View this message in context:
Hi All,We have a fairly large amount of sparse data. I was following the
following instructions in the manual:
Sparse dataIt is very common in practice to have sparse training data. MLlib
supports reading training examples stored in LIBSVM format, which is the
default format used by LIBSVM and
- dev
Is it possible that you are constructing more than one HiveContext in a
single JVM? Due to global state in Hive code this is not allowed.
Michael
On Wed, Sep 17, 2014 at 7:21 PM, Cheng, Hao hao.ch...@intel.com wrote:
Hi, Du
I am not sure what you mean “triggers the HiveContext to
Hi there,
My product environment is AWS EMR with hadoop2.4.0 and spark1.0.2. I moved the
spark configuration in SPARK_CLASSPATH to spark-default.conf, then the
hiveContext went wrong.
I also found WARN info “WARN DataNucleus.General: Plugin (Bundle)
org.datanucleus.store.rdbms is already
Hi,
The spacing between the inputs should be a single space, not a tab. I feel like
your inputs have tabs between them instead of a single space. Therefore the
parser
cannot parse the input.
Best,
Burak
- Original Message -
From: Sameer Tilak ssti...@live.com
To: user@spark.apache.org
Hi,
We built out SPARK 1.1.0 Version with MVN using
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive clean package
And the Thrift Server has been configured to use the Hive Meta Store.
When a schemaRDD is registered as table where does the metadata of this table
get stored. Can it be
The registered table is stored within the spark context itself. To have the
table available for the thrift server to get access to, you can save the sc
table into the Hive context so that way the Thrift server process can see the
table. If you are using derby as your metastore, then the
Thanks for your reply.
Michael: No. I only create one HiveContext in the code.
Hao: Yes. I subclass HiveContext and defines own function to create database
and then subclass akka Actor to call that function in response to an abstract
message. By your suggestion, I called
Hi all,
I need the kmeans code written against Pyspark for some testing purpose.
Can somebody tell me the difference between these two files.
spark-1.0.1/examples/src/main/python/kmeans.py and
spark-1.0.1/python/pyspark/mllib/clustering.py
Thanks Regards,
Meethu M
Is there a shell available for Spark SQL, similar to the way the Shark
or Hive shells work?
From my reading up on Spark SQL, it seems like one can execute SQL
queries in the Spark shell, but only from within code in a programming
language such as Scala. There does not seem to be any way to
75 matches
Mail list logo