We don't have any documentation on running SparkR on YARN and I think there
might be some issues that need to be fixed (The recent PySpark on YARN PRs
are an example).
SparkR has only been tested to work with Spark standalone mode so far.
Thanks
Shivaram
On Tue, Apr 29, 2014 at 7:56 PM,
Hi, Any suggestion to the following issue ??
I have replication factor 3 in my HDFS.
With 3 datanodes, i ran my experiments. Now i just added another node to it
with no data in it.
When i ran, SPARK launches non-local tasks in it and the time taken is more
than what it took for 3 node cluster.
I just tried to use serializer to write object directly in local mode with code:
val datasize = args(1).toInt
val dataset = (0 until datasize).map( i = (asmallstring, i))
val out: OutputStream = {
new BufferedOutputStream(new FileOutputStream(args(2)), 1024 * 100)
Yes, that’s what I meant. Sure, the numbers might not be actually sorted,
but the order of rows semantically are kept throughout non-shuffling
transforms. I’m on board with you on union as well.
Back to the original question, then, why is it important to coalesce to a
single partition? When you
i fixed it.
i make my sbt project depend on
spark/trunk/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
and it works
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-from-Spark-Java-tp4937p5096.html
Sent from the
Whoops, you are right. Sorry for the misinformation. Indeed reduceByKey
just calls combineByKey:
def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] =
{
combineByKey[V]((v: V) = v, func, func, partitioner)
}
(I think I confused reduceByKey with groupByKey.)
On Wed, Apr
Hi,
when I configue spark, run the shell instruction:
./spark-shellit told me like this:
WARN:NativeCodeLoader:Uable to load native-hadoop livrary for your
builtin-java classes where applicable,when it connect to ResourceManager,it
stopped. What should I DO?
Wish your reply
--
View this
That's the approach I finally used.
Thanks for your help :-)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-not-pair-RDDs-in-Spark-tp5034p5099.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi
The reason you saw that warning is the native Hadoop library
$HADOOP_HOME/lib/native/libhadoop.so.1.0.0 was actually compiled on 32 bit.
Anyway, it's just a warning, and won't impact Hadoop's functionalities.
Here is the way if you do want to eliminate this warning, download the
source code
On 30 Apr 2014 10:35, Akhil Das ak...@sigmoidanalytics.com wrote:
Hi
The reason you saw that warning is the native Hadoop library
$HADOOP_HOME/lib/native/libhadoop.so.1.0.0 was actually compiled on 32 bit.
Anyway, it's just a warning, and won't impact Hadoop's functionalities.
Here is the
Hi, all! For those in the Washington DC area (DC/MD/VA), we just started a
new Spark Meetup. We'd love for you to join! -d
Here's the link: http://www.meetup.com/Washington-DC-Area-Spark-Interactive/
Description:
This is an interactive meetup for Washington DC, Virginia and Maryland
users,
I'm guessing your shell stopping when it attempts to connect to the RM is
not related to that warning. You'll get that message out of the box from
Spark if you don't have HADOOP_HOME set correctly. I'm using CDH 5.0
installed in default locations, and got rid of the warning by setting
Okay, that makes sense. It’d be great if this can be better documented at
some point, because the only way to find out about the resulting RDD row
order is by looking at the code.
Thanks for the discussion!
Mingyu
On 4/29/14, 11:59 PM, Patrick Wendell pwend...@gmail.com wrote:
I don't think
Hi
Playing around with Spark S3, I'm opening multiple objects (CSV files) with:
val hfile = sc.textFile(s3n://bucket/2014-04-28/)
so hfile is a RDD representing 10 objects that were underneath 2014-04-28.
After I've sorted and otherwise transformed the content, I'm trying to write it
Ah, looks like RDD.coalesce(1) solves one part of the problem.
On Wednesday, April 30, 2014 11:15 AM, Peter thenephili...@yahoo.com wrote:
Hi
Playing around with Spark S3, I'm opening multiple objects (CSV files) with:
val hfile = sc.textFile(s3n://bucket/2014-04-28/)
so hfile is a RDD
I agree with you in general that as an API user, I shouldn’t be relying on
code. However, without looking at the code, there is no way for me to find
out even whether map() keeps the row order. Without the knowledge at all,
I’d need to do “sort” every time I need certain things in a certain order.
Yes, saveAsTextFile() will give you 1 part per RDD partition. When you
coalesce(1), you move everything in the RDD to a single partition, which
then gives you 1 output file.
It will still be called part-0 or something like that because that’s
defined by the Hadoop API that Spark uses for
S is the previous count, if any. Seq[V] are potentially many new
counts. All of them have to be added together to keep an accurate
total. It's as if the count were 3, and I tell you I've just observed
2, 5, and 1 additional occurrences -- the new count is 3 + (2+5+1) not
1 + 1.
I butted in
Thanks Nicholas, this is a bit of a shame, not very practical for log roll up
for example when every output needs to be in it's own directory.
On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Yes, saveAsTextFile() will give you 1 part per RDD
Yeah, I remember changing fold to sum in a few places, probably in
testsuites, but missed this example I guess.
On Wed, Apr 30, 2014 at 1:29 PM, Sean Owen so...@cloudera.com wrote:
S is the previous count, if any. Seq[V] are potentially many new
counts. All of them have to be added together
Hi,
One thing you can do is set the spark version your project depends on
to 1.0.0-SNAPSHOT (make sure it matches the version of Spark you're
building); then before building your project, run sbt publishLocal
on the Spark tree.
On Wed, Apr 30, 2014 at 12:11 AM, wxhsdp wxh...@gmail.com wrote:
i
Hi,
This is not related to Spark. But I thought you might be interested in the
second SF Scala conference is coming this August. The SF Scala conference was
called Sillicon Valley Scala Symposium last year. From now on, it will be
known as Scala By The Bay.
I meant to post this last week, but this is a talk I gave at the Philly ETE
conf. last week:
http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model
Also here:
http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf
dean
--
Dean Wampler, Ph.D.
Typesafe
Thanks for your reply. Sorry for the late response, I wanted to do some tests
before writing back.
The counting part works similar to your advice. I specify a minimum interval
like 1 minute, in each hour, day etc. it sums all counters of the current
children intervals.
However when I want to
In our application, we need distributed RDDs containing key-value maps. We
have operations that update RDDs by way of adding entries to the map, delete
entries from the map as well as update value part of maps.
We also have map reduce functions that operate on the RDDs.The questions are
the
Dear Sparkers,
Has anyone got any insight on this ? I am really stuck.
Yadid
On 4/28/14, 11:28 AM, Yadid Ayzenberg wrote:
Thanks for your answer.
I tried running on a single machine - master and worker on one host. I
get exactly the same results.
Very little CPU activity on the machine in
Hello,
So I was unable to run the following commands from the spark shell with CDH
5.0 and spark 0.9.0, see below.
Once I removed the property
property
nameio.compression.codec.lzo.class/name
valuecom.hadoop.compression.lzo.LzoCodec/value
finaltrue/final
/property
from the core-site.xml on the
Hi,
i'am just reviewing advanced spark features. it's about the pagerank
example.
it said any shuffle operation on two RDDs will take on the partitioner of
one of them, if one is set.
so first we partition the Links by hashPartitioner, then we join the Links
and Ranks0. Ranks0 will take
This is a consequence of the way the Hadoop files API works. However,
you can (fairly easily) add code to just rename the file because it
will always produce the same filename.
(heavy use of pseudo code)
dir = /some/dir
rdd.coalesce(1).saveAsTextFile(dir)
f = new File(dir + part-0)
Hi there,
I was wondering if somebody could give me some suggestions about how to
handle this situation:
I have a spark program, in which it reads a 6GB file first (Not RDD)
locally, and then do the map/reduce tasks. This 6GB file contains
information that will be shared by all the map tasks.
30 matches
Mail list logo