Hi I am trying to filter large table with 3 columns. Spark SQL might be a
good choice but want to do it without SQL. The goal is to filter bigtable
with multi clauses. I filtered bigtable 3times but the first filtering takes
about 50seconds but the second and third filter transformation took about
Hi,
When I run the following a piece of code, it is throwing a classnotfound
error. Any suggestion would be appreciated.
Wanted to group an RDD by key:
val t = rdd.groupByKey()
Error message:
java.lang.ClassNotFoundException:
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$
I need to partition my data into the same weighted partitions, suppose I have
20GB data and I want 4 partitions where each partition has 5GB of the data.
Thanks
--
View this message in context:
Hi, I am getting the following error but I don't understand what the problem
is.
14/05/27 17:44:29 INFO TaskSetManager: Loss was due to java.io.IOException:
Map failed [duplicate 15]
14/05/27 17:44:30 INFO TaskSetManager: Starting task 47.0:43 as TID 60281 on
executor 0: cm07 (PROCESS_LOCAL)
Is there any way to get facebook data into Spark and filter the content of
it?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, I am getting the following error. How could I fix this problem?
Joe
14/05/02 03:51:48 WARN TaskSetManager: Lost TID 12 (task 2.0:1)
14/05/02 03:51:48 INFO TaskSetManager: Loss was due to
java.lang.ClassNotFoundException:
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4
I am getting this error, please help me to fix it
4/04/28 02:16:20 INFO SparkDeploySchedulerBackend: Executor
app-20140428021620-0007/10 removed: class java.io.IOException: Cannot run
program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in
directory .): error=13,
--
View this
I have just 2 two questions?
sc.textFile(hdfs://host:port/user/matei/whatever.txt)
Is host master node?
What port we should use?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/read-file-from-hdfs-tp4824.html
Sent from the Apache Spark User List mailing
I need someone's help please I am getting the following error.
[error] 14/04/26 03:09:47 INFO cluster.SparkDeploySchedulerBackend: Executor
app-20140426030946-0004/8 removed: class java.io.IOException: Cannot run
program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in
directory
hi thank you for your reply but I could not find it. it says that no such
file or directory
http://apache-spark-user-list.1001560.n3.nabble.com/file/n4848/Capture.png
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html
Sent from the
I got the following performance is it normal in spark to be like this. some
times spark switchs into node_local mode from process_local and it becomes
10x faster. I am very confused.
scala val a = sc.textFile(/user/exobrain/batselem/LUBM1000)
scala f.count()
Long = 137805557
took 130.809661618 s
g1 = pairs1.groupByKey().count()
pairs1 = pairs1.groupByKey(g1).cache()
g2 = triples.groupByKey().count()
pairs2 = pairs2.groupByKey(g2)
pairs = pairs2.join(pairs1)
Hi, I want to implement hash-partitioned joining as shown above. But
somehow, it is taking so long to perform. As I
I want to evaluate spark performance by measuring the running time of
transformation operations such as map and join. To do so, do I need to
materialize merely count action? because As far as I know, transformations
are lazy operations and don't do any computation until we action on them but
when
I want to know as follows:
what is a partition? how it works?
how it is different from hadoop partition?
For example:
sc.parallelize([1,2,3,4]).map(lambda x:
(x,x)).partitionBy(2).glom().collect()
[[(2,2), (4,4)], [(1,1), (3,3)]]
from this, we will get 2 partitions but what does it mean? how
I want to apply the following transformations to 60Gbyte data on 7nodes with
10Gbyte memory. And I am wondering if groupByKey() function returns a RDD
with a single partition for each key? if so, what will happen if the size of
the partition doesn't fit into that particular node?
rdd =
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I was wonder if groupByKey returns 2 partitions in the below example?
x = sc.parallelize([(a, 1), (b, 1), (a, 1)])
sorted(x.groupByKey().collect())
[('a', [1, 1]), ('b', [1])]
--
View this message in context:
Hi I am trying to cache 2Gbyte data and to implement the following procedure.
In order to cache them I did as follows: Is it necessary to cache rdd2 since
rdd1 is already cached?
rdd1 = textFile(hdfs...).cache()
rdd2 = rdd1.filter(userDefinedFunc1).cache()
rdd3 =
I was wondering less partitioning rdds could help the Spark performance and
reduce shuffling? is it true?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/shuffle-vs-performance-tp4255.html
Sent from the Apache Spark User List mailing list archive at
Hi, I have multiple filters as shown below, should I use a single optimal
filter instead of them? these filters can degrade the performance of spark?
http://apache-spark-user-list.1001560.n3.nabble.com/file/n4185/Capture.png
--
View this message in context:
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-count-maps-without-shuffling-too-much-data-tp4194.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
21 matches
Mail list logo