GC problem while filtering large data

2014-12-16 Thread Joe L
Hi I am trying to filter large table with 3 columns. Spark SQL might be a good choice but want to do it without SQL. The goal is to filter bigtable with multi clauses. I filtered bigtable 3times but the first filtering takes about 50seconds but the second and third filter transformation took about

classnotfound error due to groupByKey

2014-07-04 Thread Joe L
Hi, When I run the following a piece of code, it is throwing a classnotfound error. Any suggestion would be appreciated. Wanted to group an RDD by key: val t = rdd.groupByKey() Error message: java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$

Need equallyWeightedPartitioner Algorithm

2014-06-03 Thread Joe L
I need to partition my data into the same weighted partitions, suppose I have 20GB data and I want 4 partitions where each partition has 5GB of the data. Thanks -- View this message in context:

Map failed [dupliacte 1] error

2014-05-27 Thread Joe L
Hi, I am getting the following error but I don't understand what the problem is. 14/05/27 17:44:29 INFO TaskSetManager: Loss was due to java.io.IOException: Map failed [duplicate 15] 14/05/27 17:44:30 INFO TaskSetManager: Starting task 47.0:43 as TID 60281 on executor 0: cm07 (PROCESS_LOCAL)

facebook data mining with Spark

2014-05-19 Thread Joe L
Is there any way to get facebook data into Spark and filter the content of it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

ClassNotFoundException

2014-05-01 Thread Joe L
Hi, I am getting the following error. How could I fix this problem? Joe 14/05/02 03:51:48 WARN TaskSetManager: Lost TID 12 (task 2.0:1) 14/05/02 03:51:48 INFO TaskSetManager: Loss was due to java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4

help

2014-04-27 Thread Joe L
I am getting this error, please help me to fix it 4/04/28 02:16:20 INFO SparkDeploySchedulerBackend: Executor app-20140428021620-0007/10 removed: class java.io.IOException: Cannot run program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in directory .): error=13, -- View this

read file from hdfs

2014-04-25 Thread Joe L
I have just 2 two questions? sc.textFile(hdfs://host:port/user/matei/whatever.txt) Is host master node? What port we should use? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/read-file-from-hdfs-tp4824.html Sent from the Apache Spark User List mailing

help

2014-04-25 Thread Joe L
I need someone's help please I am getting the following error. [error] 14/04/26 03:09:47 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140426030946-0004/8 removed: class java.io.IOException: Cannot run program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in directory

Re: help

2014-04-25 Thread Joe L
hi thank you for your reply but I could not find it. it says that no such file or directory http://apache-spark-user-list.1001560.n3.nabble.com/file/n4848/Capture.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html Sent from the

help me

2014-04-22 Thread Joe L
I got the following performance is it normal in spark to be like this. some times spark switchs into node_local mode from process_local and it becomes 10x faster. I am very confused. scala val a = sc.textFile(/user/exobrain/batselem/LUBM1000) scala f.count() Long = 137805557 took 130.809661618 s

Re: Spark is slow

2014-04-21 Thread Joe L
g1 = pairs1.groupByKey().count() pairs1 = pairs1.groupByKey(g1).cache() g2 = triples.groupByKey().count() pairs2 = pairs2.groupByKey(g2) pairs = pairs2.join(pairs1) Hi, I want to implement hash-partitioned joining as shown above. But somehow, it is taking so long to perform. As I

evaluate spark

2014-04-20 Thread Joe L
I want to evaluate spark performance by measuring the running time of transformation operations such as map and join. To do so, do I need to materialize merely count action? because As far as I know, transformations are lazy operations and don't do any computation until we action on them but when

what is a partition? how it works?

2014-04-16 Thread Joe L
I want to know as follows: what is a partition? how it works? how it is different from hadoop partition? For example: sc.parallelize([1,2,3,4]).map(lambda x: (x,x)).partitionBy(2).glom().collect() [[(2,2), (4,4)], [(1,1), (3,3)]] from this, we will get 2 partitions but what does it mean? how

groupByKey returns a single partition in a RDD?

2014-04-15 Thread Joe L
I want to apply the following transformations to 60Gbyte data on 7nodes with 10Gbyte memory. And I am wondering if groupByKey() function returns a RDD with a single partition for each key? if so, what will happen if the size of the partition doesn't fit into that particular node? rdd =

what is the difference between element and partition?

2014-04-15 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

groupByKey(None) returns partitions according to the keys?

2014-04-15 Thread Joe L
I was wonder if groupByKey returns 2 partitions in the below example? x = sc.parallelize([(a, 1), (b, 1), (a, 1)]) sorted(x.groupByKey().collect()) [('a', [1, 1]), ('b', [1])] -- View this message in context:

Proper caching method

2014-04-14 Thread Joe L
Hi I am trying to cache 2Gbyte data and to implement the following procedure. In order to cache them I did as follows: Is it necessary to cache rdd2 since rdd1 is already cached? rdd1 = textFile(hdfs...).cache() rdd2 = rdd1.filter(userDefinedFunc1).cache() rdd3 =

shuffle vs performance

2014-04-14 Thread Joe L
I was wondering less partitioning rdds could help the Spark performance and reduce shuffling? is it true? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/shuffle-vs-performance-tp4255.html Sent from the Apache Spark User List mailing list archive at

how to use a single filter instead of multiple filters

2014-04-13 Thread Joe L
Hi, I have multiple filters as shown below, should I use a single optimal filter instead of them? these filters can degrade the performance of spark? http://apache-spark-user-list.1001560.n3.nabble.com/file/n4185/Capture.png -- View this message in context:

how to count maps without shuffling too much data?

2014-04-13 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-count-maps-without-shuffling-too-much-data-tp4194.html Sent from the Apache Spark User List mailing list archive at Nabble.com.