Split RDD along columns

2015-01-29 Thread Schein, Sagi
Hi, I have the following usecase, assuming that I have my data in e.g. hdfs, a single file sequence file containing rows of CSV entries that I can split and build an RDD of arrays of (smaller) strings. What I want to do is to build two RDDs where the first RDD contains a subset of columns and t

python worker crash in spark 1.0

2014-06-18 Thread Schein, Sagi
Hi, I am trying to upgrade from spark v0.91 to v1.0.0 and getting into some wierd behavior. When, in pyspark, I invoke sc.textFile("hdfs://hadoop-ha01:/user/x/events_2.1").take(1) the call crashes with the below stack trace. The file resides in hadoop 2.2, it is a large event data,

moving SparkContext around

2014-04-13 Thread Schein, Sagi
A few questions about the resilience of the client side of spark. what would happen if the client process crashes, can it reconstruct its state ? Suppose I just want to serialize it and reload it back is this possible ? More advanced use case, is there a way to move SparkContext between jvms/mac

RE: Error when I use spark-streaming

2014-04-11 Thread Schein, Sagi
I would check the DNS setting. Akka seems to pick configuration from FQDN on my system Sagi From: Hahn Jiang [mailto:hahn.jiang@gmail.com] Sent: Friday, April 11, 2014 10:56 AM To: user Subject: Error when I use spark-streaming hi all, When I run spark-streaming use NetworkWordCount in