Re: updateStateByKey performance & API

2015-03-23 Thread Andre Schumacher
something like an IndexRDD). But in your case you mention serialization overhead to be the bottleneck, so maybe you could try filtering out unchanged keys before persisting the data? Just an idea.. Andre On 22/03/15 10:43, "Andre Schumacher" wrote: > > > >

Re: Using Spark Streaming with Kafka 0.7.2

2014-07-29 Thread Andre Schumacher
Hi, For testing you could also just use the Kafka 0.7.2 console consumer and pipe it's output to netcat (nc) and process that as in the example https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala That worked for me. Back

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-22 Thread Andre Schumacher
Hi, I don't think anybody has been testing importing of Impala tables directly. Is there any chance to export these first, say as unpartitioned Hive tables and import these? Just an idea.. Andre On 07/21/2014 11:46 PM, chutium wrote: > no, something like this > > 14/07/20 00:19:29 ERROR cluste

Re: spark with docker: errors with akka, NAT?

2014-06-16 Thread Andre Schumacher
Hi, are you using the amplab/spark-1.0.0 images from the global registry? Andre On 06/17/2014 01:36 AM, Mohit Jaggi wrote: > Hi Folks, > > I am having trouble getting spark driver running in docker. If I run a > pyspark example on my mac it works but the same example on a docker image > (Via b

Re: initial basic question from new user

2014-06-12 Thread Andre Schumacher
Hi, On 06/12/2014 05:47 PM, Toby Douglass wrote: > In these future jobs, when I come to load the aggregted RDD, will Spark > load and only load the columns being accessed by the query? or will Spark > load everything, to convert it into an internal representation, and then > execute the query?