Re: Constant Spark execution time with different # of slaves

2015-10-10 Thread Robineast
Do you have enough partitions of your RDDs to spread across all your processing cores? Are all executors actually processing tasks? - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this

DataFrame Explode for ArrayBuffer[Any]

2015-10-10 Thread Eugene Morozov
Hi, I have a DataFrame with several columns I'd like to explode. All of the columns I have to explode has an ArrayBuffer type of some other types inside. I'd say that the following code is totally legit to use it as explode function for any given ArrayBuffer - my assumption is that for any given

Re: Create hashmap using two RDD's

2015-10-10 Thread Kali
Hi Richard, Requirement is to get latest records using a key i think hash map is a good choice for this task. As of now data comes from third party and we are not sure what's the latest record is so hash map is chosen. Is there anything better than hash map please let me know. Thanks Sri Sent

updateStateByKey and stack overflow

2015-10-10 Thread Tian Zhang
Hi, I am following the spark streaming stateful application example and write a simple counting application with updateStateByKey. val keyStateStream = actRegBatchCountStream.updateStateByKey(update, new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initKeyStateRDD) This runs

Re: Spark GraphaX

2015-10-10 Thread Robineast
Well it depends on exactly what algorithms are involved in Network Root Cause analysis (not something I'm familiar with). GraphX provides a number of out of the box algorithms like PageRank, connected components, strongly connected components, label propagation as well as an implementation of the

Re: Checkpointing in Iterative Graph Computation

2015-10-10 Thread Robineast
One other thought - you need to call SparkContext.setCheckpointDir otherwise nothing will happen - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context:

Re: Question about GraphX connected-components

2015-10-10 Thread Igor Berman
let's start from some basics: might be u need to split your data into more partitions? spilling depends on your configuration when you create graph(look for storage level param) and your global configuration. in addition, you assumption of 64GB/100M is probably wrong, since spark divides memory

Re: Create hashmap using two RDD's

2015-10-10 Thread kali.tumm...@gmail.com
Got it ..., created hashmap and saved it to file please follow below steps .. val QuoteRDD=quotefile.map(x => x.split("\\|")). filter(line => line(0).contains("1017")). map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) , if (x(15) =="B") ( {if (x(25) == "") x(9) else

How StorageLevel, CacheManager and checkpointing influence computing RDD partitions?

2015-10-10 Thread Jacek Laskowski
Hi, I've been reviewing the Spark code and noticed that `iterator` method of RDD [1] does a check whether RDD has a non-NONE storage and calls `computeOrReadCheckpoint` private method [2] that checks RDD checkpointing. Is there a doc on how StorageLevel, CacheManager and checkpointing influence

Re: Checkpointing in Iterative Graph Computation

2015-10-10 Thread Robineast
You need to checkpoint before you materialize. You'll find you probably only want to checkpoint every 100 or so iterations otherwise the checkpointing will slow down your application excessively - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: SQLcontext changing String field to Long

2015-10-10 Thread Yana Kadiyska
can you show the output of df.printSchema? Just a guess but I think I ran into something similar with a column that was part of a path in parquet. E.g. we had an account_id in the parquet file data itself which was of type string but we also named the files in the following manner

Re: spark-submit hive connection through spark Initial job has not accepted any resources

2015-10-10 Thread Yana Kadiyska
"Job has not accepted resources" is a well-known error message -- you can search the Internet. 2 common causes come to mind: 1) you already have an application connected to the master -- by default a driver will grab all resources so unless that application disconnects, nothing else is allowed to

Re: Create hashmap using two RDD's

2015-10-10 Thread Richard Eggert
Do you need the HashMap for anything else besides writing out to a file? If not, there is really no need to create one at all. You could just keep everything as RDDs. On Oct 10, 2015 11:31 AM, "kali.tumm...@gmail.com" wrote: > Got it ..., created hashmap and saved it to

Re: How to compile Spark with customized Hadoop?

2015-10-10 Thread Raghavendra Pandey
There is spark without hadoop version.. You can use that to link with any custom hadoop version. Raghav On Oct 10, 2015 5:34 PM, "Steve Loughran" wrote: > > During development, I'd recommend giving Hadoop a version ending with > -SNAPSHOT, and building spark with maven,

Re: Create hashmap using two RDD's

2015-10-10 Thread Richard Eggert
You should be able to achieve what you're looking for by using foldByKey to find the latest record for each key. If you're relying on the order elements within the file to determine which ones are the "latest" (rather than sorting by some field within the file itself), call zipWithIndex first to

Re: How to calculate percentile of a column of DataFrame?

2015-10-10 Thread Umesh Kacha
Hi any idea? how do I call percentlie_approx using callUdf() please guide. On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha wrote: > I have a doubt Michael I tried to use callUDF in the following code it > does not work. > >

Re: Create hashmap using two RDD's

2015-10-10 Thread Sri
Thanks Richard , will give a try tomorrow... Thanks Sri Sent from my iPhone > On 10 Oct 2015, at 19:15, Richard Eggert wrote: > > You should be able to achieve what you're looking for by using foldByKey to > find the latest record for each key. If you're relying on

Join Order Optimization

2015-10-10 Thread Raajay
Hello, Does Spark-SQL support join order optimization as of the 1.5.1 release ? >From the release notes, I did not see support for this feature, but figured will ask the users-list to be sure. Thanks Raajay

Re: Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-10 Thread Ted Yu
bq. all sort of optimizations like Tungsten For Tungsten, please use 1.5.1 release. On Sat, Oct 10, 2015 at 6:24 PM, Alex Rovner wrote: > How many executors are you running with? How many nodes in your cluster? > > > On Thursday, October 8, 2015, unk1102

Re: Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-10 Thread Alex Rovner
How many executors are you running with? How many nodes in your cluster? On Thursday, October 8, 2015, unk1102 wrote: > Hi as recommended I am caching my Spark job dataframe as > dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in > Spark > job UI is

Re: Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-10 Thread Umesh Kacha
Hi Alex thanks for the response. I am using 40 executor with 30 gb including 5 gb menoryOverhead and 4 cores. My cluster has around 100 nodes with 30 gig and 8 cores. On Oct 11, 2015 06:54, "Alex Rovner" wrote: > How many executors are you running with? How many nodes

Re: SQLcontext changing String field to Long

2015-10-10 Thread shobhit gupta
here is what the df.schema.toString() prints. DF Schema is ::StructType(StructField(batch_id,StringType,true)) I think you nailed the problem, this filed is the part of our hdfs file path. We have kind of partitioned our data on the basis of batch_ids folder. How did you get around it? Thanks

Re: Datastore or DB for spark

2015-10-10 Thread Deenar Toraskar
The choice of datastore is driven by your use case. In fact Spark can work with multiple datastores too. Each datastore is optimised for certain kinds of data. e.g. HDFS is great for analytics and large data sets at rest. It is scalable and very performant, but is immutable. No-SQL databases

Re: Create hashmap using two RDD's

2015-10-10 Thread kali.tumm...@gmail.com
Hi All, I changed my way of approach now I am bale to load data into MAP and get data out using get command. val QuoteRDD=quotefile.map(x => x.split("\\|")). filter(line => line(0).contains("1017")). map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) , if (x(15) =="B") if

Re: How to compile Spark with customized Hadoop?

2015-10-10 Thread Steve Loughran
During development, I'd recommend giving Hadoop a version ending with -SNAPSHOT, and building spark with maven, as mvn knows to refresh the snapshot every day. you can do this in hadoop with mvn versions:set 2.7.0.stevel-SNAPSHOT if you are working on hadoop branch-2 or trunk direct, they

Re: Streaming Performance w/ UpdateStateByKey

2015-10-10 Thread Adrian Tanase
How are you determining how much time is serialization taking? I made this change in a streaming app that relies heavily on updateStateByKey. The memory consumption went up 3x on the executors but I can't see any perf improvement. Task execution time is the same and the serialization state