Re: How do we control output part files created by Spark job?

2015-07-11 Thread Umesh Kacha
Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10 I think reducing shuffle partitions will slower my group by query of hiveContext or it wont slow it down please guide. On Sat, Jul 11, 2015 at 7:41 AM, Srikanth srikanth...@gmail.com wrote: Is there a join involved in

spark streaming doubt

2015-07-11 Thread Shushant Arora
1.spark streaming 1.3 creates as many RDD Partitions as there are kafka partitions in topic. Say I have 300 partitions in topic and 10 executors and each with 3 cores so , is it means at a time only 10*3=30 partitions are processed and then 30 like that since executors launch tasks per RDD

Re: Spark Streaming and using Swift object store for checkpointing

2015-07-11 Thread algermissen1971
On 10 Jul 2015, at 23:10, algermissen1971 algermissen1...@icloud.com wrote: Hi, initially today when moving my streaming application to the cluster the first time I ran in to newbie error of using a local file system for checkpointing and the RDD partition count differences (see

Re: Spark performance

2015-07-11 Thread Jörn Franke
What is your business case for the move? Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani rrav...@gmail.com a écrit : Hi everyone, I have planned to move mssql server to spark?. I have using around 50,000 to 1l records. The spark performance is slow when compared to mssql server. What is

Re: Spark performance

2015-07-11 Thread David Mitchell
You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return responses in milliseconds. Analytic databases tend to operate on large data sets, and

RE: Spark performance

2015-07-11 Thread Roman Sokolov
Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com: Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired

Re: How do we control output part files created by Spark job?

2015-07-11 Thread Srikanth
Reducing no.of partitions may have impact on memory consumption. Especially if there is uneven distribution of key used in groupBy. Depends on your dataset. On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi Sriknath thanks much it worked when I set

Re: SparkDriverExecutionException when using actorStream

2015-07-11 Thread Juan Rodríguez Hortalá
Hi, I've finally fixed this. The problem was that I wasn't providing a type for the DStream in ssc.actorStream /* with this inputDStream : ReceiverInputDStream[Nothing] and we get SparkDriverExecutionException: Execution error * Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;

RE: Spark performance

2015-07-11 Thread Mohammed Guller
Hi Roman, Yes, Spark SQL will be a better solution than standard RDBMS databases for querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a powerful analytics solution. Mohammed From: David Mitchell [mailto:jdavidmitch...@gmail.com] Sent: Saturday, July 11, 2015 7:10 AM To:

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread leonida.gianfagna
Thanks a lot oubrik, I got your point, my consideration is that sum() should be already a built-in function for iterators in python. Anyway I tried your approach def mysum(iter): count = sum = 0 for item in iter: count += 1 sum += item return sum wordCountsGrouped =

Re: S3 vs HDFS

2015-07-11 Thread Aaron Davidson
Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. On Sat, Jul 11, 2015 at 11:14 AM,

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread Krishna Sankar
Looks like reduceByKey() should work here. Cheers k/ On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna leonida.gianfa...@gmail.com wrote: Thanks a lot oubrik, I got your point, my consideration is that sum() should be already a built-in function for iterators in python. Anyway I tried

Worker dies with java.io.IOException: Stream closed

2015-07-11 Thread gaurav sharma
Hi All, I am facing this issue in my production environment. My worker dies by throwing this exception. But i see the space is available on all the partitions on my disk I did NOT see any abrupt increase in DIsk IO, which might have choked the executor to write on to the stderr file. But still

Re: Spark performance

2015-07-11 Thread Jörn Franke
Honestly you are addressing this wrongly - you do not seem.to have a business case for changing - so why do you want to switch Le sam. 11 juil. 2015 à 3:28, Mohammed Guller moham...@glassbeam.com a écrit : Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines,

Re: Spark performance

2015-07-11 Thread Jörn Franke
Le sam. 11 juil. 2015 à 14:53, Roman Sokolov ole...@gmail.com a écrit : Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com: Hi Ravi, First, Neither Spark nor