date:20150711

Re: How do we control output part files created by Spark job?

2015-07-11 Thread Umesh Kacha

Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10 I think reducing shuffle partitions will slower my group by query of hiveContext or it wont slow it down please guide. On Sat, Jul 11, 2015 at 7:41 AM, Srikanth srikanth...@gmail.com wrote: Is there a join involved in

spark streaming doubt

2015-07-11 Thread Shushant Arora

1.spark streaming 1.3 creates as many RDD Partitions as there are kafka partitions in topic. Say I have 300 partitions in topic and 10 executors and each with 3 cores so , is it means at a time only 10*3=30 partitions are processed and then 30 like that since executors launch tasks per RDD

Re: Spark Streaming and using Swift object store for checkpointing

2015-07-11 Thread algermissen1971

On 10 Jul 2015, at 23:10, algermissen1971 algermissen1...@icloud.com wrote: Hi, initially today when moving my streaming application to the cluster the first time I ran in to newbie error of using a local file system for checkpointing and the RDD partition count differences (see

Re: Spark performance

2015-07-11 Thread Jörn Franke

What is your business case for the move? Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani rrav...@gmail.com a écrit : Hi everyone, I have planned to move mssql server to spark?. I have using around 50,000 to 1l records. The spark performance is slow when compared to mssql server. What is

Re: Spark performance

2015-07-11 Thread David Mitchell

You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return responses in milliseconds. Analytic databases tend to operate on large data sets, and

RE: Spark performance

2015-07-11 Thread Roman Sokolov

Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com: Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired

Re: How do we control output part files created by Spark job?

2015-07-11 Thread Srikanth

Reducing no.of partitions may have impact on memory consumption. Especially if there is uneven distribution of key used in groupBy. Depends on your dataset. On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi Sriknath thanks much it worked when I set

Re: SparkDriverExecutionException when using actorStream

2015-07-11 Thread Juan Rodríguez Hortalá

Hi, I've finally fixed this. The problem was that I wasn't providing a type for the DStream in ssc.actorStream /* with this inputDStream : ReceiverInputDStream[Nothing] and we get SparkDriverExecutionException: Execution error * Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;

RE: Spark performance

2015-07-11 Thread Mohammed Guller

Hi Roman, Yes, Spark SQL will be a better solution than standard RDBMS databases for querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a powerful analytics solution. Mohammed From: David Mitchell [mailto:jdavidmitch...@gmail.com] Sent: Saturday, July 11, 2015 7:10 AM To:

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread leonida.gianfagna

Thanks a lot oubrik, I got your point, my consideration is that sum() should be already a built-in function for iterators in python. Anyway I tried your approach def mysum(iter): count = sum = 0 for item in iter: count += 1 sum += item return sum wordCountsGrouped =

Re: S3 vs HDFS

2015-07-11 Thread Aaron Davidson

Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. On Sat, Jul 11, 2015 at 11:14 AM,

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread Krishna Sankar

Looks like reduceByKey() should work here. Cheers k/ On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna leonida.gianfa...@gmail.com wrote: Thanks a lot oubrik, I got your point, my consideration is that sum() should be already a built-in function for iterators in python. Anyway I tried

Worker dies with java.io.IOException: Stream closed

2015-07-11 Thread gaurav sharma

Hi All, I am facing this issue in my production environment. My worker dies by throwing this exception. But i see the space is available on all the partitions on my disk I did NOT see any abrupt increase in DIsk IO, which might have choked the executor to write on to the stderr file. But still

Re: Spark performance

2015-07-11 Thread Jörn Franke

Honestly you are addressing this wrongly - you do not seem.to have a business case for changing - so why do you want to switch Le sam. 11 juil. 2015 à 3:28, Mohammed Guller moham...@glassbeam.com a écrit : Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines,

Re: Spark performance

2015-07-11 Thread Jörn Franke

Le sam. 11 juil. 2015 à 14:53, Roman Sokolov ole...@gmail.com a écrit : Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com: Hi Ravi, First, Neither Spark nor

Re: How do we control output part files created by Spark job?

spark streaming doubt

Re: Spark Streaming and using Swift object store for checkpointing

Re: Spark performance

Re: Spark performance

RE: Spark performance

Re: How do we control output part files created by Spark job?

Re: SparkDriverExecutionException when using actorStream

RE: Spark performance

Re: Sum elements of an iterator inside an RDD

Re: S3 vs HDFS

Re: Sum elements of an iterator inside an RDD

Worker dies with java.io.IOException: Stream closed

Re: Spark performance

Re: Spark performance

15 matches

Site Navigation

Mail list logo

Footer information