FW: Pyspark: set Orc Stripe.size on dataframe writer issue

2018-10-17 Thread Somasundara, Ashwin
Hello Group I am having issues setting the stripe size, index stride and index on an orc file using PySpark. I am getting approx 2000 stripes for the 1.2GB file when I am expecting only 5 stripes for the 256MB setting. Tried the below options 1. Set the .options on data frame writer. The

Spark dataset to byte array over grpc

2018-04-23 Thread Ashwin Sai Shankar
rray? Also is there a better way to send this output to client? Thanks, Ashwin

Re: Why python cluster mode is not supported in standalone cluster?

2018-02-14 Thread Ashwin Sai Shankar
+dev mailing list(since i didn't get a response from user DL) On Tue, Feb 13, 2018 at 12:20 PM, Ashwin Sai Shankar <ashan...@netflix.com> wrote: > Hi Spark users! > I noticed that spark doesn't allow python apps to run in cluster mode in > spark standalone cluster. Does anyone kno

Why python cluster mode is not supported in standalone cluster?

2018-02-13 Thread Ashwin Sai Shankar
Hi Spark users! I noticed that spark doesn't allow python apps to run in cluster mode in spark standalone cluster. Does anyone know the reason? I checked jira but couldn't find anything relevant. Thanks, Ashwin

Recompute Spark outputs intelligently

2017-12-15 Thread Ashwin Raju
out which columns need to be recomputed and which can be left as is. Is there a best practice in the Spark ecosystem for this problem? Perhaps some metadata system/data lineage system we can use? I'm curious if this is a common problem that has already been addressed. Thanks, Ashwin

Re: Spark 2.2 streaming with append mode: empty output

2017-08-15 Thread Ashwin Raju
t; be updated any more. > See http://spark.apache.org/docs/latest/structured- > streaming-programming-guide.html#handling-late-data-and-watermarking > > On Mon, Aug 14, 2017 at 4:09 PM, Ashwin Raju <ther...@gmail.com> wrote: > >> Hi, >> >> I am running Spa

Spark 2.2 streaming with append mode: empty output

2017-08-14 Thread Ashwin Raju
ith outputMode("append") however, the output only has the column names, no rows. I was originally trying to output to parquet, which only supports append mode. I was seeing no data in my parquet files, so I switched to console output to debug, then noticed this issue. Am I misunderstanding something about how append mode works? Thanks, Ashwin

Reusing dataframes for streaming (spark 1.6)

2017-08-08 Thread Ashwin Raju
would like to do instead: def process(time, rdd): # create dataframe from RDD - input_df # output_df = dataframe_pipeline_fn(input_df) -ashwin

Re: Spark shuffle files

2017-03-27 Thread Ashwin Sai Shankar
ter/core/src/ > main/scala/org/apache/spark/ContextCleaner.scala > > On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar < > ashan...@netflix.com.invalid> wrote: > >> Hi! >> >> In spark on yarn, when are shuffle files on local disk removed? (Is it >>

Spark shuffle files

2017-03-27 Thread Ashwin Sai Shankar
Hi! In spark on yarn, when are shuffle files on local disk removed? (Is it when the app completes or once all the shuffle files are fetched or end of the stage?) Thanks, Ashwin

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Thanks. I'll try that. Hopefully that should work. On Mon, Jul 4, 2016 at 9:12 PM, Mathieu Longtin <math...@closetwork.org> wrote: > I started with a download of 1.6.0. These days, we use a self compiled > 1.6.2. > > On Mon, Jul 4, 2016 at 11:39 AM Ashwin Raaghav <ashraag.

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Longtin <math...@closetwork.org> wrote: > 1.6.1. > > I have no idea. SPARK_WORKER_CORES should do the same. > > On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav <ashraag...@gmail.com> > wrote: > >> Which version of Spark are you using? 1.6.1? >> >

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Which version of Spark are you using? 1.6.1? Any ideas as to why it is not working in ours? On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin <math...@closetwork.org> wrote: > 16. > > On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav <ashraag...@gmail.com> > wrote: > &g

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
se more than 1 core per server. However, it seems it will > start as many pyspark as there are cores, but maybe not use them. > > On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav <ashraag...@gmail.com> > wrote: > >> Hi Mathieu, >> >> Isn't that the same as setting &

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
node to 1. But the number of >> pyspark.daemons process is still not coming down. It looks like initially >> there is one Pyspark.daemons process and this in turn spawns as many >> pyspark.daemons processes as the number of cores in the machine. >> >> Any help is apprecia

Re: Adding h5 files in a zip to use with PySpark

2016-06-15 Thread Ashwin Raaghav
List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > -- Regards, Ashwin Raaghav

Re: Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Ashwin Giridharan
Hi Vishnu, A partition will either be in memory or in disk. -Ashwin On Feb 28, 2016 15:09, "Vishnu Viswanath" <vishnu.viswanat...@gmail.com> wrote: > Hi All, > > I have a question regarding Persistence (MEMORY_AND_DISK) > > Suppose I am trying to persist an RDD wh

Spark streaming: Consistency of multiple streams in Spark

2015-12-17 Thread Ashwin
synchronize these multiple streams. What am I missing? Thanks, Ashwin [1] http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail

Re: Hive error after update from 1.4.1 to 1.5.2

2015-12-16 Thread Ashwin Sai Shankar
Hi Bryan, I see the same issue with 1.5.2, can you pls let me know what was the resolution? Thanks, Ashwin On Fri, Nov 20, 2015 at 12:07 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Nevermind. I had a library dependency that still had the old Spark version. > > On Fr

Re: Spark on YARN multitenancy

2015-12-15 Thread Ashwin Sai Shankar
We run large multi-tenant clusters with spark/hadoop workloads, and we use 'yarn's preemption'/'spark's dynamic allocation' to achieve multitenancy. See following link on how to enable/configure preemption using fair scheduler :

How to display column names in spark-sql output

2015-12-11 Thread Ashwin Shankar
Hi, When we run spark-sql, is there a way to get column names/headers with the result? -- Thanks, Ashwin

Re: How to display column names in spark-sql output

2015-12-11 Thread Ashwin Sai Shankar
Never mind, its *set hive.cli.print.header=true* Thanks ! On Fri, Dec 11, 2015 at 5:16 PM, Ashwin Shankar <ashwinshanka...@gmail.com> wrote: > Hi, > When we run spark-sql, is there a way to get column names/headers with the > result? > > -- > Thanks, > Ashwin > > >

Re: What happens when you create more DStreams then nodes in the cluster?

2015-07-31 Thread Ashwin Giridharan
, Ashwin On Fri, Jul 31, 2015 at 4:52 PM, Brandon White bwwintheho...@gmail.com wrote: Since one input dstream creates one receiver and one receiver uses one executor / node. What happens if you create more Dstreams than nodes in the cluster? Say I have 30 Dstreams on a 15 node cluster

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-31 Thread Ashwin Giridharan
are creating 500 Dstreams based off 500 textfile directories, do we need at least 500 executors / nodes to be receivers for each one of the streams? On Tue, Jul 28, 2015 at 6:09 PM, Tathagata Das t...@databricks.com wrote: @Ashwin: You could append the topic in the data. val kafkaStreams

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread Ashwin Giridharan
, then an optimal configuration would be, --num-executors 8 --executor-cores 2 --executor-memory 2G Thanks, Ashwin On Thu, Jul 30, 2015 at 12:08 PM, unk1102 umesh.ka...@gmail.com wrote: Hi I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Ashwin Giridharan
? What is the best way to parallelize this? Any other ideas on design? -- Thanks Regards, Ashwin Giridharan

Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Ashwin Shankar
to hostmachine's ip/port. So the AM can then talk hostmachine's ip/port, which would be mapped to the container. Thoughts ? -- Thanks, Ashwin

How to pass system properties in spark ?

2015-06-03 Thread Ashwin Shankar
see that following :* log4j: Setting property [file] to []. log4j: setFile called: , true log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: (No such file or directory) at java.io.FileOutputStream.open(Native Method) -- Thanks, Ashwin

Spark on Yarn : Map outputs lifetime ?

2015-05-12 Thread Ashwin Shankar
Hi, In spark on yarn and when running spark_shuffle as auxiliary service on node manager, does map spills of a stage gets cleaned up once the next stage completes OR is it preserved till the app completes(ie waits for all the stages to complete) ? -- Thanks, Ashwin

Building spark targz

2014-11-12 Thread Ashwin Shankar
, Ashwin

Re: Building spark targz

2014-11-12 Thread Ashwin Shankar
making sure but are you looking for the tar in assembly/target dir ? On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: Hi, I just cloned spark from the github and I'm trying to build to generate a tar ball. I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version

Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
isolation ? I know I'm asking a lot of questions. Thanks in advance :) ! -- Thanks, Ashwin Netflix

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
, will the application progress with the remaining resources/fair share ? I'm new to spark, sry if I'm asking something very obvious :). Thanks, Ashwin On Wed, Oct 22, 2014 at 12:07 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Ashwin, Let me try to answer to the best of my knowledge. On Wed