Re: EXT: Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-28 Thread Vibhor Gupta
can also go through this<https://blog.clairvoyantsoft.com/bucketing-in-spark-878d2e02140f> medium article that describes a similar problem as yours. Regards, Vibhor Gupta [https://miro.medium.com/max/1200/1*q4xHBk9ksw20Vf_25OYCtA.jpeg]<https://blog.clairvoyantsoft.com/bucketing-in-spark-87

Re: EXT: Driver takes long time to finish once job ends

2022-11-22 Thread Vibhor Gupta
Hi Nikhil, You might be using the v1 file output commit protocol. http://www.openkb.info/2019/04/what-is-difference-between.html What is the difference between mapreduce.fileoutputcommitter.algorithm.version=1 and 2 | Open Knowledge Base - openkb.info

Offline elastic index creation

2022-11-09 Thread Vibhor Gupta
/Azure, but is there a way to generate this snapshot offline using spark ? Thanks, Vibhor Gupta

Re: EXT: Re: Spark SQL

2022-09-15 Thread Vibhor Gupta
Hi Mayur, In Java, you can do futures.get with a timeout and then cancel the future once timeout has been reached in the catch block. There should be something similar in scala as well. Eg: https://stackoverflow.com/a/16231834 [https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-i...@2.

Re: EXT: Network time out property is not getting set in Spark

2022-09-13 Thread Vibhor Gupta
Hi Sachit, Check the migration guide. https://spark.apache.org/docs/latest/sql-migration-guide.html#:~:text=Spark%202.4%20and%20below%3A%20the,legacy.setCommandRejectsSparkCoreConfs%20to%20false. Migration Guide: SQL, Datasets and DataFrame - Spark 3.3.0 Documentation - Apache Spark

Dynamic shuffle partitions in a single job

2022-09-08 Thread Vibhor Gupta
Hi Community, Is it possible to set no of shuffle partitions per exchange ? My spark query contains a lot of joins/aggregations involving big tables and small tables, so keeping a high value of spark.sql.shuffle.partitions helps with big tables, for small tables it creates a lot of overhead on

Re: EXT: Re: Create Dataframe from a single String in Java

2021-11-18 Thread Vibhor Gupta
Incase you are specifically looking for a createDataframe method, you can use sparkSession.createDataFrame( Arrays.asList("apple","orange","banana").stream().map(RowFactory::create).collect(Collectors.toList()), new StructType().add("fruits", &q

Re: EXT: Re: Create Dataframe from a single String in Java

2021-11-18 Thread Vibhor Gupta
You can try something like below. It creates a dataset and then converts it into a dataframe. sparkSession.createDataset( Arrays.asList("apple","orange","banana"), Encoders.STRING() ).toDF("fruits").show(); Regards, Vibhor Gupta.