Re: [Spark CORE][Spark SQL][Advanced]: Why dynamic partition pruning optimization does not work in this scenario?

2021-12-04 Thread Mohamadreza Rostami
er than the default (10mb) and trigger a dynamic partition pruning > although I can see it may be beneficial to implement dynamic partition > pruning for broadcast joins as well... > > >> On Dec 4, 2021, at 8:41 AM, Mohamadreza Rostami >> mailto:mohamadrezarosta...@gmail.com

Re: Scheduling Time > Processing Time

2021-06-20 Thread Mohamadreza Rostami
Hi, I think it’s because of locality time out. In streaming tasks you must decrease the locality time out. Sent from my iPhone > On Jun 20, 2021, at 11:55 PM, Siva Tarun Ponnada wrote: > >  > Hi Team, > I have a spark streaming job which I am running in a single node > cluster. I

Re: Benchmarks for Many-to-Many Joins

2021-04-22 Thread Mohamadreza Rostami
What kind of benchmark do you need to take? I mean, you want to benchmark Spark many to many joins, or you want to benchmark another aspect of spark or cluster? (such as network or disk) If you want only to take a many-to-many join, you can use cross join or repartitioning the data with another

[Spark Core][Advanced]: Wrong memory allocation on standalone mode cluster

2021-04-18 Thread Mohamadreza Rostami
I see a bug in executer memory allocation in the standalone cluster, but I can't find which part of the spark code causes this problem. That why's I decided to raise this issue here. Assume you have 3 workers with 10 CPU cores and 10 Gigabyte memories. Assume also you have 2 spark jobs that run

[Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Mohamadreza Rostami
I have a Hadoop cluster that uses Apache Spark to query parquet files saved on Hadoop. For example, when i'm using the following PySpark code to find a word in parquet files: df = spark.read.parquet("hdfs://test/parquets/*") df.filter(df['word'] == "jhon").show() After running this code, I go to