Re: Implementing Upsert logic Through Streaming

2019-06-29 Thread Chris Teoh
Not sure what your needs are here. If you can afford to wait, increase your micro batch windows to a long period of time, aggregate your data by key every micro batch and then apply those changes to the Oracle database. Since you're using text file to stream, there's no way to pre partition your

Re: Map side join without broadcast

2019-06-29 Thread Chris Teoh
The closest thing I can think of here is if you have both dataframes written out using buckets. Hive uses this technique for join optimisation such that both datasets of the same bucket are read by the same mapper to achieve map side joins. On Sat., 29 Jun. 2019, 9:10 pm jelmer, wrote: > I have

Re: Map side join without broadcast

2019-06-29 Thread Arbab Khalil
You can use coalesce(1) or repartition on B but it would be better to put A in cache so that it becomes available on all executors and as well as in memory because it contians on one row. On Sat, Jun 29, 2019 at 4:10 PM jelmer wrote: > I have 2 dataframes, > > Dataframe A which contains 1

Re: [pyspark 2.3+] CountDistinct

2019-06-29 Thread Rishi Shah
Thanks Abdeali! Please find details below: df.agg(countDistinct(col('col1'))).show() --> 450089 df.agg(countDistinct(col('col1'))).show() --> 450076 df.filter(col('col1').isNull()).count() --> 0 df.filter(col('col1').isNotNull()).count() --> 450063 col1 is a string Spark version 2.4.0 datasize:

Map side join without broadcast

2019-06-29 Thread jelmer
I have 2 dataframes, Dataframe A which contains 1 element per partition that is gigabytes big (an index) Dataframe B which is made up out of millions of small rows. I want to join B on A but i want all the work to be done on the executors holding the partitions of dataframe A Is there a way to

Re: [pyspark 2.3+] CountDistinct

2019-06-29 Thread Abdeali Kothari
How large is the data frame and what data type are you counting distinct for? I use count distinct quite a bit and haven't noticed any thing peculiar. Also, which exact version in 2.3.x? And, are performing any operations on the DF before the countDistinct? I recall there was a bug when I did