Re: Filter out 20% of rows

2023-09-16 Thread ashok34...@yahoo.com.INVALID
Thank you Bjorn and Mich.  Appreciated Best On Saturday, 16 September 2023 at 16:50:04 BST, Mich Talebzadeh wrote: Hi Bjorn, I thought that one is better off using percentile_approx as it seems to be the recommended approach for computing percentiles and can simplify the code. I have

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
EDIT: I don't think that the question asker will have only returned the top 25 percentages. lør. 16. sep. 2023 kl. 21:54 skrev Bjørn Jørgensen : > percentile_approx returns the approximate percentile(s) > The memory consumption is > bounded. The

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
percentile_approx returns the approximate percentile(s) The memory consumption is bounded. The larger accuracy parameter we choose, the smaller error we get. The default accuracy value is 1, to match with Hive default setting. Choose a smaller value

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Hi Bjorn, I thought that one is better off using percentile_approx as it seems to be the recommended approach for computing percentiles and can simplify the code. I have modified your code to use percentile_approx rather than manually computing it. It would be interesting to hear ideas on this.

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Happy Saturday coding  Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
ah.. yes that's right. I did have to use some time on this one and I was having some issues with the code. I restart the notebook kernel now and rerun it and I get the same result. lør. 16. sep. 2023 kl. 11:41 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > Splendid code. A minor error

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Splendid code. A minor error glancing at your code. print(df.count()) print(result_df.count()) You have not defined result_df. I gather you meant "result"? print(result.count()) That should fix it 樂 HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London