Re: Performance Issue

Gourav Sengupta Sun, 13 Jan 2019 03:13:53 -0800

Hi Tzahi,

I think that SPARK automatically broadcasts with the latest versions, but
you might have to check with your version. Did you try filtering first and
then doing the LEFT JOIN?


Regards,
Gourav Sengupta

On Sun, Jan 13, 2019 at 9:20 AM Tzahi File <tzahi.f...@ironsrc.com> wrote:

> Hi Gourav,
>
> I just wanted to attach an example of my query so I replaced my fields
> names with  "select *", I do have an agg fields in my query.
>
> What about improving performance with Sparks - like broadcasting or
> something like that?
>
> Thanks,
> Tzahi
>
> On Thu, Jan 10, 2019 at 7:23 PM Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi Tzahi,
>>
>> by using GROUP BY without any aggregate columns are you just trying to
>> find out the DISTINCT of the columns ?
>>
>> Also it may be of help (I do not know whether the SQL optimiser
>> automatically takes care of this) to have the LEFT JOIN on a smaller data
>> set by having joined on the device_id before as a subquery or separate
>> query. And when you are writing the output of the JOIN between csv_file and
>> raw_e to ORDER BY the output based on campaign_ID.
>>
>> Thanks and Regards,
>> Gourav Sengupta
>>
>>
>> On Thu, Jan 10, 2019 at 1:13 PM Tzahi File <tzahi.f...@ironsrc.com>
>> wrote:
>>
>>> Hi Gourav,
>>>
>>> My version of Spark is 2.1.
>>>
>>> The data is stored on S3 directory in parquet format.
>>>
>>> I sent you an example for a query I would like to run (the raw_e table
>>> is stored as parquet files and event_day is the partitioned filed):
>>>
>>> SELECT *
>>> FROM (select *
>>>       from parquet_files.raw_e as re
>>>       WHERE  re.event_day >= '2018-11-28' AND re.event_day <=
>>> '2018-12-28')
>>> JOIN csv_file as g
>>> ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
>>> LEFT JOIN campaigns as c
>>> ON c.campaign_id = re.campaign_id
>>> GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16,
>>> 17, 18, 19,20,21
>>>
>>> Looking forward to any insights.
>>>
>>>
>>> Thanks.
>>>
>>> On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Can you please let us know the SPARK version, and the query, and
>>>> whether the data is in parquet format or not, and where is it stored?
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>> On Wed, Jan 9, 2019 at 1:53 AM 大啊 <belie...@163.com> wrote:
>>>>
>>>>> What is your performance issue?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> At 2019-01-08 22:09:24, "Tzahi File" <tzahi.f...@ironsrc.com> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I have some performance issue running SQL query on Spark.
>>>>>
>>>>> The query contains one parquet partitioned table (partition by date)
>>>>> one each partition is about 200gb and simple table with about 100 records.
>>>>> The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole
>>>>> interface for running the SQL query.
>>>>>
>>>>> After searching after how to improve my query I have added to the
>>>>> configuration the above settings:
>>>>> spark.sql.shuffle.partitions=1000
>>>>> spark.dynamicAllocation.maxExecutors=200
>>>>>
>>>>> There wasn't any significant improvement. I'm looking for any ideas
>>>>> to improve my running time.
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Tzahi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Tzahi File
>>> Data Engineer
>>> [image: ironSource] <http://www.ironsrc.com/>
>>>
>>> email tzahi.f...@ironsrc.com
>>> mobile +972-546864835
>>> fax +972-77-5448273
>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>> ironsrc.com <http://www.ironsrc.com/>
>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>> <https://plus.google.com/+ironsrc>
>>> This email (including any attachments) is for the sole use of the
>>> intended recipient and may contain confidential information which may be
>>> protected by legal privilege. If you are not the intended recipient, or the
>>> employee or agent responsible for delivering it to the intended recipient,
>>> you are hereby notified that any use, dissemination, distribution or
>>> copying of this communication and/or its content is strictly prohibited. If
>>> you are not the intended recipient, please immediately notify us by reply
>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>
>>
>
> --
> Tzahi File
> Data Engineer
> [image: ironSource] <http://www.ironsrc.com/>
>
> email tzahi.f...@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>

Re: Performance Issue

Reply via email to