Re: FullOuterJoin on Spark

2016-06-22 Thread Gourav Sengupta
+1 for the guidance from Nirvan. Also it would be better to repartition and
store the data in parquet format in case you are planning to do the joins
more than once or with other data sources. Parquet with SPARK works likes a
charm. Over S3 I have seen its performance being quite close to cached data
over a few million records.

Regards,
Gourav

On Wed, Jun 22, 2016 at 7:11 AM, Nirav Patel <npa...@xactlycorp.com> wrote:

> Can your domain list fit in memory of one executor. if so you can use
> broadcast join.
>
> You can always narrow down to inner join and derive rest from original set
> if memory is issue there. If you are just concerned about shuffle memory
> then to reduce amount of shuffle you can do following:
> 1) partition both rdd (dataframes) with same partitioner with same count
> so corresponding data will on on same node at least
> 2) increase shuffle.memoryfraction
>
> you can use dataframes with spark 1.6 or greater to further reduce memory
> footprint. I haven't tested that though.
>
>
> On Tue, Jun 21, 2016 at 6:16 AM, Rychnovsky, Dusan <
> dusan.rychnov...@firma.seznam.cz> wrote:
>
>> Hi,
>>
>>
>> can somebody please explain the way FullOuterJoin works on Spark? Does
>> each intersection get fully loaded to memory?
>>
>> My problem is as follows:
>>
>>
>> I have two large data-sets:
>>
>>
>> * a list of web pages,
>>
>> * a list of domain-names with specific rules for processing pages from
>> that domain.
>>
>>
>> I am joining these web-pages with processing rules.
>>
>>
>> For certain domains there are millions of web-pages.
>>
>>
>> Based on the memory demands the join is having it looks like the whole
>> intersection (i.e. a domain + all corresponding pages) are kept in memory
>> while processing.
>>
>>
>> What I really need in this case, though, is to hold just the domain and
>> iterate over all corresponding pages, one at a time.
>>
>>
>> What would be the best way to do this on Spark?
>>
>> Thank you,
>>
>> Dusan Rychnovsky
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>


Re: FullOuterJoin on Spark

2016-06-22 Thread Nirav Patel
Can your domain list fit in memory of one executor. if so you can use
broadcast join.

You can always narrow down to inner join and derive rest from original set
if memory is issue there. If you are just concerned about shuffle memory
then to reduce amount of shuffle you can do following:
1) partition both rdd (dataframes) with same partitioner with same count so
corresponding data will on on same node at least
2) increase shuffle.memoryfraction

you can use dataframes with spark 1.6 or greater to further reduce memory
footprint. I haven't tested that though.


On Tue, Jun 21, 2016 at 6:16 AM, Rychnovsky, Dusan <
dusan.rychnov...@firma.seznam.cz> wrote:

> Hi,
>
>
> can somebody please explain the way FullOuterJoin works on Spark? Does
> each intersection get fully loaded to memory?
>
> My problem is as follows:
>
>
> I have two large data-sets:
>
>
> * a list of web pages,
>
> * a list of domain-names with specific rules for processing pages from
> that domain.
>
>
> I am joining these web-pages with processing rules.
>
>
> For certain domains there are millions of web-pages.
>
>
> Based on the memory demands the join is having it looks like the whole
> intersection (i.e. a domain + all corresponding pages) are kept in memory
> while processing.
>
>
> What I really need in this case, though, is to hold just the domain and
> iterate over all corresponding pages, one at a time.
>
>
> What would be the best way to do this on Spark?
>
> Thank you,
>
> Dusan Rychnovsky
>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>


FullOuterJoin on Spark

2016-06-21 Thread Rychnovsky, Dusan
Hi,


can somebody please explain the way FullOuterJoin works on Spark? Does each 
intersection get fully loaded to memory?

My problem is as follows:


I have two large data-sets:


* a list of web pages,

* a list of domain-names with specific rules for processing pages from that 
domain.


I am joining these web-pages with processing rules.


For certain domains there are millions of web-pages.


Based on the memory demands the join is having it looks like the whole 
intersection (i.e. a domain + all corresponding pages) are kept in memory while 
processing.


What I really need in this case, though, is to hold just the domain and iterate 
over all corresponding pages, one at a time.


What would be the best way to do this on Spark?

Thank you,

Dusan Rychnovsky