Re: can spark take advantage of ordered data?

2017-03-10 Thread Jonathan Coveney
While I was at Two Sigma I ended up implementing something similar to what
Koert described... you can check it out here:
https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/rdd/OrderedRDD.scala.
They've built a lot more on top of this (including support for dataframes
etc).

2017-03-10 9:45 GMT-05:00 Koert Kuipers :

> this shouldn't be too hard. adding something to spark-sorted or to the
> dataframe/dataset logical plan that says "trust me, i am already
> partitioned and sorted" seems doable. however you most likely need a custom
> hash partitioner, and you have to be careful to read the data in without
> file splitting.
>
> On Mar 10, 2017 9:10 AM, "sourabh chaki"  wrote:
>
>> My use case is also quite similar. I have 2 feeds. One 3TB and another
>> 100GB. Both the feeds are generated by hadoop reduce operation and
>> partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas
>> 100GB file has 200 partitions.
>>
>> Now when I do a join between these two feeds using spark, spark shuffles
>> both the RDDS and it takes long time to complete. Can we do something so
>> that spark can recognise the existing partitions of 3TB feed and shuffles
>> only 200GB feed?
>> It can be mapside scan for bigger RDD and shuffle read from smaller RDD?
>>
>> I have looked at spark-sorted project, but that project does not utilise
>> the pre-existing partitions in the feed.
>> Any pointer will be helpful.
>>
>> Thanks
>> Sourabh
>>
>> On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid 
>> wrote:
>>
>>> Hi Jonathan,
>>>
>>> you might be interested in https://issues.apache.org/j
>>> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata
>>> /spark-sorted (not part of spark, but it is available right now).
>>> Hopefully thats what you are looking for.  To the best of my knowledge that
>>> covers what is available now / what is being worked on.
>>>
>>> Imran
>>>
>>> On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney 
>>> wrote:
>>>
 Hello all,

 I am wondering if spark already has support for optimizations on sorted
 data and/or if such support could be added (I am comfortable dropping to a
 lower level if necessary to implement this, but I'm not sure if it is
 possible at all).

 Context: we have a number of data sets which are essentially already
 sorted on a key. With our current systems, we can take advantage of this to
 do a lot of analysis in a very efficient fashion...merges and joins, for
 example, can be done very efficiently, as can folds on a secondary key and
 so on.

 I was wondering if spark would be a fit for implementing these sorts of
 optimizations? Obviously it is sort of a niche case, but would this be
 achievable? Any pointers on where I should look?

>>>
>>>
>>


Re: can spark take advantage of ordered data?

2017-03-10 Thread Yong Zhang
I think it is an interesting requirement, but I am not familiar with Spark 
enough to say it can be done as latest spark version or not.


>From my understanding, you are looking for some API from the spark to read the 
>source directly into a ShuffledRDD, which indeed needs (K, V and a Partitioner 
>instance).


I don't think Spark provides that directly as now, but in your case, it makes 
sense to create a JIRA for spark to support in the future.


For right now, maybe there are ways to use Spark developerAPI to do what you 
need, and I will leave that to other Spark expert to confirm.


Yong



From: sourabh chaki <chaki.sour...@gmail.com>
Sent: Friday, March 10, 2017 9:03 AM
To: Imran Rashid
Cc: Jonathan Coveney; user@spark.apache.org
Subject: Re: can spark take advantage of ordered data?

My use case is also quite similar. I have 2 feeds. One 3TB and another 100GB. 
Both the feeds are generated by hadoop reduce operation and partitioned by 
hadoop hashpartitioner. 3TB feed has 10K partitions whereas 100GB file has 200 
partitions.

Now when I do a join between these two feeds using spark, spark shuffles both 
the RDDS and it takes long time to complete. Can we do something so that spark 
can recognise the existing partitions of 3TB feed and shuffles only 200GB feed?
It can be mapside scan for bigger RDD and shuffle read from smaller RDD?

I have looked at spark-sorted project, but that project does not utilise the 
pre-existing partitions in the feed.
Any pointer will be helpful.

Thanks
Sourabh

On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid 
<iras...@cloudera.com<mailto:iras...@cloudera.com>> wrote:
Hi Jonathan,

you might be interested in https://issues.apache.org/jira/browse/SPARK-3655 
(not yet available) and https://github.com/tresata/spark-sorted (not part of 
spark, but it is available right now).  Hopefully thats what you are looking 
for.  To the best of my knowledge that covers what is available now / what is 
being worked on.

Imran

On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney 
<jcove...@gmail.com<mailto:jcove...@gmail.com>> wrote:
Hello all,

I am wondering if spark already has support for optimizations on sorted data 
and/or if such support could be added (I am comfortable dropping to a lower 
level if necessary to implement this, but I'm not sure if it is possible at 
all).

Context: we have a number of data sets which are essentially already sorted on 
a key. With our current systems, we can take advantage of this to do a lot of 
analysis in a very efficient fashion...merges and joins, for example, can be 
done very efficiently, as can folds on a secondary key and so on.

I was wondering if spark would be a fit for implementing these sorts of 
optimizations? Obviously it is sort of a niche case, but would this be 
achievable? Any pointers on where I should look?




Re: can spark take advantage of ordered data?

2017-03-10 Thread Koert Kuipers
this shouldn't be too hard. adding something to spark-sorted or to the
dataframe/dataset logical plan that says "trust me, i am already
partitioned and sorted" seems doable. however you most likely need a custom
hash partitioner, and you have to be careful to read the data in without
file splitting.

On Mar 10, 2017 9:10 AM, "sourabh chaki"  wrote:

> My use case is also quite similar. I have 2 feeds. One 3TB and another
> 100GB. Both the feeds are generated by hadoop reduce operation and
> partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas
> 100GB file has 200 partitions.
>
> Now when I do a join between these two feeds using spark, spark shuffles
> both the RDDS and it takes long time to complete. Can we do something so
> that spark can recognise the existing partitions of 3TB feed and shuffles
> only 200GB feed?
> It can be mapside scan for bigger RDD and shuffle read from smaller RDD?
>
> I have looked at spark-sorted project, but that project does not utilise
> the pre-existing partitions in the feed.
> Any pointer will be helpful.
>
> Thanks
> Sourabh
>
> On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid 
> wrote:
>
>> Hi Jonathan,
>>
>> you might be interested in https://issues.apache.org/j
>> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata
>> /spark-sorted (not part of spark, but it is available right now).
>> Hopefully thats what you are looking for.  To the best of my knowledge that
>> covers what is available now / what is being worked on.
>>
>> Imran
>>
>> On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney 
>> wrote:
>>
>>> Hello all,
>>>
>>> I am wondering if spark already has support for optimizations on sorted
>>> data and/or if such support could be added (I am comfortable dropping to a
>>> lower level if necessary to implement this, but I'm not sure if it is
>>> possible at all).
>>>
>>> Context: we have a number of data sets which are essentially already
>>> sorted on a key. With our current systems, we can take advantage of this to
>>> do a lot of analysis in a very efficient fashion...merges and joins, for
>>> example, can be done very efficiently, as can folds on a secondary key and
>>> so on.
>>>
>>> I was wondering if spark would be a fit for implementing these sorts of
>>> optimizations? Obviously it is sort of a niche case, but would this be
>>> achievable? Any pointers on where I should look?
>>>
>>
>>
>


Re: can spark take advantage of ordered data?

2017-03-10 Thread sourabh chaki
My use case is also quite similar. I have 2 feeds. One 3TB and another
100GB. Both the feeds are generated by hadoop reduce operation and
partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas
100GB file has 200 partitions.

Now when I do a join between these two feeds using spark, spark shuffles
both the RDDS and it takes long time to complete. Can we do something so
that spark can recognise the existing partitions of 3TB feed and shuffles
only 200GB feed?
It can be mapside scan for bigger RDD and shuffle read from smaller RDD?

I have looked at spark-sorted project, but that project does not utilise
the pre-existing partitions in the feed.
Any pointer will be helpful.

Thanks
Sourabh

On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid  wrote:

> Hi Jonathan,
>
> you might be interested in https://issues.apache.org/
> jira/browse/SPARK-3655 (not yet available) and https://github.com/
> tresata/spark-sorted (not part of spark, but it is available right now).
> Hopefully thats what you are looking for.  To the best of my knowledge that
> covers what is available now / what is being worked on.
>
> Imran
>
> On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney 
> wrote:
>
>> Hello all,
>>
>> I am wondering if spark already has support for optimizations on sorted
>> data and/or if such support could be added (I am comfortable dropping to a
>> lower level if necessary to implement this, but I'm not sure if it is
>> possible at all).
>>
>> Context: we have a number of data sets which are essentially already
>> sorted on a key. With our current systems, we can take advantage of this to
>> do a lot of analysis in a very efficient fashion...merges and joins, for
>> example, can be done very efficiently, as can folds on a secondary key and
>> so on.
>>
>> I was wondering if spark would be a fit for implementing these sorts of
>> optimizations? Obviously it is sort of a niche case, but would this be
>> achievable? Any pointers on where I should look?
>>
>
>


RE: can spark take advantage of ordered data?

2015-03-11 Thread java8964
RangePartitioner?
At least for join, you can implement your own partitioner, to utilize the 
sorted data.
Just my 2 cents.
Date: Wed, 11 Mar 2015 17:38:04 -0400
Subject: can spark take advantage of ordered data?
From: jcove...@gmail.com
To: User@spark.apache.org

Hello all,
I am wondering if spark already has support for optimizations on sorted data 
and/or if such support could be added (I am comfortable dropping to a lower 
level if necessary to implement this, but I'm not sure if it is possible at 
all).
Context: we have a number of data sets which are essentially already sorted on 
a key. With our current systems, we can take advantage of this to do a lot of 
analysis in a very efficient fashion...merges and joins, for example, can be 
done very efficiently, as can folds on a secondary key and so on.
I was wondering if spark would be a fit for implementing these sorts of 
optimizations? Obviously it is sort of a niche case, but would this be 
achievable? Any pointers on where I should look?
   

Re: can spark take advantage of ordered data?

2015-03-11 Thread Imran Rashid
Hi Jonathan,

you might be interested in https://issues.apache.org/jira/browse/SPARK-3655
(not yet available) and https://github.com/tresata/spark-sorted (not part
of spark, but it is available right now).  Hopefully thats what you are
looking for.  To the best of my knowledge that covers what is available now
/ what is being worked on.

Imran

On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney jcove...@gmail.com
wrote:

 Hello all,

 I am wondering if spark already has support for optimizations on sorted
 data and/or if such support could be added (I am comfortable dropping to a
 lower level if necessary to implement this, but I'm not sure if it is
 possible at all).

 Context: we have a number of data sets which are essentially already
 sorted on a key. With our current systems, we can take advantage of this to
 do a lot of analysis in a very efficient fashion...merges and joins, for
 example, can be done very efficiently, as can folds on a secondary key and
 so on.

 I was wondering if spark would be a fit for implementing these sorts of
 optimizations? Obviously it is sort of a niche case, but would this be
 achievable? Any pointers on where I should look?