Re: IndexedRDD

Jem Tucker Tue, 13 Jan 2015 10:17:45 -0800

Hi,

Thanks for the replies, I guess I was hoping for a bit better than linear
scaling, this was performing IndexedRDD.join(RDD)((id, a, b) => (a, b)). In
each join every row in the smaller table is joined to one in the lookup. I
ran the same test with standard RDD joins and there was barely any time
increase at all until the small table was within 1 order of magnitude of
the larger. I agree though, the performance is not bad at all! The same
join with normal RDDs takes an order of magnitude longer i found, I can
share the results tomorrow.


I am unsure exactly how the IndexedRDD are indexed and wether they can be
sorted by key I afraid, I would be interested to know. My plan is to used
one as a historical data table which can be updated by batches in spark
streaming, has anyone got experience of trying to implement anything
similar to this?

Kindest Regards,

Jem


On 13 January 2015 at 17:05, Jerry Lam <chiling...@gmail.com> wrote:

> Hi guys,
>
> I'm interested in the IndexedRDD too.
> How many rows in the big table that matches the small table in every run?
> If the number of rows stay constant, then I think Jem wants the runtime to
> stay about constant (i.e. ~ 0.6 second for all cases). However, I agree
> with Andrew. The performance wasn't that bad at all. If it is not indexed,
> I expect it to take much longer time.
>
> Can IndexedRDD be sorted by keys as well?
>
> Best Regards,
>
> Jerry
>
> On Tue, Jan 13, 2015 at 11:06 AM, Andrew Ash <and...@andrewash.com> wrote:
>
>> Hi Jem,
>>
>> Linear time in scaling on the big table doesn't seem that surprising to
>> me.  What were you expecting?
>>
>> I assume you're doing normalRDD.join(indexedRDD).  If you were to replace
>> the indexedRDD with a normal RDD, what times do you get?
>>
>> On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker <jem.tuc...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have been playing around with the indexedRDD (
>>> https://issues.apache.org/jira/browse/SPARK-2365,
>>> https://github.com/amplab/spark-indexedrdd) and have been very
>>> impressed with its performance. Some performance testing has revealed worse
>>> than expected scaling of the join performance*, and I was just wondering if
>>> anyone else has any experience using it and what they have found?
>>>
>>> Thanks,
>>>
>>> Jem
>>>
>>> *Table below shows some of my results when joining a small RDD to a
>>> large IndexedRDD.  Each table consisted of a Long key and 15 character
>>> String value. Shows an almost linear time increase with the number of rows
>>> in the bigger table.
>>>
>>> Small Table Rows
>>>
>>>  Big Table Rows
>>>
>>> Time
>>>
>>> (s)
>>>
>>> 50000
>>>
>>> 10000000
>>>
>>> 0.6
>>>
>>> 50000
>>>
>>> 50000000
>>>
>>> 0.8
>>>
>>> 50000
>>>
>>> 100000000
>>>
>>> 1.5
>>>
>>> 50000
>>>
>>> 150000000
>>>
>>> 2.1
>>>
>>> 50000
>>>
>>> 200000000
>>>
>>> 2.8
>>>
>>> 50000
>>>
>>> 500000000
>>>
>>> 7.2
>>>
>>> 50000
>>>
>>> 1000000000
>>>
>>> 12.2
>>>
>>
>>
>

Re: IndexedRDD

Reply via email to