Re: Sending large objects to specific RDDs

Daniel Imberman Thu, 14 Jan 2016 15:51:41 -0800

Hi Ted,

So unfortunately after looking into the cluster manager that I will be
using for my testing (I'm using a super-computer called XSEDE rather than
AWS), it looks like the cluster does not actually come with Hbase installed
(this cluster is becoming somewhat problematic, as it is essentially AWS
but you have to do your own virtualization scripts). Do you have any other
thoughts on how I could go about dealing with this purely using spark and
HDFS?


Thank you

On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> Thank you Ted! That sounds like it would probably be the most efficient
> (with the least overhead) way of handling this situation.
>
> On Wed, Jan 13, 2016 at 11:36 AM Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Another approach is to store the objects in NoSQL store such as HBase.
>>
>> Looking up object should be very fast.
>>
>> Cheers
>>
>> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
>> daniel.imber...@gmail.com> wrote:
>>
>>> I'm looking for a way to send structures to pre-determined partitions so
>>> that
>>> they can be used by another RDD in a mapPartition.
>>>
>>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>>> indexes. The inverted index objects are quite large.
>>>
>>> My hope is to do a MapPartitions within the RDD of vectors where I can
>>> compare each vector to the inverted index. The issue is that I only NEED
>>> one
>>> inverted index object per partition (which would have the same key as the
>>> values within that partition).
>>>
>>>
>>> val vectors:RDD[(Int, SparseVector)]
>>>
>>> val invertedIndexes:RDD[(Int, InvIndex)] =
>>> a.reduceByKey(generateInvertedIndex)
>>> vectors:RDD.mapPartitions{
>>>     iter =>
>>>          val invIndex = invertedIndexes(samePartitionKey)
>>>          iter.map(invIndex.calculateSimilarity(_))
>>>          )
>>> }
>>>
>>> How could I go about setting up the Partition such that the specific data
>>> structure I need will be present for the mapPartition but I won't have
>>> the
>>> extra overhead of sending over all values (which would happen if I were
>>> to
>>> make a broadcast variable).
>>>
>>> One thought I have been having is to store the objects in HDFS but I'm
>>> not
>>> sure if that would be a suboptimal solution (It seems like it could slow
>>> down the process a lot)
>>>
>>> Another thought I am currently exploring is whether there is some way I
>>> can
>>> create a custom Partition or Partitioner that could hold the data
>>> structure
>>> (Although that might get too complicated and become problematic)
>>>
>>> Any thoughts on how I could attack this issue would be highly
>>> appreciated.
>>>
>>> thank you for your help!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>

Re: Sending large objects to specific RDDs

Reply via email to