Re: sc.phoenixTableAsRDD number of initial partitions

Antonio Murgia Fri, 14 Oct 2016 00:38:26 -0700

Hi Constantin,

thank you for your reply. I do know how Spark in general works, and howit stores data in memory etc. It's been almost 2 years that I work onit. So I'm definetely not collecting the whole rdd in memory ;)


Our "mantainance use case" is the following:

Copying the whole content of a table to another table applying a simpletransformation (e.g. aggregating some columns). We tried with an Upsertfrom select, but we ran into some memory issue from the phoenix side.


Do you have any suggestion to perform something like that?

Thank you in advance

#A.M.


On 10/14/2016 08:10 AM, Ciureanu Constantin wrote:

Hi Antonio,
Reading the whole table is not a good use-case for Phoenix / HBase orany DB.You should never ever store the whole content read from DB / disk intomemory, that's definitely wrong.Spark doesn't do that by itself, no matter what "they" told you thatit's going to do in order to be faster bla bla. Review your algorithmand see what's to improve, After all, I hope you just use collect() sothe OOM is on the driver (that's easier to fix, :p by not using it).Back to the OOM: After reading an RDD you can shuffle yourself /repartition in any number of partitions easily (but that sends datathrough network so it's expensive):
repartition(numPartitions)
http://spark.apache.org/docs/latest/programming-guide.html
I recommend to read this plus a few articles on Spark best practices.

Kind regards,
Constantin
În Joi, 13 oct. 2016, 18:16 Antonio Murgia, <antonio.mur...@eng.it<mailto:antonio.mur...@eng.it>> a scris:
    Hello everyone,

    I'm trying to read data from a Phoenix Table using apache Spark. I
    actually use the suggested method: sc.phoenixTableAsRDD without
    issuing
    any query (e.g. reading the whole table) and I noticed that the number
    of partitions that spark creates is equal to the number of
    regionServers. Is there a way to use a custom number of regions?

    The problem we actually face is that if a region is bigger than the
    available memory of the spark executor, it goes in OOM. Being able to
    tune the number of regions, we might use a higher number of partitions
    reducing the memory footprint of the processing (and also slowing it
    down, i know :( ).

    Thank you in advance

    #A.M.

Re: sc.phoenixTableAsRDD number of initial partitions

Reply via email to