Re: sc.phoenixTableAsRDD number of initial partitions

Antonio Murgia Fri, 14 Oct 2016 02:12:47 -0700

I know spark doc is really comprehensive, I read it a lot of times inthe last 2 years, I know how to check how Spark uses its memory and howto tweak it (e.g. using more memory for caching or not). I'll try askingto not use any memory to cache the rdd, since I'm not caching at all.Please don't reply with general spark knowledge, because I kinda knowhow spark works.


Thank you in advance.


#A.M.


On 10/14/2016 09:54 AM, Mich Talebzadeh wrote:

"I do know how Spark in general works, and how it stores data inmemory etc. It's been almost 2 years that I work on it. So I'mdefinetely not collecting the whole rdd in memory ;)"



Spark doc is a good start.

To see how spark memory is utilised look at Spark UI on <HOST>:4040 bydefault under storage tab. It will tell you what is stored.

Spark uses execution memory for result set on operation (RDD + DF) andstorage memory for anything cached with cache() or persist(). You canverify all this in Spark UI.


HTH


Dr Mich Talebzadeh

LinkedIn/https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk.Any and all responsibility forany loss, damage or destruction of data or any other property whichmay arise from relying on this email's technical content is explicitlydisclaimed. The author will in no case be liable for any monetarydamages arising from such loss, damage or destruction.

On 14 October 2016 at 08:37, Antonio Murgia <antonio.mur...@eng.it<mailto:antonio.mur...@eng.it>> wrote:


    Hi Constantin,

    thank you for your reply. I do know how Spark in general works,
    and how it stores data in memory etc. It's been almost 2 years
    that I work on it. So I'm definetely not collecting the whole rdd
    in memory ;)

    Our "mantainance use case" is the following:

    Copying the whole content of a table to another table applying a
    simple transformation (e.g. aggregating some columns). We tried
    with an Upsert from select, but we ran into some memory issue from
    the phoenix side.

    Do you have any suggestion to perform something like that?

    Thank you in advance

    #A.M.


    On 10/14/2016 08:10 AM, Ciureanu Constantin wrote:


    Hi Antonio,
    Reading the whole table is not a good use-case for Phoenix /
    HBase or any DB.
    You should never ever store the whole content read from DB / disk
    into memory, that's definitely wrong.
    Spark doesn't do that by itself, no matter what "they" told you
    that it's going to do in order to be faster bla bla. Review your
    algorithm and see what's to improve, After all, I hope you just
    use collect() so the OOM is on the driver (that's easier to fix,
    :p by not using it).
    Back to the OOM: After reading an RDD you can shuffle yourself /
    repartition in any number of partitions easily (but that sends
    data through network so it's expensive):
    repartition(numPartitions)
    http://spark.apache.org/docs/latest/programming-guide.html
    <http://spark.apache.org/docs/latest/programming-guide.html>
    I recommend to read this plus a few articles on Spark best practices.

    Kind regards,
    Constantin


    În Joi, 13 oct. 2016, 18:16 Antonio Murgia,
    <antonio.mur...@eng.it <mailto:antonio.mur...@eng.it>> a scris:

        Hello everyone,

        I'm trying to read data from a Phoenix Table using apache
        Spark. I
        actually use the suggested method: sc.phoenixTableAsRDD
        without issuing
        any query (e.g. reading the whole table) and I noticed that
        the number
        of partitions that spark creates is equal to the number of
        regionServers. Is there a way to use a custom number of regions?

        The problem we actually face is that if a region is bigger
        than the
        available memory of the spark executor, it goes in OOM. Being
        able to
        tune the number of regions, we might use a higher number of
        partitions
        reducing the memory footprint of the processing (and also
        slowing it
        down, i know :( ).

        Thank you in advance

        #A.M.

Re: sc.phoenixTableAsRDD number of initial partitions

Reply via email to