Reading the whole table is not a good use-case for Phoenix / HBase or any
You should never ever store the whole content read from DB / disk into
memory, that's definitely wrong.
Spark doesn't do that by itself, no matter what "they" told you that it's
going to do in order to be faster bla bla. Review your algorithm and see
what's to improve, After all, I hope you just use collect() so the OOM is
on the driver (that's easier to fix, :p by not using it).
Back to the OOM: After reading an RDD you can shuffle yourself /
repartition in any number of partitions easily (but that sends data through
network so it's expensive):
I recommend to read this plus a few articles on Spark best practices.
În Joi, 13 oct. 2016, 18:16 Antonio Murgia, <antonio.mur...@eng.it> a scris:
> Hello everyone,
> I'm trying to read data from a Phoenix Table using apache Spark. I
> actually use the suggested method: sc.phoenixTableAsRDD without issuing
> any query (e.g. reading the whole table) and I noticed that the number
> of partitions that spark creates is equal to the number of
> regionServers. Is there a way to use a custom number of regions?
> The problem we actually face is that if a region is bigger than the
> available memory of the spark executor, it goes in OOM. Being able to
> tune the number of regions, we might use a higher number of partitions
> reducing the memory footprint of the processing (and also slowing it
> down, i know :( ).
> Thank you in advance