1. I'd also consider how you're structuring the data before applying the
join, naively doing the join could be expensive so doing a bit of data
preparation may be necessary to improve join performance. Try to get a
baseline as well. Arrow would help improve it.

2. Try storing it back as Parquet but in a way the next application can
take advantage of predicate pushdown.



On Mon, 17 Feb 2020, 6:41 pm Subash Prabakar, <subashpraba...@gmail.com>
wrote:

> Hi Team,
>
> I have two questions regarding Arrow and Spark integration,
>
> 1. I am joining two huge tables (1PB) each - will the performance be huge
> when I use Arrow format before shuffling ? Will the
> serialization/deserialization cost have significant improvement?
>
> 2. Can we store the final data in Arrow format to HDFS and read them back
> in another Spark application? If so how could I do that ?
> Note: The dataset is transient  - separation of responsibility is for
> easier management. Though resiliency inside spark - we use different
> language (in our case Java and Python)
>
> Thanks,
> Subash
>
>

Reply via email to