Hi all, I am in the procss of learning big data. Right now, I am bringing huge databases through JDBC to Spark (a 250 million rows table can take around 3 hours), and then re-saving it into JSON, which is fast, simple, distributed, fail-safe and stores data types, although without any compression.
Reading from distributed JSON takes for this amount of data, around 2-3 minutes and works good enough for me. But, do you suggest or prefer any other format for intermediate storage, for fast and proper types reading? Not only for intermediate between a network database, but also for intermediate dataframe transformations to have data ready for processing. I have tried CSV but computational type inferring does not usually fit my needs and take long types. Haven't tried parquet since they fixed it for 1.5, but that is also another option. What do you also think of HBase, Hive or any other type? Looking for insights! Saif