Hi all,

I am in the procss of learning big data.
Right now, I am bringing huge databases through JDBC to Spark (a 250 million 
rows table can take around 3 hours), and then re-saving it into JSON, which is 
fast, simple, distributed, fail-safe and stores data types, although without 
any compression.

Reading from distributed JSON takes for this amount of data, around 2-3 minutes 
and works good enough for me. But, do you suggest or prefer any other format 
for intermediate storage, for fast and proper types reading?
Not only for intermediate between a network database, but also for intermediate 
dataframe transformations to have data ready for processing.

I have tried CSV but computational type inferring does not usually fit my needs 
and take long types. Haven't tried parquet since they fixed it for 1.5, but 
that is also another option.
What do you also think of HBase, Hive or any other type?

Looking for insights!
Saif

Reply via email to