An option would be to use hdfs for loading CSV , and jdbc support to load 
tables from Postgres. 


Regards,
Vinay

> On Oct 18, 2013, at 1:24 AM, Victor Hooi <[email protected]> wrote:
> 
> Hi,
> 
> NB: I originally posted this to the Google Group, before I saw the message 
> about how we're moving to the Apache Incubator mailing list.
> 
> I'm new to Spark, and I wanted to get some advice on the best way to load our 
> data into it:
> A CSV file generated each day, which contain user click data
> A Django app, which is running on top of PostgreSQL, containing user and 
> transaction data
> We do want the data load to be fairly quick, but we'd also want interactive 
> queries to be fast, so if anybody can explain any tradeoffs in Spark we'd 
> need to make on either, that would be good as well. I'd be leaning towards 
> sacrificing load speed to speed up queries, for our use cases.
> 
> I'm guessing we'd be looking at loading this data in once a day (or perhaps a 
> few times throughout the day). Unless there's a good way to stream in the 
> above types of sources?
> 
> My question is - what are the current recommended practices for loading in 
> the above?
> 
> With the CSV file, could we split it up, to parallelise the load? How would 
> we do this in Spark?
> 
> And with the Django app - I'm guessing I can either use Django's in-built 
> ORM, or we could query the PostgreSQL database directly? Any pros/cons of 
> either approach? Or should I be investigating something like Sqoop (or 
> whatever the Spark equivalent tool is?).
> 
> Cheers,
> Victor

Reply via email to