An option would be to use hdfs for loading CSV , and jdbc support to load tables from Postgres.
Regards, Vinay > On Oct 18, 2013, at 1:24 AM, Victor Hooi <[email protected]> wrote: > > Hi, > > NB: I originally posted this to the Google Group, before I saw the message > about how we're moving to the Apache Incubator mailing list. > > I'm new to Spark, and I wanted to get some advice on the best way to load our > data into it: > A CSV file generated each day, which contain user click data > A Django app, which is running on top of PostgreSQL, containing user and > transaction data > We do want the data load to be fairly quick, but we'd also want interactive > queries to be fast, so if anybody can explain any tradeoffs in Spark we'd > need to make on either, that would be good as well. I'd be leaning towards > sacrificing load speed to speed up queries, for our use cases. > > I'm guessing we'd be looking at loading this data in once a day (or perhaps a > few times throughout the day). Unless there's a good way to stream in the > above types of sources? > > My question is - what are the current recommended practices for loading in > the above? > > With the CSV file, could we split it up, to parallelise the load? How would > we do this in Spark? > > And with the Django app - I'm guessing I can either use Django's in-built > ORM, or we could query the PostgreSQL database directly? Any pros/cons of > either approach? Or should I be investigating something like Sqoop (or > whatever the Spark equivalent tool is?). > > Cheers, > Victor
