There is a Hcatalog project which provides the abstraction layer for different types file formats including csv as well as SQL. I don't know if this works well with spark or not. I posted the question few days ago about HCatalog, but did not get any response.
Chester Sent from my iPad On Oct 18, 2013, at 4:18 AM, Vinay <[email protected]> wrote: > An option would be to use hdfs for loading CSV , and jdbc support to load > tables from Postgres. > > > Regards, > Vinay > > On Oct 18, 2013, at 1:24 AM, Victor Hooi <[email protected]> wrote: > >> Hi, >> >> NB: I originally posted this to the Google Group, before I saw the message >> about how we're moving to the Apache Incubator mailing list. >> >> I'm new to Spark, and I wanted to get some advice on the best way to load >> our data into it: >> A CSV file generated each day, which contain user click data >> A Django app, which is running on top of PostgreSQL, containing user and >> transaction data >> We do want the data load to be fairly quick, but we'd also want interactive >> queries to be fast, so if anybody can explain any tradeoffs in Spark we'd >> need to make on either, that would be good as well. I'd be leaning towards >> sacrificing load speed to speed up queries, for our use cases. >> >> I'm guessing we'd be looking at loading this data in once a day (or perhaps >> a few times throughout the day). Unless there's a good way to stream in the >> above types of sources? >> >> My question is - what are the current recommended practices for loading in >> the above? >> >> With the CSV file, could we split it up, to parallelise the load? How would >> we do this in Spark? >> >> And with the Django app - I'm guessing I can either use Django's in-built >> ORM, or we could query the PostgreSQL database directly? Any pros/cons of >> either approach? Or should I be investigating something like Sqoop (or >> whatever the Spark equivalent tool is?). >> >> Cheers, >> Victor
