Hi, So to clarify - HDFS is the recommended way of getting the CSV and PostgreSQL data into Spark?
And I should be using something like Apache Sqoop to load into HDFS? For loading the CSV files - are there any efficient bulk methods, that can split it up to parallelise the import? Cheers, Victor On Wed, Oct 23, 2013 at 2:20 AM, Ryan Weald <[email protected]> wrote: > If you are using HDFS you also have the option of using Apache Sqoop to > load data from you SQL database into HDFS in TSV or CSV format. Once it is > on HDFS including it in a spark job would be trivial. > > -Ryan > > > On Fri, Oct 18, 2013 at 6:14 AM, Chester <[email protected]> wrote: > >> There is a Hcatalog project which provides the abstraction layer for >> different types file formats including csv as well as SQL. I don't know if >> this works well with spark or not. I posted the question few days ago about >> HCatalog, but did not get any response. >> >> Chester >> >> Sent from my iPad >> >> On Oct 18, 2013, at 4:18 AM, Vinay <[email protected]> wrote: >> >> An option would be to use hdfs for loading CSV , and jdbc support to >> load tables from Postgres. >> >> >> Regards, >> Vinay >> >> On Oct 18, 2013, at 1:24 AM, Victor Hooi < <[email protected]> >> [email protected]> wrote: >> >> Hi, >> >> *NB: I originally posted this to the Google Group, before I saw the >> message about how we're moving to the Apache Incubator mailing list.* >> >> I'm new to Spark, and I wanted to get some advice on the best way to load >> our data into it: >> >> 1. A CSV file generated each day, which contain user click data >> 2. A Django app, which is running on top of PostgreSQL, containing >> user and transaction data >> >> We do want the data load to be fairly quick, but we'd also want >> interactive queries to be fast, so if anybody can explain any tradeoffs in >> Spark we'd need to make on either, that would be good as well. I'd be >> leaning towards sacrificing load speed to speed up queries, for our use >> cases. >> >> I'm guessing we'd be looking at loading this data in once a day (or >> perhaps a few times throughout the day). Unless there's a good way to >> stream in the above types of sources? >> >> My question is - what are the current recommended practices for loading >> in the above? >> >> With the CSV file, could we split it up, to parallelise the load? How >> would we do this in Spark? >> >> And with the Django app - I'm guessing I can either use Django's in-built >> ORM, or we could query the PostgreSQL database directly? Any pros/cons of >> either approach? Or should I be investigating something like Sqoop (or >> whatever the Spark equivalent tool is?). >> >> Cheers, >> Victor >> >> >
