There is a Hcatalog project which provides the abstraction layer for different 
types file formats including csv as well as SQL. I don't know if this works 
well with spark or not. I posted the question few days ago about HCatalog, but 
did not get any response.

Chester

Sent from my iPad

On Oct 18, 2013, at 4:18 AM, Vinay <[email protected]> wrote:

> An option would be to use hdfs for loading CSV , and jdbc support to load 
> tables from Postgres. 
> 
> 
> Regards,
> Vinay
> 
> On Oct 18, 2013, at 1:24 AM, Victor Hooi <[email protected]> wrote:
> 
>> Hi,
>> 
>> NB: I originally posted this to the Google Group, before I saw the message 
>> about how we're moving to the Apache Incubator mailing list.
>> 
>> I'm new to Spark, and I wanted to get some advice on the best way to load 
>> our data into it:
>> A CSV file generated each day, which contain user click data
>> A Django app, which is running on top of PostgreSQL, containing user and 
>> transaction data
>> We do want the data load to be fairly quick, but we'd also want interactive 
>> queries to be fast, so if anybody can explain any tradeoffs in Spark we'd 
>> need to make on either, that would be good as well. I'd be leaning towards 
>> sacrificing load speed to speed up queries, for our use cases.
>> 
>> I'm guessing we'd be looking at loading this data in once a day (or perhaps 
>> a few times throughout the day). Unless there's a good way to stream in the 
>> above types of sources?
>> 
>> My question is - what are the current recommended practices for loading in 
>> the above?
>> 
>> With the CSV file, could we split it up, to parallelise the load? How would 
>> we do this in Spark?
>> 
>> And with the Django app - I'm guessing I can either use Django's in-built 
>> ORM, or we could query the PostgreSQL database directly? Any pros/cons of 
>> either approach? Or should I be investigating something like Sqoop (or 
>> whatever the Spark equivalent tool is?).
>> 
>> Cheers,
>> Victor

Reply via email to