Hi,

So to clarify - HDFS is the recommended way of getting the CSV and
PostgreSQL data into Spark?

And I should be using something like Apache Sqoop to load into HDFS?

For loading the CSV files - are there any efficient bulk methods, that can
split it up to parallelise the import?

Cheers,
Victor


On Wed, Oct 23, 2013 at 2:20 AM, Ryan Weald <[email protected]> wrote:

> If you are using HDFS you also have the option of using Apache Sqoop to
> load data from you SQL database into HDFS in TSV or CSV format. Once it is
> on HDFS including it in a spark job would be trivial.
>
> -Ryan
>
>
> On Fri, Oct 18, 2013 at 6:14 AM, Chester <[email protected]> wrote:
>
>> There is a Hcatalog project which provides the abstraction layer for
>> different types file formats including csv as well as SQL. I don't know if
>> this works well with spark or not. I posted the question few days ago about
>> HCatalog, but did not get any response.
>>
>> Chester
>>
>> Sent from my iPad
>>
>> On Oct 18, 2013, at 4:18 AM, Vinay <[email protected]> wrote:
>>
>>  An option would be to use hdfs for loading CSV , and jdbc support to
>> load tables from Postgres.
>>
>>
>> Regards,
>> Vinay
>>
>> On Oct 18, 2013, at 1:24 AM, Victor Hooi < <[email protected]>
>> [email protected]> wrote:
>>
>> Hi,
>>
>>  *NB: I originally posted this to the Google Group, before I saw the
>> message about how we're moving to the Apache Incubator mailing list.*
>>
>> I'm new to Spark, and I wanted to get some advice on the best way to load
>> our data into it:
>>
>>    1. A CSV file generated each day, which contain user click data
>>    2. A Django app, which is running on top of PostgreSQL, containing
>>    user and transaction data
>>
>> We do want the data load to be fairly quick, but we'd also want
>> interactive queries to be fast, so if anybody can explain any tradeoffs in
>> Spark we'd need to make on either, that would be good as well. I'd be
>> leaning towards sacrificing load speed to speed up queries, for our use
>> cases.
>>
>> I'm guessing we'd be looking at loading this data in once a day (or
>> perhaps a few times throughout the day). Unless there's a good way to
>> stream in the above types of sources?
>>
>> My question is - what are the current recommended practices for loading
>> in the above?
>>
>> With the CSV file, could we split it up, to parallelise the load? How
>> would we do this in Spark?
>>
>> And with the Django app - I'm guessing I can either use Django's in-built
>> ORM, or we could query the PostgreSQL database directly? Any pros/cons of
>> either approach? Or should I be investigating something like Sqoop (or
>> whatever the Spark equivalent tool is?).
>>
>> Cheers,
>> Victor
>>
>>
>

Reply via email to