Re: HELP with bulk loading

Ryan Svihla Thu, 09 Mar 2017 16:12:06 -0800

I suggest using cassandra loader

https://github.com/brianmhess/cassandra-loader


On Mar 9, 2017 5:30 PM, "Artur R" <ar...@gpnxgroup.com> wrote:

> Hello all!
>
> There are ~500gb of CSV files and I am trying to find the way how to
> upload them to C* table (new empty C* cluster of 3 nodes, replication
> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
> c3.8xlarge EC2 nodes).
>
> My first impulse was to use CQLSSTableWriter, but it is too slow is of
> single instance and I can't efficiently parallelize it (just creating Java
> threads) because after some moment it always "hangs" (looks like GC is
> overstressed) and eats all available memory.
>
> So the questions are:
> 1. What is the best way to bulk-load huge amount of data to new C* cluster?
>
> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:
>
> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked
>> tickets
>
>
> is confusing because I read that the CQLSSTableWriter + sstableloader is
> much faster than COPY. Who is right?
>
> 2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass?
>
> 3. sstableloader is slow too. Assuming that I have new empty C* cluster,
> how can I improve the upload speed? Maybe disable replication or some other
> settings while streaming and then turn it back?
>
> Thanks!
> Artur.
>

Re: HELP with bulk loading

Reply via email to