I suggest using cassandra loader https://github.com/brianmhess/cassandra-loader
On Mar 9, 2017 5:30 PM, "Artur R" <ar...@gpnxgroup.com> wrote: > Hello all! > > There are ~500gb of CSV files and I am trying to find the way how to > upload them to C* table (new empty C* cluster of 3 nodes, replication > factor 2) within reasonable time (say, 10 hours using 3-4 instance of > c3.8xlarge EC2 nodes). > > My first impulse was to use CQLSSTableWriter, but it is too slow is of > single instance and I can't efficiently parallelize it (just creating Java > threads) because after some moment it always "hangs" (looks like GC is > overstressed) and eats all available memory. > > So the questions are: > 1. What is the best way to bulk-load huge amount of data to new C* cluster? > > This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323: > > The preferred way to bulk load is now COPY; see CASSANDRA-11053 >> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked >> tickets > > > is confusing because I read that the CQLSSTableWriter + sstableloader is > much faster than COPY. Who is right? > > 2. Is there any real examples of multi-threaded using of CQLSSTableWriter? > Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass? > > 3. sstableloader is slow too. Assuming that I have new empty C* cluster, > how can I improve the upload speed? Maybe disable replication or some other > settings while streaming and then turn it back? > > Thanks! > Artur. >