Re: [EXTERNAL] Re: loading big amount of data to Cassandra
With DataStax bulkloader you can only export from a Cassandra table but not import into Cassandra (only load into DSE cluster). And +1 on the confusing name of batches ... yes it’s for writes but not for loading data. Amanda > On Aug 5, 2019, at 8:14 AM, Durity, Sean R > wrote: > > DataStax has a very fast bulk load tool - dsebulk. Not sure if it is > available for open source or not. In my experience so far, I am very > impressed with it. > > > > Sean Durity – Staff Systems Engineer, Cassandra > > -Original Message- > From: p...@xvalheru.org > Sent: Saturday, August 3, 2019 6:06 AM > To: user@cassandra.apache.org > Cc: Dimo Velev > Subject: [EXTERNAL] Re: loading big amount of data to Cassandra > > Thanks to all, > > I'll try the SSTables. > > Thanks > > Pat > >> On 2019-08-03 09:54, Dimo Velev wrote: >> Check out the CQLSSTableWriter java class - >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A= >> . You use it to generate sstables - you need to write a small program >> for that. You can then stream them over the network using the >> sstableloader (either use the utility or use the underlying classes to >> embed it in your program). >> >>> On 3. Aug 2019, at 07:17, Ayub M wrote: >>> >>> Dimo, how do you generate sstables? Do you mean load data locally on >>> a cassandra node and use sstableloader? >>> >>> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev >>> wrote: >>> >>>> Hi, >>>> >>>> Batches will actually slow down the process because they mean a >>>> different thing in C* - as you read they are just grouping changes >>>> together that you want executed atomically. >>>> >>>> Cassandra does not really have indices so that is different than a >>>> relational DB. However, after writing stuff to Cassandra it >>>> generates many smallish partitions of the data. These are then >>>> joined in the background together to improve read performance. >>>> >>>> You have two options from my experience: >>>> >>>> Option 1: use normal CQL api in async mode. This will create a >>>> high CPU load on your cluster. Depending on whether that is fine >>>> for you that might be the easiest solution. >>>> >>>> Option 2: generate sstables locally and use the sstableloader to >>>> upload them into the cluster. The streaming does not generate high >>>> cpu load so it is a viable option for clusters with other >>>> operational load. >>>> >>>> Option 2 scales with the number of cores of the machine generating >>>> the sstables. If you can split your data you can generate sstables >>>> on multiple machines. In contrast, option 1 scales with your >>>> cluster. If you have a large cluster that is idling, it would be >>>> better to use option 1. >>>> >>>> With both options I was able to write at about 50-100K rows / sec >>>> on my laptop and local Cassandra. The speed heavily depends on the >>>> size of your rows. >>>> >>>> Back to your question — I guess option2 is similar to what you >>>> are used to from tools like sqlloader for relational DBMSes >>>> >>>> I had a requirement of loading a few 100 mio rows per day into an >>>> operational cluster so I went with option 2 to offload the cpu >>>> load to reduce impact on the reading side during the loads. >>>> >>>> Cheers, >>>> Dimo >>>> >>>> Sent from my iPad >>>> >>>>> On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote: >>>>> >>>>> Hi, >>>>> >>>>> I need to upload to Cassandra about 7 billions of records. What >>>> is the best setup of Cassandra for this task? Will usage of batch >>>> speeds up the upload (I've read somewhere that batch in Cassandra >>>> is dedicated to atomicity not to speeding up communication)? How >>>> Cassandra internally works related to indexing? In SQL databases >>>> when uploading such amount of data is suggested to turn off >>>> indexing and then turn on. Is something simmillar poss
Re: [EXTERNAL] Re: loading big amount of data to Cassandra
cassandra-loader is also useful because you don't need to create sstables. https://github.com/brianmhess/cassandra-loader Hiro On Tue, Aug 6, 2019 at 12:15 AM Durity, Sean R wrote: > > DataStax has a very fast bulk load tool - dsebulk. Not sure if it is > available for open source or not. In my experience so far, I am very > impressed with it. > > > > Sean Durity – Staff Systems Engineer, Cassandra > > -Original Message- > From: p...@xvalheru.org > Sent: Saturday, August 3, 2019 6:06 AM > To: user@cassandra.apache.org > Cc: Dimo Velev > Subject: [EXTERNAL] Re: loading big amount of data to Cassandra > > Thanks to all, > > I'll try the SSTables. > > Thanks > > Pat > > On 2019-08-03 09:54, Dimo Velev wrote: > > Check out the CQLSSTableWriter java class - > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A= > > . You use it to generate sstables - you need to write a small program > > for that. You can then stream them over the network using the > > sstableloader (either use the utility or use the underlying classes to > > embed it in your program). > > > > On 3. Aug 2019, at 07:17, Ayub M wrote: > > > >> Dimo, how do you generate sstables? Do you mean load data locally on > >> a cassandra node and use sstableloader? > >> > >> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev > >> wrote: > >> > >>> Hi, > >>> > >>> Batches will actually slow down the process because they mean a > >>> different thing in C* - as you read they are just grouping changes > >>> together that you want executed atomically. > >>> > >>> Cassandra does not really have indices so that is different than a > >>> relational DB. However, after writing stuff to Cassandra it > >>> generates many smallish partitions of the data. These are then > >>> joined in the background together to improve read performance. > >>> > >>> You have two options from my experience: > >>> > >>> Option 1: use normal CQL api in async mode. This will create a > >>> high CPU load on your cluster. Depending on whether that is fine > >>> for you that might be the easiest solution. > >>> > >>> Option 2: generate sstables locally and use the sstableloader to > >>> upload them into the cluster. The streaming does not generate high > >>> cpu load so it is a viable option for clusters with other > >>> operational load. > >>> > >>> Option 2 scales with the number of cores of the machine generating > >>> the sstables. If you can split your data you can generate sstables > >>> on multiple machines. In contrast, option 1 scales with your > >>> cluster. If you have a large cluster that is idling, it would be > >>> better to use option 1. > >>> > >>> With both options I was able to write at about 50-100K rows / sec > >>> on my laptop and local Cassandra. The speed heavily depends on the > >>> size of your rows. > >>> > >>> Back to your question — I guess option2 is similar to what you > >>> are used to from tools like sqlloader for relational DBMSes > >>> > >>> I had a requirement of loading a few 100 mio rows per day into an > >>> operational cluster so I went with option 2 to offload the cpu > >>> load to reduce impact on the reading side during the loads. > >>> > >>> Cheers, > >>> Dimo > >>> > >>> Sent from my iPad > >>> > >>>> On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote: > >>>> > >>>> Hi, > >>>> > >>>> I need to upload to Cassandra about 7 billions of records. What > >>> is the best setup of Cassandra for this task? Will usage of batch > >>> speeds up the upload (I've read somewhere that batch in Cassandra > >>> is dedicated to atomicity not to speeding up communication)? How > >>> Cassandra internally works related to indexing? In SQL databases > >>> when uploading such amount of data is suggested to turn off > >>> indexing and then turn on. Is something simmillar possible in > >>> C
RE: [EXTERNAL] Re: loading big amount of data to Cassandra
DataStax has a very fast bulk load tool - dsebulk. Not sure if it is available for open source or not. In my experience so far, I am very impressed with it. Sean Durity – Staff Systems Engineer, Cassandra -Original Message- From: p...@xvalheru.org Sent: Saturday, August 3, 2019 6:06 AM To: user@cassandra.apache.org Cc: Dimo Velev Subject: [EXTERNAL] Re: loading big amount of data to Cassandra Thanks to all, I'll try the SSTables. Thanks Pat On 2019-08-03 09:54, Dimo Velev wrote: > Check out the CQLSSTableWriter java class - > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A= > . You use it to generate sstables - you need to write a small program > for that. You can then stream them over the network using the > sstableloader (either use the utility or use the underlying classes to > embed it in your program). > > On 3. Aug 2019, at 07:17, Ayub M wrote: > >> Dimo, how do you generate sstables? Do you mean load data locally on >> a cassandra node and use sstableloader? >> >> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev >> wrote: >> >>> Hi, >>> >>> Batches will actually slow down the process because they mean a >>> different thing in C* - as you read they are just grouping changes >>> together that you want executed atomically. >>> >>> Cassandra does not really have indices so that is different than a >>> relational DB. However, after writing stuff to Cassandra it >>> generates many smallish partitions of the data. These are then >>> joined in the background together to improve read performance. >>> >>> You have two options from my experience: >>> >>> Option 1: use normal CQL api in async mode. This will create a >>> high CPU load on your cluster. Depending on whether that is fine >>> for you that might be the easiest solution. >>> >>> Option 2: generate sstables locally and use the sstableloader to >>> upload them into the cluster. The streaming does not generate high >>> cpu load so it is a viable option for clusters with other >>> operational load. >>> >>> Option 2 scales with the number of cores of the machine generating >>> the sstables. If you can split your data you can generate sstables >>> on multiple machines. In contrast, option 1 scales with your >>> cluster. If you have a large cluster that is idling, it would be >>> better to use option 1. >>> >>> With both options I was able to write at about 50-100K rows / sec >>> on my laptop and local Cassandra. The speed heavily depends on the >>> size of your rows. >>> >>> Back to your question — I guess option2 is similar to what you >>> are used to from tools like sqlloader for relational DBMSes >>> >>> I had a requirement of loading a few 100 mio rows per day into an >>> operational cluster so I went with option 2 to offload the cpu >>> load to reduce impact on the reading side during the loads. >>> >>> Cheers, >>> Dimo >>> >>> Sent from my iPad >>> >>>> On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote: >>>> >>>> Hi, >>>> >>>> I need to upload to Cassandra about 7 billions of records. What >>> is the best setup of Cassandra for this task? Will usage of batch >>> speeds up the upload (I've read somewhere that batch in Cassandra >>> is dedicated to atomicity not to speeding up communication)? How >>> Cassandra internally works related to indexing? In SQL databases >>> when uploading such amount of data is suggested to turn off >>> indexing and then turn on. Is something simmillar possible in >>> Cassandra? >>>> >>>> Thanks for all suggestions. >>>> >>>> Pat >>>> >>>> >>>> Freehosting PIPNI - >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U= >>>> >>>> >>>> >>> >> > - >>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>>> F