Re: [EXTERNAL] Re: loading big amount of data to Cassandra

2019-08-06 Thread Amanda Moran
With DataStax bulkloader you can only export from a Cassandra table but not 
import into Cassandra (only load into DSE cluster). 

And +1 on the confusing name of batches ... yes it’s for writes but not for 
loading data. 

Amanda 

> On Aug 5, 2019, at 8:14 AM, Durity, Sean R  
> wrote:
> 
> DataStax has a very fast bulk load tool - dsebulk. Not sure if it is 
> available for open source or not. In my experience so far, I am very 
> impressed with it.
> 
> 
> 
> Sean Durity – Staff Systems Engineer, Cassandra
> 
> -Original Message-
> From: p...@xvalheru.org 
> Sent: Saturday, August 3, 2019 6:06 AM
> To: user@cassandra.apache.org
> Cc: Dimo Velev 
> Subject: [EXTERNAL] Re: loading big amount of data to Cassandra
> 
> Thanks to all,
> 
> I'll try the SSTables.
> 
> Thanks
> 
> Pat
> 
>> On 2019-08-03 09:54, Dimo Velev wrote:
>> Check out the CQLSSTableWriter java class -
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A=
>> . You use it to generate sstables - you need to write a small program
>> for that. You can then stream them over the network using the
>> sstableloader (either use the utility or use the underlying classes to
>> embed it in your program).
>> 
>>> On 3. Aug 2019, at 07:17, Ayub M  wrote:
>>> 
>>> Dimo, how do you generate sstables? Do you mean load data locally on
>>> a cassandra node and use sstableloader?
>>> 
>>> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev 
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Batches will actually slow down the process because they mean a
>>>> different thing in C* - as you read they are just grouping changes
>>>> together that you want executed atomically.
>>>> 
>>>> Cassandra does not really have indices so that is different than a
>>>> relational DB. However, after writing stuff to Cassandra it
>>>> generates many smallish partitions of the data. These are then
>>>> joined in the background together to improve read performance.
>>>> 
>>>> You have two options from my experience:
>>>> 
>>>> Option 1: use normal CQL api in async mode. This will create a
>>>> high CPU load on your cluster. Depending on whether that is fine
>>>> for you that might be the easiest solution.
>>>> 
>>>> Option 2: generate sstables locally and use the sstableloader to
>>>> upload them into the cluster. The streaming does not generate high
>>>> cpu load so it is a viable option for clusters with other
>>>> operational load.
>>>> 
>>>> Option 2 scales with the number of cores of the machine generating
>>>> the sstables. If you can split your data you can generate sstables
>>>> on multiple machines. In contrast, option 1 scales with your
>>>> cluster. If you have a large cluster that is idling, it would be
>>>> better to use option 1.
>>>> 
>>>> With both options I was able to write at about 50-100K rows / sec
>>>> on my laptop and local Cassandra. The speed heavily depends on the
>>>> size of your rows.
>>>> 
>>>> Back to your question — I guess option2 is similar to what you
>>>> are used to from tools like sqlloader for relational DBMSes
>>>> 
>>>> I had a requirement of loading a few 100 mio rows per day into an
>>>> operational cluster so I went with option 2 to offload the cpu
>>>> load to reduce impact on the reading side during the loads.
>>>> 
>>>> Cheers,
>>>> Dimo
>>>> 
>>>> Sent from my iPad
>>>> 
>>>>> On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I need to upload to Cassandra about 7 billions of records. What
>>>> is the best setup of Cassandra for this task? Will usage of batch
>>>> speeds up the upload (I've read somewhere that batch in Cassandra
>>>> is dedicated to atomicity not to speeding up communication)? How
>>>> Cassandra internally works related to indexing? In SQL databases
>>>> when uploading such amount of data is suggested to turn off
>>>> indexing and then turn on. Is something simmillar poss

Re: [EXTERNAL] Re: loading big amount of data to Cassandra

2019-08-06 Thread Hiroyuki Yamada
cassandra-loader is also useful because you don't need to create sstables.
https://github.com/brianmhess/cassandra-loader

Hiro

On Tue, Aug 6, 2019 at 12:15 AM Durity, Sean R
 wrote:
>
> DataStax has a very fast bulk load tool - dsebulk. Not sure if it is 
> available for open source or not. In my experience so far, I am very 
> impressed with it.
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
> -Original Message-
> From: p...@xvalheru.org 
> Sent: Saturday, August 3, 2019 6:06 AM
> To: user@cassandra.apache.org
> Cc: Dimo Velev 
> Subject: [EXTERNAL] Re: loading big amount of data to Cassandra
>
> Thanks to all,
>
> I'll try the SSTables.
>
> Thanks
>
> Pat
>
> On 2019-08-03 09:54, Dimo Velev wrote:
> > Check out the CQLSSTableWriter java class -
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A=
> > . You use it to generate sstables - you need to write a small program
> > for that. You can then stream them over the network using the
> > sstableloader (either use the utility or use the underlying classes to
> > embed it in your program).
> >
> > On 3. Aug 2019, at 07:17, Ayub M  wrote:
> >
> >> Dimo, how do you generate sstables? Do you mean load data locally on
> >> a cassandra node and use sstableloader?
> >>
> >> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Batches will actually slow down the process because they mean a
> >>> different thing in C* - as you read they are just grouping changes
> >>> together that you want executed atomically.
> >>>
> >>> Cassandra does not really have indices so that is different than a
> >>> relational DB. However, after writing stuff to Cassandra it
> >>> generates many smallish partitions of the data. These are then
> >>> joined in the background together to improve read performance.
> >>>
> >>> You have two options from my experience:
> >>>
> >>> Option 1: use normal CQL api in async mode. This will create a
> >>> high CPU load on your cluster. Depending on whether that is fine
> >>> for you that might be the easiest solution.
> >>>
> >>> Option 2: generate sstables locally and use the sstableloader to
> >>> upload them into the cluster. The streaming does not generate high
> >>> cpu load so it is a viable option for clusters with other
> >>> operational load.
> >>>
> >>> Option 2 scales with the number of cores of the machine generating
> >>> the sstables. If you can split your data you can generate sstables
> >>> on multiple machines. In contrast, option 1 scales with your
> >>> cluster. If you have a large cluster that is idling, it would be
> >>> better to use option 1.
> >>>
> >>> With both options I was able to write at about 50-100K rows / sec
> >>> on my laptop and local Cassandra. The speed heavily depends on the
> >>> size of your rows.
> >>>
> >>> Back to your question — I guess option2 is similar to what you
> >>> are used to from tools like sqlloader for relational DBMSes
> >>>
> >>> I had a requirement of loading a few 100 mio rows per day into an
> >>> operational cluster so I went with option 2 to offload the cpu
> >>> load to reduce impact on the reading side during the loads.
> >>>
> >>> Cheers,
> >>> Dimo
> >>>
> >>> Sent from my iPad
> >>>
> >>>> On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I need to upload to Cassandra about 7 billions of records. What
> >>> is the best setup of Cassandra for this task? Will usage of batch
> >>> speeds up the upload (I've read somewhere that batch in Cassandra
> >>> is dedicated to atomicity not to speeding up communication)? How
> >>> Cassandra internally works related to indexing? In SQL databases
> >>> when uploading such amount of data is suggested to turn off
> >>> indexing and then turn on. Is something simmillar possible in
> >>> C

RE: [EXTERNAL] Re: loading big amount of data to Cassandra

2019-08-05 Thread Durity, Sean R
DataStax has a very fast bulk load tool - dsebulk. Not sure if it is available 
for open source or not. In my experience so far, I am very impressed with it.



Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: p...@xvalheru.org 
Sent: Saturday, August 3, 2019 6:06 AM
To: user@cassandra.apache.org
Cc: Dimo Velev 
Subject: [EXTERNAL] Re: loading big amount of data to Cassandra

Thanks to all,

I'll try the SSTables.

Thanks

Pat

On 2019-08-03 09:54, Dimo Velev wrote:
> Check out the CQLSSTableWriter java class -
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A=
> . You use it to generate sstables - you need to write a small program
> for that. You can then stream them over the network using the
> sstableloader (either use the utility or use the underlying classes to
> embed it in your program).
>
> On 3. Aug 2019, at 07:17, Ayub M  wrote:
>
>> Dimo, how do you generate sstables? Do you mean load data locally on
>> a cassandra node and use sstableloader?
>>
>> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev 
>> wrote:
>>
>>> Hi,
>>>
>>> Batches will actually slow down the process because they mean a
>>> different thing in C* - as you read they are just grouping changes
>>> together that you want executed atomically.
>>>
>>> Cassandra does not really have indices so that is different than a
>>> relational DB. However, after writing stuff to Cassandra it
>>> generates many smallish partitions of the data. These are then
>>> joined in the background together to improve read performance.
>>>
>>> You have two options from my experience:
>>>
>>> Option 1: use normal CQL api in async mode. This will create a
>>> high CPU load on your cluster. Depending on whether that is fine
>>> for you that might be the easiest solution.
>>>
>>> Option 2: generate sstables locally and use the sstableloader to
>>> upload them into the cluster. The streaming does not generate high
>>> cpu load so it is a viable option for clusters with other
>>> operational load.
>>>
>>> Option 2 scales with the number of cores of the machine generating
>>> the sstables. If you can split your data you can generate sstables
>>> on multiple machines. In contrast, option 1 scales with your
>>> cluster. If you have a large cluster that is idling, it would be
>>> better to use option 1.
>>>
>>> With both options I was able to write at about 50-100K rows / sec
>>> on my laptop and local Cassandra. The speed heavily depends on the
>>> size of your rows.
>>>
>>> Back to your question — I guess option2 is similar to what you
>>> are used to from tools like sqlloader for relational DBMSes
>>>
>>> I had a requirement of loading a few 100 mio rows per day into an
>>> operational cluster so I went with option 2 to offload the cpu
>>> load to reduce impact on the reading side during the loads.
>>>
>>> Cheers,
>>> Dimo
>>>
>>> Sent from my iPad
>>>
>>>> On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote:
>>>>
>>>> Hi,
>>>>
>>>> I need to upload to Cassandra about 7 billions of records. What
>>> is the best setup of Cassandra for this task? Will usage of batch
>>> speeds up the upload (I've read somewhere that batch in Cassandra
>>> is dedicated to atomicity not to speeding up communication)? How
>>> Cassandra internally works related to indexing? In SQL databases
>>> when uploading such amount of data is suggested to turn off
>>> indexing and then turn on. Is something simmillar possible in
>>> Cassandra?
>>>>
>>>> Thanks for all suggestions.
>>>>
>>>> Pat
>>>>
>>>> 
>>>> Freehosting PIPNI - 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U=
>>>>
>>>>
>>>>
>>>
>>
> -
>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> F