Re: Self-healing data integrity?

2017-09-14 Thread Carlos Rolo
Wouldn't be easier for

1) The CRC to be checked by the sender, and don't send if it doesn't match?

2) And once the stream ends, you could compare the 2 CRCs to see if
something got weird during transfer?

Also you could implement this in 2 pieces instead of reviewing the
streaming architecture as a whole. I have no familiarity with Cassandra
code for making this assumptions, so just wanting to contribute (And
actually trying to implement at least the first part).

Regards,

Carlos Juzarte Rolo
Cassandra Consultant / Datastax Certified Architect / Cassandra MVP

Pythian - Love your data

rolo@pythian | Twitter: @cjrolo | Skype: cjr2k3 | Linkedin:
*linkedin.com/in/carlosjuzarterolo
*
Mobile: +351 918 918 100
www.pythian.com

On Mon, Sep 11, 2017 at 9:12 AM, DuyHai Doan  wrote:

> Agree
>
>  A tricky detail about streaming is that:
>
> 1) On the sender side, the node just send the SSTable (without any other
> components like CRC files, partition index, partition summary etc...)
> 2) The sender does not even bother to de-serialize the SSTable data, it is
> just sending the stream of bytes by reading directly SSTables content from
> disk
> 3) On the receiver side, the node receives the bytes stream and needs to
> serialize it in memory to rebuild all the SSTable components (CRC files,
> partition index, partition summary ...)
>
> So the consequences are:
>
> a. there is a bottleneck on receiving side because of serialization
> b. if there is a bit rot in SSTables, since CRC files are not sent, no
> chance to detect it from receiving side
> c. if we want to include CRC checks in the streaming path, it's a whole
> review of the streaming architecture, not only adding some feature
>
> On Sat, Sep 9, 2017 at 10:06 PM, Jeff Jirsa  wrote:
>
>> (Which isn't to say that someone shouldn't implement this; they should,
>> and there's probably a JIRA to do so already written, but it's a project of
>> volunteers, and nobody has volunteered to do the work yet)
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Sep 9, 2017, at 12:59 PM, Jeff Jirsa  wrote:
>>
>> There is, but they aren't consulted on the streaming paths (only on
>> normal reads)
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Sep 9, 2017, at 12:02 PM, DuyHai Doan  wrote:
>>
>> Jeff,
>>
>>  With default compression enabled on each table, isn't there CRC files
>> created along side with SSTables that can help detecting bit-rot ?
>>
>>
>> On Sat, Sep 9, 2017 at 7:50 PM, Jeff Jirsa  wrote:
>>
>>> Cassandra doesn't do that automatically - it can guarantee consistency
>>> on read or write via ConsistencyLevel on each query, and it can run active
>>> (AntiEntropy) repairs. But active repairs must be scheduled (by human or
>>> cron or by third party script like http://cassandra-reaper.io/), and to
>>> be pedantic, repair only fixes consistency issue, there's some work to be
>>> done to properly address/support fixing corrupted replicas (for example,
>>> repair COULD send a bit flip from one node to all of the others)
>>>
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Sep 9, 2017, at 1:07 AM, Ralph Soika  wrote:
>>>
>>> Hi,
>>>
>>> I am searching for a big data storage solution for the Imixs-Workflow
>>> project. I started with Hadoop until I became aware of the
>>> 'small-file-problem'. So I am considering using Cassandra now.
>>>
>>> But Hadoop has one important feature for me. The replicator continuously
>>> examines whether data blocks are consistent across all datanodes. This will
>>> detect disk errors and automatically move data from defective blocks to
>>> working blocks. I think this is called 'self-healing mechanism'.
>>>
>>> Is there a similar feature in Cassandra too?
>>>
>>>
>>> Thanks for help
>>>
>>> Ralph
>>>
>>>
>>>
>>> --
>>>
>>>
>>
>

-- 


--





Re: Self-healing data integrity?

2017-09-11 Thread DuyHai Doan
Agree

 A tricky detail about streaming is that:

1) On the sender side, the node just send the SSTable (without any other
components like CRC files, partition index, partition summary etc...)
2) The sender does not even bother to de-serialize the SSTable data, it is
just sending the stream of bytes by reading directly SSTables content from
disk
3) On the receiver side, the node receives the bytes stream and needs to
serialize it in memory to rebuild all the SSTable components (CRC files,
partition index, partition summary ...)

So the consequences are:

a. there is a bottleneck on receiving side because of serialization
b. if there is a bit rot in SSTables, since CRC files are not sent, no
chance to detect it from receiving side
c. if we want to include CRC checks in the streaming path, it's a whole
review of the streaming architecture, not only adding some feature

On Sat, Sep 9, 2017 at 10:06 PM, Jeff Jirsa  wrote:

> (Which isn't to say that someone shouldn't implement this; they should,
> and there's probably a JIRA to do so already written, but it's a project of
> volunteers, and nobody has volunteered to do the work yet)
>
> --
> Jeff Jirsa
>
>
> On Sep 9, 2017, at 12:59 PM, Jeff Jirsa  wrote:
>
> There is, but they aren't consulted on the streaming paths (only on normal
> reads)
>
>
> --
> Jeff Jirsa
>
>
> On Sep 9, 2017, at 12:02 PM, DuyHai Doan  wrote:
>
> Jeff,
>
>  With default compression enabled on each table, isn't there CRC files
> created along side with SSTables that can help detecting bit-rot ?
>
>
> On Sat, Sep 9, 2017 at 7:50 PM, Jeff Jirsa  wrote:
>
>> Cassandra doesn't do that automatically - it can guarantee consistency on
>> read or write via ConsistencyLevel on each query, and it can run active
>> (AntiEntropy) repairs. But active repairs must be scheduled (by human or
>> cron or by third party script like http://cassandra-reaper.io/), and to
>> be pedantic, repair only fixes consistency issue, there's some work to be
>> done to properly address/support fixing corrupted replicas (for example,
>> repair COULD send a bit flip from one node to all of the others)
>>
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Sep 9, 2017, at 1:07 AM, Ralph Soika  wrote:
>>
>> Hi,
>>
>> I am searching for a big data storage solution for the Imixs-Workflow
>> project. I started with Hadoop until I became aware of the
>> 'small-file-problem'. So I am considering using Cassandra now.
>>
>> But Hadoop has one important feature for me. The replicator continuously
>> examines whether data blocks are consistent across all datanodes. This will
>> detect disk errors and automatically move data from defective blocks to
>> working blocks. I think this is called 'self-healing mechanism'.
>>
>> Is there a similar feature in Cassandra too?
>>
>>
>> Thanks for help
>>
>> Ralph
>>
>>
>>
>> --
>>
>>
>


Re: Self-healing data integrity?

2017-09-09 Thread Jeff Jirsa
(Which isn't to say that someone shouldn't implement this; they should, and 
there's probably a JIRA to do so already written, but it's a project of 
volunteers, and nobody has volunteered to do the work yet)

-- 
Jeff Jirsa


> On Sep 9, 2017, at 12:59 PM, Jeff Jirsa  wrote:
> 
> There is, but they aren't consulted on the streaming paths (only on normal 
> reads)
> 
> 
> -- 
> Jeff Jirsa
> 
> 
>> On Sep 9, 2017, at 12:02 PM, DuyHai Doan  wrote:
>> 
>> Jeff,
>> 
>>  With default compression enabled on each table, isn't there CRC files 
>> created along side with SSTables that can help detecting bit-rot ?
>> 
>> 
>>> On Sat, Sep 9, 2017 at 7:50 PM, Jeff Jirsa  wrote:
>>> Cassandra doesn't do that automatically - it can guarantee consistency on 
>>> read or write via ConsistencyLevel on each query, and it can run active 
>>> (AntiEntropy) repairs. But active repairs must be scheduled (by human or 
>>> cron or by third party script like http://cassandra-reaper.io/), and to be 
>>> pedantic, repair only fixes consistency issue, there's some work to be done 
>>> to properly address/support fixing corrupted replicas (for example, repair 
>>> COULD send a bit flip from one node to all of the others)
>>> 
>>> 
>>> 
>>> -- 
>>> Jeff Jirsa
>>> 
>>> 
 On Sep 9, 2017, at 1:07 AM, Ralph Soika  wrote:
 
 Hi,
 
 I am searching for a big data storage solution for the Imixs-Workflow 
 project. I started with Hadoop until I became aware   of the 
 'small-file-problem'. So I am considering using Cassandra now. 
 But Hadoop has one important feature for me. The replicator continuously 
 examines whether data blocks are consistent across all datanodes. This 
 will detect disk errors and automatically move data from defective blocks 
 to working blocks. I think this is called 'self-healing mechanism'.
 Is there a similar feature in Cassandra too?
 
 Thanks for help
 
 Ralph
 
 
 
 -- 
 
>> 


Re: Self-healing data integrity?

2017-09-09 Thread Jeff Jirsa
There is, but they aren't consulted on the streaming paths (only on normal 
reads)


-- 
Jeff Jirsa


> On Sep 9, 2017, at 12:02 PM, DuyHai Doan  wrote:
> 
> Jeff,
> 
>  With default compression enabled on each table, isn't there CRC files 
> created along side with SSTables that can help detecting bit-rot ?
> 
> 
>> On Sat, Sep 9, 2017 at 7:50 PM, Jeff Jirsa  wrote:
>> Cassandra doesn't do that automatically - it can guarantee consistency on 
>> read or write via ConsistencyLevel on each query, and it can run active 
>> (AntiEntropy) repairs. But active repairs must be scheduled (by human or 
>> cron or by third party script like http://cassandra-reaper.io/), and to be 
>> pedantic, repair only fixes consistency issue, there's some work to be done 
>> to properly address/support fixing corrupted replicas (for example, repair 
>> COULD send a bit flip from one node to all of the others)
>> 
>> 
>> 
>> -- 
>> Jeff Jirsa
>> 
>> 
>>> On Sep 9, 2017, at 1:07 AM, Ralph Soika  wrote:
>>> 
>>> Hi,
>>> 
>>> I am searching for a big data storage solution for the Imixs-Workflow 
>>> project. I started with Hadoop until I became aware of the 
>>> 'small-file-problem'. So I am considering using Cassandra now. 
>>> But Hadoop has one important feature for me. The replicator continuously 
>>> examines whether data blocks are consistent across   all datanodes. 
>>> This will detect disk errors and automatically move data from defective 
>>> blocks to working blocks. I think this is called 'self-healing mechanism'.
>>> Is there a similar feature in Cassandra too?
>>> 
>>> Thanks for help
>>> 
>>> Ralph
>>> 
>>> 
>>> 
>>> -- 
>>> 
> 


Re: Self-healing data integrity?

2017-09-09 Thread DuyHai Doan
Jeff,

 With default compression enabled on each table, isn't there CRC files
created along side with SSTables that can help detecting bit-rot ?


On Sat, Sep 9, 2017 at 7:50 PM, Jeff Jirsa  wrote:

> Cassandra doesn't do that automatically - it can guarantee consistency on
> read or write via ConsistencyLevel on each query, and it can run active
> (AntiEntropy) repairs. But active repairs must be scheduled (by human or
> cron or by third party script like http://cassandra-reaper.io/), and to
> be pedantic, repair only fixes consistency issue, there's some work to be
> done to properly address/support fixing corrupted replicas (for example,
> repair COULD send a bit flip from one node to all of the others)
>
>
>
> --
> Jeff Jirsa
>
>
> On Sep 9, 2017, at 1:07 AM, Ralph Soika  wrote:
>
> Hi,
>
> I am searching for a big data storage solution for the Imixs-Workflow
> project. I started with Hadoop until I became aware of the
> 'small-file-problem'. So I am considering using Cassandra now.
>
> But Hadoop has one important feature for me. The replicator continuously
> examines whether data blocks are consistent across all datanodes. This will
> detect disk errors and automatically move data from defective blocks to
> working blocks. I think this is called 'self-healing mechanism'.
>
> Is there a similar feature in Cassandra too?
>
>
> Thanks for help
>
> Ralph
>
>
>
> --
>
>


Re: Self-healing data integrity?

2017-09-09 Thread Jeff Jirsa
Cassandra doesn't do that automatically - it can guarantee consistency on read 
or write via ConsistencyLevel on each query, and it can run active 
(AntiEntropy) repairs. But active repairs must be scheduled (by human or cron 
or by third party script like http://cassandra-reaper.io/), and to be pedantic, 
repair only fixes consistency issue, there's some work to be done to properly 
address/support fixing corrupted replicas (for example, repair COULD send a bit 
flip from one node to all of the others)



-- 
Jeff Jirsa


> On Sep 9, 2017, at 1:07 AM, Ralph Soika  wrote:
> 
> Hi,
> 
> I am searching for a big data storage solution for the Imixs-Workflow 
> project. I started with Hadoop until I became aware of the 
> 'small-file-problem'. So I am considering using Cassandra now. 
> But Hadoop has one important feature for me. The replicator continuously 
> examines whether data blocks are consistent across all datanodes. This will 
> detect disk errors and automatically move data from defective blocks to 
> working blocks. I think this is called 'self-healing mechanism'.
> Is there a similar feature in Cassandra too?
> 
> Thanks for help
> 
> Ralph
> 
> 
> 
> -- 
> 


Self-healing data integrity?

2017-09-09 Thread Ralph Soika

Hi,

I am searching for a big data storage solution for the Imixs-Workflow 
project. I started with Hadoop until I became aware of the 
'small-file-problem'. So I am considering using Cassandra now.


But Hadoop has one important feature for me. The replicator continuously 
examines whether data blocks are consistent across all datanodes. This 
will detect disk errors and automatically move data from defective 
blocks to working blocks. I think this is called 'self-healing mechanism'.


Is there a similar feature in Cassandra too?


Thanks for help

Ralph



--