Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-16 Thread Derek Chen-Becker
My vote is B, but I think you should go ahead with the actual vote thread.

Cheers,

Derek

On Fri, Sep 16, 2022 at 4:05 AM Andrés de la Peña 
wrote:

> It's been 9 days since we started the poll, and we haven't had any new
> vote since Monday. So we are still on 5 votes for A and 2 votes for B.
>
> The poll results doesn't seem to oppose the CEP. If no one has anything
> else to add, I'll start the actual vote thread.
>
> On Tue, 13 Sept 2022 at 15:05, Andrés de la Peña 
> wrote:
>
>> That's 5 votes for A and 2 votes for B so far. None of these options
>> opposes to the CEP, so I think we can probably start the vote, unless we
>> want to wait longer for the poll.
>>
>> On Mon, 12 Sept 2022 at 13:51, Benjamin Lerer  wrote:
>>
>>> A
>>>
>>> Le mer. 7 sept. 2022 à 17:02, Jeremiah D Jordan <
>>> [email protected]> a écrit :
>>>
 A

 On Sep 7, 2022, at 8:58 AM, Benedict  wrote:

 Well, I am not convinced these changes will materially impact the
 outcome, but at least we’ll have some extra fun collating the votes.


 On 7 Sep 2022, at 14:05, Andrés de la Peña 
 wrote:

 
 The poll makes sense to me. I would slightly change it to:

 A) We shouldn't prefer neither approach, and I agree to the implementor
 selecting the table schema approach for this CEP
 B) We should prefer the view approach, but I am not opposed to the
 implementor selecting the table schema approach for this CEP
 C) We should NOT implement the table schema approach, and should
 implement the view approach
 D) We should NOT implement the table view approach, and should
 implement the schema approach
 E) We should NOT implement the table schema approach, and should
 implement some other scheme (or not implement this feature)

 Where my vote is for A.


 On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:

> I’m not convinced there’s been adequate resolution over which approach
> is adopted. I know you have expressed a preference for the table schema
> approach, but the weight of other opinion so far appears to be against 
> this
> approach - even if it is broadly adopted by other databases. I will note
> that Postgres does not adopt this approach, it has a more sophisticated
> security label approach that has not been proposed by anybody so far.
>
> I think extra weight should be given to the implementer’s preference,
> so while I personally do not like the table schema approach, I am happy to
> accept this is an industry norm, and leave the decision to you.
>
> However, we should ensure the community as a whole endorses this. I
> think an indicative poll should be undertaken first, eg:
>
> A) We should implement the table schema approach, as proposed
> B) We should prefer the view approach, but I am not opposed to the
> implementor selecting the table schema approach for this CEP
> C) We should NOT implement the table schema approach, and should
> implement the view approach
> D) We should NOT implement the table schema approach, and should
> implement some other scheme (or not implement this feature)
>
> Where my vote is B
>
> On 7 Sep 2022, at 12:50, Andrés de la Peña 
> wrote:
>
> 
> If nobody has more concerns regarding the CEP I will start the vote
> tomorrow.
>
> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
> wrote:
>
>> Is there enough support here for VIEWS to be the implementation
>>> strategy for displaying masking functions?
>>
>>
>> I'm not sure that views should be "the" strategy for masking
>> functions. We have multiple approaches here:
>>
>> 1) CQL functions only. Users can decide to use the masking functions
>> on their own will. I think most dbs allow this pattern of usage, which is
>> quite straightforward. Obviously, it doesn't allow admins to decide 
>> enforce
>> users seeing only masked data. Nevertheless, it's still useful for 
>> trusted
>> database users generating masked data that will be consumed by the end
>> users of the application.
>>
>> 2) Masking functions attached to specific columns. This way the same
>> queries will see different data (masked or not) depending on the
>> permissions of the user running the query. It has the advantage of not
>> requiring to change the queries that users with different permissions 
>> run.
>> The downside is that users would need to query the schema if they need to
>> know whether a column is masked, unless we change the names of the 
>> returned
>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, 
>> IBM
>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
>> applying the masking function to columns on the base table, and some of
>> them also allow to appl

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-16 Thread Andrés de la Peña
It's been 9 days since we started the poll, and we haven't had any new vote
since Monday. So we are still on 5 votes for A and 2 votes for B.

The poll results doesn't seem to oppose the CEP. If no one has anything
else to add, I'll start the actual vote thread.

On Tue, 13 Sept 2022 at 15:05, Andrés de la Peña 
wrote:

> That's 5 votes for A and 2 votes for B so far. None of these options
> opposes to the CEP, so I think we can probably start the vote, unless we
> want to wait longer for the poll.
>
> On Mon, 12 Sept 2022 at 13:51, Benjamin Lerer  wrote:
>
>> A
>>
>> Le mer. 7 sept. 2022 à 17:02, Jeremiah D Jordan <
>> [email protected]> a écrit :
>>
>>> A
>>>
>>> On Sep 7, 2022, at 8:58 AM, Benedict  wrote:
>>>
>>> Well, I am not convinced these changes will materially impact the
>>> outcome, but at least we’ll have some extra fun collating the votes.
>>>
>>>
>>> On 7 Sep 2022, at 14:05, Andrés de la Peña  wrote:
>>>
>>> 
>>> The poll makes sense to me. I would slightly change it to:
>>>
>>> A) We shouldn't prefer neither approach, and I agree to the implementor
>>> selecting the table schema approach for this CEP
>>> B) We should prefer the view approach, but I am not opposed to the
>>> implementor selecting the table schema approach for this CEP
>>> C) We should NOT implement the table schema approach, and should
>>> implement the view approach
>>> D) We should NOT implement the table view approach, and should implement
>>> the schema approach
>>> E) We should NOT implement the table schema approach, and should
>>> implement some other scheme (or not implement this feature)
>>>
>>> Where my vote is for A.
>>>
>>>
>>> On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:
>>>
 I’m not convinced there’s been adequate resolution over which approach
 is adopted. I know you have expressed a preference for the table schema
 approach, but the weight of other opinion so far appears to be against this
 approach - even if it is broadly adopted by other databases. I will note
 that Postgres does not adopt this approach, it has a more sophisticated
 security label approach that has not been proposed by anybody so far.

 I think extra weight should be given to the implementer’s preference,
 so while I personally do not like the table schema approach, I am happy to
 accept this is an industry norm, and leave the decision to you.

 However, we should ensure the community as a whole endorses this. I
 think an indicative poll should be undertaken first, eg:

 A) We should implement the table schema approach, as proposed
 B) We should prefer the view approach, but I am not opposed to the
 implementor selecting the table schema approach for this CEP
 C) We should NOT implement the table schema approach, and should
 implement the view approach
 D) We should NOT implement the table schema approach, and should
 implement some other scheme (or not implement this feature)

 Where my vote is B

 On 7 Sep 2022, at 12:50, Andrés de la Peña 
 wrote:

 
 If nobody has more concerns regarding the CEP I will start the vote
 tomorrow.

 On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
 wrote:

> Is there enough support here for VIEWS to be the implementation
>> strategy for displaying masking functions?
>
>
> I'm not sure that views should be "the" strategy for masking
> functions. We have multiple approaches here:
>
> 1) CQL functions only. Users can decide to use the masking functions
> on their own will. I think most dbs allow this pattern of usage, which is
> quite straightforward. Obviously, it doesn't allow admins to decide 
> enforce
> users seeing only masked data. Nevertheless, it's still useful for trusted
> database users generating masked data that will be consumed by the end
> users of the application.
>
> 2) Masking functions attached to specific columns. This way the same
> queries will see different data (masked or not) depending on the
> permissions of the user running the query. It has the advantage of not
> requiring to change the queries that users with different permissions run.
> The downside is that users would need to query the schema if they need to
> know whether a column is masked, unless we change the names of the 
> returned
> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
> applying the masking function to columns on the base table, and some of
> them also allow to apply masking to views.
>
> 3) Masking functions as part of projected views. This ways users might
> need to query the view appropriate for their permissions instead of the
> base table. This might mean changing the queries if the masking policy is
> changed by the admin. MySQL recomme

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-13 Thread Andrés de la Peña
That's 5 votes for A and 2 votes for B so far. None of these options
opposes to the CEP, so I think we can probably start the vote, unless we
want to wait longer for the poll.

On Mon, 12 Sept 2022 at 13:51, Benjamin Lerer  wrote:

> A
>
> Le mer. 7 sept. 2022 à 17:02, Jeremiah D Jordan 
> a écrit :
>
>> A
>>
>> On Sep 7, 2022, at 8:58 AM, Benedict  wrote:
>>
>> Well, I am not convinced these changes will materially impact the
>> outcome, but at least we’ll have some extra fun collating the votes.
>>
>>
>> On 7 Sep 2022, at 14:05, Andrés de la Peña  wrote:
>>
>> 
>> The poll makes sense to me. I would slightly change it to:
>>
>> A) We shouldn't prefer neither approach, and I agree to the implementor
>> selecting the table schema approach for this CEP
>> B) We should prefer the view approach, but I am not opposed to the
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should
>> implement the view approach
>> D) We should NOT implement the table view approach, and should implement
>> the schema approach
>> E) We should NOT implement the table schema approach, and should
>> implement some other scheme (or not implement this feature)
>>
>> Where my vote is for A.
>>
>>
>> On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:
>>
>>> I’m not convinced there’s been adequate resolution over which approach
>>> is adopted. I know you have expressed a preference for the table schema
>>> approach, but the weight of other opinion so far appears to be against this
>>> approach - even if it is broadly adopted by other databases. I will note
>>> that Postgres does not adopt this approach, it has a more sophisticated
>>> security label approach that has not been proposed by anybody so far.
>>>
>>> I think extra weight should be given to the implementer’s preference, so
>>> while I personally do not like the table schema approach, I am happy to
>>> accept this is an industry norm, and leave the decision to you.
>>>
>>> However, we should ensure the community as a whole endorses this. I
>>> think an indicative poll should be undertaken first, eg:
>>>
>>> A) We should implement the table schema approach, as proposed
>>> B) We should prefer the view approach, but I am not opposed to the
>>> implementor selecting the table schema approach for this CEP
>>> C) We should NOT implement the table schema approach, and should
>>> implement the view approach
>>> D) We should NOT implement the table schema approach, and should
>>> implement some other scheme (or not implement this feature)
>>>
>>> Where my vote is B
>>>
>>> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
>>>
>>> 
>>> If nobody has more concerns regarding the CEP I will start the vote
>>> tomorrow.
>>>
>>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
>>> wrote:
>>>
 Is there enough support here for VIEWS to be the implementation
> strategy for displaying masking functions?


 I'm not sure that views should be "the" strategy for masking functions.
 We have multiple approaches here:

 1) CQL functions only. Users can decide to use the masking functions on
 their own will. I think most dbs allow this pattern of usage, which is
 quite straightforward. Obviously, it doesn't allow admins to decide enforce
 users seeing only masked data. Nevertheless, it's still useful for trusted
 database users generating masked data that will be consumed by the end
 users of the application.

 2) Masking functions attached to specific columns. This way the same
 queries will see different data (masked or not) depending on the
 permissions of the user running the query. It has the advantage of not
 requiring to change the queries that users with different permissions run.
 The downside is that users would need to query the schema if they need to
 know whether a column is masked, unless we change the names of the returned
 columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
 Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
 applying the masking function to columns on the base table, and some of
 them also allow to apply masking to views.

 3) Masking functions as part of projected views. This ways users might
 need to query the view appropriate for their permissions instead of the
 base table. This might mean changing the queries if the masking policy is
 changed by the admin. MySQL recommends this approach on a blog entry,
 although it's not part of its main documentation for data masking, and the
 implementation has security issues. Some of the other databases offering
 the approach 2) as their main option also support masking on view columns.

 Each approach has its own advantages and limitations, and I don't think
 we necessarily have to choose. The CEP proposes implementing 1) and 2), but
 no one impedes us to also have 3) 

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-12 Thread Benjamin Lerer
A

Le mer. 7 sept. 2022 à 17:02, Jeremiah D Jordan 
a écrit :

> A
>
> On Sep 7, 2022, at 8:58 AM, Benedict  wrote:
>
> Well, I am not convinced these changes will materially impact the outcome,
> but at least we’ll have some extra fun collating the votes.
>
>
> On 7 Sep 2022, at 14:05, Andrés de la Peña  wrote:
>
> 
> The poll makes sense to me. I would slightly change it to:
>
> A) We shouldn't prefer neither approach, and I agree to the implementor
> selecting the table schema approach for this CEP
> B) We should prefer the view approach, but I am not opposed to the
> implementor selecting the table schema approach for this CEP
> C) We should NOT implement the table schema approach, and should implement
> the view approach
> D) We should NOT implement the table view approach, and should implement
> the schema approach
> E) We should NOT implement the table schema approach, and should implement
> some other scheme (or not implement this feature)
>
> Where my vote is for A.
>
>
> On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:
>
>> I’m not convinced there’s been adequate resolution over which approach is
>> adopted. I know you have expressed a preference for the table schema
>> approach, but the weight of other opinion so far appears to be against this
>> approach - even if it is broadly adopted by other databases. I will note
>> that Postgres does not adopt this approach, it has a more sophisticated
>> security label approach that has not been proposed by anybody so far.
>>
>> I think extra weight should be given to the implementer’s preference, so
>> while I personally do not like the table schema approach, I am happy to
>> accept this is an industry norm, and leave the decision to you.
>>
>> However, we should ensure the community as a whole endorses this. I think
>> an indicative poll should be undertaken first, eg:
>>
>> A) We should implement the table schema approach, as proposed
>> B) We should prefer the view approach, but I am not opposed to the
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should
>> implement the view approach
>> D) We should NOT implement the table schema approach, and should
>> implement some other scheme (or not implement this feature)
>>
>> Where my vote is B
>>
>> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
>>
>> 
>> If nobody has more concerns regarding the CEP I will start the vote
>> tomorrow.
>>
>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
>> wrote:
>>
>>> Is there enough support here for VIEWS to be the implementation strategy
 for displaying masking functions?
>>>
>>>
>>> I'm not sure that views should be "the" strategy for masking functions.
>>> We have multiple approaches here:
>>>
>>> 1) CQL functions only. Users can decide to use the masking functions on
>>> their own will. I think most dbs allow this pattern of usage, which is
>>> quite straightforward. Obviously, it doesn't allow admins to decide enforce
>>> users seeing only masked data. Nevertheless, it's still useful for trusted
>>> database users generating masked data that will be consumed by the end
>>> users of the application.
>>>
>>> 2) Masking functions attached to specific columns. This way the same
>>> queries will see different data (masked or not) depending on the
>>> permissions of the user running the query. It has the advantage of not
>>> requiring to change the queries that users with different permissions run.
>>> The downside is that users would need to query the schema if they need to
>>> know whether a column is masked, unless we change the names of the returned
>>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
>>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
>>> applying the masking function to columns on the base table, and some of
>>> them also allow to apply masking to views.
>>>
>>> 3) Masking functions as part of projected views. This ways users might
>>> need to query the view appropriate for their permissions instead of the
>>> base table. This might mean changing the queries if the masking policy is
>>> changed by the admin. MySQL recommends this approach on a blog entry,
>>> although it's not part of its main documentation for data masking, and the
>>> implementation has security issues. Some of the other databases offering
>>> the approach 2) as their main option also support masking on view columns.
>>>
>>> Each approach has its own advantages and limitations, and I don't think
>>> we necessarily have to choose. The CEP proposes implementing 1) and 2), but
>>> no one impedes us to also have 3) if we get to have projected views.
>>> However, I think that projected views is a new general-purpose feature with
>>> its own complexities, so it would deserve its own CEP, if someone is
>>> willing to work on the implementation.
>>>
>>>
>>>
>>> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
>>> [email protected]> wrote:

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Jeremiah D Jordan
A

> On Sep 7, 2022, at 8:58 AM, Benedict  wrote:
> 
> Well, I am not convinced these changes will materially impact the outcome, 
> but at least we’ll have some extra fun collating the votes.
> 
> 
>> On 7 Sep 2022, at 14:05, Andrés de la Peña  wrote:
>> 
>> 
>> The poll makes sense to me. I would slightly change it to:
>> 
>> A) We shouldn't prefer neither approach, and I agree to the implementor 
>> selecting the table schema approach for this CEP
>> B) We should prefer the view approach, but I am not opposed to the 
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should implement 
>> the view approach
>> D) We should NOT implement the table view approach, and should implement the 
>> schema approach
>> E) We should NOT implement the table schema approach, and should implement 
>> some other scheme (or not implement this feature)
>> 
>> Where my vote is for A.
>> 
>> 
>> On Wed, 7 Sept 2022 at 13:12, Benedict > > wrote:
>> I’m not convinced there’s been adequate resolution over which approach is 
>> adopted. I know you have expressed a preference for the table schema 
>> approach, but the weight of other opinion so far appears to be against this 
>> approach - even if it is broadly adopted by other databases. I will note 
>> that Postgres does not adopt this approach, it has a more sophisticated 
>> security label approach that has not been proposed by anybody so far.
>> 
>> I think extra weight should be given to the implementer’s preference, so 
>> while I personally do not like the table schema approach, I am happy to 
>> accept this is an industry norm, and leave the decision to you.
>> 
>> However, we should ensure the community as a whole endorses this. I think an 
>> indicative poll should be undertaken first, eg:
>> 
>> A) We should implement the table schema approach, as proposed
>> B) We should prefer the view approach, but I am not opposed to the 
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should implement 
>> the view approach
>> D) We should NOT implement the table schema approach, and should implement 
>> some other scheme (or not implement this feature)
>> 
>> Where my vote is B
>> 
>>> On 7 Sep 2022, at 12:50, Andrés de la Peña >> > wrote:
>>> 
>>> 
>>> If nobody has more concerns regarding the CEP I will start the vote 
>>> tomorrow.
>>> 
>>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña >> > wrote:
>>> Is there enough support here for VIEWS to be the implementation strategy 
>>> for displaying masking functions?
>>> 
>>> I'm not sure that views should be "the" strategy for masking functions. We 
>>> have multiple approaches here:
>>> 
>>> 1) CQL functions only. Users can decide to use the masking functions on 
>>> their own will. I think most dbs allow this pattern of usage, which is 
>>> quite straightforward. Obviously, it doesn't allow admins to decide enforce 
>>> users seeing only masked data. Nevertheless, it's still useful for trusted 
>>> database users generating masked data that will be consumed by the end 
>>> users of the application.
>>> 
>>> 2) Masking functions attached to specific columns. This way the same 
>>> queries will see different data (masked or not) depending on the 
>>> permissions of the user running the query. It has the advantage of not 
>>> requiring to change the queries that users with different permissions run. 
>>> The downside is that users would need to query the schema if they need to 
>>> know whether a column is masked, unless we change the names of the returned 
>>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM 
>>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support 
>>> applying the masking function to columns on the base table, and some of 
>>> them also allow to apply masking to views.
>>> 
>>> 3) Masking functions as part of projected views. This ways users might need 
>>> to query the view appropriate for their permissions instead of the base 
>>> table. This might mean changing the queries if the masking policy is 
>>> changed by the admin. MySQL recommends this approach on a blog entry, 
>>> although it's not part of its main documentation for data masking, and the 
>>> implementation has security issues. Some of the other databases offering 
>>> the approach 2) as their main option also support masking on view columns.
>>> 
>>> Each approach has its own advantages and limitations, and I don't think we 
>>> necessarily have to choose. The CEP proposes implementing 1) and 2), but no 
>>> one impedes us to also have 3) if we get to have projected views. However, 
>>> I think that projected views is a new general-purpose feature with its own 
>>> complexities, so it would deserve its own CEP, if someone is willing to 
>>> work on the implementatio

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Benedict
Well, I am not convinced these changes will materially impact the outcome, but 
at least we’ll have some extra fun collating the votes.


> On 7 Sep 2022, at 14:05, Andrés de la Peña  wrote:
> 
> 
> The poll makes sense to me. I would slightly change it to:
> 
> A) We shouldn't prefer neither approach, and I agree to the implementor 
> selecting the table schema approach for this CEP
> B) We should prefer the view approach, but I am not opposed to the 
> implementor selecting the table schema approach for this CEP
> C) We should NOT implement the table schema approach, and should implement 
> the view approach
> D) We should NOT implement the table view approach, and should implement the 
> schema approach
> E) We should NOT implement the table schema approach, and should implement 
> some other scheme (or not implement this feature)
> 
> Where my vote is for A.
> 
> 
>> On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:
>> I’m not convinced there’s been adequate resolution over which approach is 
>> adopted. I know you have expressed a preference for the table schema 
>> approach, but the weight of other opinion so far appears to be against this 
>> approach - even if it is broadly adopted by other databases. I will note 
>> that Postgres does not adopt this approach, it has a more sophisticated 
>> security label approach that has not been proposed by anybody so far.
>> 
>> I think extra weight should be given to the implementer’s preference, so 
>> while I personally do not like the table schema approach, I am happy to 
>> accept this is an industry norm, and leave the decision to you.
>> 
>> However, we should ensure the community as a whole endorses this. I think an 
>> indicative poll should be undertaken first, eg:
>> 
>> A) We should implement the table schema approach, as proposed
>> B) We should prefer the view approach, but I am not opposed to the 
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should implement 
>> the view approach
>> D) We should NOT implement the table schema approach, and should implement 
>> some other scheme (or not implement this feature)
>> 
>> Where my vote is B
>> 
 On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
 
>>> 
>>> If nobody has more concerns regarding the CEP I will start the vote 
>>> tomorrow.
>>> 
>>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña  
>>> wrote:
> Is there enough support here for VIEWS to be the implementation strategy 
> for displaying masking functions?
 
 I'm not sure that views should be "the" strategy for masking functions. We 
 have multiple approaches here:
 
 1) CQL functions only. Users can decide to use the masking functions on 
 their own will. I think most dbs allow this pattern of usage, which is 
 quite straightforward. Obviously, it doesn't allow admins to decide 
 enforce users seeing only masked data. Nevertheless, it's still useful for 
 trusted database users generating masked data that will be consumed by the 
 end users of the application.
 
 2) Masking functions attached to specific columns. This way the same 
 queries will see different data (masked or not) depending on the 
 permissions of the user running the query. It has the advantage of not 
 requiring to change the queries that users with different permissions run. 
 The downside is that users would need to query the schema if they need to 
 know whether a column is masked, unless we change the names of the 
 returned columns. This is the approach offered by Azure/SQL Server, 
 PostgreSQL, IBM Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these 
 databases support applying the masking function to columns on the base 
 table, and some of them also allow to apply masking to views.
 
 3) Masking functions as part of projected views. This ways users might 
 need to query the view appropriate for their permissions instead of the 
 base table. This might mean changing the queries if the masking policy is 
 changed by the admin. MySQL recommends this approach on a blog entry, 
 although it's not part of its main documentation for data masking, and the 
 implementation has security issues. Some of the other databases offering 
 the approach 2) as their main option also support masking on view columns.
 
 Each approach has its own advantages and limitations, and I don't think we 
 necessarily have to choose. The CEP proposes implementing 1) and 2), but 
 no one impedes us to also have 3) if we get to have projected views. 
 However, I think that projected views is a new general-purpose feature 
 with its own complexities, so it would deserve its own CEP, if someone is 
 willing to work on the implementation.
 
 
 
> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev 
>  wrote:
> Is there enough support here f

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Berenguer Blasi
A. I agree the implementor's preference is an important aspect to take 
into account.


On 7/9/22 15:23, Ekaterina Dimitrova wrote:

A

On Wed, 7 Sep 2022 at 9:05, Andrés de la Peña  
wrote:


The poll makes sense to me. I would slightly change it to:

A) We shouldn't prefer neither approach, and I agree to the
implementor selecting the table schema approach for this CEP
B) We should prefer the view approach, but I am not opposed to the
implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should
implement the view approach
D) We should NOT implement the table view approach, and should
implement the schema approach
E) We should NOT implement the table schema approach, and should
implement some other scheme (or not implement this feature)

Where my vote is for A.


On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:

I’m not convinced there’s been adequate resolution over which
approach is adopted. I know you have expressed a preference
for the table schema approach, but the weight of other opinion
so far appears to be against this approach - even if it is
broadly adopted by other databases. I will note that Postgres
does not adopt this approach, it has a more sophisticated
security label approach that has not been proposed by anybody
so far.

I think extra weight should be given to the implementer’s
preference, so while I personally do not like the table schema
approach, I am happy to accept this is an industry norm, and
leave the decision to you.

However, we should ensure the community as a whole endorses
this. I think an indicative poll should be undertaken first, eg:

A) We should implement the table schema approach, as proposed
B) We should prefer the view approach, but I am not opposed to
the implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and
should implement the view approach
D) We should NOT implement the table schema approach, and
should implement some other scheme (or not implement this feature)

Where my vote is B


On 7 Sep 2022, at 12:50, Andrés de la Peña
 wrote:


If nobody has more concerns regarding the CEP I will start
the vote tomorrow.

On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña
 wrote:

Is there enough support here for VIEWS to be the
implementation strategy for displaying masking functions?


I'm not sure that views should be "the" strategy for
masking functions. We have multiple approaches here:

1) CQL functions only. Users can decide to use the
masking functions on their own will. I think most dbs
allow this pattern of usage, which is quite
straightforward. Obviously, it doesn't allow admins to
decide enforce users seeing only masked data.
Nevertheless, it's still useful for trusted database
users generating masked data that will be consumed by the
end users of the application.

2) Masking functions attached to specific columns. This
way the same queries will see different data (masked or
not) depending on the permissions of the user running the
query. It has the advantage of not requiring to change
the queries that users with different permissions run.
The downside is that users would need to query the schema
if they need to know whether a column is masked, unless
we change the names of the returned columns. This is the
approach offered by Azure/SQL Server, PostgreSQL, IBM
Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these
databases support applying the masking function to
columns on the base table, and some of them also allow to
apply masking to views.

3) Masking functions as part of projected views. This
ways users might need to query the view appropriate for
their permissions instead of the base table. This might
mean changing the queries if the masking policy is
changed by the admin. MySQL recommends this approach on a
blog entry, although it's not part of its main
documentation for data masking, and the implementation
has security issues. Some of the other databases offering
the approach 2) as their main option also support masking
on view columns.

Each approach has its own advantages and limitations, and
I don't think we necessarily have to choose. The CEP
proposes implementing 1) and 2), but no one impe

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Ekaterina Dimitrova
A

On Wed, 7 Sep 2022 at 9:05, Andrés de la Peña  wrote:

> The poll makes sense to me. I would slightly change it to:
>
> A) We shouldn't prefer neither approach, and I agree to the implementor
> selecting the table schema approach for this CEP
> B) We should prefer the view approach, but I am not opposed to the
> implementor selecting the table schema approach for this CEP
> C) We should NOT implement the table schema approach, and should implement
> the view approach
> D) We should NOT implement the table view approach, and should implement
> the schema approach
> E) We should NOT implement the table schema approach, and should implement
> some other scheme (or not implement this feature)
>
> Where my vote is for A.
>
>
> On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:
>
>> I’m not convinced there’s been adequate resolution over which approach is
>> adopted. I know you have expressed a preference for the table schema
>> approach, but the weight of other opinion so far appears to be against this
>> approach - even if it is broadly adopted by other databases. I will note
>> that Postgres does not adopt this approach, it has a more sophisticated
>> security label approach that has not been proposed by anybody so far.
>>
>> I think extra weight should be given to the implementer’s preference, so
>> while I personally do not like the table schema approach, I am happy to
>> accept this is an industry norm, and leave the decision to you.
>>
>> However, we should ensure the community as a whole endorses this. I think
>> an indicative poll should be undertaken first, eg:
>>
>> A) We should implement the table schema approach, as proposed
>> B) We should prefer the view approach, but I am not opposed to the
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should
>> implement the view approach
>> D) We should NOT implement the table schema approach, and should
>> implement some other scheme (or not implement this feature)
>>
>> Where my vote is B
>>
>> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
>>
>> 
>> If nobody has more concerns regarding the CEP I will start the vote
>> tomorrow.
>>
>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
>> wrote:
>>
>>> Is there enough support here for VIEWS to be the implementation strategy
 for displaying masking functions?
>>>
>>>
>>> I'm not sure that views should be "the" strategy for masking functions.
>>> We have multiple approaches here:
>>>
>>> 1) CQL functions only. Users can decide to use the masking functions on
>>> their own will. I think most dbs allow this pattern of usage, which is
>>> quite straightforward. Obviously, it doesn't allow admins to decide enforce
>>> users seeing only masked data. Nevertheless, it's still useful for trusted
>>> database users generating masked data that will be consumed by the end
>>> users of the application.
>>>
>>> 2) Masking functions attached to specific columns. This way the same
>>> queries will see different data (masked or not) depending on the
>>> permissions of the user running the query. It has the advantage of not
>>> requiring to change the queries that users with different permissions run.
>>> The downside is that users would need to query the schema if they need to
>>> know whether a column is masked, unless we change the names of the returned
>>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
>>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
>>> applying the masking function to columns on the base table, and some of
>>> them also allow to apply masking to views.
>>>
>>> 3) Masking functions as part of projected views. This ways users might
>>> need to query the view appropriate for their permissions instead of the
>>> base table. This might mean changing the queries if the masking policy is
>>> changed by the admin. MySQL recommends this approach on a blog entry,
>>> although it's not part of its main documentation for data masking, and the
>>> implementation has security issues. Some of the other databases offering
>>> the approach 2) as their main option also support masking on view columns.
>>>
>>> Each approach has its own advantages and limitations, and I don't think
>>> we necessarily have to choose. The CEP proposes implementing 1) and 2), but
>>> no one impedes us to also have 3) if we get to have projected views.
>>> However, I think that projected views is a new general-purpose feature with
>>> its own complexities, so it would deserve its own CEP, if someone is
>>> willing to work on the implementation.
>>>
>>>
>>>
>>> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
>>> [email protected]> wrote:
>>>
 Is there enough support here for VIEWS to be the implementation
 strategy for displaying masking functions?

 It seems to me the view would have to store the query and apply a where
 clause to it, so the same PK would be in play.

>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Andrés de la Peña
The poll makes sense to me. I would slightly change it to:

A) We shouldn't prefer neither approach, and I agree to the implementor
selecting the table schema approach for this CEP
B) We should prefer the view approach, but I am not opposed to the
implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should implement
the view approach
D) We should NOT implement the table view approach, and should implement
the schema approach
E) We should NOT implement the table schema approach, and should implement
some other scheme (or not implement this feature)

Where my vote is for A.


On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:

> I’m not convinced there’s been adequate resolution over which approach is
> adopted. I know you have expressed a preference for the table schema
> approach, but the weight of other opinion so far appears to be against this
> approach - even if it is broadly adopted by other databases. I will note
> that Postgres does not adopt this approach, it has a more sophisticated
> security label approach that has not been proposed by anybody so far.
>
> I think extra weight should be given to the implementer’s preference, so
> while I personally do not like the table schema approach, I am happy to
> accept this is an industry norm, and leave the decision to you.
>
> However, we should ensure the community as a whole endorses this. I think
> an indicative poll should be undertaken first, eg:
>
> A) We should implement the table schema approach, as proposed
> B) We should prefer the view approach, but I am not opposed to the
> implementor selecting the table schema approach for this CEP
> C) We should NOT implement the table schema approach, and should implement
> the view approach
> D) We should NOT implement the table schema approach, and should implement
> some other scheme (or not implement this feature)
>
> Where my vote is B
>
> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
>
> 
> If nobody has more concerns regarding the CEP I will start the vote
> tomorrow.
>
> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
> wrote:
>
>> Is there enough support here for VIEWS to be the implementation strategy
>>> for displaying masking functions?
>>
>>
>> I'm not sure that views should be "the" strategy for masking functions.
>> We have multiple approaches here:
>>
>> 1) CQL functions only. Users can decide to use the masking functions on
>> their own will. I think most dbs allow this pattern of usage, which is
>> quite straightforward. Obviously, it doesn't allow admins to decide enforce
>> users seeing only masked data. Nevertheless, it's still useful for trusted
>> database users generating masked data that will be consumed by the end
>> users of the application.
>>
>> 2) Masking functions attached to specific columns. This way the same
>> queries will see different data (masked or not) depending on the
>> permissions of the user running the query. It has the advantage of not
>> requiring to change the queries that users with different permissions run.
>> The downside is that users would need to query the schema if they need to
>> know whether a column is masked, unless we change the names of the returned
>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
>> applying the masking function to columns on the base table, and some of
>> them also allow to apply masking to views.
>>
>> 3) Masking functions as part of projected views. This ways users might
>> need to query the view appropriate for their permissions instead of the
>> base table. This might mean changing the queries if the masking policy is
>> changed by the admin. MySQL recommends this approach on a blog entry,
>> although it's not part of its main documentation for data masking, and the
>> implementation has security issues. Some of the other databases offering
>> the approach 2) as their main option also support masking on view columns.
>>
>> Each approach has its own advantages and limitations, and I don't think
>> we necessarily have to choose. The CEP proposes implementing 1) and 2), but
>> no one impedes us to also have 3) if we get to have projected views.
>> However, I think that projected views is a new general-purpose feature with
>> its own complexities, so it would deserve its own CEP, if someone is
>> willing to work on the implementation.
>>
>>
>>
>> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
>> [email protected]> wrote:
>>
>>> Is there enough support here for VIEWS to be the implementation strategy
>>> for displaying masking functions?
>>>
>>> It seems to me the view would have to store the query and apply a where
>>> clause to it, so the same PK would be in play.
>>>
>>> It has data leaking properties.
>>>
>>> It has more use cases as it can be used to
>>>
>>>- construct views that filter out sensitive columns
>>>- apply transforms

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Claude Warren via dev

My vote is B

On 07/09/2022 13:12, Benedict wrote:
I’m not convinced there’s been adequate resolution over which approach 
is adopted. I know you have expressed a preference for the table 
schema approach, but the weight of other opinion so far appears to be 
against this approach - even if it is broadly adopted by other 
databases. I will note that Postgres does not adopt this approach, it 
has a more sophisticated security label approach that has not been 
proposed by anybody so far.


I think extra weight should be given to the implementer’s preference, 
so while I personally do not like the table schema approach, I am 
happy to accept this is an industry norm, and leave the decision to you.


However, we should ensure the community as a whole endorses this. I 
think an indicative poll should be undertaken first, eg:


A) We should implement the table schema approach, as proposed
B) We should prefer the view approach, but I am not opposed to the 
implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should 
implement the view approach
D) We should NOT implement the table schema approach, and should 
implement some other scheme (or not implement this feature)


Where my vote is B


On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:


If nobody has more concerns regarding the CEP I will start the vote 
tomorrow.


On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
 wrote:


Is there enough support here for VIEWS to be the
implementation strategy for displaying masking functions?


I'm not sure that views should be "the" strategy for masking
functions. We have multiple approaches here:

1) CQL functions only. Users can decide to use the masking
functions on their own will. I think most dbs allow this pattern
of usage, which is quite straightforward. Obviously, it doesn't
allow admins to decide enforce users seeing only masked data.
Nevertheless, it's still useful for trusted database users
generating masked data that will be consumed by the end users of
the application.

2) Masking functions attached to specific columns. This way the
same queries will see different data (masked or not) depending on
the permissions of the user running the query. It has the
advantage of not requiring to change the queries that users with
different permissions run. The downside is that users would need
to query the schema if they need to know whether a column is
masked, unless we change the names of the returned columns. This
is the approach offered by Azure/SQL Server, PostgreSQL, IBM Db2,
Oracle, MariaDB/MaxScale and SnowFlake. All these databases
support applying the masking function to columns on the base
table, and some of them also allow to apply masking to views.

3) Masking functions as part of projected views. This ways users
might need to query the view appropriate for their permissions
instead of the base table. This might mean changing the queries
if the masking policy is changed by the admin. MySQL recommends
this approach on a blog entry, although it's not part of its main
documentation for data masking, and the implementation has
security issues. Some of the other databases offering the
approach 2) as their main option also support masking on view
columns.

Each approach has its own advantages and limitations, and I don't
think we necessarily have to choose. The CEP proposes
implementing 1) and 2), but no one impedes us to also have 3) if
we get to have projected views. However, I think that projected
views is a new general-purpose feature with its own complexities,
so it would deserve its own CEP, if someone is willing to work on
the implementation.



On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev
 wrote:

Is there enough support here for VIEWS to be the
implementation strategy for displaying masking functions?

It seems to me the view would have to store the query and
apply a where clause to it, so the same PK would be in play.

It has data leaking properties.

It has more use cases as it can be used to

  * construct views that filter out sensitive columns
  * apply transforms to convert units of measure

Are there more thoughts along this line?


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Benedict
I’m not convinced there’s been adequate resolution over which approach is 
adopted. I know you have expressed a preference for the table schema approach, 
but the weight of other opinion so far appears to be against this approach - 
even if it is broadly adopted by other databases. I will note that Postgres 
does not adopt this approach, it has a more sophisticated security label 
approach that has not been proposed by anybody so far.

I think extra weight should be given to the implementer’s preference, so while 
I personally do not like the table schema approach, I am happy to accept this 
is an industry norm, and leave the decision to you.

However, we should ensure the community as a whole endorses this. I think an 
indicative poll should be undertaken first, eg:

A) We should implement the table schema approach, as proposed
B) We should prefer the view approach, but I am not opposed to the implementor 
selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should implement the 
view approach
D) We should NOT implement the table schema approach, and should implement some 
other scheme (or not implement this feature)

Where my vote is B

> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
> 
> 
> If nobody has more concerns regarding the CEP I will start the vote tomorrow.
> 
> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña  wrote:
>>> Is there enough support here for VIEWS to be the implementation strategy 
>>> for displaying masking functions?
>> 
>> I'm not sure that views should be "the" strategy for masking functions. We 
>> have multiple approaches here:
>> 
>> 1) CQL functions only. Users can decide to use the masking functions on 
>> their own will. I think most dbs allow this pattern of usage, which is quite 
>> straightforward. Obviously, it doesn't allow admins to decide enforce users 
>> seeing only masked data. Nevertheless, it's still useful for trusted 
>> database users generating masked data that will be consumed by the end users 
>> of the application.
>> 
>> 2) Masking functions attached to specific columns. This way the same queries 
>> will see different data (masked or not) depending on the permissions of the 
>> user running the query. It has the advantage of not requiring to change the 
>> queries that users with different permissions run. The downside is that 
>> users would need to query the schema if they need to know whether a column 
>> is masked, unless we change the names of the returned columns. This is the 
>> approach offered by Azure/SQL Server, PostgreSQL, IBM Db2, Oracle, 
>> MariaDB/MaxScale and SnowFlake. All these databases support applying the 
>> masking function to columns on the base table, and some of them also allow 
>> to apply masking to views.
>> 
>> 3) Masking functions as part of projected views. This ways users might need 
>> to query the view appropriate for their permissions instead of the base 
>> table. This might mean changing the queries if the masking policy is changed 
>> by the admin. MySQL recommends this approach on a blog entry, although it's 
>> not part of its main documentation for data masking, and the implementation 
>> has security issues. Some of the other databases offering the approach 2) as 
>> their main option also support masking on view columns.
>> 
>> Each approach has its own advantages and limitations, and I don't think we 
>> necessarily have to choose. The CEP proposes implementing 1) and 2), but no 
>> one impedes us to also have 3) if we get to have projected views. However, I 
>> think that projected views is a new general-purpose feature with its own 
>> complexities, so it would deserve its own CEP, if someone is willing to work 
>> on the implementation.
>> 
>> 
>> 
>> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev 
>>  wrote:
>>> Is there enough support here for VIEWS to be the implementation strategy 
>>> for displaying masking functions?
>>> 
>>> It seems to me the view would have to store the query and apply a where 
>>> clause to it, so the same PK would be in play.
>>> 
>>> It has data leaking properties.
>>> 
>>> It has more use cases as it can be used to
>>> 
>>> construct views that filter out sensitive columns
>>> apply transforms to convert units of measure
>>> Are there more thoughts along this line?


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Andrés de la Peña
If nobody has more concerns regarding the CEP I will start the vote
tomorrow.

On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
wrote:

> Is there enough support here for VIEWS to be the implementation strategy
>> for displaying masking functions?
>
>
> I'm not sure that views should be "the" strategy for masking functions. We
> have multiple approaches here:
>
> 1) CQL functions only. Users can decide to use the masking functions on
> their own will. I think most dbs allow this pattern of usage, which is
> quite straightforward. Obviously, it doesn't allow admins to decide enforce
> users seeing only masked data. Nevertheless, it's still useful for trusted
> database users generating masked data that will be consumed by the end
> users of the application.
>
> 2) Masking functions attached to specific columns. This way the same
> queries will see different data (masked or not) depending on the
> permissions of the user running the query. It has the advantage of not
> requiring to change the queries that users with different permissions run.
> The downside is that users would need to query the schema if they need to
> know whether a column is masked, unless we change the names of the returned
> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
> applying the masking function to columns on the base table, and some of
> them also allow to apply masking to views.
>
> 3) Masking functions as part of projected views. This ways users might
> need to query the view appropriate for their permissions instead of the
> base table. This might mean changing the queries if the masking policy is
> changed by the admin. MySQL recommends this approach on a blog entry,
> although it's not part of its main documentation for data masking, and the
> implementation has security issues. Some of the other databases offering
> the approach 2) as their main option also support masking on view columns.
>
> Each approach has its own advantages and limitations, and I don't think we
> necessarily have to choose. The CEP proposes implementing 1) and 2), but no
> one impedes us to also have 3) if we get to have projected views. However,
> I think that projected views is a new general-purpose feature with its own
> complexities, so it would deserve its own CEP, if someone is willing to
> work on the implementation.
>
>
>
> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
> [email protected]> wrote:
>
>> Is there enough support here for VIEWS to be the implementation strategy
>> for displaying masking functions?
>>
>> It seems to me the view would have to store the query and apply a where
>> clause to it, so the same PK would be in play.
>>
>> It has data leaking properties.
>>
>> It has more use cases as it can be used to
>>
>>- construct views that filter out sensitive columns
>>- apply transforms to convert units of measure
>>
>> Are there more thoughts along this line?
>>
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-31 Thread Andrés de la Peña
>
> Is there enough support here for VIEWS to be the implementation strategy
> for displaying masking functions?


I'm not sure that views should be "the" strategy for masking functions. We
have multiple approaches here:

1) CQL functions only. Users can decide to use the masking functions on
their own will. I think most dbs allow this pattern of usage, which is
quite straightforward. Obviously, it doesn't allow admins to decide enforce
users seeing only masked data. Nevertheless, it's still useful for trusted
database users generating masked data that will be consumed by the end
users of the application.

2) Masking functions attached to specific columns. This way the same
queries will see different data (masked or not) depending on the
permissions of the user running the query. It has the advantage of not
requiring to change the queries that users with different permissions run.
The downside is that users would need to query the schema if they need to
know whether a column is masked, unless we change the names of the returned
columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
applying the masking function to columns on the base table, and some of
them also allow to apply masking to views.

3) Masking functions as part of projected views. This ways users might need
to query the view appropriate for their permissions instead of the base
table. This might mean changing the queries if the masking policy is
changed by the admin. MySQL recommends this approach on a blog entry,
although it's not part of its main documentation for data masking, and the
implementation has security issues. Some of the other databases offering
the approach 2) as their main option also support masking on view columns.

Each approach has its own advantages and limitations, and I don't think we
necessarily have to choose. The CEP proposes implementing 1) and 2), but no
one impedes us to also have 3) if we get to have projected views. However,
I think that projected views is a new general-purpose feature with its own
complexities, so it would deserve its own CEP, if someone is willing to
work on the implementation.



On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
[email protected]> wrote:

> Is there enough support here for VIEWS to be the implementation strategy
> for displaying masking functions?
>
> It seems to me the view would have to store the query and apply a where
> clause to it, so the same PK would be in play.
>
> It has data leaking properties.
>
> It has more use cases as it can be used to
>
>- construct views that filter out sensitive columns
>- apply transforms to convert units of measure
>
> Are there more thoughts along this line?
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-31 Thread Claude Warren via dev
Is there enough support here for VIEWS to be the implementation strategy 
for displaying masking functions?


It seems to me the view would have to store the query and apply a where 
clause to it, so the same PK would be in play.


It has data leaking properties.

It has more use cases as it can be used to

 * construct views that filter out sensitive columns
 * apply transforms to convert units of measure

Are there more thoughts along this line?


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-30 Thread Benedict
Not to push the point too strongly (I don’t have a very firm view of my own), 
but if we provide this via a view feature we’re just implementing one new 
feature and we get masking for free. I don’t think it is materially more 
complicated than redefining columns for users - it might even be less so, as we 
do not have to consider how applications interpret table metadata. 

Projection views are a very simple concept and pretty simple to implement I 
think, and conceptually very familiar to users. So let’s at least not prefer 
the table column modifier approach because it’s simpler or requires fewer new 
features, as I do not believe this to be the case.

> On 30 Aug 2022, at 12:46, Andrés de la Peña  wrote:
> 
> 
>> GRANT SELECT ON foo.unmasked_name TO top_secret;
> 
> Note that Cassandra doesn't have support for column-level permissions. There 
> was an initiative to add them in 2016, CASSANDRA-12859. However, the ticket 
> has been inactive since 2017. The last comments seem some discussions about 
> design.
> 
> Also, generated columns in PostgreSQL are always stored, so if they were used 
> for masking they would constitute static data masking, not dynamic. 
> 
> The approach for dynamic data masking that PostgreSQL suggests on its 
> documentation doesn't seem based on generating a masked copy of the column, 
> neither on a generated column or on a view. Instead, it uses security labels 
> to associate columns to users and masking functions. That way, the same 
> column will be seen masked or unmasked depending on the user. 
> 
> I'd say that applying the masking rule to the base column itself, and not to 
> a copy, is the most common approach among the discussed databases so far. 
> Also, it has the advantage for us of not being based on other relatively 
> complex features that we miss, such as column-level permissions or 
> not-materialized views. If someday we add those features I think they would 
> play well with what is proposed on the CEP.
> 
>> On Tue, 30 Aug 2022 at 11:46, Avi Kivity via dev  
>> wrote:
>> Agree with views, or alternatively, column permissions together with 
>> computed columns:
>> 
>> 
>> 
>> CREATE TABLE foo (
>> 
>>   id int PRIMARY KEY,
>> 
>>   unmasked_name text,
>> 
>>   name text GENERATED ALWAYS AS some_mask_function(text, 'xxx', 7)
>> 
>> )
>> 
>> 
>> 
>> (syntax from postgresql)
>> 
>> 
>> 
>> GRANT SELECT ON foo.name TO general_use;
>> 
>> GRANT SELECT ON foo.unmasked_name TO top_secret;
>> 
>> 
>> 
>>> On 26/08/2022 00.10, Benedict wrote:
>>> I’m inclined to agree that this seems a more straightforward approach that 
>>> makes fewer implied promises.
>>> 
>>> Perhaps we could deliver simple views backed by virtual tables, and model 
>>> our approach on that of Postgres, MySQL et al?
>>> 
>>> Views in C* would be very simple, just offering a subset of fields with 
>>> some UDFs applied. It would allow users to define roles with access only to 
>>> the views, or for applications to use the views for presentation purposes.
>>> 
>>> It feels like a cleaner approach to me, and we’d get two features for the 
>>> price of one. BUT I don’t feel super strongly about this.
>>> 
 On 25 Aug 2022, at 20:16, Derek Chen-Becker  wrote:
 
 
 To make sure I understand, if I wanted to use a masked column for a 
 conditional update, you're saying we would need SELECT_MASKED to use it in 
 the IF clause? I worry that this proposal is increasing in complexity; I 
 would actually be OK starting with something smaller in scope. Perhaps 
 just providing the masking functions and not tying masking to schema would 
 be sufficient for an initial goal? That wouldn't preclude additional 
 permissions, schema integration, or perhaps just plain Views in the future.
 
 Cheers,
 
 Derek
 
 On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña  
 wrote:
> I have modified the proposal adding a new SELECT_MASKED permission. Using 
> masked columns on WHERE/IF clauses would require having SELECT and either 
> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the 
> query results would always require both SELECT and UNMASK.
> 
> This way we can have the best of both worlds, allowing admins to decide 
> whether they trust their immediate users or not. wdyt?
> 
> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo  
> wrote:
>> This is the difference between security and compliance I guess :-D
>> 
>> The way I see this, the attacker or threat in this concept is not the 
>> developer with access to the database. Rather a feature like this is 
>> just a convenient way to apply some masking rule in a centralized way. 
>> The protection is against an end user of the application, who should not 
>> be able to see the personal data of someone else. Or themselves, even. 
>> As long as the application end user doesn't have access to run arbitrary 
>> CQL, then t

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-30 Thread Andrés de la Peña
>
> GRANT SELECT ON foo.unmasked_name TO top_secret;


Note that Cassandra doesn't have support for column-level permissions.
There was an initiative to add them in 2016, CASSANDRA-12859
. However, the
ticket has been inactive since 2017. The last comments seem some
discussions about design.

Also, generated columns in PostgreSQL are always stored, so if they were
used for masking they would constitute static data masking, not dynamic.

The approach for dynamic data masking that PostgreSQL suggests on its
documentation

doesn't
seem based on generating a masked copy of the column, neither on a
generated column or on a view. Instead, it uses security labels to
associate columns to users and masking functions. That way, the same column
will be seen masked or unmasked depending on the user.

I'd say that applying the masking rule to the base column itself, and not
to a copy, is the most common approach among the discussed databases so
far. Also, it has the advantage for us of not being based on other
relatively complex features that we miss, such as column-level permissions
or not-materialized views. If someday we add those features I think they
would play well with what is proposed on the CEP.

On Tue, 30 Aug 2022 at 11:46, Avi Kivity via dev 
wrote:

> Agree with views, or alternatively, column permissions together with
> computed columns:
>
>
> CREATE TABLE foo (
>
>   id int PRIMARY KEY,
>
>   unmasked_name text,
>
>   name text GENERATED ALWAYS AS some_mask_function(text, 'xxx', 7)
>
> )
>
>
> (syntax from postgresql)
>
>
> GRANT SELECT ON foo.name TO general_use;
>
> GRANT SELECT ON foo.unmasked_name TO top_secret;
>
>
> On 26/08/2022 00.10, Benedict wrote:
>
> I’m inclined to agree that this seems a more straightforward approach that
> makes fewer implied promises.
>
> Perhaps we could deliver simple views backed by virtual tables, and model
> our approach on that of Postgres, MySQL et al?
>
> Views in C* would be very simple, just offering a subset of fields with
> some UDFs applied. It would allow users to define roles with access only to
> the views, or for applications to use the views for presentation purposes.
>
> It feels like a cleaner approach to me, and we’d get two features for the
> price of one. BUT I don’t feel super strongly about this.
>
> On 25 Aug 2022, at 20:16, Derek Chen-Becker 
>  wrote:
>
> 
> To make sure I understand, if I wanted to use a masked column for a
> conditional update, you're saying we would need SELECT_MASKED to use it in
> the IF clause? I worry that this proposal is increasing in complexity; I
> would actually be OK starting with something smaller in scope. Perhaps just
> providing the masking functions and not tying masking to schema would be
> sufficient for an initial goal? That wouldn't preclude additional
> permissions, schema integration, or perhaps just plain Views in the future.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
> wrote:
>
>> I have modified the proposal adding a new SELECT_MASKED permission. Using
>> masked columns on WHERE/IF clauses would require having SELECT and either
>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
>> query results would always require both SELECT and UNMASK.
>>
>> This way we can have the best of both worlds, allowing admins to decide
>> whether they trust their immediate users or not. wdyt?
>>
>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>> wrote:
>>
>>> This is the difference between security and compliance I guess :-D
>>>
>>> The way I see this, the attacker or threat in this concept is not the
>>> developer with access to the database. Rather a feature like this is just a
>>> convenient way to apply some masking rule in a centralized way. The
>>> protection is against an end user of the application, who should not be
>>> able to see the personal data of someone else. Or themselves, even. As long
>>> as the application end user doesn't have access to run arbitrary CQL, then
>>> these frorms of masking prevent accidental unauthorized use/leaking of
>>> personal data.
>>>
>>> henrik
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>>>
 Is it typical for a masking feature to make no effort to prevent
 unmasking? I’m just struggling to see the value of this without such
 mechanisms. Otherwise it’s just a default formatter, and we should consider
 renaming the feature IMO

 On 23 Aug 2022, at 21:27, Andrés de la Peña 
 wrote:

 
 As mentioned in the CEP document, dynamic data masking doesn't try to
 prevent malicious users with SELECT permissions to indirectly guess the
 real value of the masked value. This can easily be done by just trying
 values on the WHERE clause of SELECT queries. DDM would not be a
 replacement for proper column-level permissions.

 T

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-30 Thread Avi Kivity via dev
Agree with views, or alternatively, column permissions together with 
computed columns:



CREATE TABLE foo (

  id int PRIMARY KEY,

  unmasked_name text,

  name text GENERATED ALWAYS AS some_mask_function(text, 'xxx', 7)

)


(syntax from postgresql)


GRANT SELECT ON foo.name TO general_use;

GRANT SELECT ON foo.unmasked_name TO top_secret;


On 26/08/2022 00.10, Benedict wrote:
I’m inclined to agree that this seems a more straightforward approach 
that makes fewer implied promises.


Perhaps we could deliver simple views backed by virtual tables, and 
model our approach on that of Postgres, MySQL et al?


Views in C* would be very simple, just offering a subset of fields 
with some UDFs applied. It would allow users to define roles with 
access only to the views, or for applications to use the views for 
presentation purposes.


It feels like a cleaner approach to me, and we’d get two features for 
the price of one. BUT I don’t feel super strongly about this.


On 25 Aug 2022, at 20:16, Derek Chen-Becker  
wrote:



To make sure I understand, if I wanted to use a masked column for a 
conditional update, you're saying we would need SELECT_MASKED to use 
it in the IF clause? I worry that this proposal is increasing in 
complexity; I would actually be OK starting with something smaller in 
scope. Perhaps just providing the masking functions and not tying 
masking to schema would be sufficient for an initial goal? That 
wouldn't preclude additional permissions, schema integration, or 
perhaps just plain Views in the future.


Cheers,

Derek

On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
 wrote:


I have modified the proposal adding a new SELECT_MASKED
permission. Using masked columns on WHERE/IF clauses would
require having SELECT and either UNMASK or SELECT_MASKED
permissions. Seeing the unmasked values in the query results
would always require both SELECT and UNMASK.

This way we can have the best of both worlds, allowing admins to
decide whether they trust their immediate users or not. wdyt?

On Wed, 24 Aug 2022 at 16:06, Henrik Ingo
 wrote:

This is the difference between security and compliance I
guess :-D

The way I see this, the attacker or threat in this concept is
not the developer with access to the database. Rather a
feature like this is just a convenient way to apply some
masking rule in a centralized way. The protection is against
an end user of the application, who should not be able to see
the personal data of someone else. Or themselves, even. As
long as the application end user doesn't have access to run
arbitrary CQL, then these frorms of masking prevent
accidental unauthorized use/leaking of personal data.

henrik



On Wed, Aug 24, 2022 at 10:40 AM Benedict
 wrote:

Is it typical for a masking feature to make no effort to
prevent unmasking? I’m just struggling to see the value
of this without such mechanisms. Otherwise it’s just a
default formatter, and we should consider renaming the
feature IMO


On 23 Aug 2022, at 21:27, Andrés de la Peña
 wrote:


As mentioned in the CEP document, dynamic data masking
doesn't try to prevent malicious users with SELECT
permissions to indirectly guess the real value of the
masked value. This can easily be done by just trying
values on the WHERE clause of SELECT queries. DDM would
not be a replacement for proper column-level permissions.

The data served by the database is usually consumed by
applications that present this data to end users. These
end users are not necessarily the users directly
connecting to the database. With DDM, it would be easy
for applications to mask sensitive data that is going to
be consumed by the end users. However, the users
directly connecting to the database should be trusted,
provided that they have the right SELECT permissions.

In other words, DDM doesn't directly protect the data,
but it eases the production of protected data.

Said that, we could later go one step ahead and add a
way to prevent untrusted users from inferring the masked
data. That could be done adding a new permission
required to use certain columns on WHERE clauses,
different to the current SELECT permission. That would
play especially well with column-level permissions,
which is something that we still have pending.

On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz
 wrote:

Applying this should prevent querying on a
field, else you could leak its contents, surely?


   

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-26 Thread Andrés de la Peña
>
> Yes, I was thinking that simple projection views (essentially a SELECT
> statement with application of transform functions) would complement masking
> functions, and from the discussion it sounds like this is basically what
> some of the other databases do.


I don't see that the mentioned databases in general suggest using views for
dynamic data masking. So far, I have only seen this this blog post entry
 suggesting to
use MySQL's not-materialized views with masking functions, probably because
MySQL lacks the more sophisticated mechanisms for data masking that other
databases offer.

However, using MySQL views can allow malicious users to run queries to
infer the masked data, which is what we were trying to avoid. For example:

CREATE TABLE employees(
 id INT NOT NULL AUTO_INCREMENT,
 name VARCHAR(100) NOT NULL,
 PRIMARY KEY (id));

CREATE VIEW employee_mask AS SELECT
  id,
  mask_inner(name, 1, 0, _binary'*') AS name
  FROM employees;

INSERT INTO employees(name) SELECT "Joseph";
INSERT INTO employees(name) SELECT "Olivia";

SELECT * FROM employee_mask WHERE name="Joseph";
+++
| id | name   |
+++
|  1 | J* |
+++

On Fri, 26 Aug 2022 at 02:45, Derek Chen-Becker 
wrote:

> Yes, I was thinking that simple projection views (essentially a SELECT
> statement with application of transform functions) would complement masking
> functions, and from the discussion it sounds like this is basically what
> some of the other databases do. Projection views seem like they would be
> useful in their own right, so would it be proper to write a separate CEP
> for that? I would be happy to help drive that document and discussion. I'm
> not sure if it's the best name, but I'm trying to distinguish views that
> expose a subset of an existing schema vs materialized views, which offer
> more complex capabilities.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022, 3:11 PM Benedict  wrote:
>
>> I’m inclined to agree that this seems a more straightforward approach
>> that makes fewer implied promises.
>>
>> Perhaps we could deliver simple views backed by virtual tables, and model
>> our approach on that of Postgres, MySQL et al?
>>
>> Views in C* would be very simple, just offering a subset of fields with
>> some UDFs applied. It would allow users to define roles with access only to
>> the views, or for applications to use the views for presentation purposes.
>>
>> It feels like a cleaner approach to me, and we’d get two features for the
>> price of one. BUT I don’t feel super strongly about this.
>>
>> On 25 Aug 2022, at 20:16, Derek Chen-Becker 
>> wrote:
>>
>> 
>> To make sure I understand, if I wanted to use a masked column for a
>> conditional update, you're saying we would need SELECT_MASKED to use it in
>> the IF clause? I worry that this proposal is increasing in complexity; I
>> would actually be OK starting with something smaller in scope. Perhaps just
>> providing the masking functions and not tying masking to schema would be
>> sufficient for an initial goal? That wouldn't preclude additional
>> permissions, schema integration, or perhaps just plain Views in the future.
>>
>> Cheers,
>>
>> Derek
>>
>> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
>> wrote:
>>
>>> I have modified the proposal adding a new SELECT_MASKED permission.
>>> Using masked columns on WHERE/IF clauses would require having SELECT and
>>> either UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in
>>> the query results would always require both SELECT and UNMASK.
>>>
>>> This way we can have the best of both worlds, allowing admins to decide
>>> whether they trust their immediate users or not. wdyt?
>>>
>>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>>> wrote:
>>>
 This is the difference between security and compliance I guess :-D

 The way I see this, the attacker or threat in this concept is not the
 developer with access to the database. Rather a feature like this is just a
 convenient way to apply some masking rule in a centralized way. The
 protection is against an end user of the application, who should not be
 able to see the personal data of someone else. Or themselves, even. As long
 as the application end user doesn't have access to run arbitrary CQL, then
 these frorms of masking prevent accidental unauthorized use/leaking of
 personal data.

 henrik



 On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña 
> wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users w

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-26 Thread Benjamin Lerer
Views (even only projection view) is a completely new feature with its own
set of complexities and limitations. My first feeling is that it might not
be as simple as it sounds. There are an important amount of use cases to
cover. It will definitely require its own CEP. :-)

I like Andrés' proposal. It offers some nice and easy to use safeguards.

Le ven. 26 août 2022 à 03:45, Derek Chen-Becker  a
écrit :

> Yes, I was thinking that simple projection views (essentially a SELECT
> statement with application of transform functions) would complement masking
> functions, and from the discussion it sounds like this is basically what
> some of the other databases do. Projection views seem like they would be
> useful in their own right, so would it be proper to write a separate CEP
> for that? I would be happy to help drive that document and discussion. I'm
> not sure if it's the best name, but I'm trying to distinguish views that
> expose a subset of an existing schema vs materialized views, which offer
> more complex capabilities.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022, 3:11 PM Benedict  wrote:
>
>> I’m inclined to agree that this seems a more straightforward approach
>> that makes fewer implied promises.
>>
>> Perhaps we could deliver simple views backed by virtual tables, and model
>> our approach on that of Postgres, MySQL et al?
>>
>> Views in C* would be very simple, just offering a subset of fields with
>> some UDFs applied. It would allow users to define roles with access only to
>> the views, or for applications to use the views for presentation purposes.
>>
>> It feels like a cleaner approach to me, and we’d get two features for the
>> price of one. BUT I don’t feel super strongly about this.
>>
>> On 25 Aug 2022, at 20:16, Derek Chen-Becker 
>> wrote:
>>
>> 
>> To make sure I understand, if I wanted to use a masked column for a
>> conditional update, you're saying we would need SELECT_MASKED to use it in
>> the IF clause? I worry that this proposal is increasing in complexity; I
>> would actually be OK starting with something smaller in scope. Perhaps just
>> providing the masking functions and not tying masking to schema would be
>> sufficient for an initial goal? That wouldn't preclude additional
>> permissions, schema integration, or perhaps just plain Views in the future.
>>
>> Cheers,
>>
>> Derek
>>
>> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
>> wrote:
>>
>>> I have modified the proposal adding a new SELECT_MASKED permission.
>>> Using masked columns on WHERE/IF clauses would require having SELECT and
>>> either UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in
>>> the query results would always require both SELECT and UNMASK.
>>>
>>> This way we can have the best of both worlds, allowing admins to decide
>>> whether they trust their immediate users or not. wdyt?
>>>
>>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>>> wrote:
>>>
 This is the difference between security and compliance I guess :-D

 The way I see this, the attacker or threat in this concept is not the
 developer with access to the database. Rather a feature like this is just a
 convenient way to apply some masking rule in a centralized way. The
 protection is against an end user of the application, who should not be
 able to see the personal data of someone else. Or themselves, even. As long
 as the application end user doesn't have access to run arbitrary CQL, then
 these frorms of masking prevent accidental unauthorized use/leaking of
 personal data.

 henrik



 On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña 
> wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users with SELECT permissions to indirectly guess the
> real value of the masked value. This can easily be done by just trying
> values on the WHERE clause of SELECT queries. DDM would not be a
> replacement for proper column-level permissions.
>
> The data served by the database is usually consumed by applications
> that present this data to end users. These end users are not necessarily
> the users directly connecting to the database. With DDM, it would be easy
> for applications to mask sensitive data that is going to be consumed by 
> the
> end users. However, the users directly connecting to the database should 
> be
> trusted, provided that they have the right SELECT permissions.
>
> In other words, DDM doesn't directly protect the data, but it eases
> the production of protected data.
>
> Said that, we co

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-25 Thread Derek Chen-Becker
Yes, I was thinking that simple projection views (essentially a SELECT
statement with application of transform functions) would complement masking
functions, and from the discussion it sounds like this is basically what
some of the other databases do. Projection views seem like they would be
useful in their own right, so would it be proper to write a separate CEP
for that? I would be happy to help drive that document and discussion. I'm
not sure if it's the best name, but I'm trying to distinguish views that
expose a subset of an existing schema vs materialized views, which offer
more complex capabilities.

Cheers,

Derek

On Thu, Aug 25, 2022, 3:11 PM Benedict  wrote:

> I’m inclined to agree that this seems a more straightforward approach that
> makes fewer implied promises.
>
> Perhaps we could deliver simple views backed by virtual tables, and model
> our approach on that of Postgres, MySQL et al?
>
> Views in C* would be very simple, just offering a subset of fields with
> some UDFs applied. It would allow users to define roles with access only to
> the views, or for applications to use the views for presentation purposes.
>
> It feels like a cleaner approach to me, and we’d get two features for the
> price of one. BUT I don’t feel super strongly about this.
>
> On 25 Aug 2022, at 20:16, Derek Chen-Becker  wrote:
>
> 
> To make sure I understand, if I wanted to use a masked column for a
> conditional update, you're saying we would need SELECT_MASKED to use it in
> the IF clause? I worry that this proposal is increasing in complexity; I
> would actually be OK starting with something smaller in scope. Perhaps just
> providing the masking functions and not tying masking to schema would be
> sufficient for an initial goal? That wouldn't preclude additional
> permissions, schema integration, or perhaps just plain Views in the future.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
> wrote:
>
>> I have modified the proposal adding a new SELECT_MASKED permission. Using
>> masked columns on WHERE/IF clauses would require having SELECT and either
>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
>> query results would always require both SELECT and UNMASK.
>>
>> This way we can have the best of both worlds, allowing admins to decide
>> whether they trust their immediate users or not. wdyt?
>>
>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>> wrote:
>>
>>> This is the difference between security and compliance I guess :-D
>>>
>>> The way I see this, the attacker or threat in this concept is not the
>>> developer with access to the database. Rather a feature like this is just a
>>> convenient way to apply some masking rule in a centralized way. The
>>> protection is against an end user of the application, who should not be
>>> able to see the personal data of someone else. Or themselves, even. As long
>>> as the application end user doesn't have access to run arbitrary CQL, then
>>> these frorms of masking prevent accidental unauthorized use/leaking of
>>> personal data.
>>>
>>> henrik
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>>>
 Is it typical for a masking feature to make no effort to prevent
 unmasking? I’m just struggling to see the value of this without such
 mechanisms. Otherwise it’s just a default formatter, and we should consider
 renaming the feature IMO

 On 23 Aug 2022, at 21:27, Andrés de la Peña 
 wrote:

 
 As mentioned in the CEP document, dynamic data masking doesn't try to
 prevent malicious users with SELECT permissions to indirectly guess the
 real value of the masked value. This can easily be done by just trying
 values on the WHERE clause of SELECT queries. DDM would not be a
 replacement for proper column-level permissions.

 The data served by the database is usually consumed by applications
 that present this data to end users. These end users are not necessarily
 the users directly connecting to the database. With DDM, it would be easy
 for applications to mask sensitive data that is going to be consumed by the
 end users. However, the users directly connecting to the database should be
 trusted, provided that they have the right SELECT permissions.

 In other words, DDM doesn't directly protect the data, but it eases the
 production of protected data.

 Said that, we could later go one step ahead and add a way to prevent
 untrusted users from inferring the masked data. That could be done adding a
 new permission required to use certain columns on WHERE clauses, different
 to the current SELECT permission. That would play especially well with
 column-level permissions, which is something that we still have pending.

 On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz 
 wrote:

> Applying this should prevent querying on a field, else you could leak
>> its contents, surely?
>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-25 Thread Andrés de la Peña
>
> Perhaps we could deliver simple views backed by virtual tables, and model
> our approach on that of Postgres, MySQL et al?


The approach of PostgresSQL

allows attaching masking functions to columns and users with commands such
as:

SECURITY LABEL FOR anon ON COLUMN people.phone
IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$**$$,2)';

MySQL however does only provide the masking functions without the ability
to attaching them to neither columns or users, as far as I know.

The most similar to the proposed one is the approach of Azure/SQL Server,
which is almost identical except for the CEP trying to address the recent
concerns about querying masked columns.



On Thu, 25 Aug 2022 at 22:10, Benedict  wrote:

> I’m inclined to agree that this seems a more straightforward approach that
> makes fewer implied promises.
>
> Perhaps we could deliver simple views backed by virtual tables, and model
> our approach on that of Postgres, MySQL et al?
>
> Views in C* would be very simple, just offering a subset of fields with
> some UDFs applied. It would allow users to define roles with access only to
> the views, or for applications to use the views for presentation purposes.
>
> It feels like a cleaner approach to me, and we’d get two features for the
> price of one. BUT I don’t feel super strongly about this.
>
> On 25 Aug 2022, at 20:16, Derek Chen-Becker  wrote:
>
> 
> To make sure I understand, if I wanted to use a masked column for a
> conditional update, you're saying we would need SELECT_MASKED to use it in
> the IF clause? I worry that this proposal is increasing in complexity; I
> would actually be OK starting with something smaller in scope. Perhaps just
> providing the masking functions and not tying masking to schema would be
> sufficient for an initial goal? That wouldn't preclude additional
> permissions, schema integration, or perhaps just plain Views in the future.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
> wrote:
>
>> I have modified the proposal adding a new SELECT_MASKED permission. Using
>> masked columns on WHERE/IF clauses would require having SELECT and either
>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
>> query results would always require both SELECT and UNMASK.
>>
>> This way we can have the best of both worlds, allowing admins to decide
>> whether they trust their immediate users or not. wdyt?
>>
>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>> wrote:
>>
>>> This is the difference between security and compliance I guess :-D
>>>
>>> The way I see this, the attacker or threat in this concept is not the
>>> developer with access to the database. Rather a feature like this is just a
>>> convenient way to apply some masking rule in a centralized way. The
>>> protection is against an end user of the application, who should not be
>>> able to see the personal data of someone else. Or themselves, even. As long
>>> as the application end user doesn't have access to run arbitrary CQL, then
>>> these frorms of masking prevent accidental unauthorized use/leaking of
>>> personal data.
>>>
>>> henrik
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>>>
 Is it typical for a masking feature to make no effort to prevent
 unmasking? I’m just struggling to see the value of this without such
 mechanisms. Otherwise it’s just a default formatter, and we should consider
 renaming the feature IMO

 On 23 Aug 2022, at 21:27, Andrés de la Peña 
 wrote:

 
 As mentioned in the CEP document, dynamic data masking doesn't try to
 prevent malicious users with SELECT permissions to indirectly guess the
 real value of the masked value. This can easily be done by just trying
 values on the WHERE clause of SELECT queries. DDM would not be a
 replacement for proper column-level permissions.

 The data served by the database is usually consumed by applications
 that present this data to end users. These end users are not necessarily
 the users directly connecting to the database. With DDM, it would be easy
 for applications to mask sensitive data that is going to be consumed by the
 end users. However, the users directly connecting to the database should be
 trusted, provided that they have the right SELECT permissions.

 In other words, DDM doesn't directly protect the data, but it eases the
 production of protected data.

 Said that, we could later go one step ahead and add a way to prevent
 untrusted users from inferring the masked data. That could be done adding a
 new permission required to use certain columns on WHERE clauses, different
 to the current SELECT permission. That would play especially well with
 column-level permissions, which is something that we still have pending.

 On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz 
>>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-25 Thread Benedict
I’m inclined to agree that this seems a more straightforward approach that 
makes fewer implied promises.

Perhaps we could deliver simple views backed by virtual tables, and model our 
approach on that of Postgres, MySQL et al?

Views in C* would be very simple, just offering a subset of fields with some 
UDFs applied. It would allow users to define roles with access only to the 
views, or for applications to use the views for presentation purposes.

It feels like a cleaner approach to me, and we’d get two features for the price 
of one. BUT I don’t feel super strongly about this.

> On 25 Aug 2022, at 20:16, Derek Chen-Becker  wrote:
> 
> 
> To make sure I understand, if I wanted to use a masked column for a 
> conditional update, you're saying we would need SELECT_MASKED to use it in 
> the IF clause? I worry that this proposal is increasing in complexity; I 
> would actually be OK starting with something smaller in scope. Perhaps just 
> providing the masking functions and not tying masking to schema would be 
> sufficient for an initial goal? That wouldn't preclude additional 
> permissions, schema integration, or perhaps just plain Views in the future.
> 
> Cheers,
> 
> Derek
> 
>> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña  
>> wrote:
>> I have modified the proposal adding a new SELECT_MASKED permission. Using 
>> masked columns on WHERE/IF clauses would require having SELECT and either 
>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the query 
>> results would always require both SELECT and UNMASK.
>> 
>> This way we can have the best of both worlds, allowing admins to decide 
>> whether they trust their immediate users or not. wdyt?
>> 
>>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo  wrote:
>>> This is the difference between security and compliance I guess :-D
>>> 
>>> The way I see this, the attacker or threat in this concept is not the 
>>> developer with access to the database. Rather a feature like this is just a 
>>> convenient way to apply some masking rule in a centralized way. The 
>>> protection is against an end user of the application, who should not be 
>>> able to see the personal data of someone else. Or themselves, even. As long 
>>> as the application end user doesn't have access to run arbitrary CQL, then 
>>> these frorms of masking prevent accidental unauthorized use/leaking of 
>>> personal data.
>>> 
>>> henrik
>>> 
>>> 
>>> 
 On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
 Is it typical for a masking feature to make no effort to prevent 
 unmasking? I’m just struggling to see the value of this without such 
 mechanisms. Otherwise it’s just a default formatter, and we should 
 consider renaming the feature IMO
 
>> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>> 
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to 
> prevent malicious users with SELECT permissions to indirectly guess the 
> real value of the masked value. This can easily be done by just trying 
> values on the WHERE clause of SELECT queries. DDM would not be a 
> replacement for proper column-level permissions.
> 
> The data served by the database is usually consumed by applications that 
> present this data to end users. These end users are not necessarily the 
> users directly connecting to the database. With DDM, it would be easy for 
> applications to mask sensitive data that is going to be consumed by the 
> end users. However, the users directly connecting to the database should 
> be trusted, provided that they have the right SELECT permissions.
> 
> In other words, DDM doesn't directly protect the data, but it eases the 
> production of protected data.
> 
> Said that, we could later go one step ahead and add a way to prevent 
> untrusted users from inferring the masked data. That could be done adding 
> a new permission required to use certain columns on WHERE clauses, 
> different to the current SELECT permission. That would play especially 
> well with column-level permissions, which is something that we still have 
> pending. 
> 
> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>>> Applying this should prevent querying on a field, else you could leak 
>>> its contents, surely?
>> 
>> In theory, yes.  Although I could see folks doing something like this:
>> 
>> SELECT COUNT(*) FROM patients
>> WHERE year_of_birth = 2002
>> AND date_of_birth >= '2002-04-01'
>> AND date_of_birth < '2002-11-01';
>> 
>> In this case, the rows containing the masked key column(s) could be 
>> filtered on without revealing the actual data.  But again, that's 
>> probably better for a "phase 2" of the implementation.
>> 
>>> Agreed on not being a queryable field. That would also preclude 
>>> secondary indexing, right?
>> 
>> Yes, that's my thought as well.

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-25 Thread Andrés de la Peña
Note that conditional updates return true or false to notify whether the
update has happened or not. That can also be exploited to infer the masked
data. Indeed, at the moment they also require SELECT permissions.

The masking functions can always be used on their own, as any other CQL
function and without necessarily associating them to the schema.

You would only need either UNMASK or SELECT_MASKED permissions for a
conditional update if the masking function is attached to the column
declaration in the schema of the table.

There is a timeline section

of the CEP listing the planned development steps. The first step is adding
the functions on their own. The next steps are for allowing to attach those
functions to the columns with the mentioned permissions.

On Thu, 25 Aug 2022 at 20:16, Derek Chen-Becker 
wrote:

> To make sure I understand, if I wanted to use a masked column for a
> conditional update, you're saying we would need SELECT_MASKED to use it in
> the IF clause? I worry that this proposal is increasing in complexity; I
> would actually be OK starting with something smaller in scope. Perhaps just
> providing the masking functions and not tying masking to schema would be
> sufficient for an initial goal? That wouldn't preclude additional
> permissions, schema integration, or perhaps just plain Views in the future.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
> wrote:
>
>> I have modified the proposal adding a new SELECT_MASKED permission. Using
>> masked columns on WHERE/IF clauses would require having SELECT and either
>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
>> query results would always require both SELECT and UNMASK.
>>
>> This way we can have the best of both worlds, allowing admins to decide
>> whether they trust their immediate users or not. wdyt?
>>
>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>> wrote:
>>
>>> This is the difference between security and compliance I guess :-D
>>>
>>> The way I see this, the attacker or threat in this concept is not the
>>> developer with access to the database. Rather a feature like this is just a
>>> convenient way to apply some masking rule in a centralized way. The
>>> protection is against an end user of the application, who should not be
>>> able to see the personal data of someone else. Or themselves, even. As long
>>> as the application end user doesn't have access to run arbitrary CQL, then
>>> these frorms of masking prevent accidental unauthorized use/leaking of
>>> personal data.
>>>
>>> henrik
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>>>
 Is it typical for a masking feature to make no effort to prevent
 unmasking? I’m just struggling to see the value of this without such
 mechanisms. Otherwise it’s just a default formatter, and we should consider
 renaming the feature IMO

 On 23 Aug 2022, at 21:27, Andrés de la Peña 
 wrote:

 
 As mentioned in the CEP document, dynamic data masking doesn't try to
 prevent malicious users with SELECT permissions to indirectly guess the
 real value of the masked value. This can easily be done by just trying
 values on the WHERE clause of SELECT queries. DDM would not be a
 replacement for proper column-level permissions.

 The data served by the database is usually consumed by applications
 that present this data to end users. These end users are not necessarily
 the users directly connecting to the database. With DDM, it would be easy
 for applications to mask sensitive data that is going to be consumed by the
 end users. However, the users directly connecting to the database should be
 trusted, provided that they have the right SELECT permissions.

 In other words, DDM doesn't directly protect the data, but it eases the
 production of protected data.

 Said that, we could later go one step ahead and add a way to prevent
 untrusted users from inferring the masked data. That could be done adding a
 new permission required to use certain columns on WHERE clauses, different
 to the current SELECT permission. That would play especially well with
 column-level permissions, which is something that we still have pending.

 On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz 
 wrote:

> Applying this should prevent querying on a field, else you could leak
>> its contents, surely?
>>
>
> In theory, yes.  Although I could see folks doing something like this:
>
> SELECT COUNT(*) FROM patients
> WHERE year_of_birth = 2002
> AND date_of_birth >= '2002-04-01'
> AND date_of_birth < '2002-11-01';
>
> In this case, the rows containing the masked key column(s) could be
> filtered on without revealing the actual data.  But again, that's prob

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-25 Thread Derek Chen-Becker
To make sure I understand, if I wanted to use a masked column for a
conditional update, you're saying we would need SELECT_MASKED to use it in
the IF clause? I worry that this proposal is increasing in complexity; I
would actually be OK starting with something smaller in scope. Perhaps just
providing the masking functions and not tying masking to schema would be
sufficient for an initial goal? That wouldn't preclude additional
permissions, schema integration, or perhaps just plain Views in the future.

Cheers,

Derek

On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
wrote:

> I have modified the proposal adding a new SELECT_MASKED permission. Using
> masked columns on WHERE/IF clauses would require having SELECT and either
> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
> query results would always require both SELECT and UNMASK.
>
> This way we can have the best of both worlds, allowing admins to decide
> whether they trust their immediate users or not. wdyt?
>
> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
> wrote:
>
>> This is the difference between security and compliance I guess :-D
>>
>> The way I see this, the attacker or threat in this concept is not the
>> developer with access to the database. Rather a feature like this is just a
>> convenient way to apply some masking rule in a centralized way. The
>> protection is against an end user of the application, who should not be
>> able to see the personal data of someone else. Or themselves, even. As long
>> as the application end user doesn't have access to run arbitrary CQL, then
>> these frorms of masking prevent accidental unauthorized use/leaking of
>> personal data.
>>
>> henrik
>>
>>
>>
>> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>>
>>> Is it typical for a masking feature to make no effort to prevent
>>> unmasking? I’m just struggling to see the value of this without such
>>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>>> renaming the feature IMO
>>>
>>> On 23 Aug 2022, at 21:27, Andrés de la Peña 
>>> wrote:
>>>
>>> 
>>> As mentioned in the CEP document, dynamic data masking doesn't try to
>>> prevent malicious users with SELECT permissions to indirectly guess the
>>> real value of the masked value. This can easily be done by just trying
>>> values on the WHERE clause of SELECT queries. DDM would not be a
>>> replacement for proper column-level permissions.
>>>
>>> The data served by the database is usually consumed by applications that
>>> present this data to end users. These end users are not necessarily the
>>> users directly connecting to the database. With DDM, it would be easy for
>>> applications to mask sensitive data that is going to be consumed by the end
>>> users. However, the users directly connecting to the database should be
>>> trusted, provided that they have the right SELECT permissions.
>>>
>>> In other words, DDM doesn't directly protect the data, but it eases the
>>> production of protected data.
>>>
>>> Said that, we could later go one step ahead and add a way to prevent
>>> untrusted users from inferring the masked data. That could be done adding a
>>> new permission required to use certain columns on WHERE clauses, different
>>> to the current SELECT permission. That would play especially well with
>>> column-level permissions, which is something that we still have pending.
>>>
>>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz 
>>> wrote:
>>>
 Applying this should prevent querying on a field, else you could leak
> its contents, surely?
>

 In theory, yes.  Although I could see folks doing something like this:

 SELECT COUNT(*) FROM patients
 WHERE year_of_birth = 2002
 AND date_of_birth >= '2002-04-01'
 AND date_of_birth < '2002-11-01';

 In this case, the rows containing the masked key column(s) could be
 filtered on without revealing the actual data.  But again, that's probably
 better for a "phase 2" of the implementation.

 Agreed on not being a queryable field. That would also preclude
> secondary indexing, right?


 Yes, that's my thought as well.

 On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker <
 [email protected]> wrote:

> Agreed on not being a queryable field. That would also preclude
> secondary indexing, right?
>
> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>
>> Applying this should prevent querying on a field, else you could leak
>> its contents, surely? This pretty much prohibits using it in a clustering
>> key, and a partition key with the ordered partitioner - but probably 
>> also a
>> hashed partitioner since we do not use a cryptographic hash and the hash
>> function is well defined.
>>
>> We probably also need to ensure that any ALLOW FILTERING queries on
>> such a field are disabled.
>>
>> Plausibly the data could be cryptographically jumbled before using it
>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-25 Thread Andrés de la Peña
I have modified the proposal adding a new SELECT_MASKED permission. Using
masked columns on WHERE/IF clauses would require having SELECT and either
UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
query results would always require both SELECT and UNMASK.

This way we can have the best of both worlds, allowing admins to decide
whether they trust their immediate users or not. wdyt?

On Wed, 24 Aug 2022 at 16:06, Henrik Ingo  wrote:

> This is the difference between security and compliance I guess :-D
>
> The way I see this, the attacker or threat in this concept is not the
> developer with access to the database. Rather a feature like this is just a
> convenient way to apply some masking rule in a centralized way. The
> protection is against an end user of the application, who should not be
> able to see the personal data of someone else. Or themselves, even. As long
> as the application end user doesn't have access to run arbitrary CQL, then
> these frorms of masking prevent accidental unauthorized use/leaking of
> personal data.
>
> henrik
>
>
>
> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>
>> Is it typical for a masking feature to make no effort to prevent
>> unmasking? I’m just struggling to see the value of this without such
>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>> renaming the feature IMO
>>
>> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>>
>> 
>> As mentioned in the CEP document, dynamic data masking doesn't try to
>> prevent malicious users with SELECT permissions to indirectly guess the
>> real value of the masked value. This can easily be done by just trying
>> values on the WHERE clause of SELECT queries. DDM would not be a
>> replacement for proper column-level permissions.
>>
>> The data served by the database is usually consumed by applications that
>> present this data to end users. These end users are not necessarily the
>> users directly connecting to the database. With DDM, it would be easy for
>> applications to mask sensitive data that is going to be consumed by the end
>> users. However, the users directly connecting to the database should be
>> trusted, provided that they have the right SELECT permissions.
>>
>> In other words, DDM doesn't directly protect the data, but it eases the
>> production of protected data.
>>
>> Said that, we could later go one step ahead and add a way to prevent
>> untrusted users from inferring the masked data. That could be done adding a
>> new permission required to use certain columns on WHERE clauses, different
>> to the current SELECT permission. That would play especially well with
>> column-level permissions, which is something that we still have pending.
>>
>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>>
>>> Applying this should prevent querying on a field, else you could leak
 its contents, surely?

>>>
>>> In theory, yes.  Although I could see folks doing something like this:
>>>
>>> SELECT COUNT(*) FROM patients
>>> WHERE year_of_birth = 2002
>>> AND date_of_birth >= '2002-04-01'
>>> AND date_of_birth < '2002-11-01';
>>>
>>> In this case, the rows containing the masked key column(s) could be
>>> filtered on without revealing the actual data.  But again, that's probably
>>> better for a "phase 2" of the implementation.
>>>
>>> Agreed on not being a queryable field. That would also preclude
 secondary indexing, right?
>>>
>>>
>>> Yes, that's my thought as well.
>>>
>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker <
>>> [email protected]> wrote:
>>>
 Agreed on not being a queryable field. That would also preclude
 secondary indexing, right?

 On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:

> Applying this should prevent querying on a field, else you could leak
> its contents, surely? This pretty much prohibits using it in a clustering
> key, and a partition key with the ordered partitioner - but probably also 
> a
> hashed partitioner since we do not use a cryptographic hash and the hash
> function is well defined.
>
> We probably also need to ensure that any ALLOW FILTERING queries on
> such a field are disabled.
>
> Plausibly the data could be cryptographically jumbled before using it
> in a primary key component (or permitting filtering), but it is probably
> easier and safer to exclude for now…
>
> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
>
> 
> Some thoughts on this one:
>
> In a prior job, we'd give app teams access to a single keyspace, and
> two roles: a read-write role and a read-only role.  In some cases, a
> "privileged" application role was also requested.  Depending on the
> requirements, I could see the UNMASK permission being applied to the RW or
> privileged roles.  But if there's a problem on the table and the operators
> go in to investigate, they will likely use a SUPERUSER account, and 
> t

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Henrik Ingo
This is the difference between security and compliance I guess :-D

The way I see this, the attacker or threat in this concept is not the
developer with access to the database. Rather a feature like this is just a
convenient way to apply some masking rule in a centralized way. The
protection is against an end user of the application, who should not be
able to see the personal data of someone else. Or themselves, even. As long
as the application end user doesn't have access to run arbitrary CQL, then
these frorms of masking prevent accidental unauthorized use/leaking of
personal data.

henrik



On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users with SELECT permissions to indirectly guess the
> real value of the masked value. This can easily be done by just trying
> values on the WHERE clause of SELECT queries. DDM would not be a
> replacement for proper column-level permissions.
>
> The data served by the database is usually consumed by applications that
> present this data to end users. These end users are not necessarily the
> users directly connecting to the database. With DDM, it would be easy for
> applications to mask sensitive data that is going to be consumed by the end
> users. However, the users directly connecting to the database should be
> trusted, provided that they have the right SELECT permissions.
>
> In other words, DDM doesn't directly protect the data, but it eases the
> production of protected data.
>
> Said that, we could later go one step ahead and add a way to prevent
> untrusted users from inferring the masked data. That could be done adding a
> new permission required to use certain columns on WHERE clauses, different
> to the current SELECT permission. That would play especially well with
> column-level permissions, which is something that we still have pending.
>
> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>
>> Applying this should prevent querying on a field, else you could leak its
>>> contents, surely?
>>>
>>
>> In theory, yes.  Although I could see folks doing something like this:
>>
>> SELECT COUNT(*) FROM patients
>> WHERE year_of_birth = 2002
>> AND date_of_birth >= '2002-04-01'
>> AND date_of_birth < '2002-11-01';
>>
>> In this case, the rows containing the masked key column(s) could be
>> filtered on without revealing the actual data.  But again, that's probably
>> better for a "phase 2" of the implementation.
>>
>> Agreed on not being a queryable field. That would also preclude secondary
>>> indexing, right?
>>
>>
>> Yes, that's my thought as well.
>>
>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
>> wrote:
>>
>>> Agreed on not being a queryable field. That would also preclude
>>> secondary indexing, right?
>>>
>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>>>
 Applying this should prevent querying on a field, else you could leak
 its contents, surely? This pretty much prohibits using it in a clustering
 key, and a partition key with the ordered partitioner - but probably also a
 hashed partitioner since we do not use a cryptographic hash and the hash
 function is well defined.

 We probably also need to ensure that any ALLOW FILTERING queries on
 such a field are disabled.

 Plausibly the data could be cryptographically jumbled before using it
 in a primary key component (or permitting filtering), but it is probably
 easier and safer to exclude for now…

 On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:

 
 Some thoughts on this one:

 In a prior job, we'd give app teams access to a single keyspace, and
 two roles: a read-write role and a read-only role.  In some cases, a
 "privileged" application role was also requested.  Depending on the
 requirements, I could see the UNMASK permission being applied to the RW or
 privileged roles.  But if there's a problem on the table and the operators
 go in to investigate, they will likely use a SUPERUSER account, and they'll
 see that data.

 How hard would it be for SUPERUSERs to *not* automatically get the
 UNMASK permission?

 I'll also echo the concerns around masking primary key components.
 It's highly likely that certain personal data properties would be used as a
 partition or clustering key (ex: range query for people born within a
 certain timeframe).  In addition to the "breaks existing" concern, I'm
 curious about the challenges around getting that to work with the current
 primary key implementation.

 Does this first implementation only apply to paylo

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benedict
The MySQL feature is not equivalent to this proposal, it simply offers new 
transformation functions that implement this functionality, so it is up to the 
application to apply these functions to its own selects or, as most examples 
seem to use, to create a view on the data that applies the function. Of course, 
any joins or queries on such a view will operate over the result of the 
function, not its input. This permits the DBA to create roles that really do 
have no access to the unmasked data, and would have to infer information via 
other means (perhaps joins against other tables). So, it is perhaps a misnomer 
to say that “it does this” but the MySQL feature applies uniformly, and it is 
clear what access to the data a role is being granted, as there is no 
table-level masking.

Postgres appears to adopt the same approach.


> On 24 Aug 2022, at 14:32, Andrés de la Peña  wrote:
> 
> 
> Where does MySQL suggest that? As far I can tell MySQL only offers a set of 
> functions for masking. I can't see a way to force users or tables to use 
> those functions, and is up to the users to use those functions or not. I'm 
> reading this documentation.
> 
> As for broadening the scope the proposal to prevent malicious users from 
> inferring the masked data, I guess that the additional rule would simply be 
> that a user with READ but not UNMASK permissions cannot use masked columns on 
> WHERE or IF clauses. That would include both SELECT and UPDATE statements. 
> That would differentiate us from many popular databases out there, where data 
> masking usually is a simpler thing.
> 
>> On Wed, 24 Aug 2022 at 14:08, Benedict  wrote:
>> I can’t tell for sure, but the documentation on Postgres’ feature suggests 
>> to me that it does apply the masking to all possible uses of the data, 
>> including joining and querying.
>> 
>> Snowflake’s documentation explicitly says that it does.
>> 
>> MySQL’s documentation suggests that it does this.
>> 
>> Oracle, AWS and MS SQL do not.
>> 
>> My inclination would be to - at least by default - forbid querying on 
>> columns that are masked, unless the mask permits it.
>> 
>> 
 On 24 Aug 2022, at 11:06, Andrés de la Peña  wrote:
 
>>> 
>>> Here are the names of the feature on same databases out there, errors and 
>>> omission excepted:
>>> Microsoft SQL Server / Azure SQL: Dynamic data masking
>>> MySQL: Enterprise data masking and de-identification
>>> PostgreSQL: Dynamic masking
>>> MongoDB: Data masking
>>> IBM Db2: Masks
>>> Oracle: Redaction
>>> MariaDB/MaxScale: Data masking
>>> Snowflake: Dynamic data masking
>>> 
 On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:
 Right, but we get to decide how we offer such features and what we call 
 them. I can’t imagine a good reason to call this a masking feature, 
 especially one that applies differentially to certain users, when it is 
 trivial to unmask.
 
 I’m ok offering a feature called “default formatter” or something that 
 applies some UDF to a field before returning to the client, and if users 
 wish to “mask” their data in this way that’s fine. But calling it a data 
 mask when it is trivial to circumvent is IMO dangerous, and I’d at least 
 want to see evidence that all other equivalent features in the industry 
 are similarly poorly named and offer similarly poor protection.
 
>> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
>> 
> 
>> The PCI DSS Standard v4_0 requires that credit card numbers stored on 
>> the system must be "rendered unreadable", thus this proposal is _NOT_ a 
>> good way to protect credit card numbers.
> 
> My point was simply about the fact that Dynamic Data Masking like any 
> other feature made sense for some scenario but not for others. I 
> apologise if my example was a bad one.
> 
>> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev 
>>  a écrit :
>> This change appears to be looking at two aspects:
>> Add metadata to columns
>> Add functionality based on the metadata.
>> If the system had a generic user defined metadata and the ability to 
>> define filter functions at the point where data are being returned to 
>> the client it would be possible for users implement this filter, or any 
>> other filter on the data.
>> 
>> The concept of user defined metadata and filters could be applied to 
>> other parts of the system as well.  For example, if the metadata were 
>> accessible from UDFs the metadata could be used in low level filters to 
>> remove rows from queries before they were returned.
>> 
>> 
>> 
>> 
>>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
>>>  wrote:
>>> The PCI DSS Standard v4_0 requires that credit card numbers stored on 
>>> the system must be "rendered unreadable", thus this proposal is _NOT_ a 
>>> good way to protect credit card numbers.  In fact, for any criti

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Andrés de la Peña
Where does MySQL suggest that? As far I can tell MySQL only offers a set of
functions for masking. I can't see a way to force users or tables to use
those functions, and is up to the users to use those functions or not. I'm
reading this documentation
.

As for broadening the scope the proposal to prevent malicious users from
inferring the masked data, I guess that the additional rule would simply be
that a user with READ but not UNMASK permissions cannot use masked columns
on WHERE or IF clauses. That would include both SELECT and UPDATE
statements. That would differentiate us from many popular databases out
there, where data masking usually is a simpler thing.

On Wed, 24 Aug 2022 at 14:08, Benedict  wrote:

> I can’t tell for sure, but the documentation on Postgres’ feature suggests
> to me that it does apply the masking to all possible uses of the data,
> including joining and querying.
>
> Snowflake’s documentation explicitly says that it does.
>
> MySQL’s documentation suggests that it does this.
>
> Oracle, AWS and MS SQL do not.
>
> My inclination would be to - at least by default - forbid querying on
> columns that are masked, unless the mask permits it.
>
>
> On 24 Aug 2022, at 11:06, Andrés de la Peña  wrote:
>
> 
> Here are the names of the feature on same databases out there, errors and
> omission excepted:
>
>- Microsoft SQL Server / Azure SQL: Dynamic data masking
>- MySQL: Enterprise data masking and de-identification
>- PostgreSQL: Dynamic masking
>- MongoDB: Data masking
>- IBM Db2: Masks
>- Oracle: Redaction
>- MariaDB/MaxScale: Data masking
>- Snowflake: Dynamic data masking
>
>
> On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:
>
>> Right, but we get to decide how we offer such features and what we call
>> them. I can’t imagine a good reason to call this a masking feature,
>> especially one that applies differentially to certain users, when it is
>> trivial to unmask.
>>
>> I’m ok offering a feature called “default formatter” or something that
>> applies some UDF to a field before returning to the client, and if users
>> wish to “mask” their data in this way that’s fine. But calling it a data
>> mask when it is trivial to circumvent is IMO dangerous, and I’d at least
>> want to see evidence that all other equivalent features in the industry are
>> similarly poorly named and offer similarly poor protection.
>>
>> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
>>
>> 
>>
>>> The PCI DSS Standard v4_0
>>> 
>>>  requires
>>> that credit card numbers stored on the system must be "rendered
>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>> numbers.
>>
>>
>> My point was simply about the fact that Dynamic Data Masking like any
>> other feature made sense for some scenario but not for others. I apologise
>> if my example was a bad one.
>>
>> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
>> [email protected]> a écrit :
>>
>>> This change appears to be looking at two aspects:
>>>
>>>1. Add metadata to columns
>>>2. Add functionality based on the metadata.
>>>
>>> If the system had a generic user defined metadata and the ability to
>>> define filter functions at the point where data are being returned to the
>>> client it would be possible for users implement this filter, or any other
>>> filter on the data.
>>>
>>> The concept of user defined metadata and filters could be applied to
>>> other parts of the system as well.  For example, if the metadata were
>>> accessible from UDFs the metadata could be used in low level filters to
>>> remove rows from queries before they were returned.
>>>
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr <
>>> [email protected]> wrote:
>>>
 The PCI DSS Standard v4_0
 
  requires
 that credit card numbers stored on the system must be "rendered
 unreadable", thus this proposal is _NOT_ a good way to protect credit card
 numbers.  In fact, for any critically sensitive data this is not an
 appropriate solution.  However, there seems to be agreement that it is
 appropriate for obfuscating some data in some queries by some users.



 On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer 
 wrote:

> Is it typical for a masking feature to make no effort to prevent
>> unmasking? I’m just struggling to see the value of this without such
>> mechanisms. Otherwise it’s just a default formatter, and we should 
>> consider
>> renaming the feature IMO
>
>
> The security that Dynamic Data Masking is bringing is related to how
> you make use of the feature. It is somehow the same with passwords. If you
> use a weak password it does not bring much security.
> Masking

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benedict
I can’t tell for sure, but the documentation on Postgres’ feature suggests to 
me that it does apply the masking to all possible uses of the data, including 
joining and querying.

Snowflake’s documentation explicitly says that it does.

MySQL’s documentation suggests that it does this.

Oracle, AWS and MS SQL do not.

My inclination would be to - at least by default - forbid querying on columns 
that are masked, unless the mask permits it.


> On 24 Aug 2022, at 11:06, Andrés de la Peña  wrote:
> 
> 
> Here are the names of the feature on same databases out there, errors and 
> omission excepted:
> Microsoft SQL Server / Azure SQL: Dynamic data masking
> MySQL: Enterprise data masking and de-identification
> PostgreSQL: Dynamic masking
> MongoDB: Data masking
> IBM Db2: Masks
> Oracle: Redaction
> MariaDB/MaxScale: Data masking
> Snowflake: Dynamic data masking
> 
>> On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:
>> Right, but we get to decide how we offer such features and what we call 
>> them. I can’t imagine a good reason to call this a masking feature, 
>> especially one that applies differentially to certain users, when it is 
>> trivial to unmask.
>> 
>> I’m ok offering a feature called “default formatter” or something that 
>> applies some UDF to a field before returning to the client, and if users 
>> wish to “mask” their data in this way that’s fine. But calling it a data 
>> mask when it is trivial to circumvent is IMO dangerous, and I’d at least 
>> want to see evidence that all other equivalent features in the industry are 
>> similarly poorly named and offer similarly poor protection.
>> 
 On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
 
>>> 
 The PCI DSS Standard v4_0 requires that credit card numbers stored on the 
 system must be "rendered unreadable", thus this proposal is _NOT_ a good 
 way to protect credit card numbers.
>>> 
>>> My point was simply about the fact that Dynamic Data Masking like any other 
>>> feature made sense for some scenario but not for others. I apologise if my 
>>> example was a bad one.
>>> 
 Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev 
  a écrit :
 This change appears to be looking at two aspects:
 Add metadata to columns
 Add functionality based on the metadata.
 If the system had a generic user defined metadata and the ability to 
 define filter functions at the point where data are being returned to the 
 client it would be possible for users implement this filter, or any other 
 filter on the data.
 
 The concept of user defined metadata and filters could be applied to other 
 parts of the system as well.  For example, if the metadata were accessible 
 from UDFs the metadata could be used in low level filters to remove rows 
 from queries before they were returned.
 
 
 
 
> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
>  wrote:
> The PCI DSS Standard v4_0 requires that credit card numbers stored on the 
> system must be "rendered unreadable", thus this proposal is _NOT_ a good 
> way to protect credit card numbers.  In fact, for any critically 
> sensitive data this is not an appropriate solution.  However, there seems 
> to be agreement that it is appropriate for obfuscating some data in some 
> queries by some users.   
> 
> 
> 
> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:
>>> Is it typical for a masking feature to make no effort to prevent 
>>> unmasking? I’m just struggling to see the value of this without such 
>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>> consider renaming the feature IMO
>> 
>> The security that Dynamic Data Masking is bringing is related to how you 
>> make use of the feature. It is somehow the same with passwords. If you 
>> use a weak password it does not bring much security.
>> Masking a field like people's gender is useless because you will be able 
>> to determine its value in one query. On the other hand masking credit 
>> card numbers makes a lot of sense as it will complicate the life of the 
>> person trying to have access to it and the queries needed to reach the 
>> information will leave some clear traces in the audit log.
>> 
>> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good 
>> way to protect sensitive data like credit card numbers or passwords. 
>> 
>> 
>>> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>>> Is it typical for a masking feature to make no effort to prevent 
>>> unmasking? I’m just struggling to see the value of this without such 
>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>> consider renaming the feature IMO
>>> 
> On 23 Aug 2022, at 21:27, Andrés de la Peña  
> wrote:
> 
 
 As mentioned in the CEP document, dyn

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Andrés de la Peña
Here are the names of the feature on same databases out there, errors and
omission excepted:

   - Microsoft SQL Server / Azure SQL: Dynamic data masking
   - MySQL: Enterprise data masking and de-identification
   - PostgreSQL: Dynamic masking
   - MongoDB: Data masking
   - IBM Db2: Masks
   - Oracle: Redaction
   - MariaDB/MaxScale: Data masking
   - Snowflake: Dynamic data masking


On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:

> Right, but we get to decide how we offer such features and what we call
> them. I can’t imagine a good reason to call this a masking feature,
> especially one that applies differentially to certain users, when it is
> trivial to unmask.
>
> I’m ok offering a feature called “default formatter” or something that
> applies some UDF to a field before returning to the client, and if users
> wish to “mask” their data in this way that’s fine. But calling it a data
> mask when it is trivial to circumvent is IMO dangerous, and I’d at least
> want to see evidence that all other equivalent features in the industry are
> similarly poorly named and offer similarly poor protection.
>
> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
>
> 
>
>> The PCI DSS Standard v4_0
>> 
>>  requires
>> that credit card numbers stored on the system must be "rendered
>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>> numbers.
>
>
> My point was simply about the fact that Dynamic Data Masking like any
> other feature made sense for some scenario but not for others. I apologise
> if my example was a bad one.
>
> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
> [email protected]> a écrit :
>
>> This change appears to be looking at two aspects:
>>
>>1. Add metadata to columns
>>2. Add functionality based on the metadata.
>>
>> If the system had a generic user defined metadata and the ability to
>> define filter functions at the point where data are being returned to the
>> client it would be possible for users implement this filter, or any other
>> filter on the data.
>>
>> The concept of user defined metadata and filters could be applied to
>> other parts of the system as well.  For example, if the metadata were
>> accessible from UDFs the metadata could be used in low level filters to
>> remove rows from queries before they were returned.
>>
>>
>>
>>
>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
>> wrote:
>>
>>> The PCI DSS Standard v4_0
>>> 
>>>  requires
>>> that credit card numbers stored on the system must be "rendered
>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>> numbers.  In fact, for any critically sensitive data this is not an
>>> appropriate solution.  However, there seems to be agreement that it is
>>> appropriate for obfuscating some data in some queries by some users.
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer 
>>> wrote:
>>>
 Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider
> renaming the feature IMO


 The security that Dynamic Data Masking is bringing is related to how
 you make use of the feature. It is somehow the same with passwords. If you
 use a weak password it does not bring much security.
 Masking a field like people's gender is useless because you will be
 able to determine its value in one query. On the other hand masking credit
 card numbers makes a lot of sense as it will complicate the life of the
 person trying to have access to it and the queries needed to reach the
 information will leave some clear traces in the audit log.

 Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good
 way to protect sensitive data like credit card numbers or passwords.


 Le mer. 24 août 2022 à 09:40, Benedict  a écrit :

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña 
> wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users with SELECT permissions to indirectly guess the
> real value of the masked value. This can easily be done by just trying
> values on the WHERE clause of SELECT queries. DDM would not be a
> replacement for proper column-level permissions.
>
> The data served by the database is usually consumed by applications
> that present this data to end users. These end users are

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Andrés de la Peña
>
> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO


I'd say it's a pretty standard feature. There are two parts in the
proposal; the CQL functions and the ability to link them to columns.

The CQL functions can indeed been seen as a formatter. You can see similar
functions for example in MySQL, being what they call "Enterprise Data
Masking and De-identification". Its doc says "MySQL provides
general-purpose masking functions that mask arbitrary strings, and
special-purpose masking functions that mask specific types of values.". As
long as I know, MySQL only offers this functions, without being related to
any permissions. Documentation is here
.

Associating masking functions to columns allows to prevent the accidental
leakage of sensitive data by part of users that actually have access to the
data. So it can be seen as mandatory formatting, not preventing malicious
uses with read permission to force their way into the clear data. You can
find disclaimers about it for example in the "Dynamic Data Masking" feature
of Azure SQL/SQL server: "The purpose of dynamic data masking is to limit
exposure of sensitive data, preventing users who shouldn't have access to
the data from viewing it. Dynamic data masking doesn't aim to prevent
database users from connecting directly to the database and running
exhaustive queries that expose pieces of the sensitive data.". Its doc even
has a specific section about this, here

.

As another example, IBM Db2 allows to create what they call masks. I don't
see any disclaimer about inferring the clear data, but its documentation
says "The application of enabled column masks does not interfere with the
operations of other clauses within the statement such as the WHERE, GROUP
BY, HAVING, SELECT DISTINCT, or ORDER BY. The rows that are returned in the
final result table remain the same, except that the values in the resulting
rows might have been masked by the column masks.", so I understand that
it's possible to infer the clear values unless one uses additional
permissions or security policies.



On Wed, 24 Aug 2022 at 09:48, Benjamin Lerer  wrote:

> The PCI DSS Standard v4_0
>> 
>>  requires
>> that credit card numbers stored on the system must be "rendered
>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>> numbers.
>
>
> My point was simply about the fact that Dynamic Data Masking like any
> other feature made sense for some scenario but not for others. I apologise
> if my example was a bad one.
>
> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
> [email protected]> a écrit :
>
>> This change appears to be looking at two aspects:
>>
>>1. Add metadata to columns
>>2. Add functionality based on the metadata.
>>
>> If the system had a generic user defined metadata and the ability to
>> define filter functions at the point where data are being returned to the
>> client it would be possible for users implement this filter, or any other
>> filter on the data.
>>
>> The concept of user defined metadata and filters could be applied to
>> other parts of the system as well.  For example, if the metadata were
>> accessible from UDFs the metadata could be used in low level filters to
>> remove rows from queries before they were returned.
>>
>>
>>
>>
>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
>> wrote:
>>
>>> The PCI DSS Standard v4_0
>>> 
>>>  requires
>>> that credit card numbers stored on the system must be "rendered
>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>> numbers.  In fact, for any critically sensitive data this is not an
>>> appropriate solution.  However, there seems to be agreement that it is
>>> appropriate for obfuscating some data in some queries by some users.
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer 
>>> wrote:
>>>
 Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider
> renaming the feature IMO


 The security that Dynamic Data Masking is bringing is related to how
 you make use of the feature. It is somehow the same with passwords. If you
 use a weak password it does not bring much security.
 Masking a field like people's gender is useless because you w

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benedict
Right, but we get to decide how we offer such features and what we call them. I 
can’t imagine a good reason to call this a masking feature, especially one that 
applies differentially to certain users, when it is trivial to unmask.

I’m ok offering a feature called “default formatter” or something that applies 
some UDF to a field before returning to the client, and if users wish to “mask” 
their data in this way that’s fine. But calling it a data mask when it is 
trivial to circumvent is IMO dangerous, and I’d at least want to see evidence 
that all other equivalent features in the industry are similarly poorly named 
and offer similarly poor protection.

> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
> 
> 
>> The PCI DSS Standard v4_0 requires that credit card numbers stored on the 
>> system must be "rendered unreadable", thus this proposal is _NOT_ a good way 
>> to protect credit card numbers.
> 
> My point was simply about the fact that Dynamic Data Masking like any other 
> feature made sense for some scenario but not for others. I apologise if my 
> example was a bad one.
> 
>> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev 
>>  a écrit :
>> This change appears to be looking at two aspects:
>> Add metadata to columns
>> Add functionality based on the metadata.
>> If the system had a generic user defined metadata and the ability to define 
>> filter functions at the point where data are being returned to the client it 
>> would be possible for users implement this filter, or any other filter on 
>> the data.
>> 
>> The concept of user defined metadata and filters could be applied to other 
>> parts of the system as well.  For example, if the metadata were accessible 
>> from UDFs the metadata could be used in low level filters to remove rows 
>> from queries before they were returned.
>> 
>> 
>> 
>> 
>>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr  
>>> wrote:
>>> The PCI DSS Standard v4_0 requires that credit card numbers stored on the 
>>> system must be "rendered unreadable", thus this proposal is _NOT_ a good 
>>> way to protect credit card numbers.  In fact, for any critically sensitive 
>>> data this is not an appropriate solution.  However, there seems to be 
>>> agreement that it is appropriate for obfuscating some data in some queries 
>>> by some users.   
>>> 
>>> 
>>> 
>>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:
> Is it typical for a masking feature to make no effort to prevent 
> unmasking? I’m just struggling to see the value of this without such 
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider renaming the feature IMO
 
 The security that Dynamic Data Masking is bringing is related to how you 
 make use of the feature. It is somehow the same with passwords. If you use 
 a weak password it does not bring much security.
 Masking a field like people's gender is useless because you will be able 
 to determine its value in one query. On the other hand masking credit card 
 numbers makes a lot of sense as it will complicate the life of the person 
 trying to have access to it and the queries needed to reach the 
 information will leave some clear traces in the audit log.
 
 Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good way 
 to protect sensitive data like credit card numbers or passwords. 
 
 
> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
> Is it typical for a masking feature to make no effort to prevent 
> unmasking? I’m just struggling to see the value of this without such 
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider renaming the feature IMO
> 
>>> On 23 Aug 2022, at 21:27, Andrés de la Peña  
>>> wrote:
>>> 
>> 
>> As mentioned in the CEP document, dynamic data masking doesn't try to 
>> prevent malicious users with SELECT permissions to indirectly guess the 
>> real value of the masked value. This can easily be done by just trying 
>> values on the WHERE clause of SELECT queries. DDM would not be a 
>> replacement for proper column-level permissions.
>> 
>> The data served by the database is usually consumed by applications that 
>> present this data to end users. These end users are not necessarily the 
>> users directly connecting to the database. With DDM, it would be easy 
>> for applications to mask sensitive data that is going to be consumed by 
>> the end users. However, the users directly connecting to the database 
>> should be trusted, provided that they have the right SELECT permissions.
>> 
>> In other words, DDM doesn't directly protect the data, but it eases the 
>> production of protected data.
>> 
>> Said that, we could later go one step ahead and add a way to prevent 
>> untrusted users from inferring the masked data. That could be done 
>> adding a 

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benjamin Lerer
>
> The PCI DSS Standard v4_0
> 
>  requires
> that credit card numbers stored on the system must be "rendered
> unreadable", thus this proposal is _NOT_ a good way to protect credit card
> numbers.


My point was simply about the fact that Dynamic Data Masking like any other
feature made sense for some scenario but not for others. I apologise if my
example was a bad one.

Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
[email protected]> a écrit :

> This change appears to be looking at two aspects:
>
>1. Add metadata to columns
>2. Add functionality based on the metadata.
>
> If the system had a generic user defined metadata and the ability to
> define filter functions at the point where data are being returned to the
> client it would be possible for users implement this filter, or any other
> filter on the data.
>
> The concept of user defined metadata and filters could be applied to
> other parts of the system as well.  For example, if the metadata were
> accessible from UDFs the metadata could be used in low level filters to
> remove rows from queries before they were returned.
>
>
>
>
> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
> wrote:
>
>> The PCI DSS Standard v4_0
>> 
>>  requires
>> that credit card numbers stored on the system must be "rendered
>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>> numbers.  In fact, for any critically sensitive data this is not an
>> appropriate solution.  However, there seems to be agreement that it is
>> appropriate for obfuscating some data in some queries by some users.
>>
>>
>>
>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:
>>
>>> Is it typical for a masking feature to make no effort to prevent
 unmasking? I’m just struggling to see the value of this without such
 mechanisms. Otherwise it’s just a default formatter, and we should consider
 renaming the feature IMO
>>>
>>>
>>> The security that Dynamic Data Masking is bringing is related to how you
>>> make use of the feature. It is somehow the same with passwords. If you use
>>> a weak password it does not bring much security.
>>> Masking a field like people's gender is useless because you will be able
>>> to determine its value in one query. On the other hand masking credit card
>>> numbers makes a lot of sense as it will complicate the life of the person
>>> trying to have access to it and the queries needed to reach the information
>>> will leave some clear traces in the audit log.
>>>
>>> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good
>>> way to protect sensitive data like credit card numbers or passwords.
>>>
>>>
>>> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>>>
 Is it typical for a masking feature to make no effort to prevent
 unmasking? I’m just struggling to see the value of this without such
 mechanisms. Otherwise it’s just a default formatter, and we should consider
 renaming the feature IMO

 On 23 Aug 2022, at 21:27, Andrés de la Peña 
 wrote:

 
 As mentioned in the CEP document, dynamic data masking doesn't try to
 prevent malicious users with SELECT permissions to indirectly guess the
 real value of the masked value. This can easily be done by just trying
 values on the WHERE clause of SELECT queries. DDM would not be a
 replacement for proper column-level permissions.

 The data served by the database is usually consumed by applications
 that present this data to end users. These end users are not necessarily
 the users directly connecting to the database. With DDM, it would be easy
 for applications to mask sensitive data that is going to be consumed by the
 end users. However, the users directly connecting to the database should be
 trusted, provided that they have the right SELECT permissions.

 In other words, DDM doesn't directly protect the data, but it eases the
 production of protected data.

 Said that, we could later go one step ahead and add a way to prevent
 untrusted users from inferring the masked data. That could be done adding a
 new permission required to use certain columns on WHERE clauses, different
 to the current SELECT permission. That would play especially well with
 column-level permissions, which is something that we still have pending.

 On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz 
 wrote:

> Applying this should prevent querying on a field, else you could leak
>> its contents, surely?
>>
>
> In theory, yes.  Although I could see folks doing something like this:
>
> SELECT COUNT(*) FROM patients
> WHERE year_of_birth = 2002
> AND date_of_birth >= '2002-04-01'
> AND date_of_birth < '2002-11-01';
>
> In this case, th

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Claude Warren, Jr via dev
This change appears to be looking at two aspects:

   1. Add metadata to columns
   2. Add functionality based on the metadata.

If the system had a generic user defined metadata and the ability to define
filter functions at the point where data are being returned to the client
it would be possible for users implement this filter, or any other filter
on the data.

The concept of user defined metadata and filters could be applied to
other parts of the system as well.  For example, if the metadata were
accessible from UDFs the metadata could be used in low level filters to
remove rows from queries before they were returned.




On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
wrote:

> The PCI DSS Standard v4_0
> 
>  requires
> that credit card numbers stored on the system must be "rendered
> unreadable", thus this proposal is _NOT_ a good way to protect credit card
> numbers.  In fact, for any critically sensitive data this is not an
> appropriate solution.  However, there seems to be agreement that it is
> appropriate for obfuscating some data in some queries by some users.
>
>
>
> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:
>
>> Is it typical for a masking feature to make no effort to prevent
>>> unmasking? I’m just struggling to see the value of this without such
>>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>>> renaming the feature IMO
>>
>>
>> The security that Dynamic Data Masking is bringing is related to how you
>> make use of the feature. It is somehow the same with passwords. If you use
>> a weak password it does not bring much security.
>> Masking a field like people's gender is useless because you will be able
>> to determine its value in one query. On the other hand masking credit card
>> numbers makes a lot of sense as it will complicate the life of the person
>> trying to have access to it and the queries needed to reach the information
>> will leave some clear traces in the audit log.
>>
>> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good
>> way to protect sensitive data like credit card numbers or passwords.
>>
>>
>> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>>
>>> Is it typical for a masking feature to make no effort to prevent
>>> unmasking? I’m just struggling to see the value of this without such
>>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>>> renaming the feature IMO
>>>
>>> On 23 Aug 2022, at 21:27, Andrés de la Peña 
>>> wrote:
>>>
>>> 
>>> As mentioned in the CEP document, dynamic data masking doesn't try to
>>> prevent malicious users with SELECT permissions to indirectly guess the
>>> real value of the masked value. This can easily be done by just trying
>>> values on the WHERE clause of SELECT queries. DDM would not be a
>>> replacement for proper column-level permissions.
>>>
>>> The data served by the database is usually consumed by applications that
>>> present this data to end users. These end users are not necessarily the
>>> users directly connecting to the database. With DDM, it would be easy for
>>> applications to mask sensitive data that is going to be consumed by the end
>>> users. However, the users directly connecting to the database should be
>>> trusted, provided that they have the right SELECT permissions.
>>>
>>> In other words, DDM doesn't directly protect the data, but it eases the
>>> production of protected data.
>>>
>>> Said that, we could later go one step ahead and add a way to prevent
>>> untrusted users from inferring the masked data. That could be done adding a
>>> new permission required to use certain columns on WHERE clauses, different
>>> to the current SELECT permission. That would play especially well with
>>> column-level permissions, which is something that we still have pending.
>>>
>>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz 
>>> wrote:
>>>
 Applying this should prevent querying on a field, else you could leak
> its contents, surely?
>

 In theory, yes.  Although I could see folks doing something like this:

 SELECT COUNT(*) FROM patients
 WHERE year_of_birth = 2002
 AND date_of_birth >= '2002-04-01'
 AND date_of_birth < '2002-11-01';

 In this case, the rows containing the masked key column(s) could be
 filtered on without revealing the actual data.  But again, that's probably
 better for a "phase 2" of the implementation.

 Agreed on not being a queryable field. That would also preclude
> secondary indexing, right?


 Yes, that's my thought as well.

 On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker <
 [email protected]> wrote:

> Agreed on not being a queryable field. That would also preclude
> secondary indexing, right?
>
> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>
>> Applying this should prevent querying on a fie

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Claude Warren, Jr via dev
The PCI DSS Standard v4_0

requires
that credit card numbers stored on the system must be "rendered
unreadable", thus this proposal is _NOT_ a good way to protect credit card
numbers.  In fact, for any critically sensitive data this is not an
appropriate solution.  However, there seems to be agreement that it is
appropriate for obfuscating some data in some queries by some users.



On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:

> Is it typical for a masking feature to make no effort to prevent
>> unmasking? I’m just struggling to see the value of this without such
>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>> renaming the feature IMO
>
>
> The security that Dynamic Data Masking is bringing is related to how you
> make use of the feature. It is somehow the same with passwords. If you use
> a weak password it does not bring much security.
> Masking a field like people's gender is useless because you will be able
> to determine its value in one query. On the other hand masking credit card
> numbers makes a lot of sense as it will complicate the life of the person
> trying to have access to it and the queries needed to reach the information
> will leave some clear traces in the audit log.
>
> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good way
> to protect sensitive data like credit card numbers or passwords.
>
>
> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>
>> Is it typical for a masking feature to make no effort to prevent
>> unmasking? I’m just struggling to see the value of this without such
>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>> renaming the feature IMO
>>
>> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>>
>> 
>> As mentioned in the CEP document, dynamic data masking doesn't try to
>> prevent malicious users with SELECT permissions to indirectly guess the
>> real value of the masked value. This can easily be done by just trying
>> values on the WHERE clause of SELECT queries. DDM would not be a
>> replacement for proper column-level permissions.
>>
>> The data served by the database is usually consumed by applications that
>> present this data to end users. These end users are not necessarily the
>> users directly connecting to the database. With DDM, it would be easy for
>> applications to mask sensitive data that is going to be consumed by the end
>> users. However, the users directly connecting to the database should be
>> trusted, provided that they have the right SELECT permissions.
>>
>> In other words, DDM doesn't directly protect the data, but it eases the
>> production of protected data.
>>
>> Said that, we could later go one step ahead and add a way to prevent
>> untrusted users from inferring the masked data. That could be done adding a
>> new permission required to use certain columns on WHERE clauses, different
>> to the current SELECT permission. That would play especially well with
>> column-level permissions, which is something that we still have pending.
>>
>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>>
>>> Applying this should prevent querying on a field, else you could leak
 its contents, surely?

>>>
>>> In theory, yes.  Although I could see folks doing something like this:
>>>
>>> SELECT COUNT(*) FROM patients
>>> WHERE year_of_birth = 2002
>>> AND date_of_birth >= '2002-04-01'
>>> AND date_of_birth < '2002-11-01';
>>>
>>> In this case, the rows containing the masked key column(s) could be
>>> filtered on without revealing the actual data.  But again, that's probably
>>> better for a "phase 2" of the implementation.
>>>
>>> Agreed on not being a queryable field. That would also preclude
 secondary indexing, right?
>>>
>>>
>>> Yes, that's my thought as well.
>>>
>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker <
>>> [email protected]> wrote:
>>>
 Agreed on not being a queryable field. That would also preclude
 secondary indexing, right?

 On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:

> Applying this should prevent querying on a field, else you could leak
> its contents, surely? This pretty much prohibits using it in a clustering
> key, and a partition key with the ordered partitioner - but probably also 
> a
> hashed partitioner since we do not use a cryptographic hash and the hash
> function is well defined.
>
> We probably also need to ensure that any ALLOW FILTERING queries on
> such a field are disabled.
>
> Plausibly the data could be cryptographically jumbled before using it
> in a primary key component (or permitting filtering), but it is probably
> easier and safer to exclude for now…
>
> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
>
> 
> Some thoughts on this one:
>
> In a prior job, we'd give app teams access to a single keyspace, and

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benjamin Lerer
>
> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO


The security that Dynamic Data Masking is bringing is related to how you
make use of the feature. It is somehow the same with passwords. If you use
a weak password it does not bring much security.
Masking a field like people's gender is useless because you will be able to
determine its value in one query. On the other hand masking credit card
numbers makes a lot of sense as it will complicate the life of the person
trying to have access to it and the queries needed to reach the information
will leave some clear traces in the audit log.

Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good way
to protect sensitive data like credit card numbers or passwords.


Le mer. 24 août 2022 à 09:40, Benedict  a écrit :

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users with SELECT permissions to indirectly guess the
> real value of the masked value. This can easily be done by just trying
> values on the WHERE clause of SELECT queries. DDM would not be a
> replacement for proper column-level permissions.
>
> The data served by the database is usually consumed by applications that
> present this data to end users. These end users are not necessarily the
> users directly connecting to the database. With DDM, it would be easy for
> applications to mask sensitive data that is going to be consumed by the end
> users. However, the users directly connecting to the database should be
> trusted, provided that they have the right SELECT permissions.
>
> In other words, DDM doesn't directly protect the data, but it eases the
> production of protected data.
>
> Said that, we could later go one step ahead and add a way to prevent
> untrusted users from inferring the masked data. That could be done adding a
> new permission required to use certain columns on WHERE clauses, different
> to the current SELECT permission. That would play especially well with
> column-level permissions, which is something that we still have pending.
>
> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>
>> Applying this should prevent querying on a field, else you could leak its
>>> contents, surely?
>>>
>>
>> In theory, yes.  Although I could see folks doing something like this:
>>
>> SELECT COUNT(*) FROM patients
>> WHERE year_of_birth = 2002
>> AND date_of_birth >= '2002-04-01'
>> AND date_of_birth < '2002-11-01';
>>
>> In this case, the rows containing the masked key column(s) could be
>> filtered on without revealing the actual data.  But again, that's probably
>> better for a "phase 2" of the implementation.
>>
>> Agreed on not being a queryable field. That would also preclude secondary
>>> indexing, right?
>>
>>
>> Yes, that's my thought as well.
>>
>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
>> wrote:
>>
>>> Agreed on not being a queryable field. That would also preclude
>>> secondary indexing, right?
>>>
>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>>>
 Applying this should prevent querying on a field, else you could leak
 its contents, surely? This pretty much prohibits using it in a clustering
 key, and a partition key with the ordered partitioner - but probably also a
 hashed partitioner since we do not use a cryptographic hash and the hash
 function is well defined.

 We probably also need to ensure that any ALLOW FILTERING queries on
 such a field are disabled.

 Plausibly the data could be cryptographically jumbled before using it
 in a primary key component (or permitting filtering), but it is probably
 easier and safer to exclude for now…

 On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:

 
 Some thoughts on this one:

 In a prior job, we'd give app teams access to a single keyspace, and
 two roles: a read-write role and a read-only role.  In some cases, a
 "privileged" application role was also requested.  Depending on the
 requirements, I could see the UNMASK permission being applied to the RW or
 privileged roles.  But if there's a problem on the table and the operators
 go in to investigate, they will likely use a SUPERUSER account, and they'll
 see that data.

 How hard would it be for SUPERUSERs to *not* automatically get the
 UNMASK permission?

 I'll also echo the concerns around masking primary key components.
 It's highly likely that certain personal data properties would be use

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Claude Warren, Jr via dev
This seems to me to be a client display filter, applied at the last moment
as data are streaming back to the client.  It has no impact on any keys,
queries or secondary internal index or materialized view.  It simply
prevents the display from showing the complete value.  It does not preclude
determining what some values are by building carefully crafted queries.





On Wed, Aug 24, 2022 at 8:40 AM Benedict  wrote:

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users with SELECT permissions to indirectly guess the
> real value of the masked value. This can easily be done by just trying
> values on the WHERE clause of SELECT queries. DDM would not be a
> replacement for proper column-level permissions.
>
> The data served by the database is usually consumed by applications that
> present this data to end users. These end users are not necessarily the
> users directly connecting to the database. With DDM, it would be easy for
> applications to mask sensitive data that is going to be consumed by the end
> users. However, the users directly connecting to the database should be
> trusted, provided that they have the right SELECT permissions.
>
> In other words, DDM doesn't directly protect the data, but it eases the
> production of protected data.
>
> Said that, we could later go one step ahead and add a way to prevent
> untrusted users from inferring the masked data. That could be done adding a
> new permission required to use certain columns on WHERE clauses, different
> to the current SELECT permission. That would play especially well with
> column-level permissions, which is something that we still have pending.
>
> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>
>> Applying this should prevent querying on a field, else you could leak its
>>> contents, surely?
>>>
>>
>> In theory, yes.  Although I could see folks doing something like this:
>>
>> SELECT COUNT(*) FROM patients
>> WHERE year_of_birth = 2002
>> AND date_of_birth >= '2002-04-01'
>> AND date_of_birth < '2002-11-01';
>>
>> In this case, the rows containing the masked key column(s) could be
>> filtered on without revealing the actual data.  But again, that's probably
>> better for a "phase 2" of the implementation.
>>
>> Agreed on not being a queryable field. That would also preclude secondary
>>> indexing, right?
>>
>>
>> Yes, that's my thought as well.
>>
>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
>> wrote:
>>
>>> Agreed on not being a queryable field. That would also preclude
>>> secondary indexing, right?
>>>
>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>>>
 Applying this should prevent querying on a field, else you could leak
 its contents, surely? This pretty much prohibits using it in a clustering
 key, and a partition key with the ordered partitioner - but probably also a
 hashed partitioner since we do not use a cryptographic hash and the hash
 function is well defined.

 We probably also need to ensure that any ALLOW FILTERING queries on
 such a field are disabled.

 Plausibly the data could be cryptographically jumbled before using it
 in a primary key component (or permitting filtering), but it is probably
 easier and safer to exclude for now…

 On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:

 
 Some thoughts on this one:

 In a prior job, we'd give app teams access to a single keyspace, and
 two roles: a read-write role and a read-only role.  In some cases, a
 "privileged" application role was also requested.  Depending on the
 requirements, I could see the UNMASK permission being applied to the RW or
 privileged roles.  But if there's a problem on the table and the operators
 go in to investigate, they will likely use a SUPERUSER account, and they'll
 see that data.

 How hard would it be for SUPERUSERs to *not* automatically get the
 UNMASK permission?

 I'll also echo the concerns around masking primary key components.
 It's highly likely that certain personal data properties would be used as a
 partition or clustering key (ex: range query for people born within a
 certain timeframe).  In addition to the "breaks existing" concern, I'm
 curious about the challenges around getting that to work with the current
 primary key implementation.

 Does this first implementation only apply to payload (non-key)
 columns?  The examples in the CEP currently do not show primary key
 components being masked.

 Thanks,

 Aaron


 On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo 
 wrote:

> 

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benedict
Is it typical for a masking feature to make no effort to prevent unmasking? I’m 
just struggling to see the value of this without such mechanisms. Otherwise 
it’s just a default formatter, and we should consider renaming the feature IMO

> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
> 
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to prevent 
> malicious users with SELECT permissions to indirectly guess the real value of 
> the masked value. This can easily be done by just trying values on the WHERE 
> clause of SELECT queries. DDM would not be a replacement for proper 
> column-level permissions.
> 
> The data served by the database is usually consumed by applications that 
> present this data to end users. These end users are not necessarily the users 
> directly connecting to the database. With DDM, it would be easy for 
> applications to mask sensitive data that is going to be consumed by the end 
> users. However, the users directly connecting to the database should be 
> trusted, provided that they have the right SELECT permissions.
> 
> In other words, DDM doesn't directly protect the data, but it eases the 
> production of protected data.
> 
> Said that, we could later go one step ahead and add a way to prevent 
> untrusted users from inferring the masked data. That could be done adding a 
> new permission required to use certain columns on WHERE clauses, different to 
> the current SELECT permission. That would play especially well with 
> column-level permissions, which is something that we still have pending. 
> 
> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>>> Applying this should prevent querying on a field, else you could leak its 
>>> contents, surely?
>> 
>> In theory, yes.  Although I could see folks doing something like this:
>> 
>> SELECT COUNT(*) FROM patients
>> WHERE year_of_birth = 2002
>> AND date_of_birth >= '2002-04-01'
>> AND date_of_birth < '2002-11-01';
>> 
>> In this case, the rows containing the masked key column(s) could be filtered 
>> on without revealing the actual data.  But again, that's probably better for 
>> a "phase 2" of the implementation.
>> 
>>> Agreed on not being a queryable field. That would also preclude secondary 
>>> indexing, right?
>> 
>> Yes, that's my thought as well. 
>> 
>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker  
>>> wrote:
>>> Agreed on not being a queryable field. That would also preclude secondary 
>>> indexing, right? 
>>> 
 On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
 Applying this should prevent querying on a field, else you could leak its 
 contents, surely? This pretty much prohibits using it in a clustering key, 
 and a partition key with the ordered partitioner - but probably also a 
 hashed partitioner since we do not use a cryptographic hash and the hash 
 function is well defined.
 
 We probably also need to ensure that any ALLOW FILTERING queries on such a 
 field are disabled.
 
 Plausibly the data could be cryptographically jumbled before using it in a 
 primary key component (or permitting filtering), but it is probably easier 
 and safer to exclude for now…
 
>> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
>> 
> 
> Some thoughts on this one:
> 
> In a prior job, we'd give app teams access to a single keyspace, and two 
> roles: a read-write role and a read-only role.  In some cases, a 
> "privileged" application role was also requested.  Depending on the 
> requirements, I could see the UNMASK permission being applied to the RW 
> or privileged roles.  But if there's a problem on the table and the 
> operators go in to investigate, they will likely use a SUPERUSER account, 
> and they'll see that data.
> 
> How hard would it be for SUPERUSERs to *not* automatically get the UNMASK 
> permission?
> 
> I'll also echo the concerns around masking primary key components.  It's 
> highly likely that certain personal data properties would be used as a 
> partition or clustering key (ex: range query for people born within a 
> certain timeframe).  In addition to the "breaks existing" concern, I'm 
> curious about the challenges around getting that to work with the current 
> primary key implementation.
> 
> Does this first implementation only apply to payload (non-key) columns?  
> The examples in the CEP currently do not show primary key components 
> being masked. 
> 
> Thanks,
> 
> Aaron
> 
> 
>> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo  
>> wrote:
>> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña  
>> wrote:
 One thought: The way the CEP is currently written, it is only possible 
 to mask a column one way. You can only define one masking function for 
 a column, and since you use the original column name, you could only 
 return one versio

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-23 Thread Andrés de la Peña
As mentioned in the CEP document, dynamic data masking doesn't try to
prevent malicious users with SELECT permissions to indirectly guess the
real value of the masked value. This can easily be done by just trying
values on the WHERE clause of SELECT queries. DDM would not be a
replacement for proper column-level permissions.

The data served by the database is usually consumed by applications that
present this data to end users. These end users are not necessarily the
users directly connecting to the database. With DDM, it would be easy for
applications to mask sensitive data that is going to be consumed by the end
users. However, the users directly connecting to the database should be
trusted, provided that they have the right SELECT permissions.

In other words, DDM doesn't directly protect the data, but it eases the
production of protected data.

Said that, we could later go one step ahead and add a way to prevent
untrusted users from inferring the masked data. That could be done adding a
new permission required to use certain columns on WHERE clauses, different
to the current SELECT permission. That would play especially well with
column-level permissions, which is something that we still have pending.

On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:

> Applying this should prevent querying on a field, else you could leak its
>> contents, surely?
>>
>
> In theory, yes.  Although I could see folks doing something like this:
>
> SELECT COUNT(*) FROM patients
> WHERE year_of_birth = 2002
> AND date_of_birth >= '2002-04-01'
> AND date_of_birth < '2002-11-01';
>
> In this case, the rows containing the masked key column(s) could be
> filtered on without revealing the actual data.  But again, that's probably
> better for a "phase 2" of the implementation.
>
> Agreed on not being a queryable field. That would also preclude secondary
>> indexing, right?
>
>
> Yes, that's my thought as well.
>
> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
> wrote:
>
>> Agreed on not being a queryable field. That would also preclude secondary
>> indexing, right?
>>
>> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>>
>>> Applying this should prevent querying on a field, else you could leak
>>> its contents, surely? This pretty much prohibits using it in a clustering
>>> key, and a partition key with the ordered partitioner - but probably also a
>>> hashed partitioner since we do not use a cryptographic hash and the hash
>>> function is well defined.
>>>
>>> We probably also need to ensure that any ALLOW FILTERING queries on such
>>> a field are disabled.
>>>
>>> Plausibly the data could be cryptographically jumbled before using it in
>>> a primary key component (or permitting filtering), but it is probably
>>> easier and safer to exclude for now…
>>>
>>> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
>>>
>>> 
>>> Some thoughts on this one:
>>>
>>> In a prior job, we'd give app teams access to a single keyspace, and two
>>> roles: a read-write role and a read-only role.  In some cases, a
>>> "privileged" application role was also requested.  Depending on the
>>> requirements, I could see the UNMASK permission being applied to the RW or
>>> privileged roles.  But if there's a problem on the table and the operators
>>> go in to investigate, they will likely use a SUPERUSER account, and they'll
>>> see that data.
>>>
>>> How hard would it be for SUPERUSERs to *not* automatically get the
>>> UNMASK permission?
>>>
>>> I'll also echo the concerns around masking primary key components.  It's
>>> highly likely that certain personal data properties would be used as a
>>> partition or clustering key (ex: range query for people born within a
>>> certain timeframe).  In addition to the "breaks existing" concern, I'm
>>> curious about the challenges around getting that to work with the current
>>> primary key implementation.
>>>
>>> Does this first implementation only apply to payload (non-key) columns?
>>> The examples in the CEP currently do not show primary key components being
>>> masked.
>>>
>>> Thanks,
>>>
>>> Aaron
>>>
>>>
>>> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo 
>>> wrote:
>>>
 On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña 
 wrote:

> One thought: The way the CEP is currently written, it is only possible
>> to mask a column one way. You can only define one masking function for a
>> column, and since you use the original column name, you could only return
>> one version of it in the result set, even if you had a way to define
>> several functions.
>>
>
> Right, it's one single type of mapping per the column, declared on
> CREATE/ALTER TABLE statements. Also, users can manually specify their own
> masking function in SELECT statements if they have permissions for seeing
> the clear data.
>
> For those cases where the data is automatically masked for an
> unprivileged user, I don't see the use of including different types of
> masking for t

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-23 Thread Aaron Ploetz
>
> Applying this should prevent querying on a field, else you could leak its
> contents, surely?
>

In theory, yes.  Although I could see folks doing something like this:

SELECT COUNT(*) FROM patients
WHERE year_of_birth = 2002
AND date_of_birth >= '2002-04-01'
AND date_of_birth < '2002-11-01';

In this case, the rows containing the masked key column(s) could be
filtered on without revealing the actual data.  But again, that's probably
better for a "phase 2" of the implementation.

Agreed on not being a queryable field. That would also preclude secondary
> indexing, right?


Yes, that's my thought as well.

On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
wrote:

> Agreed on not being a queryable field. That would also preclude secondary
> indexing, right?
>
> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>
>> Applying this should prevent querying on a field, else you could leak its
>> contents, surely? This pretty much prohibits using it in a clustering key,
>> and a partition key with the ordered partitioner - but probably also a
>> hashed partitioner since we do not use a cryptographic hash and the hash
>> function is well defined.
>>
>> We probably also need to ensure that any ALLOW FILTERING queries on such
>> a field are disabled.
>>
>> Plausibly the data could be cryptographically jumbled before using it in
>> a primary key component (or permitting filtering), but it is probably
>> easier and safer to exclude for now…
>>
>> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
>>
>> 
>> Some thoughts on this one:
>>
>> In a prior job, we'd give app teams access to a single keyspace, and two
>> roles: a read-write role and a read-only role.  In some cases, a
>> "privileged" application role was also requested.  Depending on the
>> requirements, I could see the UNMASK permission being applied to the RW or
>> privileged roles.  But if there's a problem on the table and the operators
>> go in to investigate, they will likely use a SUPERUSER account, and they'll
>> see that data.
>>
>> How hard would it be for SUPERUSERs to *not* automatically get the UNMASK
>> permission?
>>
>> I'll also echo the concerns around masking primary key components.  It's
>> highly likely that certain personal data properties would be used as a
>> partition or clustering key (ex: range query for people born within a
>> certain timeframe).  In addition to the "breaks existing" concern, I'm
>> curious about the challenges around getting that to work with the current
>> primary key implementation.
>>
>> Does this first implementation only apply to payload (non-key) columns?
>> The examples in the CEP currently do not show primary key components being
>> masked.
>>
>> Thanks,
>>
>> Aaron
>>
>>
>> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo 
>> wrote:
>>
>>> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña 
>>> wrote:
>>>
 One thought: The way the CEP is currently written, it is only possible
> to mask a column one way. You can only define one masking function for a
> column, and since you use the original column name, you could only return
> one version of it in the result set, even if you had a way to define
> several functions.
>

 Right, it's one single type of mapping per the column, declared on
 CREATE/ALTER TABLE statements. Also, users can manually specify their own
 masking function in SELECT statements if they have permissions for seeing
 the clear data.

 For those cases where the data is automatically masked for an
 unprivileged user, I don't see the use of including different types of
 masking for the same column into the same result set. Instead, we might be
 interested on having different types of masking associated to different
 roles. We could do so with dedicated CREATE/DROP/LIST MASK statements,
 instead of using the CREATE/ALTER/DESCRIBE TABLE statements. That CREATE
 MASK statement would associate a masking function to a column and role.
 However, I'm not sure we need that type of granularity instead of the
 simplicity of attaching the masking to the column declaration. wdyt?



>>> My gut feeling likewise is that this adds complexity but little value.
>>>

>
>>>
>>> --
>>>
>>> Henrik Ingo
>>>
>>> +358 40 569 7354 <358405697354>
>>>
>>> [image: Visit us online.]   [image: Visit us
>>> on Twitter.]   [image: Visit us on
>>> YouTube.]
>>> 
>>>   [image: Visit my LinkedIn profile.]
>>> 
>>>
>>
>
> --
> +---+
> | Derek Chen-Becker |
> | G

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-23 Thread Derek Chen-Becker
Agreed on not being a queryable field. That would also preclude secondary
indexing, right?

On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:

> Applying this should prevent querying on a field, else you could leak its
> contents, surely? This pretty much prohibits using it in a clustering key,
> and a partition key with the ordered partitioner - but probably also a
> hashed partitioner since we do not use a cryptographic hash and the hash
> function is well defined.
>
> We probably also need to ensure that any ALLOW FILTERING queries on such a
> field are disabled.
>
> Plausibly the data could be cryptographically jumbled before using it in a
> primary key component (or permitting filtering), but it is probably easier
> and safer to exclude for now…
>
> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
>
> 
> Some thoughts on this one:
>
> In a prior job, we'd give app teams access to a single keyspace, and two
> roles: a read-write role and a read-only role.  In some cases, a
> "privileged" application role was also requested.  Depending on the
> requirements, I could see the UNMASK permission being applied to the RW or
> privileged roles.  But if there's a problem on the table and the operators
> go in to investigate, they will likely use a SUPERUSER account, and they'll
> see that data.
>
> How hard would it be for SUPERUSERs to *not* automatically get the UNMASK
> permission?
>
> I'll also echo the concerns around masking primary key components.  It's
> highly likely that certain personal data properties would be used as a
> partition or clustering key (ex: range query for people born within a
> certain timeframe).  In addition to the "breaks existing" concern, I'm
> curious about the challenges around getting that to work with the current
> primary key implementation.
>
> Does this first implementation only apply to payload (non-key) columns?
> The examples in the CEP currently do not show primary key components being
> masked.
>
> Thanks,
>
> Aaron
>
>
> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo 
> wrote:
>
>> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña 
>> wrote:
>>
>>> One thought: The way the CEP is currently written, it is only possible
 to mask a column one way. You can only define one masking function for a
 column, and since you use the original column name, you could only return
 one version of it in the result set, even if you had a way to define
 several functions.

>>>
>>> Right, it's one single type of mapping per the column, declared on
>>> CREATE/ALTER TABLE statements. Also, users can manually specify their own
>>> masking function in SELECT statements if they have permissions for seeing
>>> the clear data.
>>>
>>> For those cases where the data is automatically masked for an
>>> unprivileged user, I don't see the use of including different types of
>>> masking for the same column into the same result set. Instead, we might be
>>> interested on having different types of masking associated to different
>>> roles. We could do so with dedicated CREATE/DROP/LIST MASK statements,
>>> instead of using the CREATE/ALTER/DESCRIBE TABLE statements. That CREATE
>>> MASK statement would associate a masking function to a column and role.
>>> However, I'm not sure we need that type of granularity instead of the
>>> simplicity of attaching the masking to the column declaration. wdyt?
>>>
>>>
>>>
>> My gut feeling likewise is that this adds complexity but little value.
>>
>>>

>>
>> --
>>
>> Henrik Ingo
>>
>> +358 40 569 7354 <358405697354>
>>
>> [image: Visit us online.]   [image: Visit us
>> on Twitter.]   [image: Visit us on
>> YouTube.]
>> 
>>   [image: Visit my LinkedIn profile.]
>> 
>>
>

-- 
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-23 Thread Benedict
Applying this should prevent querying on a field, else you could leak its 
contents, surely? This pretty much prohibits using it in a clustering key, and 
a partition key with the ordered partitioner - but probably also a hashed 
partitioner since we do not use a cryptographic hash and the hash function is 
well defined.

We probably also need to ensure that any ALLOW FILTERING queries on such a 
field are disabled.

Plausibly the data could be cryptographically jumbled before using it in a 
primary key component (or permitting filtering), but it is probably easier and 
safer to exclude for now…

> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
> 
> 
> Some thoughts on this one:
> 
> In a prior job, we'd give app teams access to a single keyspace, and two 
> roles: a read-write role and a read-only role.  In some cases, a "privileged" 
> application role was also requested.  Depending on the requirements, I could 
> see the UNMASK permission being applied to the RW or privileged roles.  But 
> if there's a problem on the table and the operators go in to investigate, 
> they will likely use a SUPERUSER account, and they'll see that data.
> 
> How hard would it be for SUPERUSERs to *not* automatically get the UNMASK 
> permission?
> 
> I'll also echo the concerns around masking primary key components.  It's 
> highly likely that certain personal data properties would be used as a 
> partition or clustering key (ex: range query for people born within a certain 
> timeframe).  In addition to the "breaks existing" concern, I'm curious about 
> the challenges around getting that to work with the current primary key 
> implementation.
> 
> Does this first implementation only apply to payload (non-key) columns?  The 
> examples in the CEP currently do not show primary key components being 
> masked. 
> 
> Thanks,
> 
> Aaron
> 
> 
>> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo  wrote:
>> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña  
>> wrote:
 One thought: The way the CEP is currently written, it is only possible to 
 mask a column one way. You can only define one masking function for a 
 column, and since you use the original column name, you could only return 
 one version of it in the result set, even if you had a way to define 
 several functions.
>>> 
>>> Right, it's one single type of mapping per the column, declared on 
>>> CREATE/ALTER TABLE statements. Also, users can manually specify their own 
>>> masking function in SELECT statements if they have permissions for seeing 
>>> the clear data.
>>> 
>>> For those cases where the data is automatically masked for an unprivileged 
>>> user, I don't see the use of including different types of masking for the 
>>> same column into the same result set. Instead, we might be interested on 
>>> having different types of masking associated to different roles. We could 
>>> do so with dedicated CREATE/DROP/LIST MASK statements, instead of using the 
>>> CREATE/ALTER/DESCRIBE TABLE statements. That CREATE MASK statement would 
>>> associate a masking function to a column and role. However, I'm not sure we 
>>> need that type of granularity instead of the simplicity of attaching the 
>>> masking to the column declaration. wdyt?
>>> 
>>> 
>> 
>> My gut feeling likewise is that this adds complexity but little value.
 
>> 
>> 
>> -- 
>> Henrik Ingo
>> +358 40 569 7354
>>   


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-23 Thread Aaron Ploetz
Some thoughts on this one:

In a prior job, we'd give app teams access to a single keyspace, and two
roles: a read-write role and a read-only role.  In some cases, a
"privileged" application role was also requested.  Depending on the
requirements, I could see the UNMASK permission being applied to the RW or
privileged roles.  But if there's a problem on the table and the operators
go in to investigate, they will likely use a SUPERUSER account, and they'll
see that data.

How hard would it be for SUPERUSERs to *not* automatically get the UNMASK
permission?

I'll also echo the concerns around masking primary key components.  It's
highly likely that certain personal data properties would be used as a
partition or clustering key (ex: range query for people born within a
certain timeframe).  In addition to the "breaks existing" concern, I'm
curious about the challenges around getting that to work with the current
primary key implementation.

Does this first implementation only apply to payload (non-key) columns?
The examples in the CEP currently do not show primary key components being
masked.

Thanks,

Aaron


On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo 
wrote:

> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña 
> wrote:
>
>> One thought: The way the CEP is currently written, it is only possible to
>>> mask a column one way. You can only define one masking function for a
>>> column, and since you use the original column name, you could only return
>>> one version of it in the result set, even if you had a way to define
>>> several functions.
>>>
>>
>> Right, it's one single type of mapping per the column, declared on
>> CREATE/ALTER TABLE statements. Also, users can manually specify their own
>> masking function in SELECT statements if they have permissions for seeing
>> the clear data.
>>
>> For those cases where the data is automatically masked for an
>> unprivileged user, I don't see the use of including different types of
>> masking for the same column into the same result set. Instead, we might be
>> interested on having different types of masking associated to different
>> roles. We could do so with dedicated CREATE/DROP/LIST MASK statements,
>> instead of using the CREATE/ALTER/DESCRIBE TABLE statements. That CREATE
>> MASK statement would associate a masking function to a column and role.
>> However, I'm not sure we need that type of granularity instead of the
>> simplicity of attaching the masking to the column declaration. wdyt?
>>
>>
>>
> My gut feeling likewise is that this adds complexity but little value.
>
>>
>>>
>
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.]   [image: Visit us
> on Twitter.]   [image: Visit us on
> YouTube.]
> 
>   [image: Visit my LinkedIn profile.]
> 
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-23 Thread Henrik Ingo
On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña 
wrote:

> One thought: The way the CEP is currently written, it is only possible to
>> mask a column one way. You can only define one masking function for a
>> column, and since you use the original column name, you could only return
>> one version of it in the result set, even if you had a way to define
>> several functions.
>>
>
> Right, it's one single type of mapping per the column, declared on
> CREATE/ALTER TABLE statements. Also, users can manually specify their own
> masking function in SELECT statements if they have permissions for seeing
> the clear data.
>
> For those cases where the data is automatically masked for an unprivileged
> user, I don't see the use of including different types of masking for the
> same column into the same result set. Instead, we might be interested on
> having different types of masking associated to different roles. We could
> do so with dedicated CREATE/DROP/LIST MASK statements, instead of using the
> CREATE/ALTER/DESCRIBE TABLE statements. That CREATE MASK statement would
> associate a masking function to a column and role. However, I'm not sure we
> need that type of granularity instead of the simplicity of attaching the
> masking to the column declaration. wdyt?
>
>
>
My gut feeling likewise is that this adds complexity but little value.

>
>>

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.]   [image: Visit us on
Twitter.]   [image: Visit us on YouTube.]

  [image: Visit my LinkedIn profile.] 


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-23 Thread Andrés de la Peña
>
> One thought: The way the CEP is currently written, it is only possible to
> mask a column one way. You can only define one masking function for a
> column, and since you use the original column name, you could only return
> one version of it in the result set, even if you had a way to define
> several functions.
>

Right, it's one single type of mapping per the column, declared on
CREATE/ALTER TABLE statements. Also, users can manually specify their own
masking function in SELECT statements if they have permissions for seeing
the clear data.

For those cases where the data is automatically masked for an unprivileged
user, I don't see the use of including different types of masking for the
same column into the same result set. Instead, we might be interested on
having different types of masking associated to different roles. We could
do so with dedicated CREATE/DROP/LIST MASK statements, instead of using the
CREATE/ALTER/DESCRIBE TABLE statements. That CREATE MASK statement would
associate a masking function to a column and role. However, I'm not sure we
need that type of granularity instead of the simplicity of attaching the
masking to the column declaration. wdyt?


On Mon, 22 Aug 2022 at 19:31, Henrik Ingo  wrote:

> One thought: The way the CEP is currently written, it is only possible to
> mask a column one way. You can only define one masking function for a
> column, and since you use the original column name, you could only return
> one version of it in the result set, even if you had a way to define
> several functions.
>
> I'm not proposing this should change, just calling it out.
>
> henrik
>
> On Fri, Aug 19, 2022 at 2:50 PM Andrés de la Peña 
> wrote:
>
>> Hi everyone,
>>
>> I'd like to start a discussion about this proposal for dynamic data
>> masking:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>>
>> Dynamic data masking allows to obscure sensitive information without
>> changing the stored data. It would be based on a set of native CQL
>> functions providing different types of masking, such as replacing the
>> column value by "". These functions could be used as regular functions
>> or attached to table columns with CREATE/ALTER table. There would be a new
>> UNMASK permission, so only the users with this permissions would be able to
>> see the unmasked column values. It would be possible to customize masking
>> by using UDFs as masking functions.
>>
>> Thanks,
>>
>
>
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.]   [image: Visit us
> on Twitter.]   [image: Visit us on
> YouTube.]
> 
>   [image: Visit my LinkedIn profile.]
> 
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-22 Thread Henrik Ingo
One thought: The way the CEP is currently written, it is only possible to
mask a column one way. You can only define one masking function for a
column, and since you use the original column name, you could only return
one version of it in the result set, even if you had a way to define
several functions.

I'm not proposing this should change, just calling it out.

henrik

On Fri, Aug 19, 2022 at 2:50 PM Andrés de la Peña 
wrote:

> Hi everyone,
>
> I'd like to start a discussion about this proposal for dynamic data
> masking:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>
> Dynamic data masking allows to obscure sensitive information without
> changing the stored data. It would be based on a set of native CQL
> functions providing different types of masking, such as replacing the
> column value by "". These functions could be used as regular functions
> or attached to table columns with CREATE/ALTER table. There would be a new
> UNMASK permission, so only the users with this permissions would be able to
> see the unmasked column values. It would be possible to customize masking
> by using UDFs as masking functions.
>
> Thanks,
>


-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.]   [image: Visit us on
Twitter.]   [image: Visit us on YouTube.]

  [image: Visit my LinkedIn profile.] 


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-22 Thread Andrés de la Peña
>
> Isn't there an assumption here that encryption can not be used?  Would we
> not be better served to build in an encryption strategy that keeps the data
> encrypted until the user shows permissions to decrypt, like the unmask
> property?  An encryption strategy that can work within the Cassandra
> internals?
> I think that issue is that there are some data fields that should not be
> discoverable by unauthorized users/systems, and I think this solution masks
> that issue.  I fear that this capability will be seized upon by pointy
> haired managers as a cheaper alternative to encryption, regardless of the
> warnings otherwise, and that as a whole will harm the Cassandra ecosystem.


Data encryption, access permissions and data masking are different
solutions to different problems. We don't have to choose between them, and
indeed we should aim to support the three of them at some point. None of
these features impedes the implementation of the others. Actually, is quite
common for popular databases to provide all of them.

Data encryption should protect the data files from anyone that has direct
access to the data files, such sstables, commitlog, etc. It offers
protection outside the interfaces of the database. Of course there is also
encryption of communications.

Permissions should completely prevent the access of unauthorized users to
the data within the database interface. Currently we have permissions on
CQL at the keyspace and table level, but we are missing column-level
permissions.

Data masking obfuscates all or part of the data without totally forbidding
access to it. The key here is that the masked data can still contain parts
of the original information, or be representative enough. For example,
masking can obfuscate all the digits of a credit card number except the
last four, so the clear digits can be used for some degree of
identification. As another example, a masking function returning the hash
would allow to join the masked data of different sources without exposing
it.

An example of how data masking and permissions can be used together could
be a company storing the social security numbers (SSN) of its customers.
The accounting team might need full access to the stored SSNs. Employees
attending phone calls might need to ask for the last two digits of SSN for
identification purposes, so they would need masked access. The rest of the
organization would need no access at all.

This CEP focuses exclusively on data masking, but there is no reason not to
start parallel work on other related-but-different features like
column-level permissions on on-disk data encryption.




On Mon, 22 Aug 2022 at 07:05, Claude Warren, Jr via dev <
[email protected]> wrote:

> I am more interested in the motivation where it is stated:
>
> Many users have the need of masking sensitive data, such as contact info,
>> age, gender, credit card numbers, etc. Dynamic data masking (DDM) allows to
>> obscure sensitive information while still allowing access to the masked
>> columns, and without changing the stored data.
>
>
> There is an unspoken assumption that the stored data format can not be
> changed.  It feels like this solution is starting from a false premise.
> Throughout the document there are guard statements about how this does not
> replace encryption.  Isn't there an assumption here that encryption can not
> be used?  Would we not be better served to build in an encryption strategy
> that keeps the data encrypted until the user shows permissions to decrypt,
> like the unmask property?  An encryption strategy that can work within the
> Cassandra internals?
>
> I think that issue is that there are some data fields that should not be
> discoverable by unauthorized users/systems, and I think this solution masks
> that issue.  I fear that this capability will be seized upon by pointy
> haired managers as a cheaper alternative to encryption, regardless of the
> warnings otherwise, and that as a whole will harm the Cassandra ecosystem.
>
> Yes, encryption is more difficult to implement and will take longer, but
> this feels like a sticking plaster that distracts from that underlying
> issue.
>
> my 0.02
>
> On Mon, Aug 22, 2022 at 12:30 AM Andrés de la Peña 
> wrote:
>
>> > If the column names are the same for masked and unmasked data, it would
>>> impact existing applications. I am curious what the transition plan look
>>> like for applications that expect unmasked data?
>>
>> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>>> feature, let’s say the app user is not given the UNMASK permission. Now the
>>> app is receiving masked values for these columns. This is fine for most
>>> read only applications. However, a lot of times these columns may be used
>>> as primary keys or part of primary keys in other tables. This would break
>>> existing applications.
>>> How would this work in mixed mode when  ew nodes in the cluster are
>>> masking data and others aren’t? How would it imp

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-22 Thread Andrés de la Peña
>
> Maybe a small improvement is the redacted value could be of the form
> `XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: XXX54,
> XXX998, XXX456,... Some randomness would prevent some apps flattening all
> rows to a single XXX'ed one, giving a more realistic redacted data
> distribution/structure.


I'm not sure I understand why that would be useful. Why would random
suffixes give us a more realistic redacted data distribution? If we want to
avoid returning always the same value, we could use a function that just
return the random value, without the  part, so we can use any data
type. Microsoft's SQL Server and Azure SQL have this function among their
masking functions.

Nevertheless, it would be quite easy to keep adding new masking functions
when we need them.

On Mon, 22 Aug 2022 at 06:52, Berenguer Blasi 
wrote:

> Maybe a small improvement is the redacted value could be of the form
> `XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: XXX54,
> XXX998, XXX456,... Some randomness would prevent some apps flattening all
> rows to a single XXX'ed one, giving a more realistic redacted data
> distribution/structure.
>
> I am not sure either about it's value, as that would still break any key
> or other cross-referencing.
>
> My 2cts.
> On 22/8/22 1:30, Andrés de la Peña wrote:
>
> > If the column names are the same for masked and unmasked data, it would
>> impact existing applications. I am curious what the transition plan look
>> like for applications that expect unmasked data?
>
> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>> feature, let’s say the app user is not given the UNMASK permission. Now the
>> app is receiving masked values for these columns. This is fine for most
>> read only applications. However, a lot of times these columns may be used
>> as primary keys or part of primary keys in other tables. This would break
>> existing applications.
>> How would this work in mixed mode when  ew nodes in the cluster are
>> masking data and others aren’t? How would it impact the driver?
>> How would the application learn that the column values are masked? This
>> is important in case a user has UNMASK permission and then later taken
>> away. Again this would break a lot of applications.
>
>
> Changing the masking of a column is a schema change, and as such it can be
> risky for existing applications. However, differently to deleting a column
> or revoking a SELECT permission, suddenly activating masking might pass
> undetected for existing applications.
>
> Applications developed after the introduction of this feature can check
> the table schema to know if a column is masked or not. We can even add a
> specific system view to ease this, if we think it's worth it. However,
> administrators should not activate masking when there could be applications
> that are not aware of the feature. We should be clear about this in the
> documentation.
>
> This is the way data masking seems to work in the databases I've checked.
> I also though that we could just change the name of the column when it's
> masked to something as "masked(column_name)", as it is discussed in the CEP
> document. This would make it impossible to miss that a column is masked.
> However, applications should be prepared to use different column names when
> reading result sets, depending on whether the data is masked for them or
> not. None of the databases mentioned on the "other databases" section of
> the CEP does this kind of column renaming, so it might be a kind of exotic
> behaviour. wdyt?
>
> On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña 
> wrote:
>
>> > This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>
>>
>> Good idea. I have added a section at the end of the document briefly
>> describing how some other databases deal with data masking, and with links
>> to their documentation for the topic. I am not an expert in none of those
>> databases, so please take my comments there with a grain of salt.
>>
>> On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:
>>
>>> This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>>
>>>
>>> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña 
>>> wrote:
>>>
>>> 
>>> Hi everyone,
>>>
>>> I'd like to start a discussion about this proposal for dynamic data
>>> masking:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>>>
>>> Dynamic data masking allows to obscure sensitive information without
>>> changing the stored data. It wou

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-21 Thread Claude Warren, Jr via dev
I am more interested in the motivation where it is stated:

Many users have the need of masking sensitive data, such as contact info,
> age, gender, credit card numbers, etc. Dynamic data masking (DDM) allows to
> obscure sensitive information while still allowing access to the masked
> columns, and without changing the stored data.


There is an unspoken assumption that the stored data format can not be
changed.  It feels like this solution is starting from a false premise.
Throughout the document there are guard statements about how this does not
replace encryption.  Isn't there an assumption here that encryption can not
be used?  Would we not be better served to build in an encryption strategy
that keeps the data encrypted until the user shows permissions to decrypt,
like the unmask property?  An encryption strategy that can work within the
Cassandra internals?

I think that issue is that there are some data fields that should not be
discoverable by unauthorized users/systems, and I think this solution masks
that issue.  I fear that this capability will be seized upon by pointy
haired managers as a cheaper alternative to encryption, regardless of the
warnings otherwise, and that as a whole will harm the Cassandra ecosystem.

Yes, encryption is more difficult to implement and will take longer, but
this feels like a sticking plaster that distracts from that underlying
issue.

my 0.02

On Mon, Aug 22, 2022 at 12:30 AM Andrés de la Peña 
wrote:

> > If the column names are the same for masked and unmasked data, it would
>> impact existing applications. I am curious what the transition plan look
>> like for applications that expect unmasked data?
>
> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>> feature, let’s say the app user is not given the UNMASK permission. Now the
>> app is receiving masked values for these columns. This is fine for most
>> read only applications. However, a lot of times these columns may be used
>> as primary keys or part of primary keys in other tables. This would break
>> existing applications.
>> How would this work in mixed mode when  ew nodes in the cluster are
>> masking data and others aren’t? How would it impact the driver?
>> How would the application learn that the column values are masked? This
>> is important in case a user has UNMASK permission and then later taken
>> away. Again this would break a lot of applications.
>
>
> Changing the masking of a column is a schema change, and as such it can be
> risky for existing applications. However, differently to deleting a column
> or revoking a SELECT permission, suddenly activating masking might pass
> undetected for existing applications.
>
> Applications developed after the introduction of this feature can check
> the table schema to know if a column is masked or not. We can even add a
> specific system view to ease this, if we think it's worth it. However,
> administrators should not activate masking when there could be applications
> that are not aware of the feature. We should be clear about this in the
> documentation.
>
> This is the way data masking seems to work in the databases I've checked.
> I also though that we could just change the name of the column when it's
> masked to something as "masked(column_name)", as it is discussed in the CEP
> document. This would make it impossible to miss that a column is masked.
> However, applications should be prepared to use different column names when
> reading result sets, depending on whether the data is masked for them or
> not. None of the databases mentioned on the "other databases" section of
> the CEP does this kind of column renaming, so it might be a kind of exotic
> behaviour. wdyt?
>
> On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña 
> wrote:
>
>> > This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>
>>
>> Good idea. I have added a section at the end of the document briefly
>> describing how some other databases deal with data masking, and with links
>> to their documentation for the topic. I am not an expert in none of those
>> databases, so please take my comments there with a grain of salt.
>>
>> On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:
>>
>>> This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>>
>>>
>>> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña 
>>> wrote:
>>>
>>> 
>>> Hi everyone,
>>>
>>> I'd like to start a discussion about this proposal for dynamic data
>>> masking:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynam

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-21 Thread Berenguer Blasi
Maybe a small improvement is the redacted value could be of the form 
`XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: 
XXX54, XXX998, XXX456,... Some randomness would prevent some apps 
flattening all rows to a single XXX'ed one, giving a more realistic 
redacted data distribution/structure.


I am not sure either about it's value, as that would still break any key 
or other cross-referencing.


My 2cts.

On 22/8/22 1:30, Andrés de la Peña wrote:


> If the column names are the same for masked and unmasked data, it would 
impact
existing applications. I am curious what the transition plan look
like for applications that expect unmasked data?

For example, let’s say you store SSNs and Birth dates. Upon
enabling this feature, let’s say the app user is not given the
UNMASK permission. Now the app is receiving masked values for
these columns. This is fine for most read only applications.
However, a lot of times these columns may be used as primary keys
or part of primary keys in other tables. This would break existing
applications.
How would this work in mixed mode when  ew nodes in the cluster
are masking data and others aren’t? How would it impact the driver?
How would the application learn that the column values are masked?
This is important in case a user has UNMASK permission and then
later taken away. Again this would break a lot of applications.


Changing the masking of a column is a schema change, and as such it 
can be risky for existing applications. However, differently to 
deleting a column or revoking a SELECT permission, suddenly activating 
masking might pass undetected for existing applications.


Applications developed after the introduction of this feature can 
check the table schema to know if a column is masked or not. We can 
even add a specific system view to ease this, if we think it's worth 
it. However, administrators should not activate masking when there 
could be applications that are not aware of the feature. We should be 
clear about this in the documentation.


This is the way data masking seems to work in the databases I've 
checked. I also though that we could just change the name of the 
column when it's masked to something as "masked(column_name)", as it 
is discussed in the CEP document. This would make it impossible to 
miss that a column is masked. However, applications should be prepared 
to use different column names when reading result sets, depending on 
whether the data is masked for them or not. None of the databases 
mentioned on the "other databases" section of the CEP does this kind 
of column renaming, so it might be a kind of exotic behaviour. wdyt?


On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña  
wrote:


> This type of feature is very useful, but it may be easier to
analyze this proposal if it’s compared with other DDM
implementations from other databases? Would it be reasonable
to add a table to the proposal comparing syntax and output
from eg Azure SQL vs Cassandra vs whatever ? 



Good idea. I have added a section at the end of the document
briefly describing how some other databases deal with data
masking, and with links to their documentation for the topic. I am
not an expert in none of those databases, so please take my
comments there with a grain of salt.

On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:

This type of feature is very useful, but it may be easier to
analyze this proposal if it’s compared with other DDM
implementations from other databases? Would it be reasonable
to add a table to the proposal comparing syntax and output
from eg Azure SQL vs Cassandra vs whatever ?



On Aug 19, 2022, at 4:50 AM, Andrés de la Peña
 wrote:


Hi everyone,

I'd like to start a discussion about this proposal for
dynamic data masking:

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking

Dynamic data masking allows to obscure sensitive information
without changing the stored data. It would be based on a set
of native CQL functions providing different types of masking,
such as replacing the column value by "". These functions
could be used as regular functions or attached to table
columns with CREATE/ALTER table. There would be a new UNMASK
permission, so only the users with this permissions would be
able to see the unmasked column values. It would be possible
to customize masking by using UDFs as masking functions.

Thanks,


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-21 Thread Andrés de la Peña
>
> > If the column names are the same for masked and unmasked data, it would
> impact existing applications. I am curious what the transition plan look
> like for applications that expect unmasked data?

For example, let’s say you store SSNs and Birth dates. Upon enabling this
> feature, let’s say the app user is not given the UNMASK permission. Now the
> app is receiving masked values for these columns. This is fine for most
> read only applications. However, a lot of times these columns may be used
> as primary keys or part of primary keys in other tables. This would break
> existing applications.
> How would this work in mixed mode when  ew nodes in the cluster are
> masking data and others aren’t? How would it impact the driver?
> How would the application learn that the column values are masked? This is
> important in case a user has UNMASK permission and then later taken away.
> Again this would break a lot of applications.


Changing the masking of a column is a schema change, and as such it can be
risky for existing applications. However, differently to deleting a column
or revoking a SELECT permission, suddenly activating masking might pass
undetected for existing applications.

Applications developed after the introduction of this feature can check the
table schema to know if a column is masked or not. We can even add a
specific system view to ease this, if we think it's worth it. However,
administrators should not activate masking when there could be applications
that are not aware of the feature. We should be clear about this in the
documentation.

This is the way data masking seems to work in the databases I've checked. I
also though that we could just change the name of the column when it's
masked to something as "masked(column_name)", as it is discussed in the CEP
document. This would make it impossible to miss that a column is masked.
However, applications should be prepared to use different column names when
reading result sets, depending on whether the data is masked for them or
not. None of the databases mentioned on the "other databases" section of
the CEP does this kind of column renaming, so it might be a kind of exotic
behaviour. wdyt?

On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña 
wrote:

> > This type of feature is very useful, but it may be easier to analyze
>> this proposal if it’s compared with other DDM implementations from other
>> databases? Would it be reasonable to add a table to the proposal comparing
>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>
>
> Good idea. I have added a section at the end of the document briefly
> describing how some other databases deal with data masking, and with links
> to their documentation for the topic. I am not an expert in none of those
> databases, so please take my comments there with a grain of salt.
>
> On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:
>
>> This type of feature is very useful, but it may be easier to analyze this
>> proposal if it’s compared with other DDM implementations from other
>> databases? Would it be reasonable to add a table to the proposal comparing
>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>
>>
>> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña 
>> wrote:
>>
>> 
>> Hi everyone,
>>
>> I'd like to start a discussion about this proposal for dynamic data
>> masking:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>>
>> Dynamic data masking allows to obscure sensitive information without
>> changing the stored data. It would be based on a set of native CQL
>> functions providing different types of masking, such as replacing the
>> column value by "". These functions could be used as regular functions
>> or attached to table columns with CREATE/ALTER table. There would be a new
>> UNMASK permission, so only the users with this permissions would be able to
>> see the unmasked column values. It would be possible to customize masking
>> by using UDFs as masking functions.
>>
>> Thanks,
>>
>>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-19 Thread Andrés de la Peña
>
> > This type of feature is very useful, but it may be easier to analyze
> this proposal if it’s compared with other DDM implementations from other
> databases? Would it be reasonable to add a table to the proposal comparing
> syntax and output from eg Azure SQL vs Cassandra vs whatever ?


Good idea. I have added a section at the end of the document briefly
describing how some other databases deal with data masking, and with links
to their documentation for the topic. I am not an expert in none of those
databases, so please take my comments there with a grain of salt.

On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:

> This type of feature is very useful, but it may be easier to analyze this
> proposal if it’s compared with other DDM implementations from other
> databases? Would it be reasonable to add a table to the proposal comparing
> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>
>
> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña 
> wrote:
>
> 
> Hi everyone,
>
> I'd like to start a discussion about this proposal for dynamic data
> masking:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>
> Dynamic data masking allows to obscure sensitive information without
> changing the stored data. It would be based on a set of native CQL
> functions providing different types of masking, such as replacing the
> column value by "". These functions could be used as regular functions
> or attached to table columns with CREATE/ALTER table. There would be a new
> UNMASK permission, so only the users with this permissions would be able to
> see the unmasked column values. It would be possible to customize masking
> by using UDFs as masking functions.
>
> Thanks,
>
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-19 Thread Jeff Jirsa
This type of feature is very useful, but it may be easier to analyze this 
proposal if it’s compared with other DDM implementations from other databases? 
Would it be reasonable to add a table to the proposal comparing syntax and 
output from eg Azure SQL vs Cassandra vs whatever ? 


> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña  wrote:
> 
> 
> Hi everyone,
> 
> I'd like to start a discussion about this proposal for dynamic data masking: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
> 
> Dynamic data masking allows to obscure sensitive information without changing 
> the stored data. It would be based on a set of native CQL functions providing 
> different types of masking, such as replacing the column value by "". 
> These functions could be used as regular functions or attached to table 
> columns with CREATE/ALTER table. There would be a new UNMASK permission, so 
> only the users with this permissions would be able to see the unmasked column 
> values. It would be possible to customize masking by using UDFs as masking 
> functions.
> 
> Thanks,


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-19 Thread Dinesh Joshi
sounds interesting. I would like to understand a couple things here. If the 
column names are the same for masked and unmasked data, it would impact 
existing applications. I am curious what the transition plan look like for 
applications that expect unmasked data?

For example, let’s say you store SSNs and Birth dates. Upon enabling this 
feature, let’s say the app user is not given the UNMASK permission. Now the app 
is receiving masked values for these columns. This is fine for most read only 
applications. However, a lot of times these columns may be used as primary keys 
or part of primary keys in other tables. This would break existing applications.

How would this work in mixed mode when  ew nodes in the cluster are masking 
data and others aren’t? How would it impact the driver?

How would the application learn that the column values are masked? This is 
important in case a user has UNMASK permission and then later taken away. Again 
this would break a lot of applications.

Dinesh

> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña  wrote:
> 
> 
> Hi everyone,
> 
> I'd like to start a discussion about this proposal for dynamic data masking: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
> 
> Dynamic data masking allows to obscure sensitive information without changing 
> the stored data. It would be based on a set of native CQL functions providing 
> different types of masking, such as replacing the column value by "". 
> These functions could be used as regular functions or attached to table 
> columns with CREATE/ALTER table. There would be a new UNMASK permission, so 
> only the users with this permissions would be able to see the unmasked column 
> values. It would be possible to customize masking by using UDFs as masking 
> functions.
> 
> Thanks,