[jira] [Commented] (PHOENIX-7163) Do Not Dependency Manage commons-configuration2 Version

2024-01-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PHOENIX-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801688#comment-17801688
 ] 

ASF GitHub Bot commented on PHOENIX-7163:
-

stoty opened a new pull request, #1776:
URL: https://github.com/apache/phoenix/pull/1776

   (no comment)




> Do Not Dependency Manage commons-configuration2 Version
> ---
>
> Key: PHOENIX-7163
> URL: https://issues.apache.org/jira/browse/PHOENIX-7163
> Project: Phoenix
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.2.0, 5.1.4
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Major
>
> We are using commons-configurations2 for the Hadoop metrics code, because 
> that Hadoop API is badly broken.
> Because of this, I have added dependency management for that dependency.
> We are setting an old version, which is known to have CVEs.
> Remove the dependency managment so that we can pick up any possible future 
> fixes from Hadoop instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PHOENIX-7163 Do Not Dependency Manage commons-configuration2 Version [phoenix]

2024-01-01 Thread via GitHub


stoty opened a new pull request, #1776:
URL: https://github.com/apache/phoenix/pull/1776

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] OMID-254 Upgrade to phoenix-thirdparty 2.1.0 [phoenix-omid]

2024-01-01 Thread via GitHub


stoty commented on PR #151:
URL: https://github.com/apache/phoenix-omid/pull/151#issuecomment-1873692027

   No need, if you have tested with thirdparty-2.1 in both Omid and Phoenix at 
the same time, that's fine.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] OMID-254 Upgrade to phoenix-thirdparty 2.1.0 [phoenix-omid]

2024-01-01 Thread via GitHub


NihalJain commented on PR #151:
URL: https://github.com/apache/phoenix-omid/pull/151#issuecomment-1873681962

   > Have you run the full Phoenix test suite with both Phoenix and Omid built 
with the new thirdparty,@NihalJain ?
   
   Hi @stoty I have ran following for omid.
   
   > Built code locally and ran tests.:
   > 
   > ```
   > mvn clean install -Dhbase.version=2.5.6-hadoop3 -DskipTests
   > mvn verify -Dsurefire.rerunFailingTestsCount=5 
-Dhbase.version=2.5.6-hadoop3
   > ```
   
   Also had run tests for phoenix with 
https://github.com/apache/phoenix-thirdparty/pull/8#issuecomment-1832165125


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] OMID-254 Upgrade to phoenix-thirdparty 2.1.0 [phoenix-omid]

2024-01-01 Thread via GitHub


stoty commented on PR #151:
URL: https://github.com/apache/phoenix-omid/pull/151#issuecomment-1873673603

   Have you run the full Phoenix test suite with both Phoenix and Omid built 
with the new thirdparty, @NihalJain  ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PHOENIX-7157) Upgrade to phoenix-thirdparty 2.1.0

2024-01-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PHOENIX-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801675#comment-17801675
 ] 

ASF GitHub Bot commented on PHOENIX-7157:
-

stoty commented on PR #1771:
URL: https://github.com/apache/phoenix/pull/1771#issuecomment-1873671288

   The test results have aged out.
   I have kicked off another CI run.




> Upgrade to phoenix-thirdparty 2.1.0
> ---
>
> Key: PHOENIX-7157
> URL: https://issues.apache.org/jira/browse/PHOENIX-7157
> Project: Phoenix
>  Issue Type: Improvement
>  Components: core
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
>
> Phoenix-thirdparty has been released, see 
> [https://www.mail-archive.com/user@phoenix.apache.org/msg08204.html]
> {quote}The recent release has upgraded Guava to version 32.1.3-jre from the 
> previous 31.0.1-android version. Initially, the 4.x branch maintained 
> compatibility with Java 7, necessitating the use of the Android variant of 
> Guava. However, with the end-of-life (EOL) status of the 4.x branch, the move 
> to the standard JRE version of Guava signifies a shift in compatibility 
> standards
> {quote}
> It's time we bump up. Also, now PHOENIX-7116 is in place so we can pull this 
> to branch 5.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PHOENIX-7157 Upgrade to phoenix-thirdparty 2.1.0 [phoenix]

2024-01-01 Thread via GitHub


stoty commented on PR #1771:
URL: https://github.com/apache/phoenix/pull/1771#issuecomment-1873671288

   The test results have aged out.
   I have kicked off another CI run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PHOENIX-6721) CSV bulkload tool fails with FileNotFoundException if --output points to the S3 location

2024-01-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PHOENIX-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801674#comment-17801674
 ] 

ASF GitHub Bot commented on PHOENIX-6721:
-

stoty closed pull request #1450: PHOENIX-6721 CSV bulkload tool fails with 
FileNotFoundException if --…
URL: https://github.com/apache/phoenix/pull/1450




> CSV bulkload tool fails with FileNotFoundException if --output points to the 
> S3 location
> 
>
> Key: PHOENIX-6721
> URL: https://issues.apache.org/jira/browse/PHOENIX-6721
> Project: Phoenix
>  Issue Type: Bug
>  Components: core
>Reporter: Sergey Soldatov
>Assignee: Istvan Toth
>Priority: Major
> Fix For: 5.2.0, 5.1.4
>
>
> We were trying to use CSV bulkload tool with the HBase/Phoenix running on top 
> of AWS S3 and found that once we use --output params pointing to  S3, the job 
> fails with FNFE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PHOENIX-6721) CSV bulkload tool fails with FileNotFoundException if --output points to the S3 location

2024-01-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PHOENIX-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801673#comment-17801673
 ] 

ASF GitHub Bot commented on PHOENIX-6721:
-

stoty commented on PR #1450:
URL: https://github.com/apache/phoenix/pull/1450#issuecomment-1873669734

   Fixed version of this committed from #1765 




> CSV bulkload tool fails with FileNotFoundException if --output points to the 
> S3 location
> 
>
> Key: PHOENIX-6721
> URL: https://issues.apache.org/jira/browse/PHOENIX-6721
> Project: Phoenix
>  Issue Type: Bug
>  Components: core
>Reporter: Sergey Soldatov
>Assignee: Istvan Toth
>Priority: Major
> Fix For: 5.2.0, 5.1.4
>
>
> We were trying to use CSV bulkload tool with the HBase/Phoenix running on top 
> of AWS S3 and found that once we use --output params pointing to  S3, the job 
> fails with FNFE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PHOENIX-6721 CSV bulkload tool fails with FileNotFoundException if --… [phoenix]

2024-01-01 Thread via GitHub


stoty closed pull request #1450: PHOENIX-6721 CSV bulkload tool fails with 
FileNotFoundException if --…
URL: https://github.com/apache/phoenix/pull/1450


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] PHOENIX-6721 CSV bulkload tool fails with FileNotFoundException if --… [phoenix]

2024-01-01 Thread via GitHub


stoty commented on PR #1450:
URL: https://github.com/apache/phoenix/pull/1450#issuecomment-1873669734

   Fixed version of this committed from #1765 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PHOENIX-6721) CSV bulkload tool fails with FileNotFoundException if --output points to the S3 location

2024-01-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PHOENIX-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801578#comment-17801578
 ] 

ASF GitHub Bot commented on PHOENIX-6721:
-

steveloughran commented on PR #1765:
URL: https://github.com/apache/phoenix/pull/1765#issuecomment-1873439846

   > I don't quite remember why I didn't use getWorkPath in the first place,
   
   @ss77892 we only pulled it up from a FileOutputFormat into an interface with 
the s3a committer work




> CSV bulkload tool fails with FileNotFoundException if --output points to the 
> S3 location
> 
>
> Key: PHOENIX-6721
> URL: https://issues.apache.org/jira/browse/PHOENIX-6721
> Project: Phoenix
>  Issue Type: Bug
>  Components: core
>Reporter: Sergey Soldatov
>Assignee: Istvan Toth
>Priority: Major
> Fix For: 5.2.0, 5.1.4
>
>
> We were trying to use CSV bulkload tool with the HBase/Phoenix running on top 
> of AWS S3 and found that once we use --output params pointing to  S3, the job 
> fails with FNFE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PHOENIX-6721 CSV bulkload tool fails with FileNotFoundException if --… [phoenix]

2024-01-01 Thread via GitHub


steveloughran commented on PR #1765:
URL: https://github.com/apache/phoenix/pull/1765#issuecomment-1873439846

   > I don't quite remember why I didn't use getWorkPath in the first place,
   
   @ss77892 we only pulled it up from a FileOutputFormat into an interface with 
the s3a committer work


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] (PHOENIX-7001) Change Data Capture leveraging Max Lookback and Uncovered Indexes

2024-01-01 Thread Hari Krishna Dara (Jira)


[ https://issues.apache.org/jira/browse/PHOENIX-7001 ]


Hari Krishna Dara deleted comment on PHOENIX-7001:


was (Author: haridsv):
Resolved wrong item.

> Change Data Capture leveraging Max Lookback and Uncovered Indexes
> -
>
> Key: PHOENIX-7001
> URL: https://issues.apache.org/jira/browse/PHOENIX-7001
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Kadir Ozdemir
>Priority: Major
>
> The use cases for a Change Data Capture (CDC) feature are centered around 
> capturing changes to a given table (or updatable view) as these changes 
> happen in near real-time. A CDC application can retrieve changes in real-time 
> or with some delay, or even retrieves the same set of changes multiple times. 
> This means the CDC use case can be generalized as time range queries where 
> the time range is typically short such as last x minutes or hours or 
> expressed as a specific time range in the last n days where n is typically 
> less than 7.
> A change is an update in a row. That is, a change is either updating one or 
> more columns of a table for a given row or deleting a row. It is desirable to 
> provide these changes in the order of their arrival. One can visualize the 
> delivery of these changes through a stream from a Phoenix table to the 
> application that is initiated by the application similar to the delivery of 
> any other Phoenix query results. The difference is that a regular query 
> result includes at most one result row for each row satisfying the query and 
> the deleted rows are not visible to the query result while the CDC 
> stream/result can include multiple result rows for each row and the result 
> includes deleted rows. Some use cases need to also get the pre and/or post 
> image of the row along with a change on the row. 
> The design proposed here leverages Phoenix Max Lookback and Uncovered (Global 
> or Local) Indexes. The max lookback feature retains recent changes to a 
> table, that is, the changes that have been done in the last x days typically. 
> This means that the max lookback feature already captures the changes to a 
> given table. Currently, the max lookback age is configurable at the cluster 
> level. We need to extend this capability to be able to configure the max 
> lookback age at the table level so that each table can have a different max 
> lookback age based on its CDC application requirements.
> To deliver the changes in the order of their arrival, we need a time based 
> index. This index should be uncovered as the changes are already retained in 
> the table by the max lookback feature. The arrival time can be defined as the 
> mutation timestamp generated by the server, or a user-specified timestamp (or 
> any other long integer) column. An uncovered index would allow us to 
> efficiently and orderly access to the changes. Changes to an index table are 
> also preserved by the max lookback feature.
> A CDC feature can be composed of the following components:
>  * {*}CDCUncoveredIndexRegionScanner{*}: This is a server side scanner on an 
> uncovered index used for CDC. This can inherit UncoveredIndexRegionScanner. 
> It goes through index table rows using a raw scan to identify data table rows 
> and retrieves these rows using a raw scan. Using the time range, it forms a 
> JSON blob to represent changes to the row including pre and/or post row 
> images.
>  * {*}CDC Query Compiler{*}: This is a client side component. It prepares the 
> scan object based on the given CDC query statement. 
>  * {*}CDC DDL Compiler{*}: This is a client side component. It creates the 
> time based uncovered (global/local) index based on the given CDC DDL 
> statement and a virtual table of CDC type. CDC will be a new table type. 
> A CDC DDL syntax to create CDC on a (data) table can be as follows: 
> Create CDC  on  (PHOENIX_ROW_TIMESTAMP()  | 
> ) INCLUDE (pre | post | latest | all) TTL =  seconds> INDEX =  SALT_BUCKETS=
> The above CDC DDL creates a virtual CDC table and an uncovered index. The CDC 
> table PK columns start with the timestamp or user defined column and continue 
> with the data table PK columns. The CDC table includes one non-PK column 
> which is a JSON column. The change is expressed in this JSON column in 
> multiple ways based on the CDC DDL or query statement. The change can be 
> expressed as just the mutation for the change, the latest image of the row, 
> the pre image of the row (the image before the change), the post image, or 
> any combination of these. The CDC table is not a physical table on disk. It 
> is just a virtual table to be used in a CDC query. Phoenix stores just the 
> metadata for this virtual table. 
> A CDC query can be as follow:
> Select * from  where PHOENIX_ROW_TIMESTAMP() >= TO_DATE( …) 
> AND PHOENIX_ROW_TIMESTAMP() < TO_DATE( 

[jira] [Commented] (PHOENIX-7015) Extend UncoveredGlobalIndexRegionScanner for CDC region scanner usecase

2024-01-01 Thread Hari Krishna Dara (Jira)


[ 
https://issues.apache.org/jira/browse/PHOENIX-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801515#comment-17801515
 ] 

Hari Krishna Dara commented on PHOENIX-7015:


Some PoC changes have been included in this PR: 
https://github.com/apache/phoenix/pull/1766

> Extend UncoveredGlobalIndexRegionScanner for CDC region scanner usecase
> ---
>
> Key: PHOENIX-7015
> URL: https://issues.apache.org/jira/browse/PHOENIX-7015
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Priority: Major
>
> For CDC region scanner usecase, extend UncoveredGlobalIndexRegionScanner to 
> CDCUncoveredGlobalIndexRegionScanner. The new region scanner for CDC performs 
> raw scan to index table and retrieve data table rows from index rows.
> Using the time range, it can form a JSON blob to represent changes to the row 
> including pre and/or post row images.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (PHOENIX-7001) Change Data Capture leveraging Max Lookback and Uncovered Indexes

2024-01-01 Thread Hari Krishna Dara (Jira)


[ https://issues.apache.org/jira/browse/PHOENIX-7001 ]


Hari Krishna Dara deleted comment on PHOENIX-7001:


was (Author: haridsv):
PR: https://github.com/apache/phoenix/pull/1766

> Change Data Capture leveraging Max Lookback and Uncovered Indexes
> -
>
> Key: PHOENIX-7001
> URL: https://issues.apache.org/jira/browse/PHOENIX-7001
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Kadir Ozdemir
>Priority: Major
>
> The use cases for a Change Data Capture (CDC) feature are centered around 
> capturing changes to a given table (or updatable view) as these changes 
> happen in near real-time. A CDC application can retrieve changes in real-time 
> or with some delay, or even retrieves the same set of changes multiple times. 
> This means the CDC use case can be generalized as time range queries where 
> the time range is typically short such as last x minutes or hours or 
> expressed as a specific time range in the last n days where n is typically 
> less than 7.
> A change is an update in a row. That is, a change is either updating one or 
> more columns of a table for a given row or deleting a row. It is desirable to 
> provide these changes in the order of their arrival. One can visualize the 
> delivery of these changes through a stream from a Phoenix table to the 
> application that is initiated by the application similar to the delivery of 
> any other Phoenix query results. The difference is that a regular query 
> result includes at most one result row for each row satisfying the query and 
> the deleted rows are not visible to the query result while the CDC 
> stream/result can include multiple result rows for each row and the result 
> includes deleted rows. Some use cases need to also get the pre and/or post 
> image of the row along with a change on the row. 
> The design proposed here leverages Phoenix Max Lookback and Uncovered (Global 
> or Local) Indexes. The max lookback feature retains recent changes to a 
> table, that is, the changes that have been done in the last x days typically. 
> This means that the max lookback feature already captures the changes to a 
> given table. Currently, the max lookback age is configurable at the cluster 
> level. We need to extend this capability to be able to configure the max 
> lookback age at the table level so that each table can have a different max 
> lookback age based on its CDC application requirements.
> To deliver the changes in the order of their arrival, we need a time based 
> index. This index should be uncovered as the changes are already retained in 
> the table by the max lookback feature. The arrival time can be defined as the 
> mutation timestamp generated by the server, or a user-specified timestamp (or 
> any other long integer) column. An uncovered index would allow us to 
> efficiently and orderly access to the changes. Changes to an index table are 
> also preserved by the max lookback feature.
> A CDC feature can be composed of the following components:
>  * {*}CDCUncoveredIndexRegionScanner{*}: This is a server side scanner on an 
> uncovered index used for CDC. This can inherit UncoveredIndexRegionScanner. 
> It goes through index table rows using a raw scan to identify data table rows 
> and retrieves these rows using a raw scan. Using the time range, it forms a 
> JSON blob to represent changes to the row including pre and/or post row 
> images.
>  * {*}CDC Query Compiler{*}: This is a client side component. It prepares the 
> scan object based on the given CDC query statement. 
>  * {*}CDC DDL Compiler{*}: This is a client side component. It creates the 
> time based uncovered (global/local) index based on the given CDC DDL 
> statement and a virtual table of CDC type. CDC will be a new table type. 
> A CDC DDL syntax to create CDC on a (data) table can be as follows: 
> Create CDC  on  (PHOENIX_ROW_TIMESTAMP()  | 
> ) INCLUDE (pre | post | latest | all) TTL =  seconds> INDEX =  SALT_BUCKETS=
> The above CDC DDL creates a virtual CDC table and an uncovered index. The CDC 
> table PK columns start with the timestamp or user defined column and continue 
> with the data table PK columns. The CDC table includes one non-PK column 
> which is a JSON column. The change is expressed in this JSON column in 
> multiple ways based on the CDC DDL or query statement. The change can be 
> expressed as just the mutation for the change, the latest image of the row, 
> the pre image of the row (the image before the change), the post image, or 
> any combination of these. The CDC table is not a physical table on disk. It 
> is just a virtual table to be used in a CDC query. Phoenix stores just the 
> metadata for this virtual table. 
> A CDC query can be as follow:
> Select * from  where PHOENIX_ROW_TIMESTAMP() >= TO_DATE( …) 
> AND 

[jira] (PHOENIX-7001) Change Data Capture leveraging Max Lookback and Uncovered Indexes

2024-01-01 Thread Hari Krishna Dara (Jira)


[ https://issues.apache.org/jira/browse/PHOENIX-7001 ]


Hari Krishna Dara deleted comment on PHOENIX-7001:


was (Author: haridsv):
Change merged into the feature branch.

> Change Data Capture leveraging Max Lookback and Uncovered Indexes
> -
>
> Key: PHOENIX-7001
> URL: https://issues.apache.org/jira/browse/PHOENIX-7001
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Kadir Ozdemir
>Priority: Major
>
> The use cases for a Change Data Capture (CDC) feature are centered around 
> capturing changes to a given table (or updatable view) as these changes 
> happen in near real-time. A CDC application can retrieve changes in real-time 
> or with some delay, or even retrieves the same set of changes multiple times. 
> This means the CDC use case can be generalized as time range queries where 
> the time range is typically short such as last x minutes or hours or 
> expressed as a specific time range in the last n days where n is typically 
> less than 7.
> A change is an update in a row. That is, a change is either updating one or 
> more columns of a table for a given row or deleting a row. It is desirable to 
> provide these changes in the order of their arrival. One can visualize the 
> delivery of these changes through a stream from a Phoenix table to the 
> application that is initiated by the application similar to the delivery of 
> any other Phoenix query results. The difference is that a regular query 
> result includes at most one result row for each row satisfying the query and 
> the deleted rows are not visible to the query result while the CDC 
> stream/result can include multiple result rows for each row and the result 
> includes deleted rows. Some use cases need to also get the pre and/or post 
> image of the row along with a change on the row. 
> The design proposed here leverages Phoenix Max Lookback and Uncovered (Global 
> or Local) Indexes. The max lookback feature retains recent changes to a 
> table, that is, the changes that have been done in the last x days typically. 
> This means that the max lookback feature already captures the changes to a 
> given table. Currently, the max lookback age is configurable at the cluster 
> level. We need to extend this capability to be able to configure the max 
> lookback age at the table level so that each table can have a different max 
> lookback age based on its CDC application requirements.
> To deliver the changes in the order of their arrival, we need a time based 
> index. This index should be uncovered as the changes are already retained in 
> the table by the max lookback feature. The arrival time can be defined as the 
> mutation timestamp generated by the server, or a user-specified timestamp (or 
> any other long integer) column. An uncovered index would allow us to 
> efficiently and orderly access to the changes. Changes to an index table are 
> also preserved by the max lookback feature.
> A CDC feature can be composed of the following components:
>  * {*}CDCUncoveredIndexRegionScanner{*}: This is a server side scanner on an 
> uncovered index used for CDC. This can inherit UncoveredIndexRegionScanner. 
> It goes through index table rows using a raw scan to identify data table rows 
> and retrieves these rows using a raw scan. Using the time range, it forms a 
> JSON blob to represent changes to the row including pre and/or post row 
> images.
>  * {*}CDC Query Compiler{*}: This is a client side component. It prepares the 
> scan object based on the given CDC query statement. 
>  * {*}CDC DDL Compiler{*}: This is a client side component. It creates the 
> time based uncovered (global/local) index based on the given CDC DDL 
> statement and a virtual table of CDC type. CDC will be a new table type. 
> A CDC DDL syntax to create CDC on a (data) table can be as follows: 
> Create CDC  on  (PHOENIX_ROW_TIMESTAMP()  | 
> ) INCLUDE (pre | post | latest | all) TTL =  seconds> INDEX =  SALT_BUCKETS=
> The above CDC DDL creates a virtual CDC table and an uncovered index. The CDC 
> table PK columns start with the timestamp or user defined column and continue 
> with the data table PK columns. The CDC table includes one non-PK column 
> which is a JSON column. The change is expressed in this JSON column in 
> multiple ways based on the CDC DDL or query statement. The change can be 
> expressed as just the mutation for the change, the latest image of the row, 
> the pre image of the row (the image before the change), the post image, or 
> any combination of these. The CDC table is not a physical table on disk. It 
> is just a virtual table to be used in a CDC query. Phoenix stores just the 
> metadata for this virtual table. 
> A CDC query can be as follow:
> Select * from  where PHOENIX_ROW_TIMESTAMP() >= TO_DATE( …) 
> AND 

[jira] [Commented] (PHOENIX-7001) Change Data Capture leveraging Max Lookback and Uncovered Indexes

2024-01-01 Thread Hari Krishna Dara (Jira)


[ 
https://issues.apache.org/jira/browse/PHOENIX-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801509#comment-17801509
 ] 

Hari Krishna Dara commented on PHOENIX-7001:


PR: https://github.com/apache/phoenix/pull/1766

> Change Data Capture leveraging Max Lookback and Uncovered Indexes
> -
>
> Key: PHOENIX-7001
> URL: https://issues.apache.org/jira/browse/PHOENIX-7001
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Kadir Ozdemir
>Priority: Major
>
> The use cases for a Change Data Capture (CDC) feature are centered around 
> capturing changes to a given table (or updatable view) as these changes 
> happen in near real-time. A CDC application can retrieve changes in real-time 
> or with some delay, or even retrieves the same set of changes multiple times. 
> This means the CDC use case can be generalized as time range queries where 
> the time range is typically short such as last x minutes or hours or 
> expressed as a specific time range in the last n days where n is typically 
> less than 7.
> A change is an update in a row. That is, a change is either updating one or 
> more columns of a table for a given row or deleting a row. It is desirable to 
> provide these changes in the order of their arrival. One can visualize the 
> delivery of these changes through a stream from a Phoenix table to the 
> application that is initiated by the application similar to the delivery of 
> any other Phoenix query results. The difference is that a regular query 
> result includes at most one result row for each row satisfying the query and 
> the deleted rows are not visible to the query result while the CDC 
> stream/result can include multiple result rows for each row and the result 
> includes deleted rows. Some use cases need to also get the pre and/or post 
> image of the row along with a change on the row. 
> The design proposed here leverages Phoenix Max Lookback and Uncovered (Global 
> or Local) Indexes. The max lookback feature retains recent changes to a 
> table, that is, the changes that have been done in the last x days typically. 
> This means that the max lookback feature already captures the changes to a 
> given table. Currently, the max lookback age is configurable at the cluster 
> level. We need to extend this capability to be able to configure the max 
> lookback age at the table level so that each table can have a different max 
> lookback age based on its CDC application requirements.
> To deliver the changes in the order of their arrival, we need a time based 
> index. This index should be uncovered as the changes are already retained in 
> the table by the max lookback feature. The arrival time can be defined as the 
> mutation timestamp generated by the server, or a user-specified timestamp (or 
> any other long integer) column. An uncovered index would allow us to 
> efficiently and orderly access to the changes. Changes to an index table are 
> also preserved by the max lookback feature.
> A CDC feature can be composed of the following components:
>  * {*}CDCUncoveredIndexRegionScanner{*}: This is a server side scanner on an 
> uncovered index used for CDC. This can inherit UncoveredIndexRegionScanner. 
> It goes through index table rows using a raw scan to identify data table rows 
> and retrieves these rows using a raw scan. Using the time range, it forms a 
> JSON blob to represent changes to the row including pre and/or post row 
> images.
>  * {*}CDC Query Compiler{*}: This is a client side component. It prepares the 
> scan object based on the given CDC query statement. 
>  * {*}CDC DDL Compiler{*}: This is a client side component. It creates the 
> time based uncovered (global/local) index based on the given CDC DDL 
> statement and a virtual table of CDC type. CDC will be a new table type. 
> A CDC DDL syntax to create CDC on a (data) table can be as follows: 
> Create CDC  on  (PHOENIX_ROW_TIMESTAMP()  | 
> ) INCLUDE (pre | post | latest | all) TTL =  seconds> INDEX =  SALT_BUCKETS=
> The above CDC DDL creates a virtual CDC table and an uncovered index. The CDC 
> table PK columns start with the timestamp or user defined column and continue 
> with the data table PK columns. The CDC table includes one non-PK column 
> which is a JSON column. The change is expressed in this JSON column in 
> multiple ways based on the CDC DDL or query statement. The change can be 
> expressed as just the mutation for the change, the latest image of the row, 
> the pre image of the row (the image before the change), the post image, or 
> any combination of these. The CDC table is not a physical table on disk. It 
> is just a virtual table to be used in a CDC query. Phoenix stores just the 
> metadata for this virtual table. 
> A CDC query can be as follow:
> Select * from