[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1477517177 @zentol I addressed all your latest comments. Can we merge the PR ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1475934338 @zentol I addressed all the points PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1474052060 @zentol I pushed for security reasons, I have only the addSplitsBack case to deal with in lazy splits generation scenario. You can start to take a look at the other things if you have time or just wait for this part to be finished. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1468398944 > (Note: This review is specifically about the JMX parts; nothing else) OK. Tell me if you have comments on the size-based split. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1468389988 > > Instead of this whole JMX business we could use nodetool directly. > > (Note: This review is specifically about the JMX parts; nothing else) > > Indeed, this binary is included in the offical cassandra image ! Way simpler. Thanks for the suggestion, I removed all the JMX related code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1468388485 > Instead of this whole JMX business we could use nodetool directly. > > (Note: This review is specifically about the JMX parts; nothing else) Indeed, this binary is included in the offical cassandra image ! Way simpler. Thanks for the suggestion, I removed all the JMX related code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1465798685 @zentol PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1454671877 @mosche contributed testContainers configuration that allows to avoid having a private docker image + timeouts etc... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1448409889 @zentol I addressed all your comments and changed the splits architecture: I Introduced a table size estimation (based on Cassandra statistical size estimates). I added an optional user conf to specify max split memory size. If set, the source generates splits of `maxSplitMemorySize` with protection measures (in comparison to task parallelism) for number of splits. I now read a split as a whole (no state needed). I added the related splits and size tests. They require to use JMX to force mem tables flush on the cassandra cluster so that the system size estimates can be updated (as we just wrote test data). The official Cassandra image deactivates jmx, to enable it we need to provide authentication and modify cassandra-env.sh so I had to create my own image (!) Also the flush is very long (30s) so for all split tests I wrote and flush only once (contrary to other tests that write test data for each test) PTAL. I hope it will be last round of review as I changed a lot and spent so much time on that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1429539810 @zentol the PR is not ready for review yet as I changed a lot (split size estimation and fetch state management) it is still under coding, the tests don't pass yet. But still thanks for the confirmation about the general architecture about splits and memory -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1406744295 I have implified a lot SplitsGenerator: it is now based only on Cassandra min and max tokens per partitioner and generates (min,x][x, max) for 2 splits for example. It now supports all the Cassandra partitioners. Now a Cassandra split is just a ring range (a split of cassandra token range). This also simplifies a lot the range query generation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1405116976 @zentol I'm almost done replacing the whole split mechanism. I could even remove the size estimates as Flink does not rely on expected split memory size as Beam. So I could use a very simple splitting mechanism based only on Cassandra min and max tokens per partitioner (min,x][x, max) for 2 splits for example. I also could consider that a Cassandra split processes only a single ring range and get rid of RingRange class. It'll change much. I also need to do a final pass before it is ready for another round. Otherwise you'll see sketchy things. don't waste your time reviewing now. I'll ping you to take a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1404820642 > > I don't think it is an ASF rule but fair enough, > > FYI; By and large it should be viewed as a legal requirement. By copying code from cassandra you have to adhere to their licensing, which among other state that you must have prominent notices for changes to a file. You're right cf ASF v2 4.d: `If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.` Anyway, as we decided to replace the split code with something that was never merged to Beam, there is no need anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1397142509 > > I did not author the RingRange and SplitsGenerator classes. I got them from the Apache Beam Cassandra connector. > > If you're copying stuff from other projects it please add some notice/attribution to the files and update the Flink source notice accordingly. ah yes I forgot this Flink guideline. I don't think it is an ASF rule but fair enough, I'll add it to the javadoc and notice. > > > Back in 2017 I coded a [splitter for Cassandra Beam connector](https://github.com/echauchot/beam/blob/bfa33b85b6b310556ffa5c44c99bef50575b2c56/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java#L346) that works [with tokens](https://github.com/echauchot/beam/blob/BEAM-245-CassandraIO/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/DataSizeEstimates.java) also but that is simpler and supports all the Cassandra partitionners. Would you prefer that we use this other approach ? > > Not sure? Why didn't it make it into Beam? Do you know why the Beam code is written the way it is? Actually, another splitting approach was opted in by the reviewer in 2017. But short after there was another author who changed the splitting to something similar to my 2017 token based code. So when I thought about coding the split for Flink connector I decided to take the version of the code that was merged to Beam master. But it is true that it is over complicated, redundant and not supporting the non-default Cassandra partitioner. The approach I had in 2017 was the same as the Cassandra Spark connector written by datastax (tokens + cassandra size estimates statistics). So I think I'll try to reuse this code, migrate it to Flink and update it to the latest Cassandra version and push it in this PR. WDYT ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1387058843 > @zentol I did not author the `RingRange` and `SplitsGenerator` classes. I got them from the `Apache Beam` Cassandra connector. I agree, some notions need clarification. I'll add comments. I was thinking: this `SplitsGenerator/RingRanges` I got from the Apache Beam project seems weird in many aspects that you pointed out. Back in 2017 I coded a [partitionner for Cassandra Beam connector]( https://github.com/echauchot/beam/blob/BEAM-245-CassandraIO/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/DataSizeEstimates.java) that works with tokens also but that is simpler and supports all the Cassandra partitionners. Would you prefer that we use this other approach ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1386836888 @zentol I did not author the `RingRange` and `SplitsGenerator` classes. I got them from the `Apache Beam` Cassandra connector. I agree, some notions need clarification. I'll add comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1368798535 @zentol Ryan gave a LGTM can you take a look at this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1364182149 @RyanSkraba I finished everything I wanted to change PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1363862897 Also I figured out that the the test of the cassandra tokens split I got from Apache Beam tests only for `RandomPartitioner` (the tokens split method is compatible with both `RandomPartitioner` and `Murmur3Partitioner` as only tokens boundaries vary between the 2) whereas in the actual data split it is compatible only with `Murmur3Partitioner`. So it sould test the tokens split for this cassandra partitionner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1363328649 @RyanSkraba thanks for your review. I addressed all your comments PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1344284613 > > Is there anything I can do to help the review ? > > For better or worse, no :) > > Holidays are coming up so I won't review it this year (scream) at the very least. If nothing happened til January (which I'd like to avoid!) I'll take a look myself to get things moving. Thanks @zentol -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1337060717 Any updates on this PR ? This subject has been open for a long time: all the other Cassandra PRs (https://github.com/apache/flink/pull/19586, https://github.com/apache/flink/pull/19680, https://github.com/apache/flink-connector-cassandra/pull/2) were just a preparation for this PR and this PR was submitted in October. Is there anything I can do to help the review ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1332013091 > I'm taking a look! Thanks for the ping +1 @RyanSkraba any review comments so far ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-cassandra] echauchot commented on pull request #3: [FLINK-26822] Add Cassandra Source
echauchot commented on PR #3: URL: https://github.com/apache/flink-connector-cassandra/pull/3#issuecomment-1327516095 R: @zentol R: @RyanSkraba you mentioned that you already started to review this PR (when it was in the main flink repo) car you send your review comments ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org