[jira] [Commented] (BEAM-4049) Improve write throughput of CassandraIO

2018-04-11 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433671#comment-16433671
 ] 

Alexander Dejanovski commented on BEAM-4049:


[~jbonofre]: I have a patch in the works so you can assign me this ticket if 
you want to.

> Improve write throughput of CassandraIO
> ---
>
> Key: BEAM-4049
> URL: https://issues.apache.org/jira/browse/BEAM-4049
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-cassandra
>Affects Versions: 2.4.0
>Reporter: Alexander Dejanovski
>Assignee: Jean-Baptiste Onofré
>Priority: Major
>  Labels: performance
>
> The CassandraIO currently uses the mapper to perform writes in a synchronous 
> fashion. 
> This implies that writes are serialized and is a very suboptimal way of 
> writing to Cassandra.
> The IO should use the saveAsync() method instead of save() and should wait 
> for completion each time 100 queries are in flight, in order to avoid 
> overwhelming clusters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4049) Improve write throughput of CassandraIO

2018-04-11 Thread Alexander Dejanovski (JIRA)
Alexander Dejanovski created BEAM-4049:
--

 Summary: Improve write throughput of CassandraIO
 Key: BEAM-4049
 URL: https://issues.apache.org/jira/browse/BEAM-4049
 Project: Beam
  Issue Type: Improvement
  Components: io-java-cassandra
Affects Versions: 2.4.0
Reporter: Alexander Dejanovski
Assignee: Jean-Baptiste Onofré


The CassandraIO currently uses the mapper to perform writes in a synchronous 
fashion. 

This implies that writes are serialized and is a very suboptimal way of writing 
to Cassandra.

The IO should use the saveAsync() method instead of save() and should wait for 
completion each time 100 queries are in flight, in order to avoid overwhelming 
clusters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4049) Improve write throughput of CassandraIO

2018-04-12 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435072#comment-16435072
 ] 

Alexander Dejanovski commented on BEAM-4049:


PR sent : [https://github.com/apache/beam/pull/5112]

I had to exclude Guava's ListenableFuture from the relocation to avoid 
exceptions at runtime since the DS java driver uses it in saveAsync().

> Improve write throughput of CassandraIO
> ---
>
> Key: BEAM-4049
> URL: https://issues.apache.org/jira/browse/BEAM-4049
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-cassandra
>Affects Versions: 2.4.0
>Reporter: Alexander Dejanovski
>Assignee: Jean-Baptiste Onofré
>Priority: Major
>  Labels: performance
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The CassandraIO currently uses the mapper to perform writes in a synchronous 
> fashion. 
> This implies that writes are serialized and is a very suboptimal way of 
> writing to Cassandra.
> The IO should use the saveAsync() method instead of save() and should wait 
> for completion each time 100 queries are in flight, in order to avoid 
> overwhelming clusters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437089#comment-16437089
 ] 

Alexander Dejanovski commented on BEAM-3485:


I've created a PR to fix the split generation : 
[https://github.com/apache/beam/pull/5124]

There are other issues with how connection are established, and more precisely 
how many since there should be a single Cluster object generated per physical 
cluster and JVM, while currently we're creating Cluster objects each time one 
is needed.

I'll create follow up tickets to handle this and expand the capabilities of 
both the reader (ability to add a custom where clause) and the writer (allow to 
use PreparedStatements instead of relying on the mapper).

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437309#comment-16437309
 ] 

Alexander Dejanovski commented on BEAM-3485:


Hi [~aromanenko],

yes both BEAM-3424 and BEAM-3425 will be fixed by this PR.

For BEAM-3424, we will have one split per token range at least, which in most 
"modern" installs will be 256 splits per node in the cluster. For clusters that 
do not use vnodes, we'll have one split per node in the cluster in case the 
size can't be estimated.

We could refine this and add : 
 * a heuristical default of... 10 to 20 splits per node at least (for what it's 
worth)
 * and a way of enforcing a number of splits in the reader .withSplits(...)

What do you think ?

For BEAM-3425, it's fixed here as the size_estimates table was storing 
start_token and end_token as strings, not longs, and then those strings needed 
to be converted to BigInteger.

 

 

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437322#comment-16437322
 ] 

Alexander Dejanovski commented on BEAM-3485:


Apparently I have test failures triggered by findbugs here : 
[https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4104/]

Any pointer on how to run the same tests locally ? Or how to check the findbugs 
report from Jenkins ?

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437344#comment-16437344
 ] 

Alexander Dejanovski commented on BEAM-3485:


Neat !

It's fixed.

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437588#comment-16437588
 ] 

Alexander Dejanovski commented on BEAM-3485:


# So, out of experience I know that most clusters out there are running with 16 
to 256 vnodes per node, times the number of nodes we're going to generate a lot 
of splits. Still, it would be good to be able to enforce a minimum number of 
splits if needed, so I'd be in favor of adding it as optional input. If the 
computed number of splits is lower (or if Beam fails to compute it) then we 
should fallback to the user input.
Tell me if you agree and I'll add it.
 # It is for Murmur3 but it could be good to support the RandomPartitioner 
which uses tokens between 0 and 2^127-1, which should be out of the Long span. 

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-14 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438364#comment-16438364
 ] 

Alexander Dejanovski commented on BEAM-3485:


I've just pushed a new version that allows the user to specify the minimum 
number of desired splits.

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)