Wide rows splitting

2017-09-17 Thread Adam Smith
Dear community,

I have a table with inlinks to URLs, i.e. many URLs point to
http://google.com, less URLs point to http://somesmallweb.page.

It has very wide and very skinny rows - the distribution is following a
power law. I do not know a priori how many columns a row has. Also, I can't
identify a schema to introduce a good partitioning.

Currently, I am thinking about introducing splits by: pk is like (URL,
splitnumber), where splitnumber is initially 1 and  hash URL mod
splitnumber would determine the splitnumber on insert. I would need a
separate table to maintain the splitnumber and a spark-cassandra-connector
job counts the columns and and increases/doubles the number of splits on
demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1)
when splitnumber would be 2.

Would you do the same? Is there a better way?

Thanks!
Adam


Re: Maturity and Stability of Enabling CDC

2017-09-17 Thread Jeff Jirsa
Haven't tried out CDC, but the answer based on the design doc is yes - you have 
to manually dedup CDC at the consumer level




-- 
Jeff Jirsa


> On Sep 17, 2017, at 6:21 PM, Michael Fong  wrote:
> 
> Thanks for your reply. 
> 
> 
> If anyone has tried out this new feature, perhaps he/she could answer this 
> question: would multiple copies of CDC be sent to downstream (i.e., Kafka) 
> when all nodes have enabled cdc?
> 
> Regards,
> 
>> On Mon, Sep 18, 2017 at 6:59 AM, kurt greaves  wrote:
>> I don't believe it's used by many, if any. it certainly hasn't had enough 
>> attention to determine it production ready, nor has it been out long enough 
>> for many people to be in a version where cdc is available. FWIW I've never 
>> even seen any inquiries about using it.
>> 
>> On 17 Sep. 2017 13:18, "Michael Fong"  wrote:
>> anyone?
>> 
>> On Wed, Sep 13, 2017 at 5:10 PM, Michael Fong  wrote:
>>> Hi, all,
>>> 
>>> We've noticed there is a new feature for streaming changed data other 
>>> streaming service. Doc: 
>>> http://cassandra.apache.org/doc/latest/operating/cdc.html
>>> 
>>> We are evaluating the stability (and maturity) of this feature, and 
>>> possibly integrate this with Kafka (associated w/ its connector). Has 
>>> anyone adopted this in production for real application?
>>> 
>>> Regards,
>>> 
>>> Michael
>> 
>> 
> 


Re: Maturity and Stability of Enabling CDC

2017-09-17 Thread Michael Fong
Thanks for your reply.


If anyone has tried out this new feature, perhaps he/she could answer this
question: would multiple copies of CDC be sent to downstream (i.e., Kafka)
when all nodes have enabled cdc?

Regards,

On Mon, Sep 18, 2017 at 6:59 AM, kurt greaves  wrote:

> I don't believe it's used by many, if any. it certainly hasn't had enough
> attention to determine it production ready, nor has it been out long enough
> for many people to be in a version where cdc is available. FWIW I've never
> even seen any inquiries about using it.
>
> On 17 Sep. 2017 13:18, "Michael Fong"  wrote:
>
> anyone?
>
> On Wed, Sep 13, 2017 at 5:10 PM, Michael Fong 
> wrote:
>
>> Hi, all,
>>
>> We've noticed there is a new feature for streaming changed data other
>> streaming service. Doc: http://cassandra.apache.org/doc/latest/operating/
>> cdc.html
>>
>> We are evaluating the stability (and maturity) of this feature, and
>> possibly integrate this with Kafka (associated w/ its connector). Has
>> anyone adopted this in production for real application?
>>
>> Regards,
>>
>> Michael
>>
>
>
>


Stack overflow error with UDF using IBM JVM

2017-09-17 Thread Sumant Padbidri
I'm using Cassandra 3.11 right out of the box (i.e. all default parameters) 
with the IBM JRE. Using any UDF results in a stack overflow error. They work 
fine with the Oracle JVM. I've tried increasing some stack sizes (-Xss and 
-Xmso), but that does not help. Is there some configuration I'm missing?

CREATE TABLE test (
    id int,
    val1 int,
    val2 int,
    PRIMARY KEY(id)
);

INSERT INTO test(id, val1, val2) VALUES(1, 100, 200);
INSERT INTO test(id, val1, val2) VALUES(2, 100, 300);
INSERT INTO test(id, val1, val2) VALUES(3, 200, 150);

CREATE OR REPLACE FUNCTION maxOf(current int, testvalue int) 
CALLED ON NULL INPUT 
RETURNS int
LANGUAGE java 
AS $$return Math.max(current,testvalue);$$;

SELECT id, val1, val2, maxOf(val1,val2) FROM test WHERE id = 1;

Here's the stack trace from debug.log:
java.lang.RuntimeException: java.lang.StackOverflowError
    at 
org.apache.cassandra.cql3.functions.UDFunction.async(UDFunction.java:453) 
~[apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.functions.UDFunction.executeAsync(UDFunction.java:398)
 ~[apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.functions.UDFunction.execute(UDFunction.java:298) 
~[apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.selection.ScalarFunctionSelector.getOutput(ScalarFunctionSelector.java:61)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.selection.Selection$SelectionWithProcessing$1.getOutputRow(Selection.java:592)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.selection.Selection$ResultSetBuilder.getOutputRow(Selection.java:430)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.selection.Selection$ResultSetBuilder.build(Selection.java:417)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:763)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:400)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:378)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:251)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:79)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:217)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:248) 
[apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:233) 
[apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:116)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:517)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:410)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 [netty-all-4.0.44.Final.jar:4.0.44.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
 [netty-all-4.0.44.Final.jar:4.0.44.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35)
 [netty-all-4.0.44.Final.jar:4.0.44.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:348)
 [netty-all-4.0.44.Final.jar:4.0.44.Final]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522) 
[na:1.8.0]
    at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
 [apache-cassandra-3.11.0.jar:3.11.0]
    at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) 
[apache-cassandra-3.11.0.jar:3.11.0]
    at java.lang.Thread.run(Thread.java:795) [na:2.9 (09-01-2017)]
Caused by: java.lang.StackOverflowError: null
    at java.lang.String.substring(String.java:2637) ~[na:2.9 (09-01-2017)]
    at java.lang.Class.getNonArrayClassPackageName(Class.java:1531) ~[na:2.9 
(09-01-2017)]
    at java.lang.Class.getPackageName(Class.java:1546) ~[na:2.9 (09-01-2017)]
    at java.lang.J9VMInternals$2.run(J9VMInternals.java:252) ~[na:2.9 
(09-01-2017)]
    at java.security.AccessController.doPrivileged(AccessController.java:647) 
~[na:1.8.0]
    at java.lang.J9VMInternals.checkPackageAccess(J9VMInternals.java:250) 
~[na:2.9 (09-01-2017)]
    at 
org.apache.cassandra.cql3.functions.ThreadAwareSecurityManager.isSecuredThread(ThreadAwareSecurityManager.java:210)
 

Re: Maturity and Stability of Enabling CDC

2017-09-17 Thread kurt greaves
I don't believe it's used by many, if any. it certainly hasn't had enough
attention to determine it production ready, nor has it been out long enough
for many people to be in a version where cdc is available. FWIW I've never
even seen any inquiries about using it.

On 17 Sep. 2017 13:18, "Michael Fong"  wrote:

anyone?

On Wed, Sep 13, 2017 at 5:10 PM, Michael Fong  wrote:

> Hi, all,
>
> We've noticed there is a new feature for streaming changed data other
> streaming service. Doc: http://cassandra.apache.org/doc/latest/operating/
> cdc.html
>
> We are evaluating the stability (and maturity) of this feature, and
> possibly integrate this with Kafka (associated w/ its connector). Has
> anyone adopted this in production for real application?
>
> Regards,
>
> Michael
>


Question about counters read before write behavior

2017-09-17 Thread Paul Pollack
Hi,

We're trying to confirm on a counter write that the entire partition is
read from disk vs. just the row and column of the partition to increment.
We've trade the code to this line
.
It looks like the code only uses a filter on the partition for reading if
the read does not involve collections or counters. Can anyone familiar with
the source code confirm if this is true and whether we're looking at the
right lines of code that show what data is read from disk (or from an
internal cache)?

Thanks,
Paul