Re: Audit logging to tables.

2019-04-03 Thread Jon Haddad
The virtual table could read the data out of the audit log, just like
it could read a hosts file or list the output of the ps command.


On Wed, Apr 3, 2019 at 8:02 PM Sagar  wrote:
>
> Thanks Alex.
>
> I was going through the implementation of Virtual tables thus far and the
> data that we get when we query against them seems to be more point in time
> like CachesTable or fairly static like Settings. Having said that, audit
> log's nature of data doesn't fall in either of the 2 categories. For Audit
> log, it should be more of a stream of events that happen on that node and
> almost all events need to be captured. The class AbstractDataSet being used
> by the Virtual tables suggests that it can be built on demand and thrown
> after use(which is what is happening currently) or can be persisted. IMO if
> we need audit logs on Virtual tables, we will have to go the route of being
> able to persist the events generated.
>
> Sagar.
>
> On Sun, Mar 31, 2019 at 11:35 PM Alex Ott  wrote:
>
> > Hi Sagar
> >
> > 3.x/4.x are versions for open source variant of drivers, while DSE versions
> > are 1.x/2.x
> >
> > Description of this function is a thttps://
> > docs.datastax.com/en/drivers/java/3.6/
> >
> > Sagar  at "Tue, 26 Mar 2019 22:12:56 +0530" wrote:
> >  S> Thanks Andy,
> >
> >  S> This enhancement is in the datastax version and not in the apache
> > cassandra
> >  S> driver?
> >
> >  S> Thanks!
> >  S> Sagar.
> >
> >  S> On Tue, Mar 26, 2019 at 3:23 AM Andy Tolbert <
> > andrew.tolb...@datastax.com>
> >  S> wrote:
> >
> >  >> Hello
> >  >>
> >  >> 1) yes its local only. The driver by default does connect to each host
> >  >> > though so its pretty trivial to have a load balancing policy that
> > you can
> >  >> > direct to specific hosts (this should probably be in the driver so
> > people
> >  >> > dont have to keep reimplementing it).
> >  >> >
> >  >>
> >  >> The capability to target a specific host was added to the java driver
> > (and
> >  >> others) recently in anticipation of Virtual Tables in version 3.6.0+
> > via
> >  >> Statement.setHost [1].  This will bypass the load balancing policy
> >  >> completely and send the request directly to that that Host (assuming
> > it's
> >  >> connected).
> >  >>
> >  >> The drivers also parse virtual table metadata as well.
> >  >>
> >  >> [1]:
> >  >>
> >  >>
> > https://docs.datastax.com/en/drivers/java/3.6/com/datastax/driver/core/Statement.html#setHost-com.datastax.driver.core.Host-
> >  >>
> >  >> Thanks!
> >  >> Andy
> >  >>
> >  >> On Mon, Mar 25, 2019 at 11:29 AM Sagar 
> > wrote:
> >  >>
> >  >> > Thanks Chris. I got caught up with a few things and couldn't reply
> > back.
> >  >> > So, I re-looked this again and I think virtual tables can be used for
> >  >> audit
> >  >> > logging. Considering that they don't have any replication - so we
> > won't
> >  >> be
> >  >> > clogging the network with replication IO.
> >  >> >
> >  >> > In terms of storage, from what I understood, virtual tables don't
> > have
> >  >> any
> >  >> > associated SSTables. So, is data stored only in Memtables? Can you
> > please
> >  >> > shed some light on storage and the retention because of this?
> >  >> >
> >  >> > Lastly, the driver changes, I agree, we should make the driver be
> > able to
> >  >> > contact to specific hosts with the correct LBP. If we do go this
> > route, I
> >  >> > can start taking a look at it.
> >  >> >
> >  >> > Thanks!
> >  >> > Sagar.
> >  >> >
> >  >> > On Wed, Mar 6, 2019 at 10:42 PM Chris Lohfink 
> >  >> > wrote:
> >  >> >
> >  >> > > 1) yes its local only. The driver by default does connect to each
> > host
> >  >> > > though so its pretty trivial to have a load balancing policy that
> > you
> >  >> can
> >  >> > > direct to specific hosts (this should probably be in the driver so
> >  >> people
> >  >> > > dont have to keep reimplementing it).
> >  >> > >
> >  >> > > 2) yes, easiest way is to setup a whitelist load balancing policy
> > like
> >  >> in
> >  >> > > cqlsh but like above. Best option is a custom LBP +
> > StatementWrapper
> >  >> that
> >  >> > > holds the host target which can direct individual queries to
> > specific
> >  >> > hosts
> >  >> > >
> >  >> > > 3) yes, cqlsh makes a connection to local C* instance with
> > whitelist
> >  >> > policy
> >  >> > > so it only queries that one node.
> >  >> > >
> >  >> > > Chris
> >  >> > >
> >  >> > > On Wed, Mar 6, 2019 at 9:43 AM Sagar 
> >  >> wrote:
> >  >> > >
> >  >> > > > So, I went through the ticket for the creation of Virtual
> > Tables(must
> >  >> > say
> >  >> > > > it was quite a long ticket spanning across 4 years).
> >  >> > > >
> >  >> > > > I see that there are a few tables created in the db.virtual
> > package.
> >  >> > > These
> >  >> > > > appear to be metrics related tables.
> >  >> > > >
> >  >> > > > Couple of questions here:
> >  >> > > >
> >  >> > > > 1) Do all the tables pertain only data locally? What I mean is
> > that
> >  >> in
> >  >> > a
> >  >> > > > clu

Fwd: Audit logging to tables.

2019-04-03 Thread Sagar
Thanks Alex.

I was going through the implementation of Virtual tables thus far and the
data that we get when we query against them seems to be more point in time
like CachesTable or fairly static like Settings. Having said that, audit
log's nature of data doesn't fall in either of the 2 categories. For Audit
log, it should be more of a stream of events that happen on that node and
almost all events need to be captured. The class AbstractDataSet being used
by the Virtual tables suggests that it can be built on demand and thrown
after use(which is what is happening currently) or can be persisted. IMO if
we need audit logs on Virtual tables, we will have to go the route of being
able to persist the events generated.

Sagar.

On Sun, Mar 31, 2019 at 11:35 PM Alex Ott  wrote:

> Hi Sagar
>
> 3.x/4.x are versions for open source variant of drivers, while DSE versions
> are 1.x/2.x
>
> Description of this function is a thttps://
> docs.datastax.com/en/drivers/java/3.6/
>
> Sagar  at "Tue, 26 Mar 2019 22:12:56 +0530" wrote:
>  S> Thanks Andy,
>
>  S> This enhancement is in the datastax version and not in the apache
> cassandra
>  S> driver?
>
>  S> Thanks!
>  S> Sagar.
>
>  S> On Tue, Mar 26, 2019 at 3:23 AM Andy Tolbert <
> andrew.tolb...@datastax.com>
>  S> wrote:
>
>  >> Hello
>  >>
>  >> 1) yes its local only. The driver by default does connect to each host
>  >> > though so its pretty trivial to have a load balancing policy that
> you can
>  >> > direct to specific hosts (this should probably be in the driver so
> people
>  >> > dont have to keep reimplementing it).
>  >> >
>  >>
>  >> The capability to target a specific host was added to the java driver
> (and
>  >> others) recently in anticipation of Virtual Tables in version 3.6.0+
> via
>  >> Statement.setHost [1].  This will bypass the load balancing policy
>  >> completely and send the request directly to that that Host (assuming
> it's
>  >> connected).
>  >>
>  >> The drivers also parse virtual table metadata as well.
>  >>
>  >> [1]:
>  >>
>  >>
> https://docs.datastax.com/en/drivers/java/3.6/com/datastax/driver/core/Statement.html#setHost-com.datastax.driver.core.Host-
>  >>
>  >> Thanks!
>  >> Andy
>  >>
>  >> On Mon, Mar 25, 2019 at 11:29 AM Sagar 
> wrote:
>  >>
>  >> > Thanks Chris. I got caught up with a few things and couldn't reply
> back.
>  >> > So, I re-looked this again and I think virtual tables can be used for
>  >> audit
>  >> > logging. Considering that they don't have any replication - so we
> won't
>  >> be
>  >> > clogging the network with replication IO.
>  >> >
>  >> > In terms of storage, from what I understood, virtual tables don't
> have
>  >> any
>  >> > associated SSTables. So, is data stored only in Memtables? Can you
> please
>  >> > shed some light on storage and the retention because of this?
>  >> >
>  >> > Lastly, the driver changes, I agree, we should make the driver be
> able to
>  >> > contact to specific hosts with the correct LBP. If we do go this
> route, I
>  >> > can start taking a look at it.
>  >> >
>  >> > Thanks!
>  >> > Sagar.
>  >> >
>  >> > On Wed, Mar 6, 2019 at 10:42 PM Chris Lohfink 
>  >> > wrote:
>  >> >
>  >> > > 1) yes its local only. The driver by default does connect to each
> host
>  >> > > though so its pretty trivial to have a load balancing policy that
> you
>  >> can
>  >> > > direct to specific hosts (this should probably be in the driver so
>  >> people
>  >> > > dont have to keep reimplementing it).
>  >> > >
>  >> > > 2) yes, easiest way is to setup a whitelist load balancing policy
> like
>  >> in
>  >> > > cqlsh but like above. Best option is a custom LBP +
> StatementWrapper
>  >> that
>  >> > > holds the host target which can direct individual queries to
> specific
>  >> > hosts
>  >> > >
>  >> > > 3) yes, cqlsh makes a connection to local C* instance with
> whitelist
>  >> > policy
>  >> > > so it only queries that one node.
>  >> > >
>  >> > > Chris
>  >> > >
>  >> > > On Wed, Mar 6, 2019 at 9:43 AM Sagar 
>  >> wrote:
>  >> > >
>  >> > > > So, I went through the ticket for the creation of Virtual
> Tables(must
>  >> > say
>  >> > > > it was quite a long ticket spanning across 4 years).
>  >> > > >
>  >> > > > I see that there are a few tables created in the db.virtual
> package.
>  >> > > These
>  >> > > > appear to be metrics related tables.
>  >> > > >
>  >> > > > Couple of questions here:
>  >> > > >
>  >> > > > 1) Do all the tables pertain only data locally? What I mean is
> that
>  >> in
>  >> > a
>  >> > > > cluster, each node will have its own ThreadPoolsTable pertaining
> to
>  >> > > thread
>  >> > > > pools on that node? Is that assumption correct?
>  >> > > > 2) In terms of querying, again can we query only locally? I saw
> a lot
>  >> > of
>  >> > > > discussion on the ticket for where node = 1.2.3.4. I guess that
> isn't
>  >> > > > supported? So. for any user to query for metrics of a given
> node, he
>  >> > will
>  >> > > > have to login and query on that node.
>

Re: Infinite loop in org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter

2019-04-03 Thread Brett Marcott
Thanks for the recommendation Russell. Haven't looked into that code yet,
but the docs didn't seem to indicate if it wrote sstables directly instead
of going through normal write path.


On Wed, Apr 3, 2019, 11:11 AM Russell Spitzer 
wrote:

> I would recommend using the Spark Cassandra Connector instead of the Hadoop
> based writers. The Hadoop code has not had a lot of love in a long time.
> See
>
> https://github.com/datastax/spark-cassandra-connector
>
> On Wed, Apr 3, 2019 at 12:21 PM Brett Marcott 
> wrote:
>
> > Hi folks,
> >
> > I am noticing my spark jobs being stuck when using the
> > org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter/CqlBulkOutputFormat.
> >
> >
> > It seems that whenever there is a stream failure it may be expected
> > behavior based on the code to infinite loop.
> >
> > Here are one executors logs:
> > 19/04/03 15:35:06 INFO streaming.StreamResultFuture: [Stream
> > #59290530-5625-11e9-a2bb-8bc7b49d56b0] Session with /10.82.204.173 is
> > complete
> > 19/04/03 15:35:06 WARN streaming.StreamResultFuture: [Stream
> > #59290530-5625-11e9-a2bb-8bc7b49d56b0] Stream failed
> >
> >
> > On stream failure it seems StreamResultFuture sets the exception for the
> > AbstractFuture.
> > AFAIK this should cause the Abstract future to return a new
> > ExecutionException.
> >
> > The problem seems to lie in the fact that the CqlBulkRecordWriter
> swallows
> > the Execution exception and continues in a while loop:
> >
> >
> https://github.com/apache/cassandra/blob/207c80c1fd63dfbd8ca7e615ec8002ee8983c5d6/src/java/org/apache/cassandra/hadoop/cql3/CqlBulkRecordWriter.java#L256-L274
> > <
> >
> https://github.com/apache/cassandra/blob/207c80c1fd63dfbd8ca7e615ec8002ee8983c5d6/src/java/org/apache/cassandra/hadoop/cql3/CqlBulkRecordWriter.java#L256-L274
> > >
> >
> > When taking consecutive thread dumps on the same process I see that the
> > only thread doing work is constantly creating new ExecutionExceptions
> (the
> > memory location for ExecutionException was different on each thread
> dump):
> > java.lang.Throwable.fillInStackTrace(Native Method)
> > java.lang.Throwable.fillInStackTrace(Throwable.java:783) => holding
> > Monitor(java.util.concurrent.ExecutionException@80240763})
> > java.lang.Throwable.(Throwable.java:310)
> > java.lang.Exception.(Exception.java:102)
> >
> java.util.concurrent.ExecutionException.(ExecutionException.java:90)
> >
> >
> com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:476)
> >
> >
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:357)
> >
> >
> org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter.close(CqlBulkRecordWriter.java:257)
> >
> >
> org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter.close(CqlBulkRecordWriter.java:237)
> >
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1131)
> >
> >
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1359)
> >
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
> >
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> > org.apache.spark.scheduler.Task.run(Task.scala:99)
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:285)
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > java.lang.Thread.run(Thread.java:748)
> >
> > It seems the logic that lies right below the while loop in linked code
> > above that checks for failed hosts/streamsessions maybe should have been
> > within the while loop?
> >
> > Thanks,
> >
> > Brett
>


Projects Can Apply Individually for Google Season of Docs

2019-04-03 Thread sharan

Hi All

Initially the ASF as an organisation was planning to apply as a 
mentoring organisation for Google Season of Docs on behalf of all Apache 
projects but if accepted the maximum number of technical writers we 
could allocated is two. Two technical writers would probably not be 
enough to cover the potential demand from all our projects interested in 
participating.


We've received feedback from Google that individual projects can apply. 
I will withdraw the ASF application so that any Apache project 
interested can apply individually for Season of Docs and so have the 
potential of being allocated a technical writer.


Applications for Season of Docs is open now and closes on 23^rd April 
2019. If your project would like to apply then please see the following 
link:


https://developers.google.com/season-of-docs/docs/get-started/

Good luck everyone!

Thanks
Sharan




Re: Infinite loop in org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter

2019-04-03 Thread Russell Spitzer
I would recommend using the Spark Cassandra Connector instead of the Hadoop
based writers. The Hadoop code has not had a lot of love in a long time. See

https://github.com/datastax/spark-cassandra-connector

On Wed, Apr 3, 2019 at 12:21 PM Brett Marcott 
wrote:

> Hi folks,
>
> I am noticing my spark jobs being stuck when using the
> org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter/CqlBulkOutputFormat.
>
>
> It seems that whenever there is a stream failure it may be expected
> behavior based on the code to infinite loop.
>
> Here are one executors logs:
> 19/04/03 15:35:06 INFO streaming.StreamResultFuture: [Stream
> #59290530-5625-11e9-a2bb-8bc7b49d56b0] Session with /10.82.204.173 is
> complete
> 19/04/03 15:35:06 WARN streaming.StreamResultFuture: [Stream
> #59290530-5625-11e9-a2bb-8bc7b49d56b0] Stream failed
>
>
> On stream failure it seems StreamResultFuture sets the exception for the
> AbstractFuture.
> AFAIK this should cause the Abstract future to return a new
> ExecutionException.
>
> The problem seems to lie in the fact that the CqlBulkRecordWriter swallows
> the Execution exception and continues in a while loop:
>
> https://github.com/apache/cassandra/blob/207c80c1fd63dfbd8ca7e615ec8002ee8983c5d6/src/java/org/apache/cassandra/hadoop/cql3/CqlBulkRecordWriter.java#L256-L274
> <
> https://github.com/apache/cassandra/blob/207c80c1fd63dfbd8ca7e615ec8002ee8983c5d6/src/java/org/apache/cassandra/hadoop/cql3/CqlBulkRecordWriter.java#L256-L274
> >
>
> When taking consecutive thread dumps on the same process I see that the
> only thread doing work is constantly creating new ExecutionExceptions (the
> memory location for ExecutionException was different on each thread dump):
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783) => holding
> Monitor(java.util.concurrent.ExecutionException@80240763})
> java.lang.Throwable.(Throwable.java:310)
> java.lang.Exception.(Exception.java:102)
> java.util.concurrent.ExecutionException.(ExecutionException.java:90)
>
> com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:476)
>
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:357)
>
> org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter.close(CqlBulkRecordWriter.java:257)
>
> org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter.close(CqlBulkRecordWriter.java:237)
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1131)
>
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1359)
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> org.apache.spark.scheduler.Task.run(Task.scala:99)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:285)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>
> It seems the logic that lies right below the while loop in linked code
> above that checks for failed hosts/streamsessions maybe should have been
> within the while loop?
>
> Thanks,
>
> Brett


Infinite loop in org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter

2019-04-03 Thread Brett Marcott
Hi folks,

I am noticing my spark jobs being stuck when using the 
org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter/CqlBulkOutputFormat.


It seems that whenever there is a stream failure it may be expected behavior 
based on the code to infinite loop.

Here are one executors logs:
19/04/03 15:35:06 INFO streaming.StreamResultFuture: [Stream 
#59290530-5625-11e9-a2bb-8bc7b49d56b0] Session with /10.82.204.173 is complete
19/04/03 15:35:06 WARN streaming.StreamResultFuture: [Stream 
#59290530-5625-11e9-a2bb-8bc7b49d56b0] Stream failed


On stream failure it seems StreamResultFuture sets the exception for the 
AbstractFuture.
AFAIK this should cause the Abstract future to return a new ExecutionException.

The problem seems to lie in the fact that the CqlBulkRecordWriter swallows the 
Execution exception and continues in a while loop:
https://github.com/apache/cassandra/blob/207c80c1fd63dfbd8ca7e615ec8002ee8983c5d6/src/java/org/apache/cassandra/hadoop/cql3/CqlBulkRecordWriter.java#L256-L274
 


When taking consecutive thread dumps on the same process I see that the only 
thread doing work is constantly creating new ExecutionExceptions (the memory 
location for ExecutionException was different on each thread dump):
java.lang.Throwable.fillInStackTrace(Native Method)
java.lang.Throwable.fillInStackTrace(Throwable.java:783) => holding 
Monitor(java.util.concurrent.ExecutionException@80240763})
java.lang.Throwable.(Throwable.java:310)
java.lang.Exception.(Exception.java:102)
java.util.concurrent.ExecutionException.(ExecutionException.java:90)
com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:476)
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:357)
org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter.close(CqlBulkRecordWriter.java:257)
org.apache.cassandra.hadoop.cql3.CqlBulkRecordWriter.close(CqlBulkRecordWriter.java:237)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1131)
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1359)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
org.apache.spark.scheduler.Task.run(Task.scala:99)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:285)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

It seems the logic that lies right below the while loop in linked code above 
that checks for failed hosts/streamsessions maybe should have been within the 
while loop?

Thanks,

Brett