Re: How can I efficiently export the content of my table to KAFKA

2017-04-26 Thread Justin Cameron
You can run multiple applications in parallel in Standalone mode - you just
need to configure spark to allocate resources between your jobs the way you
want (by default it assigns all resources to the first application you run,
so they won't be freed up until it has finished).

You can use Spark's web UI to check the resources that are available and
those allocated to each job. See
http://spark.apache.org/docs/latest/job-scheduling.html for more details.

On Thu, 27 Apr 2017 at 15:12 Tobias Eriksson 
wrote:

> Well, I have been working some with Spark and the biggest hurdle is that
> Spark does not allow me to run multiple jobs in parallel
>
> i.e. at the point of starting the job to taking the table of “Individuals”
> I will have to wait until all that processing is done before I can start an
> additional one
>
> so I will need to upon demand start various additional jobs where I get
> “Addresses”, “Invoices”, … and so on
>
> I know I could increase number of Workers/Executors and use Mesos for
> handling the scheduling and resource management but we have so far not been
> able to get it dynamic/flexible enough
>
> Although I admit that this could still be a way forward we have not
> evaluated it 100% yet, so I have not completely given up that thought
>
>
>
> -Tobias
>
>
>
>
>
> *From: *Justin Cameron 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Thursday, 27 April 2017 at 01:36
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: How can I efficiently export the content of my table to
> KAFKA
>
>
>
> You could probably save yourself a lot of hassle by just writing a Spark
> job that scans through the entire table, converts each row to JSON and
> dumps the output into a Kafka topic. It should be fairly straightforward to
> implement.
>
>
>
> Spark will manage the partitioning of "Producer" processes for you - no
> need for a "Coordinator" topic.
>
>
>
> On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson 
> wrote:
>
> Hi
>
> I would like to make a dump of the database, in JSON format, to KAFKA
>
> The database contains lots of data, millions and in some cases billions of
> “rows”
>
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
>
>
>
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
>
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
>
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
>
> One by one until all the “rows” have been processed.
>
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
>
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
>
>
>
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
>
> https://dzone.com/articles/data-locality-w-cassandra-how
>
>
>
>
>
> Would you consider this a good idea ?
>
> Would there in fact be a better idea, what would that be then ?
>
>
>
> -Tobias
>
>
>
> --
>
> *Justin Cameron*
> Senior Software Engineer
>
>
>
> 
>
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
-- 


*Justin Cameron*Senior Software Engineer





This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


Re: How can I efficiently export the content of my table to KAFKA

2017-04-26 Thread Tobias Eriksson
Well, I have been working some with Spark and the biggest hurdle is that Spark 
does not allow me to run multiple jobs in parallel
i.e. at the point of starting the job to taking the table of “Individuals” I 
will have to wait until all that processing is done before I can start an 
additional one
so I will need to upon demand start various additional jobs where I get 
“Addresses”, “Invoices”, … and so on
I know I could increase number of Workers/Executors and use Mesos for handling 
the scheduling and resource management but we have so far not been able to get 
it dynamic/flexible enough
Although I admit that this could still be a way forward we have not evaluated 
it 100% yet, so I have not completely given up that thought

-Tobias


From: Justin Cameron 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, 27 April 2017 at 01:36
To: "user@cassandra.apache.org" 
Subject: Re: How can I efficiently export the content of my table to KAFKA

You could probably save yourself a lot of hassle by just writing a Spark job 
that scans through the entire table, converts each row to JSON and dumps the 
output into a Kafka topic. It should be fairly straightforward to implement.

Spark will manage the partitioning of "Producer" processes for you - no need 
for a "Coordinator" topic.

On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson 
> wrote:
Hi
I would like to make a dump of the database, in JSON format, to KAFKA
The database contains lots of data, millions and in some cases billions of 
“rows”
I will provide the customer with an export of the data, where they can read it 
off of a KAFKA topic

My thinking was to have it scalable such that I will distribute the token range 
of all available partition-keys to a number of (N) processes (JSON-Producers)
First I will have a process which will read through the available tokens and 
then publish them on a KAFKA “Coordinator” Topic
And then I can create 1, 10, 20 or N processes that will act as Producers to 
the real KAFKA topic, and pick available tokens/partition-keys off of the 
“Coordinator” Topic
One by one until all the “rows” have been processed.
So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert them 
into my own JSON format and post to KAFKA
And then after that take another 1000 “rows” and then …. And then another 1000 
“rows” and so on, until it is done.

I base my idea on how I believe Apache Spark Connector accomplishes data 
locality, i.e. being aware of where tokens reside and figured that since that 
is possible it should be possible to create a job-list in a KAFKA topic, and 
have each Producer pick jobs from there, and read up data from Cassandra based 
on the partition key (token) and then post the JSON on the export KAFKA topic.
https://dzone.com/articles/data-locality-w-cassandra-how


Would you consider this a good idea ?
Would there in fact be a better idea, what would that be then ?

-Tobias

--
Justin Cameron
Senior Software Engineer


[https://cdn2.hubspot.net/hubfs/2549680/Instaclustr-Navy-logo-new.png]

This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and 
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally privileged 
information.  If you are not the intended recipient, do not copy or disclose 
its content, but please reply to this email immediately and highlight the error 
to the sender and then immediately delete the message.


Re: How can I efficiently export the content of my table to KAFKA

2017-04-26 Thread Justin Cameron
You could probably save yourself a lot of hassle by just writing a Spark
job that scans through the entire table, converts each row to JSON and
dumps the output into a Kafka topic. It should be fairly straightforward to
implement.

Spark will manage the partitioning of "Producer" processes for you - no
need for a "Coordinator" topic.

On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson 
wrote:

> Hi
>
> I would like to make a dump of the database, in JSON format, to KAFKA
>
> The database contains lots of data, millions and in some cases billions of
> “rows”
>
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
>
>
>
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
>
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
>
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
>
> One by one until all the “rows” have been processed.
>
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
>
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
>
>
>
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
>
> https://dzone.com/articles/data-locality-w-cassandra-how
>
>
>
>
>
> Would you consider this a good idea ?
>
> Would there in fact be a better idea, what would that be then ?
>
>
>
> -Tobias
>
>
>
-- 


*Justin Cameron*Senior Software Engineer





This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


How can I efficiently export the content of my table to KAFKA

2017-04-26 Thread Tobias Eriksson
Hi
I would like to make a dump of the database, in JSON format, to KAFKA
The database contains lots of data, millions and in some cases billions of 
“rows”
I will provide the customer with an export of the data, where they can read it 
off of a KAFKA topic

My thinking was to have it scalable such that I will distribute the token range 
of all available partition-keys to a number of (N) processes (JSON-Producers)
First I will have a process which will read through the available tokens and 
then publish them on a KAFKA “Coordinator” Topic
And then I can create 1, 10, 20 or N processes that will act as Producers to 
the real KAFKA topic, and pick available tokens/partition-keys off of the 
“Coordinator” Topic
One by one until all the “rows” have been processed.
So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert them 
into my own JSON format and post to KAFKA
And then after that take another 1000 “rows” and then …. And then another 1000 
“rows” and so on, until it is done.

I base my idea on how I believe Apache Spark Connector accomplishes data 
locality, i.e. being aware of where tokens reside and figured that since that 
is possible it should be possible to create a job-list in a KAFKA topic, and 
have each Producer pick jobs from there, and read up data from Cassandra based 
on the partition key (token) and then post the JSON on the export KAFKA topic.
https://dzone.com/articles/data-locality-w-cassandra-how


Would you consider this a good idea ?
Would there in fact be a better idea, what would that be then ?

-Tobias



Last chance: ApacheCon is just three weeks away

2017-04-26 Thread Rich Bowen
ApacheCon is just three weeks away, in Miami, Florida, May 15th - 18th.
http://apachecon.com/

There's still time to register and attend. ApacheCon is the best place
to find out about tomorrow's software, today.

ApacheCon is the official convention of The Apache Software Foundation,
and includes the co-located events:
  * Apache: Big Data
  * Apache: IoT
  * TomcatCon
  * FlexJS Summit
  * Cloudstack Collaboration Conference
  * BarCampApache
  * ApacheCon Lightning Talks

And there's dozens of opportunities to meet your fellow Apache
enthusiasts, both from your project, and from the other 200+ projects at
the Apache Software Foundation.

Register here:
http://events.linuxfoundation.org/events/apachecon-north-america/attend/register-

More information here: http://apachecon.com/

Follow us and learn more about ApacheCon:
  * Twitter: @ApacheCon
  * Discussion mailing list:
https://lists.apache.org/list.html?apachecon-disc...@apache.org
  * Podcasts and speaker interviews: http://feathercast.apache.org/
  * IRC: #apachecon on the https://freenode.net/

We look forward to seeing you in Miami!

-- 
Rich Bowen - VP Conferences, The Apache Software Foundation
http://apachecon.com/
@apachecon



signature.asc
Description: OpenPGP digital signature


Re: cassandra OOM

2017-04-26 Thread Jean Carlo
Hello @Durity

Would you mind to share information about your cluster? Actually I am
interested to know which version of cassandra you use. And how much time do
the gc pauses spend.


Thank you very much


Saludos

Jean Carlo

"The best way to predict the future is to invent it" Alan Kay

On Tue, Apr 25, 2017 at 7:47 PM, Durity, Sean R  wrote:

> We have seen much better stability (and MUCH less GC pauses) from G1 with
> a variety of heap sizes. I don’t even consider CMS any more.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Gopal, Dhruva [mailto:dhruva.go...@aspect.com]
> *Sent:* Tuesday, April 04, 2017 5:34 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: cassandra OOM
>
>
>
> Thanks, that’s interesting – so CMS is a better option for
> stability/performance? We’ll try this out in our cluster.
>
>
>
> *From: *Alexander Dejanovski 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, April 3, 2017 at 10:31 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: cassandra OOM
>
>
>
> Hi,
>
>
>
> we've seen G1GC going OOM on production clusters (repeatedly) with a 16GB
> heap when the workload is intense, and given you're running on m4.2xl I
> wouldn't go over 16GB for the heap.
>
>
>
> I'd suggest to revert back to CMS, using a 16GB heap and up to 6GB of new
> gen. You can use 5 as MaxTenuringThreshold as an initial value and activate
> GC logging to fine tune the settings afterwards.
>
>
>
> FYI CMS tends to perform better than G1 even though it's a little bit
> harder to tune.
>
>
>
> Cheers,
>
>
>
> On Mon, Apr 3, 2017 at 10:54 PM Gopal, Dhruva 
> wrote:
>
> 16 Gig heap, with G1. Pertinent info from jvm.options below (we’re using
> m2.2xlarge instances in AWS):
>
>
>
>
>
> #
>
> # HEAP SETTINGS #
>
> #
>
>
>
> # Heap size is automatically calculated by cassandra-env based on this
>
> # formula: max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB))
>
> # That is:
>
> # - calculate 1/2 ram and cap to 1024MB
>
> # - calculate 1/4 ram and cap to 8192MB
>
> # - pick the max
>
> #
>
> # For production use you may wish to adjust this for your environment.
>
> # If that's the case, uncomment the -Xmx and Xms options below to override
> the
>
> # automatic calculation of JVM heap memory.
>
> #
>
> # It is recommended to set min (-Xms) and max (-Xmx) heap sizes to
>
> # the same value to avoid stop-the-world GC pauses during resize, and
>
> # so that we can lock the heap in memory on startup to prevent any
>
> # of it from being swapped out.
>
> -Xms16G
>
> -Xmx16G
>
>
>
> # Young generation size is automatically calculated by cassandra-env
>
> # based on this formula: min(100 * num_cores, 1/4 * heap size)
>
> #
>
> # The main trade-off for the young generation is that the larger it
>
> # is, the longer GC pause times will be. The shorter it is, the more
>
> # expensive GC will be (usually).
>
> #
>
> # It is not recommended to set the young generation size if using the
>
> # G1 GC, since that will override the target pause-time goal.
>
> # More info: http://www.oracle.com/technetwork/articles/java/
> g1gc-1984535.html
> 
>
> #
>
> # The example below assumes a modern 8-core+ machine for decent
>
> # times. If in doubt, and if you do not particularly want to tweak, go
>
> # 100 MB per physical CPU core.
>
> #-Xmn800M
>
>
>
> #
>
> #  GC SETTINGS  #
>
> #
>
>
>
> ### CMS Settings
>
>
>
> #-XX:+UseParNewGC
>
> #-XX:+UseConcMarkSweepGC
>
> #-XX:+CMSParallelRemarkEnabled
>
> #-XX:SurvivorRatio=8
>
> #-XX:MaxTenuringThreshold=1
>
> #-XX:CMSInitiatingOccupancyFraction=75
>
> #-XX:+UseCMSInitiatingOccupancyOnly
>
> #-XX:CMSWaitDuration=1
>
> #-XX:+CMSParallelInitialMarkEnabled
>
> #-XX:+CMSEdenChunksRecordAlways
>
> # some JVMs will fill up their heap when accessed via JMX, see
> CASSANDRA-6541
>
> #-XX:+CMSClassUnloadingEnabled
>
>
>
> ### G1 Settings (experimental, comment previous section and uncomment
> section below to enable)
>
>
>
> ## Use the Hotspot garbage-first collector.
>
> -XX:+UseG1GC
>
> #
>
> ## Have the JVM do less remembered set work during STW, instead
>
> ## preferring concurrent GC. Reduces p99.9 latency.
>
> -XX:G1RSetUpdatingPauseTimePercent=5
>
> #
>
> ## Main G1GC tunable: lowering the pause target will lower throughput and
> vise versa.
>
> ## 200ms is the JVM default and lowest viable setting
>
> ## 1000ms increases throughput. Keep it smaller than the timeouts in
> cassandra.yaml.
>
> -XX:MaxGCPauseMillis=500
>
>
>
> ## Optional G1 Settings
>
>
>
> # Save CPU time on large (>= 16GB) heaps by delaying region scanning
>
> #