Re: How to deal with bootstrapping

2015-04-15 Thread Yan Fang
Hi Jeremy, In order to reach this goal, we have to assume that the job with new rules can always catch up with the one with old rules. Otherwise, I think we do not have the choice but running a lot of jobs simultaneously. Under our assumption, we have job1 with old rules running, and now add job2

Re: How to deal with bootstrapping

2015-04-15 Thread Benjamin Black
What about this: 1) Add new rule to the classifier task 2) Take note of offset of the first message processed after restart 3) Run a job to process from offset 0 to the offset from #2, after which the job is deleted I don't know how to do 2 or 3, but perhaps some of the core Samza folk could shed

Re: Samza Unit Test Instrucations

2015-04-15 Thread Yan Fang
Hi Yuanchi, There is no out-of-box unit tests provided by Samza. But there are some ways: 1) If you only want to test the logic in the Task class, normal unit tests will work. You can create a unit test that tests init(), process(), etc. 2) Create mock systems by implementing SystemAdmin, System

Samza Unit Test Instrucations

2015-04-15 Thread Yuanchi Ning
Hello Samza Team, This is Yuanchi Ning from Uber Data Engineering, Realtime Metrics, Streaming Platform team. We are planning to use Samza to process the realtime data we have, and thanks for developing such an awesome open source project. While I am building our streaming service using Samza, I

Re: How to deal with bootstrapping

2015-04-15 Thread jeremy p
Hello Yan, Thank you for the suggestion! I think your solution would work, however, I am afraid it would create a performance problem for our users. Let's say we kill the Classifier task, and create a new Classifier task with both the existing rules and new rules. We get the offset of the latest

Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Milinda Pathirage
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/ --- (Updated April 15, 2015, 6:56 p.m.) Review request for samza, Chris Riccomini a

Re: How to deal with bootstrapping

2015-04-15 Thread Yan Fang
Hi Jeremy, If my understanding is correct, whenever you add a new rule, you want to apply this rule to the historical data. Right? If you do not care about duplication, you can create a new task that contains existing rules and new rules. Configure bootstrap. This will apply all the rules from th

Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Yi Pan (Data Infrastructure)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/#review80248 --- Ship it! Ship It! - Yi Pan (Data Infrastructure) On April 15, 20

Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Milinda Pathirage
> On April 15, 2015, 6:20 p.m., Yi Pan (Data Infrastructure) wrote: > > samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java, > > line 61 > > > > > > One quick question: do we need to

Re: Review Request 33146: Adding a new KV store contract: BatchingKeyValueStore

2015-04-15 Thread Yi Pan (Data Infrastructure)
> On April 14, 2015, 9:14 p.m., Chris Riccomini wrote: > > I'm concerned that there might be an issue with this approach. In > > BaseKeyValueStorageEngineFactory, we compose stores by nesting them. If > > this is the case, I think that the top-most store will implement the > > batching key val

Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Yi Pan (Data Infrastructure)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/#review80237 --- Ship it! +1 - Yi Pan (Data Infrastructure) On April 15, 2015, 2:

Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Yi Pan (Data Infrastructure)
> On April 15, 2015, 6:20 p.m., Yi Pan (Data Infrastructure) wrote: > > samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java, > > line 37 > > > > > > I assume that this class is

Re: How to configure the Resource Manager endpoint for YARN?

2015-04-15 Thread Chris Riccomini
Hey Roger, Hmm, that's good to know, lol. Wonder how our's is working. :) I'll poke around. Cheers, Chris On Wed, Apr 15, 2015 at 11:17 AM, Roger Hoover wrote: > Turns out that HADOOP_CONF_DIR is the right env var (YARN_CONF_DIR did not > work). I had just messed up the directory path. Doh!

Re: Maximum number of jobs

2015-04-15 Thread jeremy p
Thank you, Chris. I just wrote a separate question, "How to deal with bootstrapping" where I describe the problem in detail. On Wed, Apr 15, 2015 at 1:35 PM, Chris Riccomini wrote: > Hey Jeremy, > > Samza will be fine, but at this scale you need to start worrying about > Kafka and YARN. 1 milli

How to deal with bootstrapping

2015-04-15 Thread jeremy p
So, I'm wanting to use Samza for a project I'm working on, but I keep running into a problem with bootstrapping. Let's say there's a Kafka topic called Numbers that I want to consume with Samza. Let's say each message has a single integer in it, and I want to classify it as even or odd. So I hav

Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Yi Pan (Data Infrastructure)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/#review80228 --- samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planne

Re: How to configure the Resource Manager endpoint for YARN?

2015-04-15 Thread Roger Hoover
Turns out that HADOOP_CONF_DIR is the right env var (YARN_CONF_DIR did not work). I had just messed up the directory path. Doh! Sent from my iPhone > On Apr 15, 2015, at 9:41 AM, Roger Hoover wrote: > > I'll try that. Thanks, Chris. > >> On Wed, Apr 15, 2015 at 9:37 AM, Chris Riccomini >

Request: Speakers for May 5 meet up at LinkedIn in Mountain View, CA

2015-04-15 Thread Ed Yakabosky
Hi all - I am helping to set the agenda for a May 5 meetup on Samza hosted by LinkedIn in Mountain View, CA. We’ve got several excellent speakers in line for future meet ups but looking for some additional content for May 5. Is a

Re: Maximum number of jobs

2015-04-15 Thread Chris Riccomini
Hey Jeremy, Samza will be fine, but at this scale you need to start worrying about Kafka and YARN. 1 million jobs will likely start to put pressure on YARN's RM due to memory usage and CPU usage for the scheduler. With 1 million jobs, assuming 1 container each, you'll have over 1 million connectio

Maximum number of jobs

2015-04-15 Thread jeremy p
What's the maximum number of Samza jobs I can run simultaneously on a single cluster? Let's say these jobs are very lightweight -- they require little memory or processing power. However, I need a lot of them -- let's say I need to have 1,000,000 running at any given time. Is this reasonable or

Re: How to configure the Resource Manager endpoint for YARN?

2015-04-15 Thread Roger Hoover
I'll try that. Thanks, Chris. On Wed, Apr 15, 2015 at 9:37 AM, Chris Riccomini wrote: > Hey Roger, > > Not sure if this makes a difference, but have you tried using: > > export YARN_CONF_DIR=... > > Instead? This is what we use. > > Cheers, > Chris > > On Wed, Apr 15, 2015 at 9:33 AM, Roger H

Re: Extra Systems and other extensions.

2015-04-15 Thread Chinmay Soman
+1 ! I was going to do this for my use case as well. Would love to have this ! On Wed, Apr 15, 2015 at 9:24 AM, Roger Hoover wrote: > Dan, > > This is great. Would love to have a common ElasticSearch system producer. > > Cheers, > > Roger > > On Tue, Apr 14, 2015 at 1:34 PM, Dan wrote: > > > T

Re: How to configure the Resource Manager endpoint for YARN?

2015-04-15 Thread Chris Riccomini
Hey Roger, Not sure if this makes a difference, but have you tried using: export YARN_CONF_DIR=... Instead? This is what we use. Cheers, Chris On Wed, Apr 15, 2015 at 9:33 AM, Roger Hoover wrote: > Hi, > > I'm trying to deploy a job to a small YARN cluster. How do tell the > launcher scri

How to configure the Resource Manager endpoint for YARN?

2015-04-15 Thread Roger Hoover
Hi, I'm trying to deploy a job to a small YARN cluster. How do tell the launcher script where to find the Resource Manager? I tried creating a yarn-site.xml and setting HADOOP_CONF_DIR environment variable but it doesn't find my config. 2015-04-14 22:02:45 ClientHelper [INFO] trying to connect

Re: Extra Systems and other extensions.

2015-04-15 Thread Roger Hoover
Dan, This is great. Would love to have a common ElasticSearch system producer. Cheers, Roger On Tue, Apr 14, 2015 at 1:34 PM, Dan wrote: > Thanks Jakob, I agree they'll be more maintained and tested if they're in > the main repo so that's great. > > I'll sort out Jira's and get some patches

Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Milinda Pathirage
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/ --- Review request for samza, Chris Riccomini and Yi Pan (Data Infrastructure). Bug

Re: Updating samza-sql branch to Java 1.7

2015-04-15 Thread Milinda Pathirage
Thanks everyone. Milinda On Tue, Apr 14, 2015 at 6:06 PM, Yi Pan wrote: > Merged master to samza-sql. > > On Tue, Apr 14, 2015 at 2:57 PM, Jakob Homan wrote: > > > Yes, I removed the tests for JDK6 yesterday. We're 1.7 or above now > > for development. > > > > On 14 April 2015 at 12:47, Milin