date:20150311

Re: integrate Camus and Hive?

2015-03-11 Thread Andrew Otto

e.g File produce by the camus job: /user/[hive.user]/output/
*partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*

Bhavesh, how do you get Camus to write into a directory hierarchy like this?
Is it reading the partition values from your messages' timestamps?

On Mar 11, 2015, at 11:29, Bhavesh Mistry mistry.p.bhav...@gmail.com wrote:

HI Yang,

We do this today camus to hive (without the Avro) just plain old tab
separated log line.

We use the hive -f command to add dynamic partition to hive table:

Bash Shell Scripts add time buckets into HIVE table before camus job runs:

for partition in ${@//\//,}; do
echo ALTER TABLE ${env:TABLE_NAME} ADD IF NOT EXISTS PARTITION
($partition);
done | hive -f

e.g File produce by the camus job: /user/[hive.user]/output/
*partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*

Above will add hive dynamic partition before camus job runs. It works, and
you can have any schema:

CREATE EXTERNAL TABLE IF NOT EXISTS ${env:TABLE_NAME} (
SOME Table FIELDS...
)
PARTITIONED BY (
partition_month_utc STRING,
partition_day_utc STRING,
partition_minute_bucket STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS SEQUENCEFILE
LOCATION '${env:TABLE_LOCATION_CAMUS_OUTPUT}'
;

I hope this will help ! You will have to construct hive query according
to partition define.

Thanks,

Bhavesh

On Wed, Mar 11, 2015 at 7:24 AM, Andrew Otto ao...@wikimedia.org wrote:

Hive provides the ability to provide custom patterns for partitions. You
can use this in combination with MSCK REPAIR TABLE to automatically
detect
and load the partitions into the metastore.

I tried this yesterday, and as far as I can tell it doesn’t work with a
custom partition layout. At least not with external tables. MSCK REPAIR
TABLE reports that there are directories in the table’s location that are
not partitions of the table, but it wouldn’t actually add the partition
unless the directory layout matched Hive’s default
(key1=value1/key2=value2, etc.)

On Mar 9, 2015, at 17:16, Pradeep Gollakota pradeep...@gmail.com
wrote:

If I understood your question correctly, you want to be able to read the
output of Camus in Hive and be able to know partition values. If my
understanding is right, you can do so by using the following.

Hive provides the ability to provide custom patterns for partitions. You
can use this in combination with MSCK REPAIR TABLE to automatically
detect
and load the partitions into the metastore.

Take a look at this SO

http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern

Does that help?

On Mon, Mar 9, 2015 at 1:42 PM, Yang tedd...@gmail.com wrote:

I believe many users like us would export the output from camus as a
hive
external table. but the dir structure of camus is like
//MM/DD/xx

while hive generally expects /year=/month=MM/day=DD/xx if you
define that table to be
partitioned by (year, month, day). otherwise you'd have to add those
partitions created by camus through a separate command. but in the
latter
case, would a camus job create 1 partitions ? how would we find out the
/MM/DD values from outside ? well you could always do
something by
hadoop dfs -ls and then grep the output, but it's kind of not clean

thanks
yang

Re: Does consumer support combination of whitelist and blacklist topic filtering

2015-03-11 Thread Guozhang Wang

Tao,

In MM people can pass in consumer configs, in which people can specify
consumption topics, either in regular topic list format or whitelist /
blacklist. So I think it already does what you need?

Guozhang

On Tue, Mar 10, 2015 at 10:09 PM, tao xiao xiaotao...@gmail.com wrote:

 Thank you guys for answering. I think it will be good that we can pass in a
 customised topicCount ( I think this is the interface whitelist and
 backlist implement if I am not mistaken) to MM to achieve similar thing

 On Wednesday, March 11, 2015, Guozhang Wang wangg...@gmail.com wrote:

  Hi Tao,
 
  Unfortunately MM does not support whitelist / blacklist at the same time,
  and you have to choose either one upon initialization. As for your case,
 I
  think it can be captured by some reg-ex to exclude nothing else but 10,
  but I do not know the exact expression.
 
  Guozhang
 
  On Tue, Mar 10, 2015 at 7:58 AM, tao xiao xiaotao...@gmail.com
  javascript:; wrote:
 
   I actually mean if we can achieve this in mirror maker.
  
   On Tue, Mar 10, 2015 at 10:52 PM, tao xiao xiaotao...@gmail.com
  javascript:; wrote:
  
Hi,
   
I have an user case where I need to consume a list topics with name
  that
matches pattern topic.* except for one that is topic.10. Is there a
 way
that I can combine the use of whitelist and blacklist so that I can
   achieve
something like accept all topics with regex topic.* but exclude
  topic.10?
   
--
Regards,
Tao
   
  
  
  
   --
   Regards,
   Tao
  
 
 
 
  --
  -- Guozhang
 


 --
 Regards,
 Tao




-- 
-- Guozhang

Re: createMessageStreams vs createMessageStreamsByFilter

2015-03-11 Thread Guozhang Wang

Hi James,

What I meant before is that a single fetcher may be responsible for putting
fetched data to multiple queues according to the construction of the
streams setup, where each queue may be consumed by a different thread. And
the queues are actually bounded. Now say if there are two queues that are
getting data from the same fetcher F, and are consumed by two different
user threads A and B. If thread A for some reason got slowed / hung
consuming data from queue 1, then queue 1 will eventually get full, and F
trying to put more data to it will be blocked. Since F is parked on trying
to put data to queue 1, queue 2 will not get more data from it, and thread
B may hence gets starved. Does that make sense now?

Guozhang

On Tue, Mar 10, 2015 at 5:15 PM, James Cheng jch...@tivo.com wrote:

 Hi,

 Sorry to bring up this old thread, but my question is about this exact
 thing:

 Guozhang, you said:
  A more concrete example: say you have topic AC: 3 partitions, topic BC: 6
  partitions.
 
  With createMessageStreams(AC = 3, BC = 2) a total of 5 threads will
  be created, and consuming AC-1,AC-2,AC-3,BC-1/2/3,BC-4/5/6 respectively;
 
  With createMessageStreamsByFilter(*C = 3) a total of 3 threads will be
  created, and consuming AC-1/BC-1/BC-2, AC-2/BC-3/BC-4, AC-3/BC-5/BC-6
  respectively.


 You said that in the createMessageStreamsByFilter case, if topic AC had no
 messages in it and consumer.timeout.ms = -1, then the 3 threads might all
 be blocked waiting for data to arrive from topic AC, and so messages from
 BC would not be processed.

 createMessageStreamsByFilter(*C = 1) (single stream) would have the
 same problem but just worse. Behind the scenes, is there a single thread
 that is consuming (round-robin?) messages from the different partitions and
 inserting them all into a single queue for the application code to process?
 And that is why a single partition with no messages with block the other
 messages from getting through?

 What about createMessageStreams(AC = 1)? That creates a single stream
 that contains messages from multiple partitions, which might be on
 different brokers. Does that also suffer the same problem, where if one
 partition has no messages, that the application would not receive messages
 from the other paritions?

 Thanks,
 -James


 On Feb 11, 2015, at 8:13 AM, Guozhang Wang wangg...@gmail.com wrote:

  The new consumer will be released in 0.9, which is targeted for end of
 this
  quarter.
 
  On Tue, Feb 10, 2015 at 7:11 PM, tao xiao xiaotao...@gmail.com wrote:
 
  Do you know when the new consumer API will be publicly available?
 
  On Wed, Feb 11, 2015 at 10:43 AM, Guozhang Wang wangg...@gmail.com
  wrote:
 
  Yes, it can get stuck. For example, AC and BC are processed by two
  different processes and AC processors gets stuck, hence AC messages
 will
  fill up in the consumer's buffer and eventually prevents the fetcher
  thread
  to put more data into it; the fetcher thread will be blocked on that
 and
  not be able to fetch BC.
 
  This issue has been addressed in the new consumer client, which is
  single-threaded with non-blocking APIs.
 
  Guozhang
 
  On Tue, Feb 10, 2015 at 6:24 PM, tao xiao xiaotao...@gmail.com
 wrote:
 
  Thank you Guozhang for your detailed explanation. In your example
  createMessageStreamsByFilter(*C = 3)  since threads are shared
 among
  topics there may be situation where all 3 threads threads get stuck
  with
  topic AC e.g. topic is empty which will be holding the connecting
  threads
  (setting consumer.timeout.ms=-1) hence there is no thread to serve
  topic
  BC. do you think this situation will happen?
 
  On Wed, Feb 11, 2015 at 2:15 AM, Guozhang Wang wangg...@gmail.com
  wrote:
 
  I was not clear before .. for createMessageStreamsByFilter each
  matched
  topic will have num-threads, but shared: i.e. there will be totally
  num-threads created, but each thread will be responsible for fetching
  all
  matched topics.
 
  A more concrete example: say you have topic AC: 3 partitions, topic
  BC: 6
  partitions.
 
  With createMessageStreams(AC = 3, BC = 2) a total of 5 threads
  will
  be created, and consuming AC-1,AC-2,AC-3,BC-1/2/3,BC-4/5/6
  respectively;
 
  With createMessageStreamsByFilter(*C = 3) a total of 3 threads
  will
  be
  created, and consuming AC-1/BC-1/BC-2, AC-2/BC-3/BC-4, AC-3/BC-5/BC-6
  respectively.
 
  Guozhang
 
  On Tue, Feb 10, 2015 at 8:37 AM, tao xiao xiaotao...@gmail.com
  wrote:
 
  Guozhang,
 
  Do you mean that each regex matched topic owns number of threads
  that
  get
  passed in to createMessageStreamsByFilter ? For example in below
  code
  If
  I
  have 3 matched topics each of which has 2 partitions then I should
  have
  3 *
  2 = 6 threads in total with each topic owning 2 threads.
 
  TopicFilter filter = new Whitelist(.*);
 
  int threadTotal = 2;
 
  ListKafkaStreambyte[], byte[] streams = connector
  .createMessageStreamsByFilter(filter, threadTotal);
 
 
  But what I observed from the

Re: High Level Consumer Example in 0.8.2

2015-03-11 Thread Ewen Cheslack-Postava

That example still works, the high level consumer interface hasn't changed.

There is a new high level consumer on the way and an initial version has
been checked into trunk, but it won't be ready to use until 0.9.


On Wed, Mar 11, 2015 at 9:05 AM, ankit tyagi ankittyagi.mn...@gmail.com
wrote:

 Hi All,

 we are upgrading our kafka client version from 0.8.0 to 0.8.2.

 Is there any document  for High level kafka consumer withMultiple thread
 like
 https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
 for this newer version.




-- 
Thanks,
Ewen

Re: Idle/dead producer connections on broker

2015-03-11 Thread Guozhang Wang

Hmm, this sounds like a serious bug. I do remember we have some ticket
reporting similar issues before but I cannot find it now. Let me dig a bit
deeper later.

BTW, could you try out the 0.8.2 broker version and see if this is still
easily re-producible, i.e. starting a bunch of producers to send data for a
while, and terminate them?

Guozhang

On Tue, Mar 10, 2015 at 1:00 PM, Allen Wang aw...@netflix.com.invalid
wrote:

 Hello,

 We are using Kafka 0.8.1.1 on the broker and 0.8.2 producer on the client.
 After running for a few days, we have found that there are way too many
 open file descriptors on the broker side. When we compare the connections
 on the client side, we found some connections are already gone on the
 client but still exists on the broker. Also there are connections on the
 broker where the producer instances are already terminated.

 We then did a netstat -o and found that the connections on the broker side
 does not have keep-alive enabled (as timewait is off):

 tcp6   0  0 kafka-xyz:7101 ip-a-b-c-d:33471 ESTABLISHED off
 (0.00/0/0)

 We suspect that because there is no keep-alive on the broker, there is no
 probing on the idle connections and therefore no connection clean up.

 There is a default 2 hours TCP keep alive set on the OS level on both
 sides:

 net.ipv4.tcp_keepalive_time = 7200

 On the producer side, keepalive is enabled on the connection:

 tcp6   0  0 ip-a-b-c-d:33471kafka-xyz.:7101 ESTABLISHED
 keepalive (975.50/0/0)

 Is there anyway to clean up the idle producer connections on the broker
 side? Does keepalive helps cleaning up the idle connections?

 Thanks,
 Allen




-- 
-- Guozhang

Re: integrate Camus and Hive?

2015-03-11 Thread Bhavesh Mistry

Hi Ad

You have to implement custom partitioner and also you will have to create
what ever path (based on message eg log line timestamp, or however you
choose to create directory hierarchy from your each message).

You will need to implement your own Partitioner class implementation:
https://github.com/linkedin/camus/blob/master/camus-api/src/main/java/com/linkedin/camus/etl/Partitioner.java
and use configuration etl.partitioner.class=CLASSNAME then you can
organize any way you like.

I hope this helps.

Thanks,

Bhavesh

On Wed, Mar 11, 2015 at 8:36 AM, Andrew Otto ao...@wikimedia.org wrote:

e.g File produce by the camus job: /user/[hive.user]/output/

*partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*

Bhavesh, how do you get Camus to write into a directory hierarchy like
this? Is it reading the partition values from your messages' timestamps?

On Mar 11, 2015, at 11:29, Bhavesh Mistry mistry.p.bhav...@gmail.com
wrote:

HI Yang,

We do this today camus to hive (without the Avro) just plain old tab
separated log line.

We use the hive -f command to add dynamic partition to hive table:

Bash Shell Scripts add time buckets into HIVE table before camus job
runs:

for partition in ${@//\//,}; do
echo ALTER TABLE ${env:TABLE_NAME} ADD IF NOT EXISTS PARTITION
($partition);
done | hive -f

e.g File produce by the camus job: /user/[hive.user]/output/

*partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*

Above will add hive dynamic partition before camus job runs. It works,
and
you can have any schema:

I hope this will help ! You will have to construct hive query
according
to partition define.

Thanks,

Bhavesh

On Wed, Mar 11, 2015 at 7:24 AM, Andrew Otto ao...@wikimedia.org
wrote:

Hive provides the ability to provide custom patterns for partitions.
You
can use this in combination with MSCK REPAIR TABLE to automatically
detect
and load the partitions into the metastore.

I tried this yesterday, and as far as I can tell it doesn’t work with a
custom partition layout. At least not with external tables. MSCK
REPAIR
TABLE reports that there are directories in the table’s location that
are
not partitions of the table, but it wouldn’t actually add the partition
unless the directory layout matched Hive’s default
(key1=value1/key2=value2, etc.)

On Mar 9, 2015, at 17:16, Pradeep Gollakota pradeep...@gmail.com
wrote:

If I understood your question correctly, you want to be able to read
the
output of Camus in Hive and be able to know partition values. If my
understanding is right, you can do so by using the following.

Hive provides the ability to provide custom patterns for partitions.
You
can use this in combination with MSCK REPAIR TABLE to automatically
detect
and load the partitions into the metastore.

Take a look at this SO

http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern

Does that help?

On Mon, Mar 9, 2015 at 1:42 PM, Yang tedd...@gmail.com wrote:

I believe many users like us would export the output from camus as a
hive
external table. but the dir structure of camus is like
//MM/DD/xx

while hive generally expects /year=/month=MM/day=DD/xx if you
define that table to be
partitioned by (year, month, day). otherwise you'd have to add those
partitions created by camus through a separate command. but in the
latter
case, would a camus job create 1 partitions ? how would we find out
the
/MM/DD values from outside ? well you could always do
something by
hadoop dfs -ls and then grep the output, but it's kind of not
clean

thanks
yang

Re: integrate Camus and Hive?

2015-03-11 Thread Bhavesh Mistry

Hi Andrew,

I would say camus is generic enough (but you can propose this to Camus
Team).

Here is sample code and methods that you can use to create any path or
directory structure and create a corresponding (Hive Table schema for it).

public class UTCLogPartitioner extends Partitioner {

@Override
public String *encodePartition*(JobContext context, IEtlKey key) {
 long outfilePartitionMs =
EtlMultiOutputFormat.getEtlOutputFileTimePartitionMins(context) * 6L;
 return +DateUtils.getPartition(outfilePartitionMs,
*key.getTime()*);
}

@Override
public String *generatePartitionedPath*(JobContext context, String
topic, String brokerId, int partitionId, String *encodedPartition*) {
StringBuilder sb = new StringBuilder();
sb.append(Create your HDFS custom path here);
return sb.toString();
}

}

I

Thanks,
Bhavesh

On Wed, Mar 11, 2015 at 10:42 AM, Andrew Otto ao...@wikimedia.org wrote:

 Thanks,

 Do you have this partitioner implemented?  Perhaps it would be good to try
 to get this into Camus as a build in option.  HivePartitioner? :)

 -Ao


  On Mar 11, 2015, at 13:11, Bhavesh Mistry mistry.p.bhav...@gmail.com
 wrote:
 
  Hi Ad
 
  You have to implement custom partitioner and also you will have to create
  what ever path (based on message eg log line timestamp, or however you
  choose to create directory hierarchy from your each message).
 
  You will need to implement your own Partitioner class implementation:
 
 https://github.com/linkedin/camus/blob/master/camus-api/src/main/java/com/linkedin/camus/etl/Partitioner.java
  and use configuration etl.partitioner.class=CLASSNAME  then you can
  organize any way you like.
 
  I hope this helps.
 
 
  Thanks,
 
  Bhavesh
 
 
  On Wed, Mar 11, 2015 at 8:36 AM, Andrew Otto ao...@wikimedia.org
 wrote:
 
  e.g File produce by the camus job:  /user/[hive.user]/output/
 
 
 *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
 
  Bhavesh, how do you get Camus to write into a directory hierarchy like
  this?  Is it reading the partition values from your messages'
 timestamps?
 
 
  On Mar 11, 2015, at 11:29, Bhavesh Mistry mistry.p.bhav...@gmail.com
  wrote:
 
  HI Yang,
 
  We do this today camus to hive (without the Avro) just plain old tab
  separated log line.
 
  We use the hive -f command to add dynamic partition to hive table:
 
  Bash Shell Scripts add time buckets into HIVE table before camus job
  runs:
 
  for partition in ${@//\//,}; do
   echo ALTER TABLE ${env:TABLE_NAME} ADD IF NOT EXISTS PARTITION
  ($partition);
  done | hive -f
 
 
  e.g File produce by the camus job:  /user/[hive.user]/output/
 
 
 *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
 
  Above will add hive dynamic partition before camus job runs.  It works,
  and
  you can have any schema:
 
  CREATE EXTERNAL TABLE IF NOT EXISTS ${env:TABLE_NAME} (
  SOME Table FIELDS...
  )
  PARTITIONED BY (
partition_month_utc STRING,
partition_day_utc STRING,
partition_minute_bucket STRING
  )
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  STORED AS SEQUENCEFILE
  LOCATION '${env:TABLE_LOCATION_CAMUS_OUTPUT}'
  ;
 
 
  I hope this will help !   You will have to construct  hive query
  according
  to partition define.
 
  Thanks,
 
  Bhavesh
 
  On Wed, Mar 11, 2015 at 7:24 AM, Andrew Otto ao...@wikimedia.org
  wrote:
 
  Hive provides the ability to provide custom patterns for partitions.
  You
  can use this in combination with MSCK REPAIR TABLE to automatically
  detect
  and load the partitions into the metastore.
 
  I tried this yesterday, and as far as I can tell it doesn’t work with
 a
  custom partition layout.  At least not with external tables.  MSCK
  REPAIR
  TABLE reports that there are directories in the table’s location that
  are
  not partitions of the table, but it wouldn’t actually add the
 partition
  unless the directory layout matched Hive’s default
  (key1=value1/key2=value2, etc.)
 
 
 
  On Mar 9, 2015, at 17:16, Pradeep Gollakota pradeep...@gmail.com
  wrote:
 
  If I understood your question correctly, you want to be able to read
  the
  output of Camus in Hive and be able to know partition values. If my
  understanding is right, you can do so by using the following.
 
  Hive provides the ability to provide custom patterns for partitions.
  You
  can use this in combination with MSCK REPAIR TABLE to automatically
  detect
  and load the partitions into the metastore.
 
  Take a look at this SO
 
 
 
 http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern
 
  Does that help?
 
 
  On Mon, Mar 9, 2015 at 1:42 PM, Yang tedd...@gmail.com wrote:
 
  I believe many users like us would export the output from camus as a
  hive
  external table. but the dir structure of camus is like
  //MM/DD/xx
 
  while hive generally expects

Examples of kafka based architectures?

2015-03-11 Thread Joseph Pachod

Hi all

In December Adrian Cockcroft presented some big names distributed
architectures in his talk State of the Art in Microservices at dockercon.
For each he put tooling/configuration/discovery/routing/observability on
top and then under datastores, orchestration and development. One can see
some example in the slides 51 and later there
http://fr.slideshare.net/adriancockcroft/dockercon-state-of-the-art-in-microservices
.

Regarding kafka, I fell like some similar schemes would be very welcome,
either in general or in the case of linkedin (or other big names using it).
Obviously, at least IMHO, routing is tackled, but all the rest is of
interest, since there are interdependencies. On top it would provide some
examples of where to go for newcomers like me.

So if ever some of you are able to present such schemes, well, please do.
Really.

Best
Joseph

[ANNOUNCEMENT] Apache Kafka 0.8.2.1 Released

2015-03-11 Thread Jun Rao

The Apache Kafka community is pleased to announce the release for Apache Kafka 
0.8.2.1.

The 0.8.2.1 release fixes 4 critical issues in 0.8.2.0.

All of the changes in this release can be found: 
https://archive.apache.org/dist/kafka/0.8.2.1/RELEASE_NOTES.html

Apache Kafka is high-throughput, publish-subscribe messaging system rethought 
of as a distributed commit log.

** Fast = A single Kafka broker can handle hundreds of megabytes of reads and
writes per second from thousands of clients.

** Scalable = Kafka is designed to allow a single cluster to serve as the 
central data backbone
for a large organization. It can be elastically and transparently expanded 
without downtime.
Data streams are partitioned and spread over a cluster of machines to allow 
data streams
larger than the capability of any single machine and to allow clusters of 
co-ordinated consumers.

** Durable = Messages are persisted on disk and replicated within the cluster 
to prevent
data loss. Each broker can handle terabytes of messages without performance 
impact.

** Distributed by Design = Kafka has a modern cluster-centric design that 
offers
strong durability and fault-tolerance guarantees.

You can download the release from: http://kafka.apache.org/downloads.html

We welcome your help and feedback. For more information on how to
report problems, and to get involved, visit the project website at 
http://kafka.apache.org/

Thanks,

Jun

Re: integrate Camus and Hive?

2015-03-11 Thread Andrew Otto

Thanks,

Do you have this partitioner implemented? Perhaps it would be good to try to
get this into Camus as a build in option. HivePartitioner? :)

-Ao

On Mar 11, 2015, at 13:11, Bhavesh Mistry mistry.p.bhav...@gmail.com wrote: