Re: integrate Camus and Hive?

Andrew Otto Thu, 12 Mar 2015 06:44:33 -0700

Hm, aye, I haven’t tried to write a custom partitioner, and that does look 
pretty easy.  I’ll put that on my backlog to think about.


The Camus team in the past has been excited to accept patches, and I think if I 
Hive partitioner came with Camus it would make it much easier to use.  Oh wait, 
this is the Kafka list, isn’t it…?  Ha, we should talk about this on the Camus 
list.   I am adding a ToDo for myself!


> On Mar 11, 2015, at 17:38, Bhavesh Mistry <mistry.p.bhav...@gmail.com> wrote:
> 
> Hi Andrew,
> 
> I would say camus is generic enough (but you can propose this to Camus
> Team).
> 
> Here is sample code and methods that you can use to create any path or
> directory structure and create a corresponding (Hive Table schema for it).
> 
> public class UTCLogPartitioner extends Partitioner {
> 
>    @Override
>    public String *encodePartition*(JobContext context, IEtlKey key) {
>             long outfilePartitionMs =
> EtlMultiOutputFormat.getEtlOutputFileTimePartitionMins(context) * 60000L;
>             return ""+DateUtils.getPartition(outfilePartitionMs,
> *key.getTime()*);
>    }
> 
>    @Override
>    public String *generatePartitionedPath*(JobContext context, String
> topic, String brokerId, int partitionId, String *encodedPartition*) {
>        StringBuilder sb = new StringBuilder();
>        sb.append("Create your HDFS custom path here");
>        return sb.toString();
>    }
> 
> }
> 
> I
> 
> Thanks,
> Bhavesh
> 
> On Wed, Mar 11, 2015 at 10:42 AM, Andrew Otto <ao...@wikimedia.org> wrote:
> 
>> Thanks,
>> 
>> Do you have this partitioner implemented?  Perhaps it would be good to try
>> to get this into Camus as a build in option.  HivePartitioner? :)
>> 
>> -Ao
>> 
>> 
>>> On Mar 11, 2015, at 13:11, Bhavesh Mistry <mistry.p.bhav...@gmail.com>
>> wrote:
>>> 
>>> Hi Ad
>>> 
>>> You have to implement custom partitioner and also you will have to create
>>> what ever path (based on message eg log line timestamp, or however you
>>> choose to create directory hierarchy from your each message).
>>> 
>>> You will need to implement your own Partitioner class implementation:
>>> 
>> https://github.com/linkedin/camus/blob/master/camus-api/src/main/java/com/linkedin/camus/etl/Partitioner.java
>>> and use configuration "etl.partitioner.class=CLASSNAME"  then you can
>>> organize any way you like.
>>> 
>>> I hope this helps.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Bhavesh
>>> 
>>> 
>>> On Wed, Mar 11, 2015 at 8:36 AM, Andrew Otto <ao...@wikimedia.org>
>> wrote:
>>> 
>>>>> e.g File produce by the camus job:  /user/[hive.user]/output/
>>>>> 
>>>> 
>> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
>>>> 
>>>> Bhavesh, how do you get Camus to write into a directory hierarchy like
>>>> this?  Is it reading the partition values from your messages'
>> timestamps?
>>>> 
>>>> 
>>>>> On Mar 11, 2015, at 11:29, Bhavesh Mistry <mistry.p.bhav...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> HI Yang,
>>>>> 
>>>>> We do this today camus to hive (without the Avro) just plain old tab
>>>>> separated log line.
>>>>> 
>>>>> We use the hive -f command to add dynamic partition to hive table:
>>>>> 
>>>>> Bash Shell Scripts add time buckets into HIVE table before camus job
>>>> runs:
>>>>> 
>>>>> for partition in "${@//\//,}"; do
>>>>> echo "ALTER TABLE ${env:TABLE_NAME} ADD IF NOT EXISTS PARTITION
>>>>> ($partition);"
>>>>> done | hive -f
>>>>> 
>>>>> 
>>>>> e.g File produce by the camus job:  /user/[hive.user]/output/
>>>>> 
>>>> 
>> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
>>>>> 
>>>>> Above will add hive dynamic partition before camus job runs.  It works,
>>>> and
>>>>> you can have any schema:
>>>>> 
>>>>> CREATE EXTERNAL TABLE IF NOT EXISTS ${env:TABLE_NAME} (
>>>>> SOME Table FIELDS...
>>>>> )
>>>>> PARTITIONED BY (
>>>>>  partition_month_utc STRING,
>>>>>  partition_day_utc STRING,
>>>>>  partition_minute_bucket STRING
>>>>> )
>>>>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>>>>> STORED AS SEQUENCEFILE
>>>>> LOCATION '${env:TABLE_LOCATION_CAMUS_OUTPUT}'
>>>>> ;
>>>>> 
>>>>> 
>>>>> I hope this will help !   You will have to construct  hive query
>>>> according
>>>>> to partition define.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Bhavesh
>>>>> 
>>>>> On Wed, Mar 11, 2015 at 7:24 AM, Andrew Otto <ao...@wikimedia.org>
>>>> wrote:
>>>>> 
>>>>>>> Hive provides the ability to provide custom patterns for partitions.
>>>> You
>>>>>>> can use this in combination with MSCK REPAIR TABLE to automatically
>>>>>> detect
>>>>>>> and load the partitions into the metastore.
>>>>>> 
>>>>>> I tried this yesterday, and as far as I can tell it doesn’t work with
>> a
>>>>>> custom partition layout.  At least not with external tables.  MSCK
>>>> REPAIR
>>>>>> TABLE reports that there are directories in the table’s location that
>>>> are
>>>>>> not partitions of the table, but it wouldn’t actually add the
>> partition
>>>>>> unless the directory layout matched Hive’s default
>>>>>> (key1=value1/key2=value2, etc.)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Mar 9, 2015, at 17:16, Pradeep Gollakota <pradeep...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> If I understood your question correctly, you want to be able to read
>>>> the
>>>>>>> output of Camus in Hive and be able to know partition values. If my
>>>>>>> understanding is right, you can do so by using the following.
>>>>>>> 
>>>>>>> Hive provides the ability to provide custom patterns for partitions.
>>>> You
>>>>>>> can use this in combination with MSCK REPAIR TABLE to automatically
>>>>>> detect
>>>>>>> and load the partitions into the metastore.
>>>>>>> 
>>>>>>> Take a look at this SO
>>>>>>> 
>>>>>> 
>>>> 
>> http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern
>>>>>>> 
>>>>>>> Does that help?
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 9, 2015 at 1:42 PM, Yang <teddyyyy...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> I believe many users like us would export the output from camus as a
>>>>>> hive
>>>>>>>> external table. but the dir structure of camus is like
>>>>>>>> /YYYY/MM/DD/xxxxxx
>>>>>>>> 
>>>>>>>> while hive generally expects /year=YYYY/month=MM/day=DD/xxxxxx if
>> you
>>>>>>>> define that table to be
>>>>>>>> partitioned by (year, month, day). otherwise you'd have to add those
>>>>>>>> partitions created by camus through a separate command. but in the
>>>>>> latter
>>>>>>>> case, would a camus job create >1 partitions ? how would we find out
>>>> the
>>>>>>>> YYYY/MM/DD values from outside ? ---- well you could always do
>>>>>> something by
>>>>>>>> hadoop dfs -ls and then grep the output, but it's kind of not
>>>> clean....
>>>>>>>> 
>>>>>>>> 
>>>>>>>> thanks
>>>>>>>> yang
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: integrate Camus and Hive?

Reply via email to