>>>>- does each slave read all of the splits, or is the master process
responsible for obtaining the...

Here is the workflow I am using. I cannot provide implementation details
but I hope it will give you enough information to proceed:

1. On the master node you will have to create a custom Hadoop Input format
that will be used for map reduce.
2. That input format will generate custom HCatalog input splits.
3. The given custom input split will have what it needs to run on the
mapper. Originally we just serialized the ReaderContext with each split but
that did not perform very well when you have thousands of splits. So the
solution was to create a wrapper for the HCatSplit and just serialize the
original split itself and Hadoop config obtained from ReaderContext. This
will allow you to deserialize what you need on the mapper (and create a
record reader)

>>> - how does the master pass the ReaderContext to the slave?

You do not need to serialze the whole ReaderContext (which can become
rather large). Just Hadoop config should be included into the wrapper for
HCatSplit:

HCatReader hcatReader = DataTransferFactory.getHCatReader(inputSplit,
config); // on the mapper

>>>>- are there any real-world code examples?

I did not see such examples in the public domain .... You can probably look
at what was done for HCat/Pig ... but I cannot help here because we do not
use Pig...

Hope this helps.







On Mon, Jun 16, 2014 at 1:12 PM, Brian Jeltema <
brian.jelt...@digitalenvoy.net> wrote:

> Thanks. I’d already implemented something like this based on some docs I
> found. I’m a little
> confused about the scenario for reading the splits on slaves:
>
>   - does each slave read all of the splits, or is the master process
> responsible for obtaining the
>     list of splits and then modifying the ReaderContext to contain a
> partial list before passing the
>     ReaderContext to the slave?
>
>   - how does the master pass the ReaderContext to the slave?
>
>   - are there any real-world code examples?
>
> Thanks
> Brian
>
> On Jun 16, 2014, at 12:19 PM, Dmitry Vasilenko <dvasi...@gmail.com> wrote:
>
> Here is the code sketch to get you started:
>
> Step 1. Create a builder:
>
> ReadEntity.Builder builder = new ReadEntity.Builder();
> String database = ...
> builder.withDatabase(database);
> String table = ...
> builder.withTable(table);
> String filter = ...
> if (filter != null) {
> builder.withFilter(filter);
> }
> String region = getString(context.getRegion());
> if (region != null) {
> builder.withRegion(region);
> }
>
>
> Step 2: Get initial reader context
>
> Map<String, String> config = ...
> // make sure that you have hive.metastore.uris property in the config
> ReadEntity entity = builder.build();
> ReaderContext readerContext = DataTransferFactory.getHCatReader(entity,
> config).prepareRead();
>
> Step 3: Get input splits and Hadoop Configuration
>
> List<InputSplit> splits = readerContext.getSplits();
> Configuration config = readerContext.getConfig();
>
> Step 4: Get records
>
> a) for each input split get the reader:
>
> HCatReader hcatReader = DataTransferFactory.getHCatReader(inputSplit,
> config);
>
> Iterator<HCatRecord> records = hcatReader.read();
>
> b) Iterate over the records for that reader
>
>
>
>
>
> On Mon, Jun 16, 2014 at 9:57 AM, Brian Jeltema <
> brian.jelt...@digitalenvoy.net> wrote:
>
>> regarding:
>>
>> 3. To read the HCat records....
>>
>> It depends on how you' like to read the records  ... will you be reading
>> ALL the records remotely from the client app
>> or you will get input splits and read the records on mappers....???
>>
>> The code will be different (somewhat)... let me know...
>>
>>
>> in this case I’d be reading all of the records remotely from the client
>> app
>>
>> TIA
>> Brian
>>
>> On Jun 13, 2014, at 9:51 AM, Dmitry Vasilenko <dvasi...@gmail.com> wrote:
>>
>> I am not sure about java docs... ;-]
>> I have spent the last three years integrating with HCat and to make it
>> work had to go thru the code...
>>
>> So here are some samples that can be helpful to start with. If you are
>> using Hive 0.12.0 I would not bother with the new APIs... I had to create
>> some shim classes for HCat to make my code version independent but I cannot
>> share that.
>>
>> So
>>
>> 1. To enumerate tables ... just use Hive client ... this seems to be
>> version independent
>>
>>    hiveMetastoreClient = new HiveMetaStoreClient(conf);
>>
>> // the conf should contain the "hive.metastore.uris" property that point
>> to your Hive Metastore thrift server
>>    List<String> databases = hiveMetastoreClient.getAllDatabases();
>> // this will get you all the databases
>>    List<String> tables = hiveMetastoreClient.getAllTables(database);
>> // this will get you all the tables for the give data base
>>
>> 2. To get the table schema... I assume that you are after HCat schema
>>
>>
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.mapreduce.InputSplit;
>> import org.apache.hadoop.mapreduce.Job;
>> import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
>> import org.apache.hcatalog.data.schema.HCatSchemaUtils;
>> import org.apache.hcatalog.mapreduce.HCatInputFormat;
>> import org.apache.hcatalog.mapreduce.HCatSplit;
>> import org.apache.hcatalog.mapreduce.InputJobInfo;
>>
>>
>>   Job job = new Job(config);
>>   job.setJarByClass(XXXXXX.class); // this will be your class
>> job.setInputFormatClass(HCatInputFormat.class);
>> job.setOutputFormatClass(TextOutputFormat.class);
>>   InputJobInfo inputJobInfo = InputJobInfo.create("my_data_base",
>> "my_table", "partition filter");
>> HCatInputFormat.setInput(job, inputJobInfo);
>> HCatSchema s =  HCatInputFormat.getTableSchema(job);
>>
>>
>> 3. To read the HCat records....
>>
>> It depends on how you' like to read the records  ... will you be reading
>> ALL the records remotely from the client app
>> or you will get input splits and read the records on mappers....???
>>
>> The code will be different (somewhat)... let me know...
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 13, 2014 at 8:25 AM, Brian Jeltema <
>> brian.jelt...@digitalenvoy.net> wrote:
>>
>>> Version 0.12.0.
>>>
>>> I’d like to obtain the table’s schema, scan a table partition, and use
>>> the schema to parse the rows.
>>>
>>> I can probably figure this out by looking at the HCatalog source. My
>>> concern was that
>>> the HCatalog packages in the Hive distributions are excluded in the
>>> JavaDoc, which implies
>>> that the API is not public. Is there a reason for this?
>>>
>>> Brian
>>>
>>> On Jun 13, 2014, at 9:10 AM, Dmitry Vasilenko <dvasi...@gmail.com>
>>> wrote:
>>>
>>> You should be able to access this information. The exact API depends on
>>> the version of Hive/HCat. As you know earlier HCat API is being deprecated
>>> and will be removed in Hive 0.14.0. I can provide you with the code sample
>>> if you tell me what you are trying to do and what version of Hive you are
>>> using.
>>>
>>>
>>> On Fri, Jun 13, 2014 at 7:33 AM, Brian Jeltema <
>>> brian.jelt...@digitalenvoy.net> wrote:
>>>
>>>> I’m experimenting with HCatalog, and would like to be able to access
>>>> tables and their schema
>>>> from a Java application (not Hive/Pig/MapReduce). However, the API
>>>> seems to be hidden, which
>>>> leads leads me to believe that this is not a supported use case. Is
>>>> HCatalog use limited to
>>>> one of the supported frameworks?
>>>>
>>>> TIA
>>>>
>>>> Brian
>>>
>>>
>>>
>>>
>>
>>
>
>

Reply via email to