RE: PutHiveQL Multiple Ordered Statements

Peter Wicks (pwicks) Fri, 23 Sep 2016 08:25:30 -0700

Matt,

I realized you meant ExtractText when I saw that SplitText doesn't allow you to 
change the split option.


SplitText does add an attribute for `text.line.count`, but ExtractText doesn't 
have anything like that.  Thoughts?

--Peter


-----Original Message-----
From: Matt Burgess [mailto:[email protected]] 
Sent: Friday, September 23, 2016 8:02 AM
To: [email protected]
Subject: Re: PutHiveQL Multiple Ordered Statements

Yes SplitText will write a "fragment.index" attribute (as well as other 
attributes about the split) you could use for priority, except you may need to 
reverse it (${fragment.count:minus(fragment.index)} or something like that) for 
priority.

On Fri, Sep 23, 2016 at 9:46 AM, Peter Wicks (pwicks) <[email protected]> wrote:
> Matt,
>
> I put some thought into this option; but I was worried about 
> guaranteed order of execution. So then I started looking at the 
> prioritized queue. If I use a prioritized queue and a max batch size 
> of 1 on PutHiveQL I think I could get it to work; however I am not 
> really sure how to apply the correct priority attribute to the correct 
> split.  Does split already apply a split index? (I haven't checked)
>
> Thanks,
>   Peter
>
> -----Original Message-----
> From: Matt Burgess [mailto:[email protected]]
> Sent: Friday, September 23, 2016 6:34 AM
> To: [email protected]
> Subject: Re: PutHiveQL Multiple Ordered Statements
>
> Peter,
>
> Since each of your statements ends with a semicolon, I would think you could 
> use SplitText with Enable Multiline Mode and a delimiter of ';'
> to get flowfiles containing a single statement apiece, then route 
> those to a single PutHiveQL. Not sure what the exact regex would look 
> like but on its face it looks possible :)
>
> Regards,
> Matt
>
> On Fri, Sep 23, 2016 at 8:14 AM, Peter Wicks (pwicks) <[email protected]> 
> wrote:
>> I have a PutHDFS processor drop a file, I then have a long chain of 
>> ReplaceText -> PutHiveQL processors that runs a series of steps.
>>
>> The below ~4 steps allow me to take the file generated by NiFi in one 
>> format and move it into the final table, which is ORC with several 
>> Timestamp columns (thus why I’m not using AvroToORC, since I’d lose my 
>> Timestamps.
>>
>>
>>
>> The exact HQL, all in one block, is roughly:
>>
>>
>>
>> DROP TABLE `db.tbl_${filename}`;
>>
>>
>>
>> CREATE TABLE ` db.tbl _${filename}`(
>>
>>    Some list of columns goes here that exactly matches the schema of 
>> `prod_db.tbl`
>>
>> )
>>
>> ROW FORMAT DELIMITED
>>
>> FIELDS TERMINATED BY '\001'
>>
>> STORED AS TEXTFILE;
>>
>>  LOAD DATA INPATH '${absolute.hdfs.path}/${filename}' INTO TABLE ` 
>> db.tbl _${filename}`;
>>
>>  INSERT INTO `prod_db.tbl`
>>
>> SELECT * FROM ` db.tbl _${filename}`;
>>
>>                 DROP TABLE ` db.tbl _${filename}`;
>>
>>
>>
>> Right now I’m having to split this into 5 separate ReplaceText steps, 
>> each one followed by a PutHiveQL.  Is there a way I can push a 
>> multi-statement, order dependent, script like this to Hive in a simpler way?
>>
>>
>>
>> Thanks,
>>
>>   Peter

RE: PutHiveQL Multiple Ordered Statements

Reply via email to