Re: PutHiveQL Multiple Ordered Statements

Matt Burgess Fri, 23 Sep 2016 08:24:58 -0700

Good catch, I got the two mixed together in my mind :P

Any way to ensure each HiveQL statement is on a single line in the
source file?  If not, you could use ExecuteScript and have your own
regex that does the multi-line-until-semicolon split. You could add
whatever attributes you want in that case too.


Regards,
Matt

On Fri, Sep 23, 2016 at 11:19 AM, Peter Wicks (pwicks)
<[email protected]> wrote:
> Matt,
>
> I realized you meant ExtractText when I saw that SplitText doesn't allow you 
> to change the split option.
>
> SplitText does add an attribute for `text.line.count`, but ExtractText 
> doesn't have anything like that.  Thoughts?
>
> --Peter
>
>
> -----Original Message-----
> From: Matt Burgess [mailto:[email protected]]
> Sent: Friday, September 23, 2016 8:02 AM
> To: [email protected]
> Subject: Re: PutHiveQL Multiple Ordered Statements
>
> Yes SplitText will write a "fragment.index" attribute (as well as other 
> attributes about the split) you could use for priority, except you may need 
> to reverse it (${fragment.count:minus(fragment.index)} or something like 
> that) for priority.
>
> On Fri, Sep 23, 2016 at 9:46 AM, Peter Wicks (pwicks) <[email protected]> 
> wrote:
>> Matt,
>>
>> I put some thought into this option; but I was worried about
>> guaranteed order of execution. So then I started looking at the
>> prioritized queue. If I use a prioritized queue and a max batch size
>> of 1 on PutHiveQL I think I could get it to work; however I am not
>> really sure how to apply the correct priority attribute to the correct
>> split.  Does split already apply a split index? (I haven't checked)
>>
>> Thanks,
>>   Peter
>>
>> -----Original Message-----
>> From: Matt Burgess [mailto:[email protected]]
>> Sent: Friday, September 23, 2016 6:34 AM
>> To: [email protected]
>> Subject: Re: PutHiveQL Multiple Ordered Statements
>>
>> Peter,
>>
>> Since each of your statements ends with a semicolon, I would think you could 
>> use SplitText with Enable Multiline Mode and a delimiter of ';'
>> to get flowfiles containing a single statement apiece, then route
>> those to a single PutHiveQL. Not sure what the exact regex would look
>> like but on its face it looks possible :)
>>
>> Regards,
>> Matt
>>
>> On Fri, Sep 23, 2016 at 8:14 AM, Peter Wicks (pwicks) <[email protected]> 
>> wrote:
>>> I have a PutHDFS processor drop a file, I then have a long chain of
>>> ReplaceText -> PutHiveQL processors that runs a series of steps.
>>>
>>> The below ~4 steps allow me to take the file generated by NiFi in one
>>> format and move it into the final table, which is ORC with several
>>> Timestamp columns (thus why I’m not using AvroToORC, since I’d lose my 
>>> Timestamps.
>>>
>>>
>>>
>>> The exact HQL, all in one block, is roughly:
>>>
>>>
>>>
>>> DROP TABLE `db.tbl_${filename}`;
>>>
>>>
>>>
>>> CREATE TABLE ` db.tbl _${filename}`(
>>>
>>>    Some list of columns goes here that exactly matches the schema of
>>> `prod_db.tbl`
>>>
>>> )
>>>
>>> ROW FORMAT DELIMITED
>>>
>>> FIELDS TERMINATED BY '\001'
>>>
>>> STORED AS TEXTFILE;
>>>
>>>  LOAD DATA INPATH '${absolute.hdfs.path}/${filename}' INTO TABLE `
>>> db.tbl _${filename}`;
>>>
>>>  INSERT INTO `prod_db.tbl`
>>>
>>> SELECT * FROM ` db.tbl _${filename}`;
>>>
>>>                 DROP TABLE ` db.tbl _${filename}`;
>>>
>>>
>>>
>>> Right now I’m having to split this into 5 separate ReplaceText steps,
>>> each one followed by a PutHiveQL.  Is there a way I can push a
>>> multi-statement, order dependent, script like this to Hive in a simpler way?
>>>
>>>
>>>
>>> Thanks,
>>>
>>>   Peter

Re: PutHiveQL Multiple Ordered Statements

Reply via email to