Matt,

I wanted to circle back on this conversation.  What are your thoughts on a new 
property for this processor that would support this and do the split internally?
I feel like a lot of people (or at least me) are coming from a world where they 
have HQL files with many statements back to back and are used to being able to 
pass these into Beeline/HiveCLI and not think about it.

I could put a property that defaulted to false, that said something like, "HQL 
May Contain Multiple Statements" (a little wordy). If a user set this to true 
it would look for ;'s at the end of trimmed lines and split on it. Then execute 
in order.
Not sure about escaping, like when a comma may be present inside of a string 
constant, but that could be worked through.

Let me know your thoughts.

Thanks,
  Peter 



-----Original Message-----
From: Matt Burgess [mailto:[email protected]] 
Sent: Friday, September 23, 2016 9:25 AM
To: [email protected]
Subject: Re: PutHiveQL Multiple Ordered Statements

Good catch, I got the two mixed together in my mind :P

Any way to ensure each HiveQL statement is on a single line in the source file? 
 If not, you could use ExecuteScript and have your own regex that does the 
multi-line-until-semicolon split. You could add whatever attributes you want in 
that case too.

Regards,
Matt

On Fri, Sep 23, 2016 at 11:19 AM, Peter Wicks (pwicks) <[email protected]> 
wrote:
> Matt,
>
> I realized you meant ExtractText when I saw that SplitText doesn't allow you 
> to change the split option.
>
> SplitText does add an attribute for `text.line.count`, but ExtractText 
> doesn't have anything like that.  Thoughts?
>
> --Peter
>
>
> -----Original Message-----
> From: Matt Burgess [mailto:[email protected]]
> Sent: Friday, September 23, 2016 8:02 AM
> To: [email protected]
> Subject: Re: PutHiveQL Multiple Ordered Statements
>
> Yes SplitText will write a "fragment.index" attribute (as well as other 
> attributes about the split) you could use for priority, except you may need 
> to reverse it (${fragment.count:minus(fragment.index)} or something like 
> that) for priority.
>
> On Fri, Sep 23, 2016 at 9:46 AM, Peter Wicks (pwicks) <[email protected]> 
> wrote:
>> Matt,
>>
>> I put some thought into this option; but I was worried about 
>> guaranteed order of execution. So then I started looking at the 
>> prioritized queue. If I use a prioritized queue and a max batch size 
>> of 1 on PutHiveQL I think I could get it to work; however I am not 
>> really sure how to apply the correct priority attribute to the 
>> correct split.  Does split already apply a split index? (I haven't 
>> checked)
>>
>> Thanks,
>>   Peter
>>
>> -----Original Message-----
>> From: Matt Burgess [mailto:[email protected]]
>> Sent: Friday, September 23, 2016 6:34 AM
>> To: [email protected]
>> Subject: Re: PutHiveQL Multiple Ordered Statements
>>
>> Peter,
>>
>> Since each of your statements ends with a semicolon, I would think you could 
>> use SplitText with Enable Multiline Mode and a delimiter of ';'
>> to get flowfiles containing a single statement apiece, then route 
>> those to a single PutHiveQL. Not sure what the exact regex would look 
>> like but on its face it looks possible :)
>>
>> Regards,
>> Matt
>>
>> On Fri, Sep 23, 2016 at 8:14 AM, Peter Wicks (pwicks) <[email protected]> 
>> wrote:
>>> I have a PutHDFS processor drop a file, I then have a long chain of 
>>> ReplaceText -> PutHiveQL processors that runs a series of steps.
>>>
>>> The below ~4 steps allow me to take the file generated by NiFi in 
>>> one format and move it into the final table, which is ORC with 
>>> several Timestamp columns (thus why I’m not using AvroToORC, since I’d lose 
>>> my Timestamps.
>>>
>>>
>>>
>>> The exact HQL, all in one block, is roughly:
>>>
>>>
>>>
>>> DROP TABLE `db.tbl_${filename}`;
>>>
>>>
>>>
>>> CREATE TABLE ` db.tbl _${filename}`(
>>>
>>>    Some list of columns goes here that exactly matches the schema of 
>>> `prod_db.tbl`
>>>
>>> )
>>>
>>> ROW FORMAT DELIMITED
>>>
>>> FIELDS TERMINATED BY '\001'
>>>
>>> STORED AS TEXTFILE;
>>>
>>>  LOAD DATA INPATH '${absolute.hdfs.path}/${filename}' INTO TABLE ` 
>>> db.tbl _${filename}`;
>>>
>>>  INSERT INTO `prod_db.tbl`
>>>
>>> SELECT * FROM ` db.tbl _${filename}`;
>>>
>>>                 DROP TABLE ` db.tbl _${filename}`;
>>>
>>>
>>>
>>> Right now I’m having to split this into 5 separate ReplaceText 
>>> steps, each one followed by a PutHiveQL.  Is there a way I can push 
>>> a multi-statement, order dependent, script like this to Hive in a simpler 
>>> way?
>>>
>>>
>>>
>>> Thanks,
>>>
>>>   Peter

Reply via email to