There is a setting in hive site which allows transform scripts to
continue even if they take a long time to return a single row.

Edward

On Fri, Sep 21, 2012 at 8:55 AM, John Omernik <[email protected]> wrote:
> Greetings All -
>
> I have a transform script that some some awesome stuff (at least to my eyes)
>
> Basically, here is the SQL
>
>
>   SELECT TRANSFORM (filename)
>   USING 'worker.sh' as (col1, col2, col3, col4, col5)
>   FROM mysource_filetable
>
>
> worker.sh is actually a wrapper script that
>
> looks like this:
>
> #!/bin/bash
>
> while read line; do
>     filename=$line
>     python /mnt/node_scripts/parser.py -i $filename -o STDOUT
> done
>
> The reason for handling calling the python script in a bash script is so I
> can read off stdin, process the data, and then shoot it off to standard OUT.
> There are some other reasons... but it works great, most of the time.
>
> Sometimes, for whatever reason, we have a situation where the hive
> "listener" )(I don't know what else to call it) gets bored listening for
> data. The python script can take a long time depending on the data being
> sent to it.  It gives up listening for STDOUT, the task times out, and the
> job retries that file somewhere else where it succeeds. No big deal.
> However, the python script and the java that's calling it seems to still be
> running using up resources. If it doesn't exit cleanly, it kinda wigs out
> and goes on to TRANSFORM THE WORLD (said in a loud echoing booming voice).
> Anywho, just curious if there are ways I can monitor for that. Perhaps check
> for things in my worker.sh, maybe run python direct from hive? Settings in
> hive that will force kill the runaways?  Transform, and it's capabilities
> are AWESOME, but like much in hive, documentation is all over the place.
>
>

Reply via email to