There is a setting in hive site which allows transform scripts to continue even if they take a long time to return a single row.
Edward On Fri, Sep 21, 2012 at 8:55 AM, John Omernik <[email protected]> wrote: > Greetings All - > > I have a transform script that some some awesome stuff (at least to my eyes) > > Basically, here is the SQL > > > SELECT TRANSFORM (filename) > USING 'worker.sh' as (col1, col2, col3, col4, col5) > FROM mysource_filetable > > > worker.sh is actually a wrapper script that > > looks like this: > > #!/bin/bash > > while read line; do > filename=$line > python /mnt/node_scripts/parser.py -i $filename -o STDOUT > done > > The reason for handling calling the python script in a bash script is so I > can read off stdin, process the data, and then shoot it off to standard OUT. > There are some other reasons... but it works great, most of the time. > > Sometimes, for whatever reason, we have a situation where the hive > "listener" )(I don't know what else to call it) gets bored listening for > data. The python script can take a long time depending on the data being > sent to it. It gives up listening for STDOUT, the task times out, and the > job retries that file somewhere else where it succeeds. No big deal. > However, the python script and the java that's calling it seems to still be > running using up resources. If it doesn't exit cleanly, it kinda wigs out > and goes on to TRANSFORM THE WORLD (said in a loud echoing booming voice). > Anywho, just curious if there are ways I can monitor for that. Perhaps check > for things in my worker.sh, maybe run python direct from hive? Settings in > hive that will force kill the runaways? Transform, and it's capabilities > are AWESOME, but like much in hive, documentation is all over the place. > >
