Hello, folks! I'm using greatly customized HBaseStorage in my pig script. And during HBaseStorage.setLocation() I'm preparing a file with values that would be source for my filter. The filter is used during HBaseStorage.getNext().
Since Pig script is basically MR job with many mappers, it means that my values-file must be accessible for all my Map tasks. There is DistributedCache that should copy files across the cluster to have them as local for any map tasks. I don't want to write my file to HDFS in first place, because there is no way to clean it up after MR job is done (may be you can point me in the direction). On the other hand if I'm writing the file to local file system "/tmp", then I may either specify deleteOnExit() or just forget about it - linux will take care of its local "/tmp". But here is small problem. DistributedCache copies files only if it is used with command line parameter like "-files". In that case GenericOptionsParsers copies all files, but DistributedCache API itself allows only to specify parameters in jobConf - it doesn't actually do copying. I've found that GenericOptionsParser specifies property "tmpfiles", which is used by JobClient to copy files before it runs MR job. And I've been able to specify the same property in jobConf from my HBaseStorage. It does the trick, but it's a hack. Is there any other correct way to achieve the goal? Thanks in advance. -- Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [email protected]
