Hi, I'm processing squid log files with Pig courtesy of MyRegexLoader. After a first processing step (saving with PigStorage) there's quite a lot of data processing to do.
There's a catch, though. A superfluous copy operation: 1. variant: Copy the original Squid logs manually to HDFS with "hdfs dfs -copyFromLocal", then read them in Pig (distributed mode) from HDFS with MyRegexLoader, then store them in HDFS with PigStorage. 2. variant: Read the original Logs from local filesystem in Pig (local mode) with MyRegexLoader, store the on the local filesystem with PigStorage, then copy the result to HDFS with "hdfs dfs -copyFromLocal". Is there a way to have Pig read files from local fs, but store the result in HDFS? Given that reading files from local fs can't be done in distributed mode, I'd be totally happy to have that operation only run on the local node as long as the stored file is accessible via HDFS afterwards. I tried various ways to specify file locations as hdfs:// and file://, but that didn't work out. AFAICS the documentation is pretty silent on this. Any ideas or hints about what to do? Regards, Carl-Daniel -- http://www.hailfinger.org/