This is going to sound very odd, but I am hoping to use a transform script in such a way that I pass a filepath to the transform script, to which it reads the file and produces a bunch of rows in hive. In this case the data is pcaps. I have a location accessible to all nodes, and I want to have my transform script read in a file location, and then spit out, for example the IP addresses that were seen in the packet capture (using a script I've already written). Can I do something whereby I load my file locations into a table in hive (one file per row) and read that table into a transform script and only have one map task per source row? I don't want my script to parse several files, it may make for some poor parrelelization, but I am having trouble forcing such a small record count per map task.
Thoughts?