Hi All, I am a little stuck on something and after advise on the best path. I have a file which has within it a list of domain names, from that I can obtain a unique list of the domains I have. The harder part is I know need to index that list so that I end up with a list of domain name,index. I have considered using a python UDF as I don't think this is possible in pig directly but how do I then pass in the whole list and retrieve back the whole list with indexes. On top of that, once I have the new list I will store that on s3 (easy bit). But next time I run the job I need to merge that list with the new domains, keeping the old indexes and giving any new records and index counting up from there. I expect here I could join the old and new unique lists before passing to python, then in python know I only need to add an index to the new records. I presume I can do the python part by effectively just creating an array, but I guess my real issue is how I ensure this part of my job is only handled as a reduce and not a map job, after which I would then be running more map jobs.
Thanks Mark Mark Olliver DevOps InfectiousMedia