Hi All, As we all know, Apache Pig is a data flow language. If I write a Pig Script and the Pig decides to split and run two or more jobs to execute the task in hand, so How does Pig Store the data which it passes from job1 to job 2. ???!!
*I read the Pig documentation which says* :- "*Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner.*" *(url : http://pig.apache.org/docs/r0.9.1/perf.html#memory-management <http://pig.apache.org/docs/r0.9.1/perf.html#memory-management>)* So Does Pig has a writer which stores the output of an intermediate job in Memory / RAM for better performance (spill to disk if required) and then if PIG has implemented a Reader which reads the data directly from memory to pass that data to the next Job for Processing??? In Mapreduce, we write the entire data to disk and then read it again for the next job to start. Does Pig has a upper hand here, by implementing readers and writers which writes in RAM/memory (spill if required) and reads from RAM (and disk if required) for better Performance. Kindly share your expertise/ views on the highlighted comment from the PIG documentation as to what does it actually mean or is stating otherwise. Thanks in Advance, Cheers :))