How does Pig Pass Data from First Job and its next Job

Argho Chatterjee Tue, 15 Sep 2015 09:27:28 -0700

Hi All,

As we all know, Apache Pig is a data flow language. If I write a Pig Script
and the Pig decides to split and run two or more jobs to execute the task
in hand, so How does Pig Store the data which it passes from job1 to job 2.
???!!


*I read the Pig documentation which says* :-

"*Pig allocates a fix amount of memory to store bags and spills to disk as
soon as the memory limit is reached. This is very similar to how Hadoop
decides when to spill data accumulated by the combiner.*"

*(url : http://pig.apache.org/docs/r0.9.1/perf.html#memory-management
<http://pig.apache.org/docs/r0.9.1/perf.html#memory-management>)*

So Does Pig has a writer which stores the output of an intermediate job in
Memory / RAM for better performance (spill to disk if required) and then if
PIG has implemented a Reader which reads the data directly from memory to
pass that data to the next Job for Processing???

In Mapreduce, we write the entire data to disk and then read it again for
the next job to start.

Does Pig has a upper hand here, by implementing readers and writers which
writes in RAM/memory (spill if required) and reads from RAM (and disk if
required) for better Performance.

Kindly share your expertise/ views on the highlighted comment from the PIG
documentation as to what does it actually mean or is stating otherwise.

Thanks in Advance,

Cheers :))

How does Pig Pass Data from First Job and its next Job

Reply via email to