Re: Running multiple Pig jobs simultaneously on same data

勇胡 Wed, 15 Jun 2011 06:56:49 -0700

Jon,

If I want to modify data(insert or delete) in the HDFS, how can I do it?
>From the description, I can not directly modify the data itself(update the
data), I can not append the new data to the file! How the HDFS implement the
data modification? I just feel a little bit confusion.


Yong
在 2011年6月15日 下午3:36，Jonathan Coveney <jcove...@gmail.com>写道：

> Yong,
>
> Currently, HDFS does not support appending to a file. So once a file is
> created, it literally cannot be changed (although it can be deleted, I
> suppose). this lets you avoid issues where I do a SELECT * on the entire
> database, and the dba can't update a row, or other things like that. There
> are some append patches in the works but I am not sure how they handle the
> concurrency implications.
>
> Make sense?
> Jon
>
> 2011/6/15 勇胡 <yongyong...@gmail.com>
>
> > I read the link, and I just felt that the HDFS is designed for the
> > read-frequently operation, not for the write-frequently( A file
> > once created, written, and closed need not be changed.) .
> >
> > For your description (Immutable means that after creation it cannot be
> > modified.), if I understand correct, you mean that the HDFS can not
> > implement "update" semantics as same as in the database area? The write
> > operation can not directly apply to the specific tuple or record? The
> > result
> > of write operation just appends at the end of the file.
> >
> > Regards
> >
> > Yong
> >
> > 2011/6/15 Nathan Bijnens <nat...@nathan.gs>
> >
> > > Immutable means that after creation it cannot be modified.
> > >
> > > HDFS applications need a write-once-read-many access model for files. A
> > > file
> > > once created, written, and closed need not be changed. This assumption
> > > simplifies data coherency issues and enables high throughput data
> access.
> > A
> > > MapReduce application or a web crawler application fits perfectly with
> > this
> > > model. There is a plan to support appending-writes to files in the
> > future.
> > >
> > >
> >
> http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
> > >
> > > Best regards,
> > >  Nathan
> > > ---
> > > nat...@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs
> > >
> > >
> > > On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 <yongyong...@gmail.com> wrote:
> > >
> > > > How can I understand immutable? I mean whether the HDFS implements
> lock
> > > > mechanism to obtain immutable data access when the concurrent tasks
> > > process
> > > > the same set of data or uses other strategy to implement immutable?
> > > >
> > > > Thanks
> > > >
> > > > Yong
> > > >
> > > > 2011/6/14 Bill Graham <billgra...@gmail.com>
> > > >
> > > > > Yes, this is possible. Data in HDFS is immutable and MR tasks are
> > > spawned
> > > > > in
> > > > > their own VM so multiple concurrent jobs acting on the same input
> > data
> > > > are
> > > > > fine.
> > > > >
> > > > > On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
> > > > > pradipta.du...@me.com> wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > We have a requirement where we have to process same set of data
> (in
> > > > > Hadoop
> > > > > > cluster) by running multiple Pig jobs simultaneously.
> > > > > >
> > > > > > Any idea whether this is possible in Pig?
> > > > > >
> > > > > > Thanks,
> > > > > > Pradipta
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Running multiple Pig jobs simultaneously on same data

Reply via email to