Re: save several 64MB files in Pig Latin

Ruslan Al-Fakikh Mon, 10 Jun 2013 04:35:24 -0700

Hi Pedro,

Yes, Pig Latin is always compiled to MapReduce.
Usually you don't have to specify the number of mappers (I am not sure
whether you really can). If you have a file of 500MB and it is splittable
then the number of mappers is automatically equals to 500MB / 64MB (block
size) which is around 8. Here I assume that you have the default block size
of 64MB. If your file is not splittable then the whole file will go to one
mapper :(


Let me know if you have further questions.

Thanks


On Mon, Jun 10, 2013 at 1:36 PM, Pedro Sá da Costa <[email protected]>wrote:

> Yes, I understand the previous answers now. The reason of my question is
> because I was trying to "split" a file with pig latin by loading the file
> and writing portions of the file again in HDFS. With both replies, it seems
> that pig latin uses mapreduce to compute the scripts, correct?
>
> And in map reduce, if I have one file with 500MB size, and I run an example
> with 10 maps (we forget the reducers now), it means that each map will read
> more or less 50MB?
>
>
>
> On 10 June 2013 11:21, Bertrand Dechoux <[email protected]> wrote:
>
> > I wasn't clear. Specifying the size of the files is not your real aim, I
> > guess. But you think that's what is needed in order to solve your problem
> > that we don't know about. 500MB is not a really big file in itself and is
> > not an issue for HDFS and MapReduce.
> >
> > There is no absolute way to know how much data a reducer will produce
> given
> > its input because it depends on the implementation. In order to have a
> > simple life cycle, each Reducer will write its own file. So if you want
> to
> > have smaller files, you will need to increase the number of Reducer.
> (Same
> > size + more files -> smaller files) However there is no way to have files
> > with an exact size. One of the obvious reason is because you would need
> to
> > break a record (key/value) into two files. And there no reconciliation
> > strategy for that. It does happen between blocks of a file but blocks of
> a
> > file are ordered so the RecordReader knows how to deal with it.
> >
> > Bertrand
> >
> > On Mon, Jun 10, 2013 at 8:58 AM, Johnny Zhang <[email protected]>
> > wrote:
> >
> > > Hi, Pedro:
> > > Basically how many splits of files depends on how many reducer you have
> > in
> > > your Pig job. So if total result data size is 100MB, and you have 10
> > > reducers, you will get 10 files and each file with 10MB. Bertrand's
> > pointer
> > > is about specify number of reducer for your Pig job.
> > >
> > > Johnny
> > >
> > >
> > > On Sun, Jun 9, 2013 at 10:42 PM, Pedro Sá da Costa <[email protected]
> > > >wrote:
> > >
> > > > I don't understand why my purpose is not clear. The previous e-mails
> > > > explain it very clearly.  I want to split a 500MB single txt in HDFS
> > into
> > > > multiple files using Pig latin. Is it possible? E.g.,
> > > >
> > > > A = LOAD ‘myfile.txt’ USING PigStorage() AS (t);
> > > > STORE A INTO ‘multiplefiles’ USING PigStorage(); -- and here creates
> > > > multiple file with a specific size
> > > >
> > > >
> > > >
> > > >
> > > > On 10 June 2013 07:29, Bertrand Dechoux <[email protected]> wrote:
> > > >
> > > > > The purpose is not really clear. But if you are looking for how to
> > > > specify
> > > > > multiple Reducer task, it is well explained in the documentation.
> > > > > http://pig.apache.org/docs/r0.11.1/perf.html#parallel
> > > > >
> > > > > You will get one file per reducer. It is up to you to specify the
> > right
> > > > > number but be careful of not falling into the small files problem
> in
> > > the
> > > > > end.
> > > > > http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
> > > > >
> > > > > If you have specific question on HDFS itself or pig optimisation,
> you
> > > > > should provide more explanation.
> > > > > (64MB is the default block size for HDFS)
> > > > >
> > > > > Regard
> > > > >
> > > > > Bertrand
> > > > >
> > > > >
> > > > > On Mon, Jun 10, 2013 at 6:53 AM, Pedro Sá da Costa <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > I said 64MB, but it can be 128MB, or 5KB. It doesn't matter the
> > > > number. I
> > > > > > just want to extract data and put into several files with
> specific
> > > > size.
> > > > > > Basically, I am doing a cat to a big txt file, and I want to
> split
> > > the
> > > > > > content into multiple files with a fixed size.
> > > > > >
> > > > > >
> > > > > > On 7 June 2013 10:14, Johnny Zhang <[email protected]> wrote:
> > > > > >
> > > > > > > Pedro, you can try Piggybank MultiStorage, which split results
> > into
> > > > > > > different dir/files by specific index attribute. But not sure
> how
> > > it
> > > > > can
> > > > > > > make sure the file size is 64MB. Why 64MB specifically? what's
> > the
> > > > > > > connection between your data and 64MB?
> > > > > > >
> > > > > > > Johnny
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jun 7, 2013 at 12:56 AM, Pedro Sá da Costa <
> > > > [email protected]
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > I am using the instruction:
> > > > > > > >
> > > > > > > > store A into 'result-australia-0' using PigStorage('\t');
> > > > > > > >
> > > > > > > > to store the data in HDFS. But the problem is that, this
> > creates
> > > 1
> > > > > file
> > > > > > > > with 500MB of size. Instead, want to save several 64MB files.
> > > How I
> > > > > do
> > > > > > > > this?
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Bertrand Dechoux
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > >
> >
> >
> >
> > --
> > Bertrand Dechoux
> >
>
>
>
> --
> Best regards,
>

Re: save several 64MB files in Pig Latin

Reply via email to