I wasn't clear. Specifying the size of the files is not your real aim, I
guess. But you think that's what is needed in order to solve your problem
that we don't know about. 500MB is not a really big file in itself and is
not an issue for HDFS and MapReduce.

There is no absolute way to know how much data a reducer will produce given
its input because it depends on the implementation. In order to have a
simple life cycle, each Reducer will write its own file. So if you want to
have smaller files, you will need to increase the number of Reducer. (Same
size + more files -> smaller files) However there is no way to have files
with an exact size. One of the obvious reason is because you would need to
break a record (key/value) into two files. And there no reconciliation
strategy for that. It does happen between blocks of a file but blocks of a
file are ordered so the RecordReader knows how to deal with it.

Bertrand

On Mon, Jun 10, 2013 at 8:58 AM, Johnny Zhang <[email protected]> wrote:

> Hi, Pedro:
> Basically how many splits of files depends on how many reducer you have in
> your Pig job. So if total result data size is 100MB, and you have 10
> reducers, you will get 10 files and each file with 10MB. Bertrand's pointer
> is about specify number of reducer for your Pig job.
>
> Johnny
>
>
> On Sun, Jun 9, 2013 at 10:42 PM, Pedro Sá da Costa <[email protected]
> >wrote:
>
> > I don't understand why my purpose is not clear. The previous e-mails
> > explain it very clearly.  I want to split a 500MB single txt in HDFS into
> > multiple files using Pig latin. Is it possible? E.g.,
> >
> > A = LOAD ‘myfile.txt’ USING PigStorage() AS (t);
> > STORE A INTO ‘multiplefiles’ USING PigStorage(); -- and here creates
> > multiple file with a specific size
> >
> >
> >
> >
> > On 10 June 2013 07:29, Bertrand Dechoux <[email protected]> wrote:
> >
> > > The purpose is not really clear. But if you are looking for how to
> > specify
> > > multiple Reducer task, it is well explained in the documentation.
> > > http://pig.apache.org/docs/r0.11.1/perf.html#parallel
> > >
> > > You will get one file per reducer. It is up to you to specify the right
> > > number but be careful of not falling into the small files problem in
> the
> > > end.
> > > http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
> > >
> > > If you have specific question on HDFS itself or pig optimisation, you
> > > should provide more explanation.
> > > (64MB is the default block size for HDFS)
> > >
> > > Regard
> > >
> > > Bertrand
> > >
> > >
> > > On Mon, Jun 10, 2013 at 6:53 AM, Pedro Sá da Costa <[email protected]
> > > >wrote:
> > >
> > > > I said 64MB, but it can be 128MB, or 5KB. It doesn't matter the
> > number. I
> > > > just want to extract data and put into several files with specific
> > size.
> > > > Basically, I am doing a cat to a big txt file, and I want to split
> the
> > > > content into multiple files with a fixed size.
> > > >
> > > >
> > > > On 7 June 2013 10:14, Johnny Zhang <[email protected]> wrote:
> > > >
> > > > > Pedro, you can try Piggybank MultiStorage, which split results into
> > > > > different dir/files by specific index attribute. But not sure how
> it
> > > can
> > > > > make sure the file size is 64MB. Why 64MB specifically? what's the
> > > > > connection between your data and 64MB?
> > > > >
> > > > > Johnny
> > > > >
> > > > >
> > > > > On Fri, Jun 7, 2013 at 12:56 AM, Pedro Sá da Costa <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > I am using the instruction:
> > > > > >
> > > > > > store A into 'result-australia-0' using PigStorage('\t');
> > > > > >
> > > > > > to store the data in HDFS. But the problem is that, this creates
> 1
> > > file
> > > > > > with 500MB of size. Instead, want to save several 64MB files.
> How I
> > > do
> > > > > > this?
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > >
> > >
> > >
> > > --
> > > Bertrand Dechoux
> > >
> >
> >
> >
> > --
> > Best regards,
> >
>



-- 
Bertrand Dechoux

Reply via email to