Hi Pedro, Yes, Pig Latin is always compiled to MapReduce. Usually you don't have to specify the number of mappers (I am not sure whether you really can). If you have a file of 500MB and it is splittable then the number of mappers is automatically equals to 500MB / 64MB (block size) which is around 8. Here I assume that you have the default block size of 64MB. If your file is not splittable then the whole file will go to one mapper :(
Let me know if you have further questions. Thanks On Mon, Jun 10, 2013 at 1:36 PM, Pedro Sá da Costa <[email protected]>wrote: > Yes, I understand the previous answers now. The reason of my question is > because I was trying to "split" a file with pig latin by loading the file > and writing portions of the file again in HDFS. With both replies, it seems > that pig latin uses mapreduce to compute the scripts, correct? > > And in map reduce, if I have one file with 500MB size, and I run an example > with 10 maps (we forget the reducers now), it means that each map will read > more or less 50MB? > > > > On 10 June 2013 11:21, Bertrand Dechoux <[email protected]> wrote: > > > I wasn't clear. Specifying the size of the files is not your real aim, I > > guess. But you think that's what is needed in order to solve your problem > > that we don't know about. 500MB is not a really big file in itself and is > > not an issue for HDFS and MapReduce. > > > > There is no absolute way to know how much data a reducer will produce > given > > its input because it depends on the implementation. In order to have a > > simple life cycle, each Reducer will write its own file. So if you want > to > > have smaller files, you will need to increase the number of Reducer. > (Same > > size + more files -> smaller files) However there is no way to have files > > with an exact size. One of the obvious reason is because you would need > to > > break a record (key/value) into two files. And there no reconciliation > > strategy for that. It does happen between blocks of a file but blocks of > a > > file are ordered so the RecordReader knows how to deal with it. > > > > Bertrand > > > > On Mon, Jun 10, 2013 at 8:58 AM, Johnny Zhang <[email protected]> > > wrote: > > > > > Hi, Pedro: > > > Basically how many splits of files depends on how many reducer you have > > in > > > your Pig job. So if total result data size is 100MB, and you have 10 > > > reducers, you will get 10 files and each file with 10MB. Bertrand's > > pointer > > > is about specify number of reducer for your Pig job. > > > > > > Johnny > > > > > > > > > On Sun, Jun 9, 2013 at 10:42 PM, Pedro Sá da Costa <[email protected] > > > >wrote: > > > > > > > I don't understand why my purpose is not clear. The previous e-mails > > > > explain it very clearly. I want to split a 500MB single txt in HDFS > > into > > > > multiple files using Pig latin. Is it possible? E.g., > > > > > > > > A = LOAD ‘myfile.txt’ USING PigStorage() AS (t); > > > > STORE A INTO ‘multiplefiles’ USING PigStorage(); -- and here creates > > > > multiple file with a specific size > > > > > > > > > > > > > > > > > > > > On 10 June 2013 07:29, Bertrand Dechoux <[email protected]> wrote: > > > > > > > > > The purpose is not really clear. But if you are looking for how to > > > > specify > > > > > multiple Reducer task, it is well explained in the documentation. > > > > > http://pig.apache.org/docs/r0.11.1/perf.html#parallel > > > > > > > > > > You will get one file per reducer. It is up to you to specify the > > right > > > > > number but be careful of not falling into the small files problem > in > > > the > > > > > end. > > > > > http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ > > > > > > > > > > If you have specific question on HDFS itself or pig optimisation, > you > > > > > should provide more explanation. > > > > > (64MB is the default block size for HDFS) > > > > > > > > > > Regard > > > > > > > > > > Bertrand > > > > > > > > > > > > > > > On Mon, Jun 10, 2013 at 6:53 AM, Pedro Sá da Costa < > > [email protected] > > > > > >wrote: > > > > > > > > > > > I said 64MB, but it can be 128MB, or 5KB. It doesn't matter the > > > > number. I > > > > > > just want to extract data and put into several files with > specific > > > > size. > > > > > > Basically, I am doing a cat to a big txt file, and I want to > split > > > the > > > > > > content into multiple files with a fixed size. > > > > > > > > > > > > > > > > > > On 7 June 2013 10:14, Johnny Zhang <[email protected]> wrote: > > > > > > > > > > > > > Pedro, you can try Piggybank MultiStorage, which split results > > into > > > > > > > different dir/files by specific index attribute. But not sure > how > > > it > > > > > can > > > > > > > make sure the file size is 64MB. Why 64MB specifically? what's > > the > > > > > > > connection between your data and 64MB? > > > > > > > > > > > > > > Johnny > > > > > > > > > > > > > > > > > > > > > On Fri, Jun 7, 2013 at 12:56 AM, Pedro Sá da Costa < > > > > [email protected] > > > > > > > >wrote: > > > > > > > > > > > > > > > I am using the instruction: > > > > > > > > > > > > > > > > store A into 'result-australia-0' using PigStorage('\t'); > > > > > > > > > > > > > > > > to store the data in HDFS. But the problem is that, this > > creates > > > 1 > > > > > file > > > > > > > > with 500MB of size. Instead, want to save several 64MB files. > > > How I > > > > > do > > > > > > > > this? > > > > > > > > > > > > > > > > -- > > > > > > > > Best regards, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Bertrand Dechoux > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > > > > > > > > > > > > -- > > Bertrand Dechoux > > > > > > -- > Best regards, >
