Re: Small files

Ruslan Al-Fakikh Mon, 30 Sep 2013 13:22:57 -0700

Hi,

It says that your command returns non-zero code. Does it return it in case
you invoke it manually outside of Pig?
I think I don't have any valuable ideas otherwise.


Thanks


On Mon, Sep 30, 2013 at 10:37 AM, Anastasis Andronidis <
andronat_...@hotmail.com> wrote:

> Hello again,
>
> any comments on this?
>
> Thanks,
> Anastasis
>
> On 27 Σεπ 2013, at 5:36 μ.μ., Anastasis Andronidis <
> andronat_...@hotmail.com> wrote:
>
> > Hello,
> >
> > I am working on a very small project for my university and I have a
> small cluster with 2 worker nodes and 1 master node. I'm using Pig to do
> some calculations and I have a question regarding small files.
> >
> > I have a UDF that is reading a small input (around 200k) and correlates
> the data from HDFS. My first approach was to upload the small file onto
> HDFS and later, by using getCacheFiles(), access it into my UDF.
> >
> > After though, I needed to change things in this small file and this
> meant to delete the file on HDFS, re-upload it and re-run Pig. But in the
> end I need to change this small file frequently and I wanted to bypass HDFS
> (because all those read + write + read in pig again is very very slow for
> multiple iterations of my script), so what I did was:
> >
> > === pig script ===
> > %declare MYFILE     `cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"}
> {print $0}'`
> >
> > .... MyUDF( line, '$MYFILE') .....
> >
> > In the beginning, it worked great. But later (when my file started to
> get larger of 100KB) on pig was stacking and I had to kill it:
> >
> > 2013-09-27 16:14:47,722 [main] INFO
>  org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
> cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'
> > ^C2013-09-27 16:15:28,102 [main] ERROR org.apache.pig.Main - ERROR 2999:
> Unexpected internal error. Error executing shell command: cat myfile.txt |
> awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'. Command exit with exit code of
> 130
> >
> > (btw is this a bug or something? should hung like that?)
> >
> > How can I manage small files in such cases so I don't need to re upload
> everything in HDFS every time and make my iteration faster?
> >
> > Thanks,
> > Anastasis
>
>

Re: Small files

Reply via email to