Hadoop Streaming icm HDFS

rpereira Sat, 23 Apr 2016 03:36:07 -0700

Hi

I have a textfile that I'm processing through hadoop streaming.
I placed the file on de HDFS.

My data transform process is a set of awk and sed commands that createsa table structure.I can choose the count of mappers. When I use one mapper the data iscorrect.

When choosing more than one mapper then the data will be split up.
The splitting up is done on eol.
I would like to have it split up Before the text markers.

I need to have the text blocks not be splitted up as it will mean lossof information.

And I like to be able to use more than one mapper.


Example:

============================================
Current situation :
        text mark 1
           some data
                  ...
                         some data
        text mark 2
                  some data
        ----------------split-----------------
                  ...
           some data
        text mark 3
                  some data
                  ...
           some data

============================================
Correct situation :
        text mark 1
           some data
                  ...
                         some data
        ----------------split here -------------
        text mark 2
                  some data
                  ...
           some data
        ----------------or split here ----------
        text mark 3
                  some data
                  ...
           some data

I wouldn't like to do preprocessing before placing it on the HDFS tosolve this issue. I want to go ahead from the HDFS filesystem beingflexible with the count of mapper processes applied.

Are there any possibilities to have the splitting be done outside thetextblocks keeping the text blocks complete ?


Kind Regards
Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Hadoop Streaming icm HDFS

Reply via email to