Hi

I have a textfile that I'm processing through hadoop streaming.
I placed the file on de HDFS.
My data transform process is a set of awk and sed commands that creates a table structure. I can choose the count of mappers. When I use one mapper the data is correct.
When choosing more than one mapper then the data will be split up.
The splitting up is done on eol.
I would like to have it split up Before the text markers.
I need to have the text blocks not be splitted up as it will mean loss of information.
And I like to be able to use more than one mapper.

Example:

============================================
Current situation :
        text mark 1
           some data
                  ...
                         some data
        text mark 2
                  some data
        ----------------split-----------------
                  ...
           some data
        text mark 3
                  some data
                  ...
           some data

============================================
Correct situation :
        text mark 1
           some data
                  ...
                         some data
        ----------------split here -------------
        text mark 2
                  some data
                  ...
           some data
        ----------------or split here ----------
        text mark 3
                  some data
                  ...
           some data


I wouldn't like to do preprocessing before placing it on the HDFS to solve this issue. I want to go ahead from the HDFS filesystem being flexible with the count of mapper processes applied.

Are there any possibilities to have the splitting be done outside the textblocks keeping the text blocks complete ?

Kind Regards
Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to