Have a look at the sparkContext.binaryFiles, it works like wholeTextFiles but 
returns a PortableDataStream per file. It might be a workable solution though 
you'll need to handle the binary to UTF-8 or equivalent conversion

Thanks,
Ewan

From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: 03 September 2015 15:22
To: user@spark.apache.org
Subject: How to Take the whole file as a partition

Hi All,

I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark can 
read them as partition on the file level. Which means want the FileSplit turn 
off.

I know there are some solutions, but not very good in my case:
1, I can't use WholeTextFiles method, because my file is too big, I don't want 
to risk the performance.
2, I try to use newAPIHadoopFile and turnoff the file split:

                                                lines = 
ctx.newAPIHadoopFile(inputPath, NonSplitableTextInputFormat.class, 
LongWritable.class, Text.class, hadoopConf).values()
                                                                                
.map(new Function<Text, String>() {
                                                                                
                @Override
                                                                                
                public String call(Text arg0) throws Exception {
                                                                                
                                return arg0.toString();
                                                                                
                }
                                                                                
});

This works for some cases, but it truncate some lines (I am not sure why, but 
it looks like there is a limit on this file reading). I have a feeling that the 
spark truncate this file on 2GB bytes. Anyway it happens (because same data has 
no issue when I use mapreduce to do the input), the spark sometimes do a trunc 
on very big file if try to read all of them.

3, I can do another way is distribute the file name as the input of the Spark 
and in function open stream to read the file directly. This is what I am 
planning to do but I think it is ugly. I want to know anyone have better 
solution for it?

BTW: the file currently in text format, but it might be parquet format later, 
that is also reason I don't like my third option.

Regards,

Shuai

Reply via email to