Re: How to preserve/preset partition information when load time series data?
Hi Shuai, It should certainly be possible to do it that way, but I would recommend against it. If you look at HadoopRDD, its doing all sorts of little book-keeping that you would most likely want to mimic. eg., tracking the number of bytes & records that are read, setting up all the hadoop configuration, splits, readers, scheduling tasks for locality, etc. Thats why I suggested that really you want to just create a small variant of HadoopRDD. hope that helps, Imran On Sat, Mar 14, 2015 at 11:10 AM, Shawn Zheng wrote: > Sorry for reply late. > > But I just think of one solution: if I load all the file name itself (not > the contain of the file), so I have a RDD[key, iterable[filename]], then I > run mapPartitionsToPair on it with preservesPartitioning=true > > Do you think it is a right solution? I am not sure whether it has > potential issue if I try to fake/enforce the partition in my own way. > > Regards, > > Shuai > > On Wed, Mar 11, 2015 at 8:09 PM, Imran Rashid > wrote: > >> It should be *possible* to do what you want ... but if I understand you >> right, there isn't really any very easy way to do it. I think you would >> need to write your own subclass of RDD, which has its own logic on how the >> input files get put divided among partitions. You can probably subclass >> HadoopRDD and just modify getPartitions(). your logic could look at the >> day of each filename to decide which partition it goes into. You'd need to >> make corresponding changes for HadoopPartition & the compute() method. >> >> (or if you can't subclass HadoopRDD directly you can use it for >> inspiration.) >> >> On Mon, Mar 9, 2015 at 11:18 AM, Shuai Zheng >> wrote: >> >>> Hi All, >>> >>> >>> >>> If I have a set of time series data files, they are in parquet format >>> and the data for each day are store in naming convention, but I will not >>> know how many files for one day. >>> >>> >>> >>> 20150101a.parq >>> >>> 20150101b.parq >>> >>> 20150102a.parq >>> >>> 20150102b.parq >>> >>> 20150102c.parq >>> >>> … >>> >>> 201501010a.parq >>> >>> … >>> >>> >>> >>> Now I try to write a program to process the data. And I want to make >>> sure each day’s data into one partition, of course I can load all into one >>> big RDD to do partition but it will be very slow. As I already know the >>> time series of the file name, is it possible for me to load the data into >>> the RDD also preserve the partition? I know I can preserve the partition by >>> each file, but is it anyway for me to load the RDD and preserve partition >>> based on a set of files: one partition multiple files? >>> >>> >>> >>> I think it is possible because when I load a RDD from 100 files (assume >>> cross 100 days), I will have 100 partitions (if I disable file split when >>> load file). Then I can use a special coalesce to repartition the RDD? But I >>> don’t know is it possible to do that in current Spark now? >>> >>> >>> >>> Regards, >>> >>> >>> >>> Shuai >>> >> >> >
Re: How to preserve/preset partition information when load time series data?
It should be *possible* to do what you want ... but if I understand you right, there isn't really any very easy way to do it. I think you would need to write your own subclass of RDD, which has its own logic on how the input files get put divided among partitions. You can probably subclass HadoopRDD and just modify getPartitions(). your logic could look at the day of each filename to decide which partition it goes into. You'd need to make corresponding changes for HadoopPartition & the compute() method. (or if you can't subclass HadoopRDD directly you can use it for inspiration.) On Mon, Mar 9, 2015 at 11:18 AM, Shuai Zheng wrote: > Hi All, > > > > If I have a set of time series data files, they are in parquet format and > the data for each day are store in naming convention, but I will not know > how many files for one day. > > > > 20150101a.parq > > 20150101b.parq > > 20150102a.parq > > 20150102b.parq > > 20150102c.parq > > … > > 201501010a.parq > > … > > > > Now I try to write a program to process the data. And I want to make sure > each day’s data into one partition, of course I can load all into one big > RDD to do partition but it will be very slow. As I already know the time > series of the file name, is it possible for me to load the data into the > RDD also preserve the partition? I know I can preserve the partition by > each file, but is it anyway for me to load the RDD and preserve partition > based on a set of files: one partition multiple files? > > > > I think it is possible because when I load a RDD from 100 files (assume > cross 100 days), I will have 100 partitions (if I disable file split when > load file). Then I can use a special coalesce to repartition the RDD? But I > don’t know is it possible to do that in current Spark now? > > > > Regards, > > > > Shuai >
How to preserve/preset partition information when load time series data?
Hi All, If I have a set of time series data files, they are in parquet format and the data for each day are store in naming convention, but I will not know how many files for one day. 20150101a.parq 20150101b.parq 20150102a.parq 20150102b.parq 20150102c.parq . 201501010a.parq . Now I try to write a program to process the data. And I want to make sure each day's data into one partition, of course I can load all into one big RDD to do partition but it will be very slow. As I already know the time series of the file name, is it possible for me to load the data into the RDD also preserve the partition? I know I can preserve the partition by each file, but is it anyway for me to load the RDD and preserve partition based on a set of files: one partition multiple files? I think it is possible because when I load a RDD from 100 files (assume cross 100 days), I will have 100 partitions (if I disable file split when load file). Then I can use a special coalesce to repartition the RDD? But I don't know is it possible to do that in current Spark now? Regards, Shuai