On 2018/08/21 16:20:13, Lukasz Cwik <[email protected]> wrote:
> I would agree with Eugene. A simple application that does this is probably
> what your looking for.
>
> There are ways to make this work with parallel processing systems but its
> quite a hassle and only worthwhile if your computation is very expensive
> and want the additional computational power of multiple CPU cores. For
> example, in a parallel processing you could read the records from the file
> and remember the file offset / line number of each record. You could then
> group them under a single key and use the sorting extension to sort using
> the file offset / line number and then write out all the sorted records out
> to a single file. Note that this will likely be a lot slower then a simple
> program.
>
> On Tue, Aug 21, 2018 at 8:02 AM Eugene Kirpichov <[email protected]>
> wrote:
>
> > It sounds like you want to sequentially read a file, sequentially process
> > the records and sequentially write them. The best way to do this is likely
> > without using Beam, just write some Java or Python code using standard file
> > APIs (use Beam's FileSystem APIs if you need to access data on a non-local
> > filesystem).
> >
> > On Tue, Aug 21, 2018 at 7:11 AM [email protected] <[email protected]>
> > wrote:
> >
> >> Hi
> >>
> >> I have to process a big file and call several Pardo's to do some
> >> transformations. Records in file dont have any unique key.
> >>
> >> Lets say file 'testfile' has 1 million records.
> >>
> >> After processing , I want to generate only one output file same as my
> >> input 'testfile' and also i have a requirement to write those 1 million
> >> records in same order (after applying some Pardo's)
> >>
> >> What is best way to do it
> >>
> >> Thanks
> >> Aniruddh
> >>
> >>
> >>
> >>
>
Thanks Eugene and Lukasz for revert. One further query please. Use case is to
process 50,000 files together multiple times in a day , do some processing on
records (which includes DLP calls) , then to generate same set of 50,000 output
files and each output file containing same number of records in same order as
of its input file (replica of input file's records preserving order). Is it an
ideal use case for Dataflow . If yes , then is there any example to lookup for
same scenario and if not then what is recommended way. Thanks a lot