In my use-case, I’m reading a single CSV file, parsing the records, and then inserting them into a database.
The whole bounded/unbounded thing is very unclear to me: when is a PCollection which? How do I make it be one or the other? More generally, how does the Pipeline work? Can multiple PTransforms be running at once? My code has a simple DoFn that takes the ReadableFile provided by FileIO, opens it for streaming, and starts generating records (using apache commons-csv). The next PTransform then takes those records and inserts them into a DB. Will this be handled continuously, or will each PTransform block until it has finished all of its processing? In other words, will JdbcIO start inserting data before the CSVParser has reached the end of the file (assuming it’s a big file)? From: Chamikara Jayalath <[email protected]> Sent: mardi 24 juillet 2018 18:34 To: [email protected] Subject: Re: Large CSV files Are you trying to read a growing file ? I don't think this scenario is well supported. You can use FileIO.MatchAll.continuously() if you want to read a growing list of files (where new files get added to a given directory). If you are reading a large but fixed set of files then what you need is a bounded source not an unbounded source. We do not have pre-defined a source for reading CSV files with multi-line records (unless you can identify a record delimiter and use TextIO with withDelimiter() option). So I'd suggest using FileIO.match() or FileIO.matchAll() and using a custom ParDo to read records. Thanks, Cham On Mon, Jul 23, 2018 at 11:28 PM Kai Jiang <[email protected]<mailto:[email protected]>> wrote: I have the same situation. If CSV is splittable, we could use SDF. [Image removed by sender.]ᐧ On Mon, Jul 23, 2018 at 1:38 PM Raghu Angadi <[email protected]<mailto:[email protected]>> wrote: It might be simpler to discuss if you replicate the question here. Are your CSV files splittable? Otherwise Flink/Dataflow runners would not load the entire file into memory. This is a streaming application, right? MatchAll in FileIO.java is used in TextIO, AvroIO etc to read files continuously in streaming applications. It is built on SDF and allows reading smaller chunks of the file (as long as the file is splittable). Raghu. On Mon, Jul 23, 2018 at 7:16 AM Andrew Pilloud <[email protected]<mailto:[email protected]>> wrote: Hi Kelsey, I posted a reply on stackoverflow. It sounds like you might be using the DirectRunner, which isn't meant to handle datasets that are too big to fit into memory. If that is the case, have you tried the Flink local runner or the Dataflow runner? Andrew On Mon, Jul 23, 2018 at 4:06 AM Kelsey RIDER <[email protected]<mailto:[email protected]>> wrote: Hello, SO question here : https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam Anybody have any ideas? Am I missing something? Thanks Suite à l’évolution des dispositifs de réglementation du travail, si vous recevez ce mail avant 7h00, en soirée, durant le week-end ou vos congés merci, sauf cas d’urgence exceptionnelle, de ne pas le traiter ni d’y répondre immédiatement.
