RE: Large CSV files

Kelsey RIDER Wed, 25 Jul 2018 08:14:15 -0700

In my use-case, I’m reading a single CSV file, parsing the records, and then 
inserting them into a database.


The whole bounded/unbounded thing is very unclear to me: when is a PCollection 
which? How do I make it be one or the other?
More generally, how does the Pipeline work? Can multiple PTransforms be running 
at once?

My code has a simple DoFn that takes the ReadableFile provided by FileIO, opens 
it for streaming, and starts generating records (using apache commons-csv). The 
next PTransform then takes those records and inserts them into a DB. Will this 
be handled continuously, or will each PTransform block until it has finished 
all of its processing? In other words, will JdbcIO start inserting data before 
the CSVParser has reached the end of the file (assuming it’s a big file)?

From: Chamikara Jayalath <[email protected]>
Sent: mardi 24 juillet 2018 18:34
To: [email protected]
Subject: Re: Large CSV files

Are you trying to read a growing file ? I don't think this scenario is well 
supported. You can use FileIO.MatchAll.continuously() if you want to read a 
growing list of files (where new files get added to a given directory).

If you are reading a large but fixed set of files then what you need is a 
bounded source not an unbounded source. We do not have pre-defined a source for 
reading CSV files with multi-line records (unless you can identify a record 
delimiter and use TextIO with withDelimiter() option). So I'd suggest using 
FileIO.match() or FileIO.matchAll() and using a custom ParDo to read records.

Thanks,
Cham



On Mon, Jul 23, 2018 at 11:28 PM Kai Jiang 
<[email protected]<mailto:[email protected]>> wrote:
I have the same situation. If CSV is splittable, we could use SDF.
[Image removed by sender.]ᐧ

On Mon, Jul 23, 2018 at 1:38 PM Raghu Angadi 
<[email protected]<mailto:[email protected]>> wrote:
It might be simpler to discuss if you replicate the question here.

Are your CSV files splittable? Otherwise Flink/Dataflow runners would not load 
the entire file into memory. This is a streaming application, right? MatchAll 
in FileIO.java is used in TextIO, AvroIO etc to read files continuously in 
streaming applications. It is built on SDF and allows reading smaller chunks of 
the file (as long as the file is splittable).

Raghu.


On Mon, Jul 23, 2018 at 7:16 AM Andrew Pilloud 
<[email protected]<mailto:[email protected]>> wrote:
Hi Kelsey,

I posted a reply on stackoverflow. It sounds like you might be using the 
DirectRunner, which isn't meant to handle datasets that are too big to fit into 
memory. If that is the case, have you tried the Flink local runner or the 
Dataflow runner?

Andrew

On Mon, Jul 23, 2018 at 4:06 AM Kelsey RIDER 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

SO question here : 
https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam
Anybody have any ideas? Am I missing something?

Thanks
Suite à l’évolution des dispositifs de réglementation du travail, si vous 
recevez ce mail avant 7h00, en soirée, durant le week-end ou vos congés merci, 
sauf cas d’urgence exceptionnelle, de ne pas le traiter ni d’y répondre 
immédiatement.

RE: Large CSV files

Reply via email to