On Tue, Nov 19, 2019 at 1:56 AM 👌👌 <[email protected]> wrote: > Hello! I am a beam user. > I want to ask you two questions. > First. > I use beam of my project and my data is jsonobject.I find in my flow,the > data will be Serializable and deSerializable many times,but i am not know > where use Serializable and deSerializable,because this question give me > many time exhaustion,so could you please tell me whether i can close > it,only Serializable on read and write. > If you mean "does Beam only apply coders when reading and writing an external storage system" (eg files, Kafka, BigQuery etc.), the answer is no: - Data in external storage systems is stored in the format appropriate for the system, which is different and unrelated to the wire format of Beam coders, so Beam coders can not be used to parse or format data for external storage. - Beam runners apply coders to transmit data over the wire between workers or to write it to disk for temporary materialization (e.g. for fault tolerance). There is no way to know what elements of what PCollection's will or won't be materialized - a runner is allowed to do this with any element at any time anywhere in the pipeline. Runners try to do it as little as possible, but there are no hard guarantees, and you can not even assume that if a runner didn't materialize something this time, it won't materialize it next time you run exactly the same pipeline on exactly the same data.
> Second. > I use beam and run on spark,but i find a problem,some keys have many > value,and create data Slant,so i want to know whether have some methods to > solve it.I try use Reshuffle.of(),does't have action. > Please elaborate what you're doing with the (key, [value...]) tuples produced by GroupByKey. Depending on what you do with them, there may or may not be a way to speed things up. > Thanks for your answer! >
