Re: Beam File Input - unexpected result

podunk Wed, 09 Nov 2022 07:34:43 -0800

I did open issue in Jira: 
HOP-4575[https://issues.apache.org/jira/browse/HOP-4575]
Looks like bug
 
This feature ([2] 
https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top)]
 does not work here. Clicking on localization icon gives nothing.
When pressing CTRL-Shift-I it takes me to the pane but nothing is there (blank 
pane). There's as well no tab Data. I mean after pipeline execution.
Regards

Sent: Friday, November 04, 2022 at 10:10 PM
From: "Hans Van Akelyen" <[email protected]>
To: [email protected]
Subject: Re: Beam File Input - unexpected result
Hi,

Seems like an odd thing you are encountering.
Not sure how you are ending up with that result, if you think you are hitting a 
bug feel free to create a ticket with a reproduction path.

As for debugging, you are right that the Beam file definitions currently have 
no way of doing a best guess on what the structure is.
This is possible using the text file input, but that transform is not optimised 
for Beam usage but you can build the definition in the text file input and then 
copy that to the file definition.

When using the direct runner you can preview data flowing through the 
transform, you can press on a transform and use the “preview” output [1].
It will launch the pipeline and show you the result. When executing in 
Dataflow/Spark/Flink we also have a concept to “see” what is happening inside 
the pipeline. You can use the Execution information perspective [2]. It can 
save execution information and sample data, when running in a remote cluster it 
is best to also have a Hop Server running as an endpoint to save the execution 
information.

As for the final part, Because of the distributed/retry on fail and other 
mechanisms in Beam for transforms like a text file output we let every 
bundle/instance write to a new file this is the safest way to do it (and this 
is the default and recommended approach) . If it must go to a single file we 
have a workaround to change the number of copies on the output transform and 
enter the value “SINGLE_BEAM” this will add a group by to the beam pipeline 
forcing it to a single thread and thus being able to write to a single file, 
this also has a performance kickback. For more information on this you can take 
a look at how we handle our Transforms [3].

Cheers,
Hans

[1] 
https://hop.apache.org/manual/latest/pipeline/run-preview-debug-pipeline.html#top[https://hop.apache.org/manual/latest/pipeline/run-preview-debug-pipeline.html#top]
[2] 
https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top[https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top]
[3] 
https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others[https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others]

On 4 November 2022 at 19:26:27, [email protected][mailto:[email protected]] 
([email protected][mailto:[email protected]]) wrote:

Hi,

I'm playing with Beam pipeline. My goal is to merge two big files.
So I have source (one of two) file like:

column_one|colum_two
0099"|"0080199111"
...

My trivial pipeline is:
Beam File Input => Text file output
I created definition for Beam File Input: separator "|", column_one - string, 
column_two - string

But in I get in result (Text file output):

column_one|colum_two
0|0|9|9|"|"|0|0|8|0|1|9|9|1|1|1|"
...

Why each character is separated by "|"?

I also get 51 result files. Even if I set 'Number of workers : 3' in Pipeline 
Run Configuration for engine 'Beam Direct pipeline engine'

Also this source file is really big and building definition is quite time 
consuming process - would be great such options like in Text file input where 
Hop detects fields and is able to preview it.

Best

Re: Beam File Input - unexpected result

Reply via email to