That "fix" caused an out of memory error so I wouldn't rely on it too much
for larger volumes of files.
But this is why we added the integration test. Listing thousands of files
IIRC. I couldn't possibly comment beyond that ;-)

Take care,

Matt

Op di 12 sep. 2023 17:18 schreef Fabian Peters <[email protected]>:

> Hi Matt,
>
> Well, I've since applied the recursion-based fix again and the pipeline
> started working as expected. Was anything else changed in the logic that
> would ensure that multiple rows get passed into the transform? This was the
> original problem, that only the first row was acted upon. (The problem only
> occurs if the path to the directory is set via "Get filename from field".)
>
> cheers
>
> Fabian
>
> Am 12.09.2023 um 16:37 schrieb Matt Casters <[email protected]
> >:
>
> It's surprising since we have a successful test running with "Get File
> Names" on the Beam direct runner.
>
>
> https://ci-builds.apache.org/job/Hop/job/Hop-integration-tests/lastCompletedBuild/testReport/(root)/beam_directrunner/0010_get_file_names/
>
> I think that the main thing is to have permissions on the gs:// location
> you want to get files from.
>
> Cheers,
>
> Matt
>
>
> Op wo 6 sep. 2023 09:05 schreef Fabian Peters <[email protected]>:
>
>> Good morning all!
>>
>> Not having worked with Hop for a couple of months I downloaded the 2.5.0
>> version and found that an existing pipeline failed to work as expected.
>> This is due to the "Get file names" transform returning only a single row
>> for each row passed to "Get filename from field". I ran into the same
>> issue
>> <https://issues.apache.org/jira/projects/HOP/issues/HOP-4191?filter=allissues>
>>  last
>> year, but the fix <https://github.com/apache/hop/pull/1674/files> I
>> provided turned out to sometimes cause a stack overflow
>> <https://issues.apache.org/jira/projects/HOP/issues/HOP-4528?filter=allissues>
>>  and
>> was reverted. (No hard feelings…)
>>
>> Is there another way to make this work on Beam/Dataflow? Or is there an
>> alternative approach I can use to get all files in a GCS path, short of
>> using their HTTP API?
>>
>> Besides this: Great work on the Dataflow template handling – works like a
>> charm now!
>>
>> cheers
>>
>> Fabian
>>
>
>

Reply via email to