Re: Can ExecuteStreamCommand do this?

Mike Thomsen Fri, 30 Sep 2022 05:27:06 -0700

I don't know what your use case is, but we avoid anything beyond gzip
because S3 is so cheap.


On Thu, Sep 29, 2022 at 10:51 AM James McMahon <[email protected]> wrote:
>
> Thank you Mark. Had no idea there was this file-based dependency to 7z files. 
> Since my workaround appears to be working I think I may just move forward 
> with that.
> Steve, Mark - thank you again for replying.
> Jim
>
> On Thu, Sep 29, 2022 at 9:15 AM Mark Payne <[email protected]> wrote:
>>
>> It’s been a while. But if I remember correctly, the reason that NiFi does 
>> not natively support 7-zip format is that with 7-zip, the dictionary is 
>> written at the end of the file.
>> So when data is compressed, the dictionary is built up during compression 
>> and written at the end. This makes sense from a compression standpoint.
>> However, what it means, is that in order to decompress it, you must first 
>> jump to the end of the file in order to access the dictionary. Then jump 
>> back to the beginning of the file in order to perform the decompression.
>> NiFi makes use of Input Streams and Output Streams for FlowFIle access - it 
>> doesn’t provide a File-based approach. And this ability to jump to the end, 
>> read the dictionary, and then jump back to the beginning isn’t really 
>> possible with Input/Output Streams - at least, not without buffering 
>> everything into memory.
>>
>> So it would make sense that there would be a “Not Implemented” error when 
>> attempting to do the same thing using the 7-zip application directly, when 
>> attempting to use input streams & output streams.
>> I think that if you’re stuck with 7-zip, your own option will be to do what 
>> you’re doing - write the data out as a file, run the 7-zip application 
>> against that file, writing the output to some directory, and then picking up 
>> the files from that directory.
>> The alternative, of course, would be to update the source so that it’s 
>> creating zip files instead of 7-zip files, if you have sway over the source 
>> producer.
>>
>> Thanks
>> -Mark
>>
>>
>> On Sep 29, 2022, at 8:58 AM, stephen.hindmarch.bt.com via users 
>> <[email protected]> wrote:
>>
>> James,
>>
>> E_NOTIMPL means that feature is not implemented. I can see there is 
>> discussion about this down at sourceforge but the detail is blocked by my 
>> employer’s firewall.
>>
>> p7zip / Discussion / Help: E_NOTIMPL for stdin / stdout pipe
>>
>> https://sourceforge.net/p/p7zip/discussion/383044/thread/8066736d
>>
>> Steve Hindmarch
>>
>> From: James McMahon <[email protected]>
>> Sent: 29 September 2022 12:12
>> To: Hindmarch,SJ,Stephen,VIR R <[email protected]>
>> Cc: [email protected]
>> Subject: Re: Can ExecuteStreamCommand do this?
>>
>> I ran with these Command Arguments in the ExecuteStreamCommand configuration:
>> x;-si;-so;-spf;-aou
>> ${filename} removed, -si indicating use of STDIN, -so STDOUT.
>>
>> The same error is thrown by 7z through ExecuteStreamCommand: Executable 
>> command /bin/7za ended in an error: ERROR: Can not open the file as an 
>> archive  E_NOTIMPL
>>
>> I tried this at the command line, getting the same failure:
>> cat testArchive.7z | 7za x -si -so | dd of=stooges.txt
>>
>>
>> On Thu, Sep 29, 2022 at 6:44 AM James McMahon <[email protected]> wrote:
>>
>> Good morning, Steve. Indeed, that second paragraph is exactly how I did get 
>> this to work. I unpack to disk and then read in the twelve results using a 
>> GetFile. So far it is working well. It just feels a little wrong to me to do 
>> this, as I have introduced an extra write to and read from disk, which is 
>> going to be slower than doing it all in memory within the JVM. While that 
>> may not seem like anything significant for a single 7z file, as we work 
>> across thousands and thousands it can be significant.
>>
>> I am about to try what you suggested above: dropping the ${filename} 
>> entirely from the STDIN / STDOUT configuration. I realize it is not likely 
>> going to give me the twelve output flowfiles I'm seeking in the "output 
>> stream" path from ExecuteStreamCommand. I just want to see if it works 
>> without throwing that error.
>>
>> Welcome any other thoughts or comments you may have. Thanks again for your 
>> comments so far.
>>
>> Jim
>>
>> On Thu, Sep 29, 2022 at 5:23 AM <[email protected]> wrote:
>>
>> James,
>>
>> I have been thinking more about your problem and this may be the wrong 
>> approach. If you successfully unpack your files into the flow file content, 
>> you will still have one output flow file containing the unpacked contents of 
>> all of your files. If you need 12 separate files in their own flowfiles then 
>> you will need to find some way of splitting them up. Is there a byte 
>> sequence you can use in a SplitContent process, or a specific file length 
>> you can use in SplitText?
>>
>> Otherwise you may be better off using ExecuteStreamCommand to unpack the 
>> files on disk. Run it verbosely and use the output of that step to create a 
>> list of the locations where your recently unpacked files are. Or create a 
>> temporary directory to unpack in and fetch all the files in there, cleaning 
>> up aftwerwards. Then you can load the files with FetchFile. FetchFile can be 
>> instructed to delete the file it has just read so can also clean up after 
>> itself.
>>
>> Steve Hindmarch
>>
>> From: stephen.hindmarch.bt.com via users <[email protected]>
>> Sent: 29 September 2022 09:19
>> To: [email protected]; [email protected]
>> Subject: RE: Can ExecuteStreamCommand do this?
>>
>> James,
>>
>> Using ${filename} and -si together seems wrong to me. What happens when you 
>> try that on the command line?
>>
>> Steve Hindmarch
>>
>> From: James McMahon <[email protected]>
>> Sent: 28 September 2022 13:49
>> To: [email protected]; Hindmarch,SJ,Stephen,VIR R 
>> <[email protected]>
>> Subject: Re: Can ExecuteStreamCommand do this?
>>
>> Thank you Steve. I 've employed a ListFile/FetchFile to load the 7z files 
>> into the flow . When I have my ESC configured like this following, I get my 
>> unpacked files results to the #{unpacked.destination} directory on disk:
>> Command Arguments            
>> x;${filename};-spf;-o#{unpacked.destination};-aou
>> Command Path                    /bin/7a
>> Ignore STDIN                       true
>> Working Directory                #{unpacked.destination}
>> Argument Delimiter               ;
>> Output Destination Attribute  No value set
>> I get twelve files in my output destination folder.
>>
>> When I try this one, get an error and no output:
>> Command Arguments            x;${filename};-si;-so;-spf;-aou
>> Command Path                    /bin/7a
>> Ignore STDIN                       false
>> Working Directory                #{unpacked.destination}
>> Argument Delimiter               ;
>> Output Destination Attribute  No value set
>>
>> This yields this error...
>> Executable command /bin/7za ended in an error: ERROR: Can not open the file 
>> as archive
>> E_NOTIMPL
>> ...and it yields only one flowfile result in Output Stream, and that is a 
>> brief text/plain report of the results of the 7za extraction like this:
>>
>> This indicates it did indeed find my 7z file and it did indeed identify the 
>> 12 files in it, yet still I get no output to my outgoing flow path:
>> Extracting archive: /parent/subparent/testArchive.7z
>> - -
>> Path = /parentdir/subdir/testArchive.7z
>> Type = 7z
>> Physical Size = 7204
>> Headers Size = 298
>> Method = LZMA2:96k
>> Solid = +
>> Blocks = 1
>>
>> Everything is Ok
>>
>> Folders: 1
>> Files: 12
>> Size: 90238
>> Compressed: 7204
>>
>> ${filename} in both cases is a fully qualified name to the file, like this: 
>> /dir/subdir/myTestFile.7z.
>>
>> I can't seem to get the ESC output stream to be the extracted files. 
>> Anything jump out at you?
>>
>> On Wed, Sep 28, 2022 at 8:06 AM stephen.hindmarch.bt.com via users 
>> <[email protected]> wrote:
>>
>> Hi James,
>>
>> I am not in a position to test this right now, but you have to think of the 
>> flowfile content as STDIN and STDOUT. So with 7zip you need to use the “-si” 
>> and “-so” flags to ensure there are no files involved. Then if you can load 
>> the content of a file into a flowfile, eg with GetFile, then you should be 
>> able to unpack it with ExecuteStreamCommand. Set “Ignore STDIN” = “false”.
>>
>> I have written up my own use case on github. This involves having a Redis 
>> script as the input, and results of the script as the output.
>>
>> my-nifi-cluster/experiment-redis_direct.md at main · 
>> hindmasj/my-nifi-cluster · GitHub
>>
>> The first part of the post shows how to do it with the input commands on the 
>> command line, so a bit like you running “7za ${filename} -so”. The second 
>> part has the script inside the flowfile and is treated as STDIN, a bit like 
>> you doing “unzip -si -so”.
>>
>> See if that helps. Fundamentally, if you do “7za -si -so < myfile.7z” on the 
>> command line and see the output on the console, ExecuteStreamCommand will 
>> behave the same.
>>
>> Steve Hindmarch
>> From: James McMahon <[email protected]>
>> Sent: 28 September 2022 12:02
>> To: [email protected]
>> Subject: Can ExecuteStreamCommand do this?
>>
>> I continue to struggle with ExecuteStreamCommand, and am hoping one of you 
>> from our user community can help me with the following:
>> 1. Can ExecuteStreamCommand be used as I am trying to use it?
>> 2. Can you direct me to an example where ExecuteStreamCommand is configured 
>> to do something similar to my use case?
>>
>> My use case:
>> The incoming flowfiles in my flow path are 7z zips. Based on what I've 
>> researched so far, NiFi's native processors don't handle unpacking of 7z 
>> files.
>>
>> I want to read the 7z files as STDIN to ExecuteStreamCommand.
>> I'd like the processor to call out to a 7za app, which will unpack the 7z.
>> One incoming flowfile will yield multiple output files. Let's say twelve in 
>> this case.
>> My goal is to output those twelve as new flowfiles out of 
>> ExecuteStreamCommand, to its output stream path.
>>
>> I can't yet get this to work. Best I've been able to do is configure 
>> ExecuteStreamCommand to unpack ${filename} to a temporary output directory 
>> on disk. Then I have another path in my flow polling that directory every 
>> few minutes looking for new data. Am hoping to eliminate that intermediate 
>> write/read to/from disk by keeping this all within the flow and JVM memory.
>>
>> Thanks very much in advance for any assistance.
>>
>>

Re: Can ExecuteStreamCommand do this?

Reply via email to