My version of nifi does not have Range Sampling unfortunately.
If I get the flowfile through a session as done in the Cookbook, does
anyone know of an approach in Groovy to grab line N and avoid loading the
entire CSV file into string variable *text*?

On Thu, Feb 9, 2023 at 7:18 PM Matt Burgess <[email protected]> wrote:

> I’m AFK ATM but Range Sampling was added into the SampleRecord processor (
> https://issues.apache.org/jira/browse/NIFI-9814), the Jira doesn’t say
> which version it went into but it is definitely in 1.19.1+. If that’s
> available to you then you can just specify “2” as the range and it will
> only return that line.
>
> For total record count without loading the whole thing into memory,
> there’s probably a more efficient way but you could use ConvertRecord and
> convert it from CSV to CSV and it should write out the “record.count”
> attribute. I think some/most/all record processors write this attribute,
> and they work record by record so they don’t load the whole thing into
> memory. Even SampleRecord adds a record.count attribute but if you specify
> one line the value will be 1 :)
>
> Regards,
> Matt
>
>
> On Feb 9, 2023, at 6:57 PM, James McMahon <[email protected]> wrote:
>
> 
> Hello. I am trying to identify a header line and a data line count from a
> flowfile that is in csv format.
>
> Most of us are familiar with Matt B's outstanding Cookbook series, and I
> am trying to use that as my starting point. Here is my Groovy code:
>
> import org.apache.commons.io.IOUtils
> import java.nio.charset.StandardCharsets
> def ff=session.get()
> if(!ff)return
> try {
>      def text = ''
>      // Cast a closure with an inputStream parameter to InputStreamCallback
>      session.read(ff, {inputStream ->
>           text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
>           // Do something with text here
>           // get header from the second line of the flowfile
>           // set datacount as the total line count of the file - 2
>           ...
>           ff = session.putAttribute(ff, 'mdb.table.header', header)
>           ff = session.putAttribute(ff, 'mdb.table.datarecords', datacount)
>      } as InputStreamCallback)
>      session.transfer(flowFile, REL_SUCCESS)
> } catch(e) {
>      log.error('Error occurred identifying tables in mdb file', e)
>      session.transfer(ff, REL_FAILURE)
> }
>
> I want to avoid using that line in red, because as Matt cautions in his
> cookbook, our csv files are too large. I do not want to read in the entire
> file to variable text. It's going to be a problem.
>
> How in Groovy can I cherry pick only the line I want from the stream (line
> #2 in this case)?
>
> Also, how can I get a count of the total lines without loading them all
> into text?
>
> Thanks in advance for your help.
>
>

Reply via email to