Accumulators and other important metrics for your job

Hamish Whittal Thu, 27 May 2021 10:04:06 -0700

Hi folks,

I have a problematic dataset I'm working with and am trying to find ways of
"debugging" the data.


For example, the most simple thing I would like to do is to know how many
rows of data I've read and compare that to a simple count of the lines in
the file.

I could do:
   df.count()

but this seems clunky (and expensive) for something that should be easy to
keep track of. I then thought accumulators might be the solution, but it
seems that I would have to do a second pass through the data at least to
"addInPlace" to the lines total. I might as well do that count then.

I would also expect that if I hit a row without the relevant data, I should
be able to tally that too. Say, a record without the requisite primary key.

I note too that accumulators are only tallies, but what if I want to keep
track of every file read. Say my directory has 100k files or some such, I
want to know that I have read each file by its filename. Accumulators won't
help me there since I want to keep filenames rather than just numbers of
files read. I might for example then be able to work out that it missed
file X because it was a corrupt file.

Has anyone got some advice on handling this sort of stuff?

Thanks in advance.

Accumulators and other important metrics for your job

Reply via email to