Hi folks, I have a problematic dataset I'm working with and am trying to find ways of "debugging" the data.
For example, the most simple thing I would like to do is to know how many rows of data I've read and compare that to a simple count of the lines in the file. I could do: df.count() but this seems clunky (and expensive) for something that should be easy to keep track of. I then thought accumulators might be the solution, but it seems that I would have to do a second pass through the data at least to "addInPlace" to the lines total. I might as well do that count then. I would also expect that if I hit a row without the relevant data, I should be able to tally that too. Say, a record without the requisite primary key. I note too that accumulators are only tallies, but what if I want to keep track of every file read. Say my directory has 100k files or some such, I want to know that I have read each file by its filename. Accumulators won't help me there since I want to keep filenames rather than just numbers of files read. I might for example then be able to work out that it missed file X because it was a corrupt file. Has anyone got some advice on handling this sort of stuff? Thanks in advance.