It seems nearly impossible to use CSV files as Hadoop input. I see that there
is a CsvRecordInput class, but have found virtually no examples online of how
to use it...and the one example I did find blatantly assumed that the CSV
records were delimited by endlines...which is not CSV spec. Base
Hello there,
I got the following errors relative to disk problem. I checked the
slave node which runs the task, only half of the disk space is used. I
don't understand why that happens. The application I run dumps most of
the intermediate keys to one particular reducer so that particular
reducer h
It sounds like you may need to give up a little to make things work -
Suppose, for example, that you placed a limit on the length of a quoted
string,
say 1024 characters - the reader can then either start at the beginning or
read back by, say 1024 characters to see if the start is in a quote and
pr
Thanks for responding. Unfortunately, the data already exists. I have no way
of instituting limitations on the format, much less reformatting it to suit my
needs. It is true that I can make some general assumptions about the data
(unrealistically long strings are unlikely to occur), but I can
Two other points -
if you have several input files make a custom input whose reader make
protected boolean isSplitable(JobContext context, Path file) return false
and you do not have problems starting in the middle -
If the input is not truly massive you can simply write a piece of code to
find th