CSV files as input

2012-02-22 Thread Keith Wiley
It seems nearly impossible to use CSV files as Hadoop input. I see that there is a CsvRecordInput class, but have found virtually no examples online of how to use it...and the one example I did find blatantly assumed that the CSV records were delimited by endlines...which is not CSV spec. Base

disk problem while dealing with large amount of data

2012-02-22 Thread Allen
Hello there, I got the following errors relative to disk problem. I checked the slave node which runs the task, only half of the disk space is used. I don't understand why that happens. The application I run dumps most of the intermediate keys to one particular reducer so that particular reducer h

Re: CSV files as input

2012-02-22 Thread Steve Lewis
It sounds like you may need to give up a little to make things work - Suppose, for example, that you placed a limit on the length of a quoted string, say 1024 characters - the reader can then either start at the beginning or read back by, say 1024 characters to see if the start is in a quote and pr

Re: CSV files as input

2012-02-22 Thread Keith Wiley
Thanks for responding. Unfortunately, the data already exists. I have no way of instituting limitations on the format, much less reformatting it to suit my needs. It is true that I can make some general assumptions about the data (unrealistically long strings are unlikely to occur), but I can

Re: CSV files as input

2012-02-22 Thread Steve Lewis
Two other points - if you have several input files make a custom input whose reader make protected boolean isSplitable(JobContext context, Path file) return false and you do not have problems starting in the middle - If the input is not truly massive you can simply write a piece of code to find th