Hi,

The following snippet lets me iterate over each character of a file in HDFS
-

// Opening the file
Configuration conf = new Configuration();
FSDataInputStream in = null;
FileSystem fs = FileSystem.get(conf);
Path inFile = new Path(args[0]);
in = fs.open(inFile);
// Reading the file
Reader reader = new BufferedReader(new InputStreamReader(in,
Charset.forName(StandardCharsets.UTF_8.name())));
int c = 0;
while ((c = reader.read()) != -1) {
  System.out.println((char)c);
}

But I imagine this is probably inefficient because of the BufferedReader.

I tried something like -

Configuration conf = new Configuration();
FSDataInputStream in = null;
FileSystem fs = FileSystem.get(conf);
Path inFile = new Path(args[0]);
in = fs.open(inFile);
ByteBuffer x = ByteBuffer.allocate(655360);
int length = in.read(x);
while (length > 0) {
  int c = 0;
  while (c < length) {
     System.out.println(x.getChar(c));
     c++;
   }
   x.clear();
   length = in.read(x);
}

Although this is significantly faster, this does not seem to be printing
out the correct characters.

What is the best way to iterate over each character of a file stored in
HDFS?

Thanks,

-- 
Pratyush Das

Reply via email to