Hi, I have gone through the k means clustering and canopy clustering. Here I can see that before running clustering we need to convert the text files to sequence files using a function called seqdirectory in mahout. For this function the input is a directory with one file per record and filename is record id.
But I have more than 10 million records initially in not more than 5 to 10 files in text format in HDFS. So now creating 10 million files as input to this seqdirectory function doesn't seem right. I have I'd and record tab separated and 1 record per line in my text file. So is there any other way. Thanks, Subbu
