Look at the last third of the book, especially chapter 14. One important thing to check is whether your integers represent codes or actually represent numbers. Codes should be encoded as key words.
Hashed vector encoding should work quite well. On Wed, Jan 9, 2013 at 10:10 PM, Haddad Said <[email protected]> wrote: > Hi, > > I have a data set in CSV which is a set of key value pairs, the data set is > huge and the values are a mixture of integers and short strings (i.e. not > lengthy texts, but rather key words) and I want to process it using > Mahout's clustering algorithms. > > The issue is in converting this CSV into vectors that can be consumed by > Mahout. I have been reading "Mahout In Action" and there seems to be two > options for vectorizing, using numeric values with Mahout's DenseVector, > RandomAccessSparseVector, and SequentialAccessSparseVector implementation > or use Vector Space Model to vectorize text documents. > > The data I want to vectorize it not really a text document, but as it is a > huge data set with many different keys and values it is difficult to map it > to numeric values. What is the best way to vectorize this kind of data for > use in Mahout? > > Any pointers would be appreciated. > > Thanks >
