Mahout now supports doing its distributed linalg natively on Spark so the problem of sequence file input load into Spark is already solved there (trunk, http://mahout.apache.org/users/sparkbindings/home.html, drmFromHDFS() call -- and then you can access to the direct rdd via "rdd" matrix property if needed).
if you specifically try ensure interoperability with MLlib, however, I did not try that -- however, Mahout's linalg & tits bindings to Spark works with Kryo serializer only, so if/when MLLib algorithms do not support kryo serializer, it would not be interoperable. -d On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi <stutiawas...@hcl.com>wrote: > Hi All, > > I am very new to Spark and trying to play around with Mllib hence > apologies for the basic question. > > > > I am trying to run KMeans algorithm using Mahout and Spark MLlib to see > the performance. Now initial datasize was 10 GB. Mahout converts the data > in Sequence File <Text,VectorWritable> which is used for KMeans > Clustering. The Sequence File crated was ~ 6GB in size. > > > > Now I wanted if I can use the Mahout Sequence file to be executed in Spark > MLlib for KMeans . I have read that SparkContext.sequenceFile may be used > here. Hence I tried to read my sequencefile as below but getting the error : > > > > Command on Spark Shell : > > scala> val data = sc.sequenceFile[String,VectorWritable]("/ > KMeans_dataset_seq/part-r-00000",String,VectorWritable) > > <console>:12: error: not found: type VectorWritable > > val data = sc.sequenceFile[String,VectorWritable](" > /KMeans_dataset_seq/part-r-00000",String,VectorWritable) > > > > Here I have 2 ques: > > 1. Mahout has “Text” as Key but Spark is printing “not found: type:Text” > hence I changed it to String.. Is this correct ??? > > 2. How will VectorWritable be found in Spark. Do I need to include Mahout > jar in Classpath or any other option ?? > > > > Please Suggest > > > > Regards > > Stuti Awasthi > > > > ::DISCLAIMER:: > > ---------------------------------------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > E-mail transmission is not guaranteed to be secure or error-free as > information could be intercepted, corrupted, > lost, destroyed, arrive late or incomplete, or may contain viruses in > transmission. The e mail and its contents > (with or without referred errors) shall therefore not attach any liability > on the originator or HCL or its affiliates. > Views or opinions, if any, presented in this email are solely those of the > author and may not necessarily reflect the > views or opinions of HCL or its affiliates. Any form of reproduction, > dissemination, copying, disclosure, modification, > distribution and / or publication of this message without the prior > written consent of authorized representative of > HCL is strictly prohibited. If you have received this email in error > please delete it and notify the sender immediately. > Before opening any email and/or attachments, please check them for viruses > and other defects. > > > ---------------------------------------------------------------------------------------------------------------------------------------------------- >