Re: Regarding tf calculation for 2 millions files...

Jeff Eastman Wed, 25 Apr 2012 07:37:18 -0700

Sure, skip the step of converting the single file into multiple files,and adjust your Hadoop split size to get maybe 4 mappers on your 2 nodecluster. Seq2sparse will handle that directly and give you a few vectorsequence files that you can use for subsequent processing. By breakingyour big file into 2 million small files you are suffering huge internalfragmentation overhead on Hadoop and creating 2 million map tasks thatwill take forever to run on your small cluster.


On 4/25/12 7:19 AM, balaprasanna wrote:

Hi,
I am currently using Mahout for machine learning algorithms. I have a single
file which consist of 2 million lines of text. I want to run document-term
matrix for them. I have converted entire single file into a directory
consisting of 2 million files individually. Now I am running seq2sparse for
calculation of tf matrices. I am running this on hadoop cloudera for two
nodes. Since the file size is large, this take lot of time for calculation
of tf matrices for 2 million files. Is there is any alternative way such
that I can speed up this process.
Regards,
Prasanna


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regarding-tf-calculation-for-2-millions-files-tp3937920p3937920.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Regarding tf calculation for 2 millions files...

Reply via email to