Hi,

we are running  LDA on 50 million files.

Each file is not more than 5 MB. Each file represent the content of the
user. Files keeps on updating as we receive new information about the user.

Currently we store all these files on ec2 and when we need to run LDA, We
transfer those files to S3 and run the mahout process. Transferring files
to S3 takes a long time. Also hadoop job is not that efficient when size of
less than 128MB.

Currently we are thinking of moving files directly to s3 whenever collector
gets the data. Say for exampe User1 will have these file as we get the
content. User1_1, User2_2,...etc.
And before running LDA, there would be another process which aggregates all
the data from User1 and then feed it to convert it into vector.

Can you please help us on designing the workflow.

Thanks,
Nishant

Reply via email to