Hi, we are running LDA on 50 million files.
Each file is not more than 5 MB. Each file represent the content of the user. Files keeps on updating as we receive new information about the user. Currently we store all these files on ec2 and when we need to run LDA, We transfer those files to S3 and run the mahout process. Transferring files to S3 takes a long time. Also hadoop job is not that efficient when size of less than 128MB. Currently we are thinking of moving files directly to s3 whenever collector gets the data. Say for exampe User1 will have these file as we get the content. User1_1, User2_2,...etc. And before running LDA, there would be another process which aggregates all the data from User1 and then feed it to convert it into vector. Can you please help us on designing the workflow. Thanks, Nishant
