Hi! I'm a student interested in using Spark for my big data research project. I've set up a cloud successfully, and now I'm developing a 'data pipeline' for batch processing of data.
Input: A S3 folder of flat files (csv), all with the same number of columns. Columns include: ID (integer) in_degree connections (list of integer ID's) out_degree connections (list of integer ID's) year (int) month (str) latitude (double) longitude (double) city (string) name (string) descriptionA (a very long string) descriptionB descriptionC descriptionD etc,etc (more discrete and categorical variables) Current total size is about 100GB, but i'd like to be able to write a workflow that can scale up to the full 2TB dataset later. Desired action: I want to import the data from S3, and then 'transform the dataset (add columns) For instance, I would want to do apply operations to every row in the dataset, creating new 'features/columns' for each row. Some of these operations would be functions (that take in values from multiple columns) and either run simple operations (addition, division, split_by_word, etc) or more complicated functions (interface with an external website (google/facebook) to grab further information about a given row ID, and plug a result into a new column. By nature, all of these operations can run totally in parallel. What's a workable way to transform my dataset within spark (or spark sql), into a format that I can run through mllib? Thanks in advance! Brian