Hi, I'm currently working on Spark, HBase-Setup which processes log files (~10 GB/day). These log files are persisted hourly on n > 10 application servers and copied to a 4 node hdfs.
Our current spark-job aggregates single visits (based on a session-uuid) across all application-servers on a daily basis. Visits are filtered (only about 1% of data remains) and stored in an HBase for further processing. Currently there is no use of the Spark-Streaming API, i.e. a cronjob runs every day and fires the visit calculation. Questions 1) Ist it really necessary to store the log files in the HDFS or can spark somehow read the files from a local file system and distribute the data to the other nodes? Rationale: The data is (probably) only read once during the visit calculation which defies the purpose of a dfs. 2) If the raw log files have to be in the HDFS, I have to remove the files from the HDFS after processing them, so COPY -> PROCESS -> REMOVE. Is this the way to go? 3) Before I can process a visit for an hour. I have to wait until all log files of all application servers have been copied to the HDFS. It doesn't seem like StreamingContext.fileStream can wait for more sophisticated patterns, e.g. ("context*/logs-2016-08-01-15"). Do you guys have a recommendation to solve this problem? One possible solution: After the files have been copied, create an additional file that indicates spark that all files are available? If you have any questions, please don't hesitate to ask. Thanks, David