Hi, I want to work on some use case something like below.
Just want to know if something similar has been already done which can be reused. Idea is to use Spark for ETL / Data Science / Streaming pipeline. So when data comes inside the cluster front door we will do following steps 1) Upload raw files onto HDFS 2) Schema of the raw file is specified in json file ( Other formats are also open for suggestion). We want to specify datatypes , fieldnames , optional or filed is required. for example name string required 3) Process the raw data uploaded in Step one and check if it confirms to schema above Push the good rows to hive table or hdfs Push the error rows to the errors folder 4) Hive table is created based on schema which we specify ---------------------- Example user flow can be mycode.upload mycode.validate mycode.createHiveTable mycode.loadHive or mycode.loadFromDatabase mycode.validate mycode.createHiveTable or mycode.loadFromDatabase mycode.validate mycode.storeToHdfs