Hi,

I want to work on some use case something like below.

Just want to know if something similar has been already done which can be
reused.

Idea is to use Spark for ETL / Data Science / Streaming pipeline.

So when data comes inside the cluster front door we will do following steps


1)

Upload raw files onto HDFS

2)
Schema of the raw file is specified in json file ( Other formats are also
open for suggestion). We want to specify datatypes , fieldnames , optional
or

filed is required.

for example

name string required

3)

Process the raw data uploaded in Step one and check if it confirms to
schema above
Push the good rows to hive table or hdfs
Push the error rows to the errors folder

4)

Hive table is created based on schema which we specify

----------------------

Example user flow can be

mycode.upload
mycode.validate
mycode.createHiveTable
mycode.loadHive

or

mycode.loadFromDatabase
mycode.validate
mycode.createHiveTable

or

mycode.loadFromDatabase
mycode.validate
mycode.storeToHdfs

Reply via email to