We are trying to implement sqoop in our environment which has 30 mysql sharded databases and all the databases have around 30 databases with 150 tables in each of the database which are all sharded (horizontally sharded that means the data is divided into all the tables in mysql).
The problem is that we have a total of around 70K tables which needed to be pulled from mysql into hdfs. So, my question is that generating 70K sqoop commands and running them parallel is feasible or not? Also, doing incremental updates is going to be like invoking 70K another sqoop jobs which intern kick of map-reduce jobs. The main problem is monitoring and managing this huge number of jobs? Can anyone suggest me the best way of doing it or is sqoop a good candidate for this type of scenario? Currently the same process is done by generating tsv files mysql server and dumped into staging server and from there we'll generate hdfs put statements.. Appreciate your suggestions !!! Thanks, Srinivas Surasani
