Re: Sqoop with sharded Mysql

Chalcy Thu, 31 May 2012 18:01:05 -0700

Hi Srinivas Surasani,

What is the size of your data?  A sqoop command from sqlserver imports a
table of 75 gb in about 10 minutes with 40 mappers into a hive table .  The
sqlserver table has clustered index on rowid.  The schema of the hive table
need not be created separately.  It is created automatically by the sqoop
import.


Sqoop tool is amazing.  You do have to set up 70k jobs, I would think, and
separate them as separate set of sqoop jobs and run them in parallel.

I handle increment updates by using partitions in hive table.

Since you are already getting the data into hdfs, looks like your cluster
can handle the volume of your data.

Happy sqooping!
Chalcy

On Thu, May 31, 2012 at 5:59 PM, Srinivas Surasani <[email protected]> wrote:

> We are trying to implement sqoop in our environment which has 30 mysql
> sharded databases and all the databases have around 30 databases with
> 150 tables in each of the database which are all sharded (horizontally
> sharded that means the data is divided into all the tables in mysql).
>
> The problem is that we have a total of around 70K tables which needed
> to be pulled from mysql into hdfs.
>
> So, my question is that generating 70K sqoop commands and running them
> parallel is feasible or not?
>
> Also, doing incremental updates is going to be like invoking 70K
> another sqoop jobs which intern kick of map-reduce jobs.
>
> The main problem is monitoring and managing this huge number of jobs?
>
> Can anyone suggest me the best way of doing it or is sqoop a good
> candidate for this type of scenario?
>
> Currently the same process is done by generating tsv files  mysql
> server and dumped into staging server and  from there we'll generate
> hdfs put statements..
>
> Appreciate your suggestions !!!
>
>
> Thanks,
> Srinivas Surasani
>

Re: Sqoop with sharded Mysql

Reply via email to