SQL DB Loader?

Anze Fri, 05 Nov 2010 06:43:28 -0700

> > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty
> > easy, since iirc Sqoop does generate an input format for you.


Yes, but if I remember correctly (I have looked at Sqoop quite some time ago) 
Sqoop generates classes based on SQL the user provides. Unless you suggest 
using input format classes only as a starting point? That would probably 
work... 

> > Good project for someone looking to get started in contributing to Pig

It is tempting. :)

> Also, Sqoop only needs to be installed on the client machine; it doesn't
> require modifying your Hadoop deployment on your servers anywhere. If
> you're writing any Java MapReduce programs, or Java UDFs for Pig, it's
> likely you've already got the JDK on this machine already.

I am running Pig remotely, not from a local machine. But this will be changed 
soon so this will not be a problem for me anymore.

Thanks,

Anze


On Friday 05 November 2010, Aaron Kimball wrote:
> By default, Sqoop should load files into HDFS as delimited text. The
> existing PigStorage should be able to work with the data loaded here.
> Similarly, if you use PigStorage to write data back to HDFS in delimited
> text form, Sqoop can export those files to your RDBMS.
> 
> Also, Sqoop only needs to be installed on the client machine; it doesn't
> require modifying your Hadoop deployment on your servers anywhere. If
> you're writing any Java MapReduce programs, or Java UDFs for Pig, it's
> likely you've already got the JDK on this machine already.
> 
> - Aaron
> 
> On Thu, Nov 4, 2010 at 3:22 PM, Dmitriy Ryaboy <[email protected]> wrote:
> > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty
> > easy, since iirc Sqoop does generate an input format for you.
> > 
> > Good project for someone looking to get started in contributing to Pig
> > ...
> > 
> > :)
> > 
> > -D
> > 
> > 2010/11/4 Anze <[email protected]>
> > 
> > > Hi Arvind!
> > > 
> > > Should we take this discussion off the list? It is not really
> > > Pig-related anymore... Not sure what the custom is around here. :)
> > > 
> > > > > process. Apart from just the speed, Sqoop offers many other
> > 
> > advantages
> > 
> > > > > too such as incremental loads, exporting data from HDFS back to the
> > > > > database, automatic creation of Hive tables or populating hbase
> > > > > etc.
> > > 
> > > Only Pig is missing then... >:-D
> > > Sorry, couldn't hold that back... ;)
> > > 
> > > I would love to use Sqoop for another task (periodically importing
> > > MySQL tables to HBase) if schema gets more or less preserved, however
> > > I don't dare
> > > upgrade JRE to JDK at the moment in fear of breaking things.
> > > 
> > > > Anze - I just checked that our Sqoop packages do declare the JDK
> > > > dependency. Which package did you see as not having this dependency?
> > > 
> > > We are using:
> > > -----
> > > deb http://archive.cloudera.com/debian lenny-cdh3b1 contrib
> > > -----
> > > But there is no sqoop package per se, I guess it is part of hadoop
> > 
> > package:
> > > -----
> > > $ aptitude show hadoop-0.20 | grep Depends
> > > Depends: adduser, sun-java6-jre, sun-java6-bin
> > > -----
> > > $ aptitude search sun-java6 | grep "jdk\|jre"
> > > p   sun-java6-jdk                   - Sun Java(TM) Development Kit
> > > (JDK)
> > 
> > 6
> > 
> > > i A sun-java6-jre                   - Sun Java(TM) Runtime Environment
> > > (JRE) 6
> > > -----
> > > 
> > > 
> > > This is where aa...@cloudera advises that JDK is needed (instead of
> > > JRE) for
> > 
> > > successful running of sqoop:
> > http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exceptio
> > n_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexcep
> > tion-
> > 
> > > j7ziz<
> > 
> > http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exceptio
> > n_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexcep
> > tion-%0Aj7ziz
> > 
> > > As I said, I am interested in Sqoop (and alternatives) as we will be
> > 
> > facing
> > 
> > > the problem in near future, so I appreciate your involvement in this
> > > thread!
> > > 
> > > Anze
> > > 
> > > On Thursday 04 November 2010, [email protected] wrote:
> > > > Anze - I just checked that our Sqoop packages do declare the JDK
> > > > dependency. Which package did you see as not having this dependency?
> > > > 
> > > > Arvind
> > > > 
> > > > On Thu, Nov 4, 2010 at 9:25 AM, [email protected]
> > > 
> > > <[email protected]>wrote:
> > > > > Sqoop is Java based and you should have JDK 1.6 or higher available
> > 
> > on
> > 
> > > > > your system. We will add this as a dependency for the package.
> > > > > 
> > > > > Regarding accessing MySQL from a cluster - it should not be a
> > > > > problem
> > > 
> > > if
> > > 
> > > > > you control the number of tasks that do that. Sqoop allows you to
> > > > > explicitly specify the number of mappers, where each mapper holds a
> > > > > a connection to the database and effectively parallelizes  the
> > > > > loading process. Apart from just the speed, Sqoop offers many
> > > > > other
> > 
> > advantages
> > 
> > > > > too such as incremental loads, exporting data from HDFS back to the
> > > > > database, automatic creation of Hive tables or populating hbase
> > > > > etc.
> > > > > 
> > > > > Arvind
> > > > > 
> > > > > 2010/11/4 Anze <[email protected]>
> > > > > 
> > > > >> So Sqoop doesn't require JDK?
> > > > >> It seemed weird to me too. Also, if it would require it, then JDK
> > > 
> > > would
> > > 
> > > > >> probably have to be among dependencies of the package Sqoop is in.
> > > > >> 
> > > > >> I started working on DBLoader, but the learning curve seems quite
> > > 
> > > steep
> > > 
> > > > >> and I
> > > > >> don't have enough time for it right now. Also, as Ankur said, it
> > 
> > might
> > 
> > > > >> not be
> > > > >> a good idea to hit MySQL from the cluster.
> > > > >> 
> > > > >> The ideal solution IMHO would be loading data from MySQL to HDFS
> > 
> > from
> > 
> > > a
> > > 
> > > > >> single
> > > > >> machine (but within LoadFunc, of course) and work with the data
> > > > >> from there (with schema automatically converted from MySQL). But
> > > > >> I don't know enough about Pig to do that kind of thing... yet. :)
> > > > >> 
> > > > >> Anze
> > > > >> 
> > > > >> On Wednesday 03 November 2010, [email protected] wrote:
> > > > >> > Sorry that you ran into a problem. Typically, it is usually
> > > 
> > > something
> > > 
> > > > >> like
> > > > >> 
> > > > >> > missing a required option etc that could cause this and if you
> > 
> > were
> > 
> > > to
> > > 
> > > > >> send
> > > > >> 
> > > > >> > a mail to [email protected], you would get prompt
> > 
> > assistance.
> > 
> > > > >> > Regardless, if you still have any use cases like this, I will be
> > > 
> > > glad
> > > 
> > > > >> > to help you out in using Sqoop for that purpose.
> > > > >> > 
> > > > >> > Arvind
> > > > >> > 
> > > > >> > 2010/11/3 Anze <[email protected]>
> > > > >> > 
> > > > >> > > I tried to run it, got NullPointerException, searched the net,
> > > 
> > > found
> > > 
> > > > >> > > Sqoop requires JDK (instead of JRE) and gave up. I am working
> > > > >> > > on
> > 
> > a
> > 
> > > > >> > > production cluster - so I'd rather not upgrade to JDK if not
> > > > >> 
> > > > >> necessary.
> > > > >> 
> > > > >> > > :)
> > > > >> > > 
> > > > >> > > But I was able export MySQL with a simple bash script:
> > > > >> > > **********
> > > > >> > > #!/bin/bash
> > > > >> > > 
> > > > >> > > MYSQL_TABLES=( table1 table2 table3 )
> > > > >> > > WHERE=/home/hadoop/pig
> > > > >> > > 
> > > > >> > > for i in ${mysql_tabl...@]}
> > > > >> > > do
> > > > >> > > 
> > > > >> > >  mysql -BAN -h <mysql_host> -u <username> --password=<pass>
> > > > >> > >  <database>
> > > > >> 
> > > > >> \
> > > > >> 
> > > > >> > >    -e "select * from $i;" --skip-column-names > $WHERE/$i.csv
> > > > >> > >  
> > > > >> > >  hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/
> > > > >> > >  rm $WHERE/$i.csv
> > > > >> > > 
> > > > >> > > done
> > > > >> > > **********
> > > > >> > > 
> > > > >> > > Of course, in my case the tables were small enough so I could
> > > > >> > > do
> > > 
> > > it.
> > > 
> > > > >> And
> > > > >> 
> > > > >> > > of course I lost schema in process.
> > > > >> > > 
> > > > >> > > Hope it helps someone else too...
> > > > >> > > 
> > > > >> > > Anze
> > > > >> > > 
> > > > >> > > On Wednesday 03 November 2010, [email protected] wrote:
> > > > >> > > > Anze,
> > > > >> > > > 
> > > > >> > > > Did you get a chance to try out Sqoop? If not, I would
> > 
> > encourage
> > 
> > > > >> > > > you
> > > > >> 
> > > > >> to
> > > > >> 
> > > > >> > > do
> > > > >> > > 
> > > > >> > > > so. Here is a link to the user
> > > > >> > > > guide<
> > > 
> > > http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html>
> > > 
> > > > >> > > > .
> > > > >> > > > 
> > > > >> > > > Sqoop allows you to easily move data across from relational
> > > > >> 
> > > > >> databases
> > > > >> 
> > > > >> > > > and other enterprise systems to HDFS and back.
> > > > >> > > > 
> > > > >> > > > Arvind
> > > > >> > > > 
> > > > >> > > > 2010/11/3 Anze <[email protected]>
> > > > >> > > > 
> > > > >> > > > > Alejandro, thanks for answering!
> > > > >> > > > > 
> > > > >> > > > > I was hoping it could be done directly from Pig, but... :)
> > > > >> > > > > 
> > > > >> > > > > I'll take a look at Sqoop then, and if that doesn't help,
> > 
> > I'll
> > 
> > > > >> just
> > > > >> 
> > > > >> > > write
> > > > >> > > 
> > > > >> > > > > a simple batch to export data to TXT/CSV. Thanks for the
> > > > >> > > > > pointer!
> > > > >> > > > > 
> > > > >> > > > > Anze
> > > > >> > > > > 
> > > > >> > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote:
> > > > >> > > > > > Not a 100% Pig solution, but you could use Sqoop to get
> > 
> > the
> > 
> > > > >> > > > > > data
> > > > >> 
> > > > >> in
> > > > >> 
> > > > >> > > as
> > > > >> > > 
> > > > >> > > > > > a pre-processing step. And if you want to handle all as
> > > 
> > > single
> > > 
> > > > >> job,
> > > > >> 
> > > > >> > > > > > you
> > > > >> > > > > 
> > > > >> > > > > could
> > > > >> > > > > 
> > > > >> > > > > > use Oozie to create a workflow that does Sqoop and then
> > 
> > your
> > 
> > > > >> > > > > > Pig processing.
> > > > >> > > > > > 
> > > > >> > > > > > Alejandro
> > > > >> > > > > > 
> > > > >> > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze
> > > > >> > > > > > <[email protected]>
> > > > >> 
> > > > >> wrote:
> > > > >> > > > > > > Hi!
> > > > >> > > > > > > 
> > > > >> > > > > > > Part of data I have resides in MySQL. Is there a
> > > > >> > > > > > > loader
> > > 
> > > that
> > > 
> > > > >> > > > > > > would
> > > > >> > > > > 
> > > > >> > > > > allow
> > > > >> > > > > 
> > > > >> > > > > > > loading directly from it?
> > > > >> > > > > > > 
> > > > >> > > > > > > I can't find anything on the net, but it seems to me
> > 
> > this
> > 
> > > > >> > > > > > > must
> > > > >> 
> > > > >> be
> > > > >> 
> > > > >> > > > > > > a
> > > > >> > > > > 
> > > > >> > > > > quite
> > > > >> > > > > 
> > > > >> > > > > > > common problem.
> > > > >> > > > > > > I checked piggybank but there is only DBStorage (and
> > > > >> > > > > > > no DBLoader).
> > > > >> > > > > > > 
> > > > >> > > > > > > Is some DBLoader out there too?
> > > > >> > > > > > > 
> > > > >> > > > > > > Thanks,
> > > > >> > > > > > > 
> > > > >> > > > > > > Anze

Re: MySQL / JDBC / SQL DB Loader?

Reply via email to