Hi Arvind!

Should we take this discussion off the list? It is not really Pig-related 
anymore... Not sure what the custom is around here. :)

> > process. Apart from just the speed, Sqoop offers many other advantages
> > too such as incremental loads, exporting data from HDFS back to the
> > database, automatic creation of Hive tables or populating hbase etc.

Only Pig is missing then... >:-D
Sorry, couldn't hold that back... ;)

I would love to use Sqoop for another task (periodically importing MySQL 
tables to HBase) if schema gets more or less preserved, however I don't dare 
upgrade JRE to JDK at the moment in fear of breaking things. 

> Anze - I just checked that our Sqoop packages do declare the JDK
> dependency. Which package did you see as not having this dependency?

We are using:
-----
deb http://archive.cloudera.com/debian lenny-cdh3b1 contrib
-----
But there is no sqoop package per se, I guess it is part of hadoop package:
-----
$ aptitude show hadoop-0.20 | grep Depends
Depends: adduser, sun-java6-jre, sun-java6-bin
-----
$ aptitude search sun-java6 | grep "jdk\|jre"
p   sun-java6-jdk                   - Sun Java(TM) Development Kit (JDK) 6
i A sun-java6-jre                   - Sun Java(TM) Runtime Environment (JRE) 6
-----


This is where aa...@cloudera advises that JDK is needed (instead of JRE) for 
successful running of sqoop:
http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exception_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexception-
j7ziz

As I said, I am interested in Sqoop (and alternatives) as we will be facing 
the problem in near future, so I appreciate your involvement in this thread!

Anze


On Thursday 04 November 2010, [email protected] wrote:
> Anze - I just checked that our Sqoop packages do declare the JDK
> dependency. Which package did you see as not having this dependency?
> 
> Arvind
> 
> On Thu, Nov 4, 2010 at 9:25 AM, [email protected] 
<[email protected]>wrote:
> > Sqoop is Java based and you should have JDK 1.6 or higher available on
> > your system. We will add this as a dependency for the package.
> > 
> > Regarding accessing MySQL from a cluster - it should not be a problem if
> > you control the number of tasks that do that. Sqoop allows you to
> > explicitly specify the number of mappers, where each mapper holds a a
> > connection to the database and effectively parallelizes  the loading
> > process. Apart from just the speed, Sqoop offers many other advantages
> > too such as incremental loads, exporting data from HDFS back to the
> > database, automatic creation of Hive tables or populating hbase etc.
> > 
> > Arvind
> > 
> > 2010/11/4 Anze <[email protected]>
> > 
> >> So Sqoop doesn't require JDK?
> >> It seemed weird to me too. Also, if it would require it, then JDK would
> >> probably have to be among dependencies of the package Sqoop is in.
> >> 
> >> I started working on DBLoader, but the learning curve seems quite steep
> >> and I
> >> don't have enough time for it right now. Also, as Ankur said, it might
> >> not be
> >> a good idea to hit MySQL from the cluster.
> >> 
> >> The ideal solution IMHO would be loading data from MySQL to HDFS from a
> >> single
> >> machine (but within LoadFunc, of course) and work with the data from
> >> there (with schema automatically converted from MySQL). But I don't
> >> know enough about Pig to do that kind of thing... yet. :)
> >> 
> >> Anze
> >> 
> >> On Wednesday 03 November 2010, [email protected] wrote:
> >> > Sorry that you ran into a problem. Typically, it is usually something
> >> 
> >> like
> >> 
> >> > missing a required option etc that could cause this and if you were to
> >> 
> >> send
> >> 
> >> > a mail to [email protected], you would get prompt assistance.
> >> > 
> >> > Regardless, if you still have any use cases like this, I will be glad
> >> > to help you out in using Sqoop for that purpose.
> >> > 
> >> > Arvind
> >> > 
> >> > 2010/11/3 Anze <[email protected]>
> >> > 
> >> > > I tried to run it, got NullPointerException, searched the net, found
> >> > > Sqoop requires JDK (instead of JRE) and gave up. I am working on a
> >> > > production cluster - so I'd rather not upgrade to JDK if not
> >> 
> >> necessary.
> >> 
> >> > > :)
> >> > > 
> >> > > But I was able export MySQL with a simple bash script:
> >> > > **********
> >> > > #!/bin/bash
> >> > > 
> >> > > MYSQL_TABLES=( table1 table2 table3 )
> >> > > WHERE=/home/hadoop/pig
> >> > > 
> >> > > for i in ${mysql_tabl...@]}
> >> > > do
> >> > > 
> >> > >  mysql -BAN -h <mysql_host> -u <username> --password=<pass>
> >> > >  <database>
> >> 
> >> \
> >> 
> >> > >    -e "select * from $i;" --skip-column-names > $WHERE/$i.csv
> >> > >  
> >> > >  hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/
> >> > >  rm $WHERE/$i.csv
> >> > > 
> >> > > done
> >> > > **********
> >> > > 
> >> > > Of course, in my case the tables were small enough so I could do it.
> >> 
> >> And
> >> 
> >> > > of course I lost schema in process.
> >> > > 
> >> > > Hope it helps someone else too...
> >> > > 
> >> > > Anze
> >> > > 
> >> > > On Wednesday 03 November 2010, [email protected] wrote:
> >> > > > Anze,
> >> > > > 
> >> > > > Did you get a chance to try out Sqoop? If not, I would encourage
> >> > > > you
> >> 
> >> to
> >> 
> >> > > do
> >> > > 
> >> > > > so. Here is a link to the user
> >> > > > guide<http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html>
> >> > > > .
> >> > > > 
> >> > > > Sqoop allows you to easily move data across from relational
> >> 
> >> databases
> >> 
> >> > > > and other enterprise systems to HDFS and back.
> >> > > > 
> >> > > > Arvind
> >> > > > 
> >> > > > 2010/11/3 Anze <[email protected]>
> >> > > > 
> >> > > > > Alejandro, thanks for answering!
> >> > > > > 
> >> > > > > I was hoping it could be done directly from Pig, but... :)
> >> > > > > 
> >> > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll
> >> 
> >> just
> >> 
> >> > > write
> >> > > 
> >> > > > > a simple batch to export data to TXT/CSV. Thanks for the
> >> > > > > pointer!
> >> > > > > 
> >> > > > > Anze
> >> > > > > 
> >> > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote:
> >> > > > > > Not a 100% Pig solution, but you could use Sqoop to get the
> >> > > > > > data
> >> 
> >> in
> >> 
> >> > > as
> >> > > 
> >> > > > > > a pre-processing step. And if you want to handle all as single
> >> 
> >> job,
> >> 
> >> > > > > > you
> >> > > > > 
> >> > > > > could
> >> > > > > 
> >> > > > > > use Oozie to create a workflow that does Sqoop and then your
> >> > > > > > Pig processing.
> >> > > > > > 
> >> > > > > > Alejandro
> >> > > > > > 
> >> > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[email protected]>
> >> 
> >> wrote:
> >> > > > > > > Hi!
> >> > > > > > > 
> >> > > > > > > Part of data I have resides in MySQL. Is there a loader that
> >> > > > > > > would
> >> > > > > 
> >> > > > > allow
> >> > > > > 
> >> > > > > > > loading directly from it?
> >> > > > > > > 
> >> > > > > > > I can't find anything on the net, but it seems to me this
> >> > > > > > > must
> >> 
> >> be
> >> 
> >> > > > > > > a
> >> > > > > 
> >> > > > > quite
> >> > > > > 
> >> > > > > > > common problem.
> >> > > > > > > I checked piggybank but there is only DBStorage (and no
> >> > > > > > > DBLoader).
> >> > > > > > > 
> >> > > > > > > Is some DBLoader out there too?
> >> > > > > > > 
> >> > > > > > > Thanks,
> >> > > > > > > 
> >> > > > > > > Anze

Reply via email to