Re: Sqoop vs spark jdbc

Mich Talebzadeh Thu, 25 Aug 2016 02:35:39 -0700

Actually I started using Spark to import data from RDBMS (in this case
Oracle) after upgrading to Hive 2, running an import like below


sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username
scratchpad -P \
                --query "select * from scratchpad.dummy2 where \
                 \$CONDITIONS" \
                      --split-by ID \
                   --hive-import  --hive-table "test.dumy2" --target-dir
"/tmp/dummy2" *--direct*

This gets the data into HDFS and then throws this error

ERROR [main] tool.ImportTool: Imported Failed: No enum constant
org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS

I can easily get the data into Hive from the file on HDFS or dig into the
problem (Spark 2, Hive 2, Hadoop 2.6, Sqoop 1.4.5) but I find Spark trouble
free like below

 val df = HiveContext.read.format("jdbc").options(
 Map("url" -> dbURL,
 "dbtable" -> "scratchpad.dummy)",
 "partitionColumn" -> partitionColumnName,
 "lowerBound" -> lowerBoundValue,
 "upperBound" -> upperBoundValue,
 "numPartitions" -> numPartitionsValue,
 "user" -> dbUserName,
 "password" -> dbPassword)).load

It does work, opens parallel connections to Oracle DB and creates DF with
the specified number of partitions.

One thing I am not sure or tried if Spark supports direct mode yet.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 09:07, Bhaskar Dutta <bhas...@gmail.com> wrote:

> Which RDBMS are you using here, and what is the data volume and frequency
> of pulling data off the RDBMS?
> Specifying these would help in giving better answers.
>
> Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and Oracle,
> so you can use that for better performance if using one of these databases.
>
> And don't forget that you Sqoop can load data directly into Parquet or
> Avro (I think direct mode is not supported in this case).
> Also you can use Kite SDK with Sqoop to manage/transform datasets, perform
> schema evolution and such.
>
> ~bhaskar
>
>
> On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
> mail.venkatakart...@gmail.com> wrote:
>
>> Team,
>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>> ?
>>
>> I'm performing few analytics using spark data for which data is residing
>> in rdbms.
>>
>> Please guide me with this.
>>
>>
>> Thanks
>> Venkata Karthik P
>>
>>
>

Re: Sqoop vs spark jdbc

Reply via email to