Re: Incremental import fail to hive from PostgreSQL

Jarek Jarcec Cecho Mon, 16 Apr 2012 06:31:06 -0700

Hi Roshan,
Sqoop is currently firstly exporting data to separate directory on HDFS (let's 
call it "intermediate" directory) and then we're moving data into hive. If I 
read the thread correctly, then I would expect that your intermediate directory 
is called "users" in your home directory (/user/$username/users) whereas the 
hive data are most likely stored in /user/hive/warehouse/users.  Krishan's 
solution with removing directory should remove only the empty intermediate 
directory on HDFS and it definitely shouldn't remove your actually data in HIVE.


I've already noticed this issue and provided patch in JIRA SQOOP-443. It's not 
committed yet, but feel free to check out current trunk and apply attached 
patch.

https://issues.apache.org/jira/browse/SQOOP-443

Jarcec

On Apr 16, 2012, at 6:29 AM, Roshan Pradeep wrote:

> Hi Krishnan
> 
> Thanks for your perfect explanation.
> 
> In that case, can I use sql command to filter data from PostgreSQL and push 
> to hive? Because every time if I delete the HDFS folder and import, it will 
> import 1.12GB data back to hadoop from postgreSQL. 
> 
> I am tying to minimize the network traffic and hope you can understand my 
> requirement.
> 
> Thanks again.
> 
> On Mon, Apr 16, 2012 at 1:49 PM, Krishnan K <[email protected]> wrote:
> Hi Roshan,
> 
> Yes, because the sqoop first pulls the data into HDFS before pushing it into 
> the hive table.
> "FileAlreadyExistsException" occurs at the HDFS level and not at the hive-end.
> 
> You may want to write a shell script that --
> 1) first deletes the existing directory in HDFS and then
> 2) execute the sqoop command
> 
> You can now run the shell script to perform your import operation from 
> PostgreSQL to Hive.
> 
> Regards,
> Krishnan
> 
> 
> 
> On Mon, Apr 16, 2012 at 6:39 AM, Roshan Pradeep <[email protected]> wrote:
> Hi 
> 
> Thanks for the quick reply.
> 
> Regarding my incremental import, do I need to remove the directory every time 
> when I want to re-import the updated data from PostgreSQL?
> 
> Because I am working with some GB size data and want to confirm this. 
> 
> Thanks.
> 
> 
> On Mon, Apr 16, 2012 at 9:48 AM, Krishnan K <[email protected]> wrote:
> Hi Roshan,
> 
> If you have run the sqoop command once and even if it fails, it creates an 
> output directory in HDFS.
> You can delete this folder (users) and then try running this sqoop command 
> once again -
> 
> hadoop dfs -rmr users and execute this command.
> 
> Sqoop first imports the data from PostGRE into HDFS and 
> then moves into the hive default directory (/user/hive/warehouse/<tablename>)
> 
> For sqoop to copy this first into HDFS, you must ensure that it does not 
> already exist.
> 
> -Krishnan
> 
> 
> On Mon, Apr 16, 2012 at 4:49 AM, Roshan Pradeep <[email protected]> wrote:
> Hi All
> 
> I want to import the updated data from my source (PostgreSQL) to hive based 
> on a column (lastmodifiedtime) in postgreSQL
> 
> The command I am using
> 
> /app/sqoop/bin/sqoop import --hive-table users --connect 
> jdbc:postgresql:/<server_url>/<database> --table users --username XXXXXXX 
> --password YYYYYY --hive-home /app/hive --hive-import --incremental 
> lastmodified --check-column lastmodifiedtime
> 
> With the above command, I am getting the below error
> 
> 12/04/13 16:31:21 INFO orm.CompilationManager: Writing jar file: 
> /tmp/sqoop-root/compile/11ce8600a5656ed49e631a260c387692/users.jar
> 12/04/13 16:31:21 INFO tool.ImportTool: Incremental import based on column 
> "lastmodifiedtime"
> 12/04/13 16:31:21 INFO tool.ImportTool: Upper bound value: '2012-04-13 
> 16:31:21.865429'
> 12/04/13 16:31:21 WARN manager.PostgresqlManager: It looks like you are 
> importing from postgresql.
> 12/04/13 16:31:21 WARN manager.PostgresqlManager: This transfer can be 
> faster! Use the --direct
> 12/04/13 16:31:21 WARN manager.PostgresqlManager: option to exercise a 
> postgresql-specific fast path.
> 12/04/13 16:31:21 INFO mapreduce.ImportJobBase: Beginning import of users
> 12/04/13 16:31:23 ERROR tool.ImportTool: Encountered IOException running 
> import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output 
> directory users already exists
>         at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:123)
>         at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:770)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>         at 
> org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:141)
>         at 
> org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:201)
>         at 
> org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:413)
>         at 
> org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:102)
>         at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:380)
>         at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:453)
>         at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
>         at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
>         at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
> 
> According to the above stack trace, sqoop it identify the updated data from 
> postgreSQL, but it says output directory already exists. Could someone please 
> help me to correct this issue.
> 
> I am using 
> 
> Hadoop - 0.20.2
> Hive - 0.8.1
> Sqoop - 1.4.1-incubating
> 
> 
> Thanks.
> 
> 
> 
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Incremental import fail to hive from PostgreSQL

Reply via email to