Re: Incremental import from PostgreSQL to Hive having issues

Nitin Pawar Sun, 15 Apr 2012 22:48:38 -0700

best way to solve this is load the data in different partition each time
you load the data. (depending on the data you can put the data partitioned
by date or data-hour combination frequency on which you load the data)


I am not sure how you are installing sqoop. If you are using yum on redhat,
then you can trying doing yum update or you can use apt-get to update.

In hive 0.8.0 there is an option to append to already existing data but in
this case you will need to make sure that duplication of data does not
happen. So partitioning the data is simplest and easiest way to go for now.

Thanks,
Nitin

On Mon, Apr 16, 2012 at 4:26 AM, Roshan Pradeep <codeva...@gmail.com> wrote:

> Hi Nitin
>
> Thanks for your reply.
>
> I am using sqoop *1.4.1-incubating* version. In the sqoop releases
> download page the is no such version you are referring. Please correct me
> if I am wrong.
>
> Delete the warehouse folder and import is working fine, but my tables
> having GB of data, so every time delete & import is not a good answer to my
> solution. I am working on a solution to our production system.
>
> Is there any way to solve this issue.
>
> Thanks.
>
>
> On Fri, Apr 13, 2012 at 11:13 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:
>
>> Hi Roshan,
>>
>> I guess you are using sqoop version older than 17.
>>
>> You are facing similar issue mentioned in 
>> SQOOP-216<https://issues.cloudera.org/browse/SQOOP-216>
>>
>> You can try to delete the directory already existing.
>>
>> Thanks,
>> Nitin
>>
>>
>> On Fri, Apr 13, 2012 at 6:12 PM, Roshan Pradeep <codeva...@gmail.com>wrote:
>>
>>> Hadoop - 0.20.2
>>> Hive - 0.8.1
>>>
>>> Thanks.
>>>
>>>
>>> On Fri, Apr 13, 2012 at 5:03 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:
>>>
>>>> can you tell us what is
>>>> 1) hive version
>>>> 2) hadoop version that you are using?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 13, 2012 at 12:23 PM, Roshan Pradeep 
>>>> <codeva...@gmail.com>wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I want to import the updated data from my source (PostgreSQL) to hive
>>>>> based on a column (lastmodifiedtime) in postgreSQL
>>>>>
>>>>> *The command I am using*
>>>>>
>>>>> /app/sqoop/bin/sqoop import --hive-table users --connect
>>>>> jdbc:postgresql:/<server_url>/<database> --table users --username XXXXXXX
>>>>> --password YYYYYY --hive-home /app/hive --hive-import --incremental
>>>>> lastmodified --check-column lastmodifiedtime
>>>>>
>>>>> *With the above command, I am getting the below error*
>>>>>
>>>>> 12/04/13 16:31:21 INFO orm.CompilationManager: Writing jar file:
>>>>> /tmp/sqoop-root/compile/11ce8600a5656ed49e631a260c387692/users.jar
>>>>> 12/04/13 16:31:21 INFO tool.ImportTool: Incremental import based on
>>>>> column "lastmodifiedtime"
>>>>> 12/04/13 16:31:21 INFO tool.ImportTool: Upper bound value: '2012-04-13
>>>>> 16:31:21.865429'
>>>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: It looks like you
>>>>> are importing from postgresql.
>>>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: This transfer can be
>>>>> faster! Use the --direct
>>>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: option to exercise a
>>>>> postgresql-specific fast path.
>>>>> 12/04/13 16:31:21 INFO mapreduce.ImportJobBase: Beginning import of
>>>>> users
>>>>> 12/04/13 16:31:23 ERROR tool.ImportTool: Encountered IOException
>>>>> running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:
>>>>> Output directory users already exists
>>>>>         at
>>>>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:123)
>>>>>         at
>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:770)
>>>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>>>>         at
>>>>> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>>>>         at
>>>>> org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:141)
>>>>>         at
>>>>> org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:201)
>>>>>         at
>>>>> org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:413)
>>>>>         at
>>>>> org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:102)
>>>>>         at
>>>>> org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:380)
>>>>>         at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:453)
>>>>>         at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
>>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>         at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
>>>>>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
>>>>>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
>>>>>         at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
>>>>>         at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
>>>>>
>>>>> According to the above, it identify the updated data from postgreSQL,
>>>>> but it says output directory already exists. Could someone please help me
>>>>> to correct this issue.
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>


-- 
Nitin Pawar

Re: Incremental import from PostgreSQL to Hive having issues

Reply via email to