It's OK I appreciate your help no matter what, so don’t worry about the timing…

As an interesting side note we are testing now on a Greenplum cluster with 
eight segment hosts and some serious hardware on each node.  I'm told that load 
performance using this tweak, on this db, like our loads on test instances, is 
equal or greater than performance of our existing production fastloads of same 
data on Teradata; which is a good thing.


From: Szabolcs Vasas [mailto:va...@apache.org]
Sent: Thursday, January 31, 2019 4:19 AM
To: user@sqoop.apache.org
Subject: [EXTERNAL] Re: Re: Postgres direct

Hi Robert,

Sorry for the late reply, I have been quite busy recently, I will try to post 
at least an initial review by the end of tomorrow.
Don't worry about the Oracle tests this is again something we will need to 
address separately.

Szabolcs

On Fri, Jan 25, 2019 at 11:17 PM Robert B Hamilton 
<mailto:robert.hamil...@gm.com> wrote:
Hi. Szabolcs and many thanks!

I put in a pull request but the CI is failing on some oracle checks that are 
not related to any of these changes.  So I'm not sure where to go from there.

There are several changes we put into this one. The first two are the most 
important since without them it would be pretty much a showstopper for our shop:

1. I added optional support for version 8.x of postgress, for which the COPY 
command has a different syntax than later versions.

2. I also noticed some performance challenges, which I think are related to 
mapreduce mapper feeding lines to the CopyIn object one by one.This could 
introduce some considerable delays if the database is some network distant from 
the Hadoop cluster.     So I added a rudimentary line buffering and the export 
was about 10 to 20 times faster.  The option is enabled by using  --batch in 
combination with --direct.  The only downside seems to be that if a data issue 
causes failure we cannot identify the exact row that has the problem, only that 
it was in that particular batch.

3. We use '\x1c' for our standard delimiter in text mode. But unfortunately 
that is not a valid XML char so is difficult to pass to the map tasks via 
standard Hadoop config.  This introduces a limitation to direct mode that is 
not present in non direct exports.  I made a minor change to base64 encode this 
before putting into config and base64 decode when initializing the map. So with 
this change we can do - -fields-terminated-by '\0x1c' and it just works.

4. Likewise, using empty string to represent null is supported by non-direct 
but not by direct mode. I made a modification so that  --null-string ''  also 
works now with direct mode

5. We have some bizarre data that contains strings with escape chars, things 
like '\x96'.  Its in a verbatim field and we are required to load it into the 
database table as is. (yes I know the security implications 😊)
   COPY normally interprets these escape sequences and converts them to chars 
before inserting into tables. Big problem if the escape sequence evaluates to a 
non-utf char. So I added an option to disable this escape interpretation.

6. I added a switch -Dpostegresql.format.text=true, which allows us to switch 
to TEXT mode copy instead of CSV mode.  This is convenient when we have data 
coming in from many sources that don't know how to properly format a CSV file.

PS please forgive the verbiage in the Confidentiality note; I don't have any 
way to remove it when mailing from a corporate account....
--

From: Szabolcs Vasas [mailto:mailto:va...@apache.org]
Sent: Friday, January 18, 2019 7:05 AM
To: mailto:user@sqoop.apache.org
Subject: [EXTERNAL] Re: Postgres direct

Hi Robert,

Sounds great! We can discuss it in this chain or you could create a pull 
request and start the discussion there.

On Wed, Jan 16, 2019 at 12:05 AM Attila Szabó <mailto:mailto:mau...@apache.org> 
wrote:
Hi Robert,

I would be happy to assist you with this change.

Cheers,
Attila
On Tue, Jan 15, 2019, 11:55 PM Robert B Hamilton 
<mailto:mailto:robert.hamil...@gm.com wrote:
Hi.  Is there much active work being done on the Postgres direct export?

I have found that there is very significant performance improvement if we 
buffer lines into the CopyIn in the map method. We've incorporated this 
in-house with some pretty good results.

Who would I contact if I wanted to share this code? I know about the dev mail 
list but I would like to talk it over first before jumping in with a jira...

There are also a few other features that we've added like an option for CSV vs 
TEXT format,  and some miscellaneous options parsing.


Nothing in this message is intended to constitute an electronic signature 
unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to 
which it is addressed. It may contain confidential and/or privileged material. 
Any review, transmission, dissemination or other use, or taking of any action 
in reliance upon this message by persons or entities other than the intended 
recipient is prohibited and may be unlawful. If you received this message in 
error, please contact the sender and delete it from your computer.


Nothing in this message is intended to constitute an electronic signature 
unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to 
which it is addressed. It may contain confidential and/or privileged material. 
Any review, transmission, dissemination or other use, or taking of any action 
in reliance upon this message by persons or entities other than the intended 
recipient is prohibited and may be unlawful. If you received this message in 
error, please contact the sender and delete it from your computer.


Nothing in this message is intended to constitute an electronic signature 
unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to 
which it is addressed. It may contain confidential and/or privileged material. 
Any review, transmission, dissemination or other use, or taking of any action 
in reliance upon this message by persons or entities other than the intended 
recipient is prohibited and may be unlawful. If you received this message in 
error, please contact the sender and delete it from your computer.

Reply via email to