RE: sqlline parquet to tsv filesize imabalance causing slow sqoop export to MS sql server

Kunal Khatua Thu, 16 Nov 2017 22:29:28 -0800

It might be that your parallelization is causing it to generate 4 files, where 
only <= 3  files are sufficient.


Try experimenting with the planner. width .max_per_query  to a value of 3 ... 
that might help.

https://drill.apache.org/docs/configuration-options-introduction/


-----Original Message-----
From: Reed Villanueva [mailto:[email protected]] 
Sent: Thursday, November 16, 2017 5:29 PM
To: [email protected]
Subject: sqlline parquet to tsv filesize imabalance causing slow sqoop export 
to MS sql server

I am new to using drill and am trying to convert a table stored on hadoop dfs 
as .parquet to .tsv format using sqlline that came with the drill package. The 
problem is that when doing this, the tsv files are poorly 'balanced'. When 
checking the sizes of the converted files, I see:

+-----------+----------------------------+

| Fragment  | Number of records written  |

+-----------+----------------------------+

| 1_3       | 1                          |

| 1_1       | 306955                     |

| 1_2       | 493009                     |

| 1_0       | 698228                     |

+-----------+----------------------------+

[mapr@mnode02 mytable_export]$ ls -l
total 486470
-rwxr-xr-x 1 mapr mapr 105581719 Oct 19 10:25 1_0_0.tsv -rwxr-xr-x 1 mapr mapr 
155385226 Oct 19 10:25 1_1_0.tsv -rwxr-xr-x 1 mapr mapr 237176680 Oct 19 10:25 
1_2_0.tsv
-rwxr-xr-x 1 mapr mapr       279 Oct 19 10:25 1_3_0.tsv

So when trying to export the files from hadoop using sqoop export, (I
think) the imbalance is causing the export progress to go very slowly for some 
of the job mappers (when checking the hadoop applications web interface).

My question, then, is whether there is a way to control the size and amount of 
the tsv files being created? Ultimately I am trying to export the table in tsv 
format to a Microsoft SQL Server DB(and can't alleviate the slowdown with 
--batch or --direct because sqoop apparently does not support those options for 
MS SQL server, nor is there a way to sqoop export parquet to sql server).

The sql I am using to use to convert the table is mytable.sql:

alter session set `store.format`='tsv';
create table dfs.tmp.`mytable_export`
as select
    ID, NAME, STATUS, GROUP, from_unixtime(etl_date/1000) as etl_date from 
dfs.root.`/location/of/table/to/convert/mytable`;

with this script bash myscript.sh mytable:

#!/bin/bash
...
tablename=$1
sqldir="/path/to/sql/to/run/"$tablename".sql"
echo $sqldir
...
#write table to tsv
/opt/mapr/drill/drill-1.8.0/bin/sqlline \
    -u jdbc:drill:zk=mnode01:5181,mnode02:5181,mnode03:5181 \
    -n username\
    -p password \
    --run=$sqldir
...

Any suggestions or advice would be appreciated, thanks.

--
This electronic message is intended only for the named recipient, and may 
contain information that is confidential or privileged. If you are not the 
intended recipient, you are hereby notified that any disclosure, copying, 
distribution or use of the contents of this message is strictly prohibited. If 
you have received this message in error or are not the named recipient, please 
notify us immediately by contacting the sender at the electronic mail address 
noted above, and delete and destroy all copies of this message. Thank you.

RE: sqlline parquet to tsv filesize imabalance causing slow sqoop export to MS sql server

Reply via email to