I am new to using drill and am trying to convert a table stored on hadoop
dfs as .parquet to .tsv format using sqlline that came with the drill
package. The problem is that when doing this, the tsv files are poorly
'balanced'. When checking the sizes of the converted files, I see:

+-----------+----------------------------+

| Fragment  | Number of records written  |

+-----------+----------------------------+

| 1_3       | 1                          |

| 1_1       | 306955                     |

| 1_2       | 493009                     |

| 1_0       | 698228                     |

+-----------+----------------------------+

[mapr@mnode02 mytable_export]$ ls -l
total 486470
-rwxr-xr-x 1 mapr mapr 105581719 Oct 19 10:25 1_0_0.tsv
-rwxr-xr-x 1 mapr mapr 155385226 Oct 19 10:25 1_1_0.tsv
-rwxr-xr-x 1 mapr mapr 237176680 Oct 19 10:25 1_2_0.tsv
-rwxr-xr-x 1 mapr mapr       279 Oct 19 10:25 1_3_0.tsv

So when trying to export the files from hadoop using sqoop export, (I
think) the imbalance is causing the export progress to go very slowly for
some of the job mappers (when checking the hadoop applications web
interface).

My question, then, is whether there is a way to control the size and amount
of the tsv files being created? Ultimately I am trying to export the table
in tsv format to a Microsoft SQL Server DB(and can't alleviate the slowdown
with --batch or --direct because sqoop apparently does not support those
options for MS SQL server, nor is there a way to sqoop export parquet to sql
server).

The sql I am using to use to convert the table is mytable.sql:

alter session set `store.format`='tsv';
create table dfs.tmp.`mytable_export`
as select
    ID, NAME, STATUS, GROUP, from_unixtime(etl_date/1000) as etl_date
from dfs.root.`/location/of/table/to/convert/mytable`;

with this script bash myscript.sh mytable:

#!/bin/bash
...
tablename=$1
sqldir="/path/to/sql/to/run/"$tablename".sql"
echo $sqldir
...
#write table to tsv
/opt/mapr/drill/drill-1.8.0/bin/sqlline \
    -u jdbc:drill:zk=mnode01:5181,mnode02:5181,mnode03:5181 \
    -n username\
    -p password \
    --run=$sqldir
...

Any suggestions or advice would be appreciated, thanks.

-- 
This electronic message is intended only for the named 
recipient, and may contain information that is confidential or 
privileged. If you are not the intended recipient, you are 
hereby notified that any disclosure, copying, distribution or 
use of the contents of this message is strictly prohibited. If 
you have received this message in error or are not the named
recipient, please notify us immediately by contacting the 
sender at the electronic mail address noted above, and delete 
and destroy all copies of this message. Thank you.

Reply via email to