Ok.
No do no split in smaller files. This is done automatically. Your behavior 
looks strange. For that file size I would expect that it takes below one 
minute. 
Maybe you hit a bug in the spark on hive engine. You could try with a file with 
less columns, but the same size. I assume that this is a hive table with simple 
columns (nothing deeply nested) and that you do not any transformations.
What is the CTAS query?
Do you enable vectorization in Hive?

If you just need a simple mapping from CSV to orc you can use any framework 
(mr, tez, spark etc), because performance does not differ so much in these 
cases, especially for the small amount of data you process.

> On 9 Dec 2016, at 11:02, Joaquin Alzola <joaquin.alz...@lebara.com> wrote:
> 
> Hi Jorn
>  
> The file is about 1.5GB with 1.5 milion records and about 550 fields in each 
> row.
>  
> ORC is compress as Zlib.
>  
> I am using a standalone solution before expanding it, so everything is on the 
> same node.
> Hive 2.0.1 à Spark 1.6.3 à HDFS 2.6.5
>  
> The configuration is much more as standard and have not change anything much.
>  
> It cannot be a network issue because all the apps are on the same node.
>  
> Since I am doing all of this translation on the Hive point (from textfile to 
> ORC) I wanted to know if I could do it quicker on the Spark or HDFS level 
> (doing the file conversion some other way) not on the stop of the “stack”
>  
> We take the files every day once so if I put them in textfile and then to ORC 
> it will take me almost half a day just to display the data.
>  
> It is basicly a time consuming task, and want to do it much quicker. A better 
> solution of course would be to put smaller files with FLUME but this I will 
> do it in the future.
>  
> From: Jörn Franke [mailto:jornfra...@gmail.com] 
> Sent: 09 December 2016 09:48
> To: user@hive.apache.org
> Subject: Re: Hive Stored Textfile to Stored ORC taking long time
>  
> How large is the file? Might IO be an issue? How many disks have you on the 
> only node?
>  
> Do you compress the ORC (snappy?). 
>  
> What is the Hadoop distribution? Configuration baseline? Hive version?
>  
> Not sure if i understood your setup, but might network be an issue?
> 
> On 9 Dec 2016, at 02:08, Joaquin Alzola <joaquin.alz...@lebara.com> wrote:
> 
> HI List
>  
> The transformation from textfile table to stored ORC table takes quiet a long 
> time.
>  
> Steps follow>
>  
> 1.Create one normal table using textFile format
> 
> 2.Load the data normally into this table
> 
> 3.Create one table with the schema of the expected results of your normal 
> hive table using stored as orcfile
> 
> 4.Insert overwrite query to copy the data from textFile table to orcfile table
> 
>  
> I have about 1,5 million records with about 550 fields in each row.
>  
> Doing step 4 takes about 30 minutes (moving from one format to the other).
>  
> I have spark with only one worker (same for HDFS) so running now a standalone 
> server but with 25G and 14 cores on that worker.
>  
> BR
>  
> Joaquin
> This email is confidential and may be subject to privilege. If you are not 
> the intended recipient, please do not copy or disclose its content but 
> contact the sender immediately upon receipt.
> This email is confidential and may be subject to privilege. If you are not 
> the intended recipient, please do not copy or disclose its content but 
> contact the sender immediately upon receipt.

Reply via email to