Re: Hive output file 000000_0

Furcy Pin Wed, 08 Aug 2018 08:59:32 -0700

Indeed, if your table is big, then you DO want to parallelise and setting
the number of reducers to 1 is clearly not a good idea.


>From what you explained, I still don't understand what your exact problem
is.
Can't you just leave your query as is and be okay with many part_0000X
files?




On Wed, 8 Aug 2018 at 16:20, Sujeet Pardeshi <[email protected]>
wrote:

> Thanks Furcy. Will the below setting not hit performance? Just one
> reducer? I am processing huge data and cannot compromise on performance.
>
>
>
> SET mapred.reduce.tasks = 1
>
>
>
>
>
> Regards,
>
> Sujeet Singh Pardeshi
>
> Software Specialist
>
> SAS Research and Development (India) Pvt. Ltd.
>
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra,
> 411 013
> *o*ff: +91-20-30418810
> [image: Description: untitled]
>
>  *"When the solution is simple, God is answering…" *
>
>
>
> *From:* Furcy Pin <[email protected]>
> *Sent:* 08 August 2018 PM 06:37
> *To:* [email protected]
> *Subject:* Re: Hive output file 000000_0
>
>
>
> *EXTERNAL*
>
> It might sound silly, but isn't it what Hive is supposed to do, being a
> distributed computation framework and all ?
>
>
>
> Hive will write one file per reducer, called 00000_0, 00001_0, etc. where
> the number corresponds to the number of your reducer.
>
> Sometimes the _0 will be a _1 or _2 or more depending on the retries or
> speculative execution.
>
>
>
> This is how MapReduce works, and this is how Tez and Spark work too: you
> can't have several executors writing simultaneously in the same file on
> HDFS
>
> (maybe you can on s3, but not through the HDFS api). So each executor
> writes in its own file.
>
> If you want to read back the data with MapReduce, Tez or Spark, you simply
> specify the folder path, and all files in it will be read. That's what Hive
> does too.
>
>
>
> Now, if you really want only one file (for instance to export it as a csv
> file), I guess you can try to add this before your query
>
> (if you use MapReduce, there probably are similar configurations for Hive
> on Tez or Hive on Spark)
>
> SET mapred.reduce.tasks = 1
>
>
>
> And if you really suspect that some files are not correctly overwritten
> (like it happened to me once with Spark on Azure blob storage), I suggest
> that you check the file timestamp
>
> to determine which execution of your query wrote it.
>
>
>
>
>
> Hope this helps,
>
>
>
> Furcy
>
>
>
> On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi <[email protected]>
> wrote:
>
> Hi Deepak,
> Thanks for your response. The table is not bucketed or clustered. It can
> be seen below.
>
> DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary;
> CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary
> (
>   bouncer VARCHAR(12),
>   device_type VARCHAR(52),
>   visitor_type VARCHAR(10),
>   visit_origination_type VARCHAR(65),
>   visit_origination_name VARCHAR(260),
>   pg_domain_name VARCHAR(215),
>   class1_id VARCHAR(650),
>   class2_id VARCHAR(650),
>   bouncers INT,
>   rv_revenue DECIMAL(17,2),
>   visits INT,
>   active_page_view_time INT,
>   total_page_view_time BIGINT,
>   average_visit_duration INT,
>   co_conversions INT,
>   page_views INT,
>   landing_page_url VARCHAR(1332),
>   dt DATE
>
> )
> PARTITIONED BY (datelocal DATE)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\001'
> LOCATION '${OUTPUT_PATH}/daily_summary/'
> TBLPROPERTIES ('serialization.null.format'='');
>
> MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary;
>
> Regards,
> Sujeet Singh Pardeshi
> Software Specialist
> SAS Research and Development (India) Pvt. Ltd.
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra,
> 411 013
> off: +91-20-30418810
>
>  "When the solution is simple, God is answering…"
>
> -----Original Message-----
> From: Deepak Jaiswal <[email protected]>
> Sent: 07 August 2018 PM 11:19
> To: [email protected]
> Subject: Re: Hive output file 000000_0
>
> EXTERNAL
>
> Hi Sujeet,
>
> I am assuming that the table is bucketed? If so, then the name represents
> which bucket the file belongs to as Hive creates 1 file per bucket for each
> operation.
>
> In this case, the file 000003_0 belongs to bucket 3.
> To always have files named 000000_0, the table must be unbucketed.
> I hope it helps.
>
> Regards,
> Deepak
>
> On 8/7/18, 1:33 AM, "Sujeet Pardeshi" <[email protected]> wrote:
>
>     Hi All,
>     I am doing an Insert overwrite operation through a hive external table
> onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times
> I am noticing that it creates file with other names like 0000003_0 etc. I
> always need to overwrite the existing file but with inconsistent file names
> I am unable to do so. How do I force hive to always create a consistent
> filename like 000000_0? Below is an example of how my code looks like,
> where tab_content is a hive external table.
>
>     INSERT OVERWRITE TABLE tab_content
>     PARTITION(datekey)
>     select * from source
>
>     Regards,
>     Sujeet Singh Pardeshi
>     Software Specialist
>     SAS Research and Development (India) Pvt. Ltd.
>     Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune,
> Maharashtra, 411 013
>     off: +91-20-30418810
>
>      "When the solution is simple, God is answering…"
>
>

Re: Hive output file 000000_0

Reply via email to