Hey Christopher, I'm working with Teng on this issue. Thank you for the explanation. I tried both workarounds:
just leaving hive.metastore.warehouse.dir empty is not doing anything. Still the tmp data is written to S3 and the job attempts to rename/copy+delete from S3 to S3. But anyway, since the wished effect of this setting was not working before, we will discard this. So it will be empty in the future. I tried you're attempt with copying the hive-jars to the spark class path. It was breaking then (the query did not execute at all) because of this error message: errorMessage:java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(com.esotericsoftware.kryo.Kryo, java.io.InputStream, java.lang.Class)) I used a super simple query to produce this: SELECT CONCAT('some_string', some_string_col) FROM some_table; We suspect, that this comes from an too old Spark Hive Version, which was used to compile the Spark Version you build in your github project or other recency problems. We suggest recompiling your Spark Version with the AWS Hive Version, which has the Hive adaptions you mentioned already implemented. Or what do you think? Cheers Fabian 2015-04-02 10:19 GMT+02:00 Teng Qiu <teng...@gmail.com>: > ---------- Forwarded message ---------- > From: Bozeman, Christopher <bozem...@amazon.com> > Date: 2015-04-01 22:43 GMT+02:00 > Subject: RE: Issue on Spark SQL insert or create table with Spark > running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished > To: chutium <teng....@gmail.com>, "user@spark.apache.org" > <user@spark.apache.org> > > > Teng, > > There is no need to alter hive.metastore.warehouse.dir. Leave it as > is and just create external tables with location pointing to S3. > What I suspect you are seeing is that spark-sql is writing to a temp > directory within S3 then issuing a rename to the final location as > would be done with HDFS. But in S3, there is not a rename operation > so there is a performance hit as S3 performs a copy then delete. I > tested 1TB from/to S3 external tables and it worked, it is just there > the additional delay for the rename (copy). > > EMR has modified Hive to avoid the expensive rename and you can take > advantage of this, too with Spark SQL by just copying the EMR Hive > jars into the Spark class path. Like: > /bin/ls /home/hadoop/.versions/hive-*/lib/*.jar | xargs -n 1 -I %% cp > %% ~/spark/classpath/emr > > Please note that since EMR Hive is 0.13 at this time, this does break > some other features already supported by spark-sql if using the > built-in Hive library (for example, AVRO support). So if using this > workaround to make a better performant query when writing to S3 be > sure to test your use-case. > > Thanks > Christopher > > > -----Original Message----- > From: chutium [mailto:teng....@gmail.com] > Sent: Wednesday, April 01, 2015 9:34 AM > To: user@spark.apache.org > Subject: Issue on Spark SQL insert or create table with Spark running > on AWS EMR -- s3n.S3NativeFileSystem: rename never finished > > Hi, > > we always get issues on inserting or creating table with Amazon EMR > Spark version, by inserting about 1GB resultset, the spark sql query > will never be finished. > > by inserting small resultset (like 500MB), works fine. > > *spark.sql.shuffle.partitions* by default 200 or *set > spark.sql.shuffle.partitions=1* do not help. > > the log stopped at: > */15/04/01 15:48:13 INFO s3n.S3NativeFileSystem: rename > > s3://hive-db/tmp/hive-hadoop/hive_2015-04-01_15-47-43_036_1196347178448825102-15/-ext-10000 > s3://hive-db/db_xxx/some_huge_table/* > > then only metrics.MetricsSaver logs. > > we set > / <property> > <name>hive.metastore.warehouse.dir</name> > <value>s3://hive-db</value> > </property>/ > but hive.exec.scratchdir ist not set, i have no idea why the tmp files > were created in /s3://hive-db/tmp/hive-hadoop// > > we just tried the newest Spark 1.3.0 on AMI 3.5.x and AMI 3.6 > ( > https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md > ), > still not work. > > anyone get same issue? any idea about how to fix it? > > i believe Amazon EMR's Spark version use > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem to access s3, but > not the original hadoop s3n implementation, right? > > /home/hadoop/spark/classpath/emr/* > and > /home/hadoop/spark/classpath/emrfs/* > is in classpath > > btw. is there any plan to use the new hadoop s3a implementation instead of > s3n ? > > Thanks for any help. > > Teng > > > > > -- > View this message in context: > > http://apache-spark-user-list.1001560.n3.nabble.com/Issue-on-Spark-SQL-insert-or-create-table-with-Spark-running-on-AWS-EMR-s3n-S3NativeFileSystem-renamd-tp22340.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For > additional commands, e-mail: user-h...@spark.apache.org > -- *Fabian Wollert* Business Intelligence *POSTANSCHRIFT* Zalando SE 11501 Berlin *STANDORT* Zalando SE Mollstraße 1 10178 Berlin Germany Phone: +49 30 20968 1819 Fax: +49 30 27594 693 E-Mail: fabian.woll...@zalando.de Web: www.zalando.de Jobs: jobs.zalando.de Zalando SE, Tamara-Danz-Straße 1, 10243 Berlin Handelsregister: Amtsgericht Charlottenburg, HRB 158855 B Steuer-Nr. 29/560/00596 * USt-ID-Nr. DE 260543043 Vorstand: Robert Gentz, David Schneider, Rubin Ritter Vorsitzende des Aufsichtsrates: Cristina Stenbeck Sitz der Gesellschaft: Berlin