Re: Problem embedding GaussianMixtureModel in a closure
Hi Yanbo, thanks for info. Is it likely to change in (near :) ) future? Ability to call this function only on local data (ie not in rdd) seems to be rather serious limitation. cheers, Tomasz On 02.01.2016 09:45, Yanbo Liang wrote: Hi Tomasz, The GMM is bind with the peer Java GMM object, so it need reference to SparkContext. Some of MLlib(not ML) models are simple object such as KMeansModel, LinearRegressionModel etc., but others will refer SparkContext. The later ones and corresponding member functions should not called in map(). Cheers Yanbo 2016-01-01 4:12 GMT+08:00 Tomasz Fruboes <tomasz.frub...@ncbj.gov.pl <mailto:tomasz.frub...@ncbj.gov.pl>>: Dear All, I'm trying to implement a procedure that iteratively updates a rdd using results from GaussianMixtureModel.predictSoft. In order to avoid problems with local variable (the obtained GMM) beeing overwritten in each pass of the loop I'm doing the following: ### for i in xrange(10): gmm = GaussianMixture.train(rdd, 2) def getSafePredictor(unsafeGMM): return lambda x: \ (unsafeGMM.predictSoft(x.features), unsafeGMM.gaussians.mu <http://unsafeGMM.gaussians.mu>) safePredictor = getSafePredictor(gmm) predictionsRDD = (labelledpointrddselectedfeatsNansPatched .map(safePredictor) ) print predictionsRDD.take(1) (... - rest of code - update rdd with results from predictionsRdd) ### Unfortunately this ends with: ### Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. ### Any idea why I'm getting this behaviour? My expectation would be, that GMM should be a "simple" object without SparkContext in it. I'm using spark 1.5.2 Thanks, Tomasz ps As a workaround I'm doing currently def getSafeGMM(unsafeGMM): return lambda x: unsafeGMM.predictSoft(x) safeGMM = getSafeGMM(gmm) predictionsRDD = \ safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd)) which works fine. If it's possible I would like to avoid this approach, since it would require to perform another closure on gmm.gaussians later in my code - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Problem embedding GaussianMixtureModel in a closure
Dear All, I'm trying to implement a procedure that iteratively updates a rdd using results from GaussianMixtureModel.predictSoft. In order to avoid problems with local variable (the obtained GMM) beeing overwritten in each pass of the loop I'm doing the following: ### for i in xrange(10): gmm = GaussianMixture.train(rdd, 2) def getSafePredictor(unsafeGMM): return lambda x: \ (unsafeGMM.predictSoft(x.features), unsafeGMM.gaussians.mu) safePredictor = getSafePredictor(gmm) predictionsRDD = (labelledpointrddselectedfeatsNansPatched .map(safePredictor) ) print predictionsRDD.take(1) (... - rest of code - update rdd with results from predictionsRdd) ### Unfortunately this ends with: ### Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. ### Any idea why I'm getting this behaviour? My expectation would be, that GMM should be a "simple" object without SparkContext in it. I'm using spark 1.5.2 Thanks, Tomasz ps As a workaround I'm doing currently def getSafeGMM(unsafeGMM): return lambda x: unsafeGMM.predictSoft(x) safeGMM = getSafeGMM(gmm) predictionsRDD = \ safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd)) which works fine. If it's possible I would like to avoid this approach, since it would require to perform another closure on gmm.gaussians later in my code - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Union of many RDDs taking a long time
Hi Matt, is there a reason you need to call coalesce every loop iteration? Most likely it forces spark to do lots of unnecessary shuffles. Also - for really large number of inputs this approach can lead to due to to many nested RDD.union calls. A safer approach is to call union from SparkContext once, as soon as you have all RDDs ready. For python it looks this way: rdds = [] for i in xrange(cnt): rdd = ... rdds.append(rdd) finalRDD = sparkContext.union(rdds) HTH, Tomasz W dniu 18.06.2015 o 02:53, Matt Forbes pisze: I have multiple input paths which each contain data that need to be mapped in a slightly different way into a common data structure. My approach boils down to: RDDT rdd = null; for (Configuration conf : configurations) { RDDT nextRdd = loadFromConfiguration(conf); rdd = (rdd == null) ? nextRdd : rdd.union(nextRdd); rdd = rdd.coalesce(nextRdd.partitions().size()); } Now, for a small number of inputs there doesn't seem to be a problem, but for the full set which is about 60 sub-RDDs coming in at around 500MM total records takes a very long time to construct. Just for a simple load-then-count example job, it takes 13 minutes total, where the count() task only accounts for 2 minutes of that. Is there something I should be doing differently here? If you can't tell, this is in java so my RDD is probably some mess of nested wrapped RDDs but I'm not sure if that would be the real issue. Thanks, Matt - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Multi user setup and saving a DataFrame / RDD to a network exported file system
Hi, thanks for answer, I'll open a ticket. In the meantime - I have found a workaround. The recipe is the following: 1. Create a new account/group on all machines (lets call it sparkuser). Run spark from this account. 2. Add your user to group sparkuser. 3. If you decide to write RDD/parquet file under workdir directory you need to execute the following (just once, before running spark-submit): chgrp sparkuser workdir chmod g+s workdir setfacl -d -m g::rwx workdir (first two steps can be replaced also by newgrp sparkuser, but this way all your files will be created with sparkuser group) than calls like rdd.saveAsPickleFile(workdir+/somename) work just fine. The above solution has one serious problem - any other user from sparkuser group will be able to overwrite your saved data. cheers, Tomasz W dniu 20.05.2015 o 23:08, Davies Liu pisze: Could you file a JIRA for this? The executor should run under the user who submit a job, I think. On Wed, May 20, 2015 at 2:40 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl wrote: Thanks for a suggestion. I have tried playing with it, sc.sparkUser() gives me expected user name, but it doesnt solve the problem. From a quick search through the spark code it seems to me, that this setting is effective only for yarn and mesos. I think the workaround for the problem could be using --deploy-mode cluster (not 100% convenient, since disallows any interactive work), but this is not supported for python based programs. Cheers, Tomasz W dniu 20.05.2015 o 10:57, Iulian DragoČ™ pisze: You could try setting `SPARK_USER` to the user under which your workers are running. I couldn't find many references to this variable, but at least Yarn and Mesos take it into account when spawning executors. Chances are that standalone mode also does it. iulian On Wed, May 20, 2015 at 9:29 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl wrote: Hi, thanks for answer. The rights are drwxr-xr-x 3 tfruboes all 5632 05-19 15 tel:5632%2005-19%2015:40 test19EE/ I have tried setting the rights to 777 for this directory prior to execution. This does not get propagated down the chain, ie the directory created as a result of the save call (namesAndAges.parquet2 in the path in the dump [1] below) is created with the drwxr-xr-x rights (owned by the user submitting the job, ie tfruboes). The temp directories created inside namesAndAges.parquet2/_temporary/0/ (e.g. task_201505200920_0009_r_01) are owned by root, again with drwxr-xr-x access rights Cheers, Tomasz W dniu 19.05.2015 o 23:56, Davies Liu pisze: It surprises me, could you list the owner information of /mnt/lustre/bigdata/med_home/tmp/test19EE/ ? On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl wrote: Dear Experts, we have a spark cluster (standalone mode) in which master and workers are started from root account. Everything runs correctly to the point when we try doing operations such as dataFrame.select(name, age).save(ofile, parquet) or rdd.saveAsPickleFile(ofile) , where ofile is path on a network exported filesystem (visible on all nodes, in our case this is lustre, I guess on nfs effect would be similar). Unsurprisingly temp files created on workers are owned by root, which then leads to a crash (see [1] below). Is there a solution/workaround for this (e.g. controlling file creation mode of the temporary files)? Cheers, Tomasz ps I've tried to google this problem, couple of similar reports, but no clear answer/solution found ps2 For completeness - running master/workers as a regular user solves the problem only for the given user. For other users submitting to this master the result is given in [2] below [0] Cluster details: Master/workers: centos 6.5 Spark 1.3.1 prebuilt for hadoop 2.4 (same behaviour for the 2.6 build) [1] ## File /mnt/home/tfruboes/2015.05.SparkLocal/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o27.save. : java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE
Re: saveAsTextFile() part- files are missing
Hi, it looks you are writing to a local filesystem. Could you try writing to a location visible by all nodes (master and workers), e.g. nfs share? HTH, Tomasz W dniu 21.05.2015 o 17:16, rroxanaioana pisze: Hello! I just started with Spark. I have an application which counts words in a file (1 MB file). The file is stored locally. I loaded the file using native code and then created the RDD from it. JavaRDDString rddFromFile = context.parallelize(myFile, 2); JavaRDDString words = rddFromFile.flatMap(...); JavaPairRDDString, Integer pairs = words.mapToPair(...); JavaPairRDDString, Integer counter = pairs.reduceByKey(..); counter.saveAsTextFile(file:///root/output); context.close(); I have one master and 2 slaves. I run the program from the master node. The output directory is created on the master node and on the 2 nodes. On the master node I have only one file _SUCCES (empty) and on the nodes I have _temporary file. I printed the counter at the console, the result seems ok. What am I doing wrong? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-part-files-are-missing-tp22974.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Multi user setup and saving a DataFrame / RDD to a network exported file system
Hi, thanks for answer. The rights are drwxr-xr-x 3 tfruboes all 5632 05-19 15:40 test19EE/ I have tried setting the rights to 777 for this directory prior to execution. This does not get propagated down the chain, ie the directory created as a result of the save call (namesAndAges.parquet2 in the path in the dump [1] below) is created with the drwxr-xr-x rights (owned by the user submitting the job, ie tfruboes). The temp directories created inside namesAndAges.parquet2/_temporary/0/ (e.g. task_201505200920_0009_r_01) are owned by root, again with drwxr-xr-x access rights Cheers, Tomasz W dniu 19.05.2015 o 23:56, Davies Liu pisze: It surprises me, could you list the owner information of /mnt/lustre/bigdata/med_home/tmp/test19EE/ ? On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl wrote: Dear Experts, we have a spark cluster (standalone mode) in which master and workers are started from root account. Everything runs correctly to the point when we try doing operations such as dataFrame.select(name, age).save(ofile, parquet) or rdd.saveAsPickleFile(ofile) , where ofile is path on a network exported filesystem (visible on all nodes, in our case this is lustre, I guess on nfs effect would be similar). Unsurprisingly temp files created on workers are owned by root, which then leads to a crash (see [1] below). Is there a solution/workaround for this (e.g. controlling file creation mode of the temporary files)? Cheers, Tomasz ps I've tried to google this problem, couple of similar reports, but no clear answer/solution found ps2 For completeness - running master/workers as a regular user solves the problem only for the given user. For other users submitting to this master the result is given in [2] below [0] Cluster details: Master/workers: centos 6.5 Spark 1.3.1 prebuilt for hadoop 2.4 (same behaviour for the 2.6 build) [1] ## File /mnt/home/tfruboes/2015.05.SparkLocal/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o27.save. : java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_01/part-r-2.parquet; isDirectory=false; length=534; replication=1; blocksize=33554432; modification_time=1432042832000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-2.parquet at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:43) at org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:690) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:129) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1181) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) ## [2] ## 15/05/19 14:45:19 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, wn23023.cis.gov.pl): java.io.IOException: Mkdirs failed to create file:/mnt/lustre/bigdata/med_home/tmp/test18/namesAndAges.parquet2/_temporary/0/_temporary/attempt_201505191445_0009_r_00_0 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906
Re: Multi user setup and saving a DataFrame / RDD to a network exported file system
Thanks for a suggestion. I have tried playing with it, sc.sparkUser() gives me expected user name, but it doesnt solve the problem. From a quick search through the spark code it seems to me, that this setting is effective only for yarn and mesos. I think the workaround for the problem could be using --deploy-mode cluster (not 100% convenient, since disallows any interactive work), but this is not supported for python based programs. Cheers, Tomasz W dniu 20.05.2015 o 10:57, Iulian DragoČ™ pisze: You could try setting `SPARK_USER` to the user under which your workers are running. I couldn't find many references to this variable, but at least Yarn and Mesos take it into account when spawning executors. Chances are that standalone mode also does it. iulian On Wed, May 20, 2015 at 9:29 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl wrote: Hi, thanks for answer. The rights are drwxr-xr-x 3 tfruboes all 5632 05-19 15 tel:5632%2005-19%2015:40 test19EE/ I have tried setting the rights to 777 for this directory prior to execution. This does not get propagated down the chain, ie the directory created as a result of the save call (namesAndAges.parquet2 in the path in the dump [1] below) is created with the drwxr-xr-x rights (owned by the user submitting the job, ie tfruboes). The temp directories created inside namesAndAges.parquet2/_temporary/0/ (e.g. task_201505200920_0009_r_01) are owned by root, again with drwxr-xr-x access rights Cheers, Tomasz W dniu 19.05.2015 o 23:56, Davies Liu pisze: It surprises me, could you list the owner information of /mnt/lustre/bigdata/med_home/tmp/test19EE/ ? On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl wrote: Dear Experts, we have a spark cluster (standalone mode) in which master and workers are started from root account. Everything runs correctly to the point when we try doing operations such as dataFrame.select(name, age).save(ofile, parquet) or rdd.saveAsPickleFile(ofile) , where ofile is path on a network exported filesystem (visible on all nodes, in our case this is lustre, I guess on nfs effect would be similar). Unsurprisingly temp files created on workers are owned by root, which then leads to a crash (see [1] below). Is there a solution/workaround for this (e.g. controlling file creation mode of the temporary files)? Cheers, Tomasz ps I've tried to google this problem, couple of similar reports, but no clear answer/solution found ps2 For completeness - running master/workers as a regular user solves the problem only for the given user. For other users submitting to this master the result is given in [2] below [0] Cluster details: Master/workers: centos 6.5 Spark 1.3.1 prebuilt for hadoop 2.4 (same behaviour for the 2.6 build) [1] ## File /mnt/home/tfruboes/2015.05.SparkLocal/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o27.save. : java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_01/part-r-2.parquet; isDirectory=false; length=534; replication=1; blocksize=33554432; modification_time=1432042832000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-2.parquet at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:43) at org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:690
Multi user setup and saving a DataFrame / RDD to a network exported file system
Dear Experts, we have a spark cluster (standalone mode) in which master and workers are started from root account. Everything runs correctly to the point when we try doing operations such as dataFrame.select(name, age).save(ofile, parquet) or rdd.saveAsPickleFile(ofile) , where ofile is path on a network exported filesystem (visible on all nodes, in our case this is lustre, I guess on nfs effect would be similar). Unsurprisingly temp files created on workers are owned by root, which then leads to a crash (see [1] below). Is there a solution/workaround for this (e.g. controlling file creation mode of the temporary files)? Cheers, Tomasz ps I've tried to google this problem, couple of similar reports, but no clear answer/solution found ps2 For completeness - running master/workers as a regular user solves the problem only for the given user. For other users submitting to this master the result is given in [2] below [0] Cluster details: Master/workers: centos 6.5 Spark 1.3.1 prebuilt for hadoop 2.4 (same behaviour for the 2.6 build) [1] ## File /mnt/home/tfruboes/2015.05.SparkLocal/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o27.save. : java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_01/part-r-2.parquet; isDirectory=false; length=534; replication=1; blocksize=33554432; modification_time=1432042832000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-2.parquet at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:43) at org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:690) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:129) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1181) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) ## [2] ## 15/05/19 14:45:19 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, wn23023.cis.gov.pl): java.io.IOException: Mkdirs failed to create file:/mnt/lustre/bigdata/med_home/tmp/test18/namesAndAges.parquet2/_temporary/0/_temporary/attempt_201505191445_0009_r_00_0 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784) at parquet.hadoop.ParquetFileWriter.init(ParquetFileWriter.java:154) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at