Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes
I think you can collect the results in driver through toLocalIterator method of RDD and save the result to the driver program; rather than writing it to the file on the local disk and collecting it separately. If your data is small enough and if you have enough cores/memory try processing everything in local mode and write the results locally. -Sathish On Fri, Aug 11, 2017 at 1:17 PM Steve Loughran wrote: > On 10 Aug 2017, at 09:51, Hemanth Gudela > wrote: > > Yeah, installing HDFS in our environment is unfornutately going to take > lot of time (approvals/planning etc). I will have to live with local FS for > now. > The other option I had already tried is collect() and send everything to > driver node. But my data volume is too huge for driver node to handle alone. > > > NFS cross mount. > > > I’m now trying to split the data into multiple datasets, then collect > individual dataset and write it to local FS on driver node (this approach > slows down the spark job, but I hope it works). > > > > I doubt it. The job driver is in charge of committing work renaming data > under _temporary into the right place. Every operation which calls write() > to safe to an FS must have the same paths visible to all nodes in the spark > cluster. > > A cluster-wide filesystem of some form is mandatory, or you abandon > write() and implement your own operations to save (partitioned) data > > > Thank you, > Hemanth > > *From: *Femi Anthony > *Date: *Thursday, 10 August 2017 at 11.24 > *To: *Hemanth Gudela > *Cc: *"user@spark.apache.org" > *Subject: *Re: spark.write.csv is not able write files to specified path, > but is writing to unintended subfolder _temporary/0/task_xxx folder on > worker nodes > > Also, why are you trying to write results locally if you're not using a > distributed file system ? Spark is geared towards writing to a distributed > file system. I would suggest trying to collect() so the data is sent to the > master and then do a write if the result set isn't too big, or repartition > before trying to write (though I suspect this won't really help). You > really should install HDFS if that is possible. > > Sent from my iPhone > > > On Aug 10, 2017, at 3:58 AM, Hemanth Gudela > wrote: > > Thanks for reply Femi! > > I’m writing the file like this à myDataFrame. > write.mode("overwrite").csv("myFilePath") > There absolutely are no errors/warnings after the write. > > _SUCCESS file is created on master node, but the problem of _temporary is > noticed only on worked nodes. > > I know spark.write.csv works best with HDFS, but with the current setup I > have in my environment, I have to deal with spark write to node’s local > file system and not to HDFS. > > Regards, > Hemanth > > *From: *Femi Anthony > *Date: *Thursday, 10 August 2017 at 10.38 > *To: *Hemanth Gudela > *Cc: *"user@spark.apache.org" > *Subject: *Re: spark.write.csv is not able write files to specified path, > but is writing to unintended subfolder _temporary/0/task_xxx folder on > worker nodes > > Normally the* _temporary* directory gets deleted as part of the cleanup > when the write is complete and a SUCCESS file is created. I suspect that > the writes are not properly completed. How are you specifying the write ? > Any error messages in the logs ? > > On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela < > hemanth.gud...@qvantel.com> wrote: > > Hi, > > I’m running spark on cluster mode containing 4 nodes, and trying to write > CSV files to node’s local path (*not HDFS*). > I’m spark.write.csv to write CSV files. > > *On master node*: > spark.write.csv creates a folder with csv file name and writes many files > with part-r-000n suffix. This is okay for me, I can merge them later. > *But on worker nodes*: > spark.write.csv creates a folder with csv file name and > writes many folders and files under _temporary/0/. This is not okay for me. > Could someone please suggest me what could have been going wrong in my > settings/how to be able to write csv files to the specified folder, and not > to subfolders (_temporary/0/task_xxx) in worker machines. > > Thank you, > Hemanth > > > > > > -- > http://www.femibyte.com/twiki5/bin/view/Tech/ > http://www.nextmatrix.com > "Great spirits have always encountered violent opposition from mediocre > minds." - Albert Einstein. > >
Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes
On 10 Aug 2017, at 09:51, Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> wrote: Yeah, installing HDFS in our environment is unfornutately going to take lot of time (approvals/planning etc). I will have to live with local FS for now. The other option I had already tried is collect() and send everything to driver node. But my data volume is too huge for driver node to handle alone. NFS cross mount. I’m now trying to split the data into multiple datasets, then collect individual dataset and write it to local FS on driver node (this approach slows down the spark job, but I hope it works). I doubt it. The job driver is in charge of committing work renaming data under _temporary into the right place. Every operation which calls write() to safe to an FS must have the same paths visible to all nodes in the spark cluster. A cluster-wide filesystem of some form is mandatory, or you abandon write() and implement your own operations to save (partitioned) data Thank you, Hemanth From: Femi Anthony mailto:femib...@gmail.com>> Date: Thursday, 10 August 2017 at 11.24 To: Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes Also, why are you trying to write results locally if you're not using a distributed file system ? Spark is geared towards writing to a distributed file system. I would suggest trying to collect() so the data is sent to the master and then do a write if the result set isn't too big, or repartition before trying to write (though I suspect this won't really help). You really should install HDFS if that is possible. Sent from my iPhone On Aug 10, 2017, at 3:58 AM, Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> wrote: Thanks for reply Femi! I’m writing the file like this --> myDataFrame.write.mode("overwrite").csv("myFilePath") There absolutely are no errors/warnings after the write. _SUCCESS file is created on master node, but the problem of _temporary is noticed only on worked nodes. I know spark.write.csv works best with HDFS, but with the current setup I have in my environment, I have to deal with spark write to node’s local file system and not to HDFS. Regards, Hemanth From: Femi Anthony mailto:femib...@gmail.com>> Date: Thursday, 10 August 2017 at 10.38 To: Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes Normally the _temporary directory gets deleted as part of the cleanup when the write is complete and a SUCCESS file is created. I suspect that the writes are not properly completed. How are you specifying the write ? Any error messages in the logs ? On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> wrote: Hi, I’m running spark on cluster mode containing 4 nodes, and trying to write CSV files to node’s local path (not HDFS). I’m spark.write.csv to write CSV files. On master node: spark.write.csv creates a folder with csv file name and writes many files with part-r-000n suffix. This is okay for me, I can merge them later. But on worker nodes: spark.write.csv creates a folder with csv file name and writes many folders and files under _temporary/0/. This is not okay for me. Could someone please suggest me what could have been going wrong in my settings/how to be able to write csv files to the specified folder, and not to subfolders (_temporary/0/task_xxx) in worker machines. Thank you, Hemanth -- http://www.femibyte.com/twiki5/bin/view/Tech/ http://www.nextmatrix.com<http://www.nextmatrix.com/> "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.
Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes
Yeah, installing HDFS in our environment is unfornutately going to take lot of time (approvals/planning etc). I will have to live with local FS for now. The other option I had already tried is collect() and send everything to driver node. But my data volume is too huge for driver node to handle alone. I’m now trying to split the data into multiple datasets, then collect individual dataset and write it to local FS on driver node (this approach slows down the spark job, but I hope it works). Thank you, Hemanth From: Femi Anthony Date: Thursday, 10 August 2017 at 11.24 To: Hemanth Gudela Cc: "user@spark.apache.org" Subject: Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes Also, why are you trying to write results locally if you're not using a distributed file system ? Spark is geared towards writing to a distributed file system. I would suggest trying to collect() so the data is sent to the master and then do a write if the result set isn't too big, or repartition before trying to write (though I suspect this won't really help). You really should install HDFS if that is possible. Sent from my iPhone On Aug 10, 2017, at 3:58 AM, Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> wrote: Thanks for reply Femi! I’m writing the file like this --> myDataFrame.write.mode("overwrite").csv("myFilePath") There absolutely are no errors/warnings after the write. _SUCCESS file is created on master node, but the problem of _temporary is noticed only on worked nodes. I know spark.write.csv works best with HDFS, but with the current setup I have in my environment, I have to deal with spark write to node’s local file system and not to HDFS. Regards, Hemanth From: Femi Anthony mailto:femib...@gmail.com>> Date: Thursday, 10 August 2017 at 10.38 To: Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes Normally the _temporary directory gets deleted as part of the cleanup when the write is complete and a SUCCESS file is created. I suspect that the writes are not properly completed. How are you specifying the write ? Any error messages in the logs ? On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> wrote: Hi, I’m running spark on cluster mode containing 4 nodes, and trying to write CSV files to node’s local path (not HDFS). I’m spark.write.csv to write CSV files. On master node: spark.write.csv creates a folder with csv file name and writes many files with part-r-000n suffix. This is okay for me, I can merge them later. But on worker nodes: spark.write.csv creates a folder with csv file name and writes many folders and files under _temporary/0/. This is not okay for me. Could someone please suggest me what could have been going wrong in my settings/how to be able to write csv files to the specified folder, and not to subfolders (_temporary/0/task_xxx) in worker machines. Thank you, Hemanth -- http://www.femibyte.com/twiki5/bin/view/Tech/ http://www.nextmatrix.com "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.
Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes
Also, why are you trying to write results locally if you're not using a distributed file system ? Spark is geared towards writing to a distributed file system. I would suggest trying to collect() so the data is sent to the master and then do a write if the result set isn't too big, or repartition before trying to write (though I suspect this won't really help). You really should install HDFS if that is possible. Sent from my iPhone > On Aug 10, 2017, at 3:58 AM, Hemanth Gudela > wrote: > > Thanks for reply Femi! > > I’m writing the file like this à > myDataFrame.write.mode("overwrite").csv("myFilePath") > There absolutely are no errors/warnings after the write. > > _SUCCESS file is created on master node, but the problem of _temporary is > noticed only on worked nodes. > > I know spark.write.csv works best with HDFS, but with the current setup I > have in my environment, I have to deal with spark write to node’s local file > system and not to HDFS. > > Regards, > Hemanth > > From: Femi Anthony > Date: Thursday, 10 August 2017 at 10.38 > To: Hemanth Gudela > Cc: "user@spark.apache.org" > Subject: Re: spark.write.csv is not able write files to specified path, but > is writing to unintended subfolder _temporary/0/task_xxx folder on worker > nodes > > Normally the _temporary directory gets deleted as part of the cleanup when > the write is complete and a SUCCESS file is created. I suspect that the > writes are not properly completed. How are you specifying the write ? Any > error messages in the logs ? > > On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela > wrote: > Hi, > > I’m running spark on cluster mode containing 4 nodes, and trying to write CSV > files to node’s local path (not HDFS). > I’m spark.write.csv to write CSV files. > > On master node: > spark.write.csv creates a folder with csv file name and writes many files > with part-r-000n suffix. This is okay for me, I can merge them later. > But on worker nodes: > spark.write.csv creates a folder with csv file name and > writes many folders and files under _temporary/0/. This is not okay for me. > Could someone please suggest me what could have been going wrong in my > settings/how to be able to write csv files to the specified folder, and not > to subfolders (_temporary/0/task_xxx) in worker machines. > > Thank you, > Hemanth > > > > > -- > http://www.femibyte.com/twiki5/bin/view/Tech/ > http://www.nextmatrix.com > "Great spirits have always encountered violent opposition from mediocre > minds." - Albert Einstein.
Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes
Yes, I have tried with file:/// and the fullpath, as well as just the full path without file:/// prefix. Spark session has been closed, no luck though ☹ Regards, Hemanth From: Femi Anthony Date: Thursday, 10 August 2017 at 11.06 To: Hemanth Gudela Cc: "user@spark.apache.org" Subject: Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes Is your filePath prefaced with file:/// and the full path or is it relative ? You might also try calling close() on the Spark context or session the end of the program execution to try and ensure that cleanup is completed Sent from my iPhone On Aug 10, 2017, at 3:58 AM, Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> wrote: Thanks for reply Femi! I’m writing the file like this --> myDataFrame.write.mode("overwrite").csv("myFilePath") There absolutely are no errors/warnings after the write. _SUCCESS file is created on master node, but the problem of _temporary is noticed only on worked nodes. I know spark.write.csv works best with HDFS, but with the current setup I have in my environment, I have to deal with spark write to node’s local file system and not to HDFS. Regards, Hemanth From: Femi Anthony mailto:femib...@gmail.com>> Date: Thursday, 10 August 2017 at 10.38 To: Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes Normally the _temporary directory gets deleted as part of the cleanup when the write is complete and a SUCCESS file is created. I suspect that the writes are not properly completed. How are you specifying the write ? Any error messages in the logs ? On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> wrote: Hi, I’m running spark on cluster mode containing 4 nodes, and trying to write CSV files to node’s local path (not HDFS). I’m spark.write.csv to write CSV files. On master node: spark.write.csv creates a folder with csv file name and writes many files with part-r-000n suffix. This is okay for me, I can merge them later. But on worker nodes: spark.write.csv creates a folder with csv file name and writes many folders and files under _temporary/0/. This is not okay for me. Could someone please suggest me what could have been going wrong in my settings/how to be able to write csv files to the specified folder, and not to subfolders (_temporary/0/task_xxx) in worker machines. Thank you, Hemanth -- http://www.femibyte.com/twiki5/bin/view/Tech/ http://www.nextmatrix.com "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.
Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes
Is your filePath prefaced with file:/// and the full path or is it relative ? You might also try calling close() on the Spark context or session the end of the program execution to try and ensure that cleanup is completed Sent from my iPhone > On Aug 10, 2017, at 3:58 AM, Hemanth Gudela > wrote: > > Thanks for reply Femi! > > I’m writing the file like this à > myDataFrame.write.mode("overwrite").csv("myFilePath") > There absolutely are no errors/warnings after the write. > > _SUCCESS file is created on master node, but the problem of _temporary is > noticed only on worked nodes. > > I know spark.write.csv works best with HDFS, but with the current setup I > have in my environment, I have to deal with spark write to node’s local file > system and not to HDFS. > > Regards, > Hemanth > > From: Femi Anthony > Date: Thursday, 10 August 2017 at 10.38 > To: Hemanth Gudela > Cc: "user@spark.apache.org" > Subject: Re: spark.write.csv is not able write files to specified path, but > is writing to unintended subfolder _temporary/0/task_xxx folder on worker > nodes > > Normally the _temporary directory gets deleted as part of the cleanup when > the write is complete and a SUCCESS file is created. I suspect that the > writes are not properly completed. How are you specifying the write ? Any > error messages in the logs ? > > On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela > wrote: > Hi, > > I’m running spark on cluster mode containing 4 nodes, and trying to write CSV > files to node’s local path (not HDFS). > I’m spark.write.csv to write CSV files. > > On master node: > spark.write.csv creates a folder with csv file name and writes many files > with part-r-000n suffix. This is okay for me, I can merge them later. > But on worker nodes: > spark.write.csv creates a folder with csv file name and > writes many folders and files under _temporary/0/. This is not okay for me. > Could someone please suggest me what could have been going wrong in my > settings/how to be able to write csv files to the specified folder, and not > to subfolders (_temporary/0/task_xxx) in worker machines. > > Thank you, > Hemanth > > > > > -- > http://www.femibyte.com/twiki5/bin/view/Tech/ > http://www.nextmatrix.com > "Great spirits have always encountered violent opposition from mediocre > minds." - Albert Einstein.
Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes
Thanks for reply Femi! I’m writing the file like this --> myDataFrame.write.mode("overwrite").csv("myFilePath") There absolutely are no errors/warnings after the write. _SUCCESS file is created on master node, but the problem of _temporary is noticed only on worked nodes. I know spark.write.csv works best with HDFS, but with the current setup I have in my environment, I have to deal with spark write to node’s local file system and not to HDFS. Regards, Hemanth From: Femi Anthony Date: Thursday, 10 August 2017 at 10.38 To: Hemanth Gudela Cc: "user@spark.apache.org" Subject: Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes Normally the _temporary directory gets deleted as part of the cleanup when the write is complete and a SUCCESS file is created. I suspect that the writes are not properly completed. How are you specifying the write ? Any error messages in the logs ? On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> wrote: Hi, I’m running spark on cluster mode containing 4 nodes, and trying to write CSV files to node’s local path (not HDFS). I’m spark.write.csv to write CSV files. On master node: spark.write.csv creates a folder with csv file name and writes many files with part-r-000n suffix. This is okay for me, I can merge them later. But on worker nodes: spark.write.csv creates a folder with csv file name and writes many folders and files under _temporary/0/. This is not okay for me. Could someone please suggest me what could have been going wrong in my settings/how to be able to write csv files to the specified folder, and not to subfolders (_temporary/0/task_xxx) in worker machines. Thank you, Hemanth -- http://www.femibyte.com/twiki5/bin/view/Tech/ http://www.nextmatrix.com "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.
Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes
Normally the* _temporary* directory gets deleted as part of the cleanup when the write is complete and a SUCCESS file is created. I suspect that the writes are not properly completed. How are you specifying the write ? Any error messages in the logs ? On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela wrote: > Hi, > > > > I’m running spark on cluster mode containing 4 nodes, and trying to write > CSV files to node’s local path (*not HDFS*). > > I’m spark.write.csv to write CSV files. > > > > *On master node*: > > spark.write.csv creates a folder with csv file name and writes many files > with part-r-000n suffix. This is okay for me, I can merge them later. > > *But on worker nodes*: > > spark.write.csv creates a folder with csv file name and > writes many folders and files under _temporary/0/. This is not okay for me. > > Could someone please suggest me what could have been going wrong in my > settings/how to be able to write csv files to the specified folder, and not > to subfolders (_temporary/0/task_xxx) in worker machines. > > > > Thank you, > > Hemanth > > > -- http://www.femibyte.com/twiki5/bin/view/Tech/ http://www.nextmatrix.com "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.