Re: Spark GraphFrame ConnectedComponents
Hi Steve, I could get the application working by setting "spark.hadoop.fs.default.name". Thank you!! And thank you for your input on using S3 for checkpoint. I am still working on PoC so will consider using HDFS for the final implementation. Thanks Ankur On Fri, Jan 6, 2017 at 9:57 AM, Steve Loughran wrote: > > On 5 Jan 2017, at 21:10, Ankur Srivastava > wrote: > > Yes I did try it out and it choses the local file system as my checkpoint > location starts with s3n:// > > I am not sure how can I make it load the S3FileSystem. > > > set fs.default.name to s3n://whatever , or, in spark context, > spark.hadoop.fs.default.name > > However > > 1. you should really use s3a, if you have the hadoop 2.7 JARs on your > classpath. > 2. neither s3n or s3a are real filesystems, and certain assumptions that > checkpointing code tends to make "renames being O(1) atomic calls" do not > hold. It may be that checkpointing to s3 isn't as robust as you'd like > > > > > On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung > wrote: > >> Right, I'd agree, it seems to be only with delete. >> >> Could you by chance run just the delete to see if it fails >> >> FileSystem.get(sc.hadoopConfiguration) >> .delete(new Path(somepath), true) >> ------ >> *From:* Ankur Srivastava >> *Sent:* Thursday, January 5, 2017 10:05:03 AM >> *To:* Felix Cheung >> *Cc:* user@spark.apache.org >> >> *Subject:* Re: Spark GraphFrame ConnectedComponents >> >> Yes it works to read the vertices and edges data from S3 location and is >> also able to write the checkpoint files to S3. It only fails when deleting >> the data and that is because it tries to use the default file system. I >> tried looking up how to update the default file system but could not find >> anything in that regard. >> >> Thanks >> Ankur >> >> On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung >> wrote: >> >>> From the stack it looks to be an error from the explicit call to >>> hadoop.fs.FileSystem. >>> >>> Is the URL scheme for s3n registered? >>> Does it work when you try to read from s3 from Spark? >>> >>> _ >>> From: Ankur Srivastava >>> Sent: Wednesday, January 4, 2017 9:23 PM >>> Subject: Re: Spark GraphFrame ConnectedComponents >>> To: Felix Cheung >>> Cc: >>> >>> >>> >>> This is the exact trace from the driver logs >>> >>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong >>> FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7 >>> be/connected-components-c1dbc2b0/3, expected: file:/// >>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) >>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >>> ileSystem.java:80) >>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta >>> tus(RawLocalFileSystem.java:529) >>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt >>> ernal(RawLocalFileSystem.java:747) >>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc >>> alFileSystem.java:524) >>> at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS >>> ystem.java:534) >>> at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib >>> $ConnectedComponents$$run(ConnectedComponents.scala:340) >>> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone >>> nts.scala:139) >>> at GraphTest.main(GraphTest.java:31) --- Application Class >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >>> ssorImpl.java:57) >>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >>> thodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:606) >>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy >>> $SparkSubmit$$runMain(SparkSubmit.scala:731) >>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit >>> .scala:181) >>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) >>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) >>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>> >>> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10 >>> >>> Thanks >>> Ankur >>> >>> On Wed, Jan 4
Re: Spark GraphFrame ConnectedComponents
On 5 Jan 2017, at 21:10, Ankur Srivastava mailto:ankur.srivast...@gmail.com>> wrote: Yes I did try it out and it choses the local file system as my checkpoint location starts with s3n:// I am not sure how can I make it load the S3FileSystem. set fs.default.name to s3n://whatever , or, in spark context, spark.hadoop.fs.default.name However 1. you should really use s3a, if you have the hadoop 2.7 JARs on your classpath. 2. neither s3n or s3a are real filesystems, and certain assumptions that checkpointing code tends to make "renames being O(1) atomic calls" do not hold. It may be that checkpointing to s3 isn't as robust as you'd like On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung mailto:felixcheun...@hotmail.com>> wrote: Right, I'd agree, it seems to be only with delete. Could you by chance run just the delete to see if it fails FileSystem.get(sc.hadoopConfiguration) .delete(new Path(somepath), true) From: Ankur Srivastava mailto:ankur.srivast...@gmail.com>> Sent: Thursday, January 5, 2017 10:05:03 AM To: Felix Cheung Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark GraphFrame ConnectedComponents Yes it works to read the vertices and edges data from S3 location and is also able to write the checkpoint files to S3. It only fails when deleting the data and that is because it tries to use the default file system. I tried looking up how to update the default file system but could not find anything in that regard. Thanks Ankur On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung mailto:felixcheun...@hotmail.com>> wrote: >From the stack it looks to be an error from the explicit call to >hadoop.fs.FileSystem. Is the URL scheme for s3n registered? Does it work when you try to read from s3 from Spark? _ From: Ankur Srivastava mailto:ankur.srivast...@gmail.com>> Sent: Wednesday, January 4, 2017 9:23 PM Subject: Re: Spark GraphFrame ConnectedComponents To: Felix Cheung mailto:felixcheun...@hotmail.com>> Cc: mailto:user@spark.apache.org>> This is the exact trace from the driver logs Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534) at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340) at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139) at GraphTest.main(GraphTest.java:31) --- Application Class at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10 Thanks Ankur On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava mailto:ankur.srivast...@gmail.com>> wrote: Hi I am rerunning the pipeline to generate the exact trace, I have below part of trace from last run: Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n://, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528) Also I think the error is happening in this part of the code "ConnectedComponents.scala:339" I am referring the code @https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala if (shouldCheckpoint && (iteration % checkpointInterval == 0)) { // TODO: remove this after DataFrame.checkpoint is implemented val out = s"${checkpointDir.get}/$ite
Re: Spark GraphFrame ConnectedComponents
Would it be more robust to use the Path when creating the FileSystem? https://github.com/graphframes/graphframes/issues/160 On Thu, Jan 5, 2017 at 4:57 PM, Felix Cheung wrote: > This is likely a factor of your hadoop config and Spark rather then > anything specific with GraphFrames. > > You might have better luck getting assistance if you could isolate the > code to a simple case that manifests the problem (without GraphFrames), and > repost. > > > -- > *From:* Ankur Srivastava > *Sent:* Thursday, January 5, 2017 3:45:59 PM > *To:* Felix Cheung; d...@spark.apache.org > > *Cc:* user@spark.apache.org > *Subject:* Re: Spark GraphFrame ConnectedComponents > > Adding DEV mailing list to see if this is a defect with ConnectedComponent > or if they can recommend any solution. > > Thanks > Ankur > > On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava < > ankur.srivast...@gmail.com> wrote: > >> Yes I did try it out and it choses the local file system as my checkpoint >> location starts with s3n:// >> >> I am not sure how can I make it load the S3FileSystem. >> >> On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung >> wrote: >> >>> Right, I'd agree, it seems to be only with delete. >>> >>> Could you by chance run just the delete to see if it fails >>> >>> FileSystem.get(sc.hadoopConfiguration) >>> .delete(new Path(somepath), true) >>> ------ >>> *From:* Ankur Srivastava >>> *Sent:* Thursday, January 5, 2017 10:05:03 AM >>> *To:* Felix Cheung >>> *Cc:* user@spark.apache.org >>> >>> *Subject:* Re: Spark GraphFrame ConnectedComponents >>> >>> Yes it works to read the vertices and edges data from S3 location and is >>> also able to write the checkpoint files to S3. It only fails when deleting >>> the data and that is because it tries to use the default file system. I >>> tried looking up how to update the default file system but could not find >>> anything in that regard. >>> >>> Thanks >>> Ankur >>> >>> On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung >> > wrote: >>> >>>> From the stack it looks to be an error from the explicit call to >>>> hadoop.fs.FileSystem. >>>> >>>> Is the URL scheme for s3n registered? >>>> Does it work when you try to read from s3 from Spark? >>>> >>>> _ >>>> From: Ankur Srivastava >>>> Sent: Wednesday, January 4, 2017 9:23 PM >>>> Subject: Re: Spark GraphFrame ConnectedComponents >>>> To: Felix Cheung >>>> Cc: >>>> >>>> >>>> >>>> This is the exact trace from the driver logs >>>> >>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong >>>> FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7 >>>> be/connected-components-c1dbc2b0/3, expected: file:/// >>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) >>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >>>> ileSystem.java:80) >>>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta >>>> tus(RawLocalFileSystem.java:529) >>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt >>>> ernal(RawLocalFileSystem.java:747) >>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc >>>> alFileSystem.java:524) >>>> at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS >>>> ystem.java:534) >>>> at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib >>>> $ConnectedComponents$$run(ConnectedComponents.scala:340) >>>> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone >>>> nts.scala:139) >>>> at GraphTest.main(GraphTest.java:31) --- Application Class >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >>>> ssorImpl.java:57) >>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >>>> thodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy >>>> $SparkSubmit$$runMain(SparkSubmit.scala:731) >>>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit >>&g
Re: Spark GraphFrame ConnectedComponents
This is likely a factor of your hadoop config and Spark rather then anything specific with GraphFrames. You might have better luck getting assistance if you could isolate the code to a simple case that manifests the problem (without GraphFrames), and repost. From: Ankur Srivastava Sent: Thursday, January 5, 2017 3:45:59 PM To: Felix Cheung; d...@spark.apache.org Cc: user@spark.apache.org Subject: Re: Spark GraphFrame ConnectedComponents Adding DEV mailing list to see if this is a defect with ConnectedComponent or if they can recommend any solution. Thanks Ankur On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava mailto:ankur.srivast...@gmail.com>> wrote: Yes I did try it out and it choses the local file system as my checkpoint location starts with s3n:// I am not sure how can I make it load the S3FileSystem. On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung mailto:felixcheun...@hotmail.com>> wrote: Right, I'd agree, it seems to be only with delete. Could you by chance run just the delete to see if it fails FileSystem.get(sc.hadoopConfiguration) .delete(new Path(somepath), true) From: Ankur Srivastava mailto:ankur.srivast...@gmail.com>> Sent: Thursday, January 5, 2017 10:05:03 AM To: Felix Cheung Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark GraphFrame ConnectedComponents Yes it works to read the vertices and edges data from S3 location and is also able to write the checkpoint files to S3. It only fails when deleting the data and that is because it tries to use the default file system. I tried looking up how to update the default file system but could not find anything in that regard. Thanks Ankur On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung mailto:felixcheun...@hotmail.com>> wrote: >From the stack it looks to be an error from the explicit call to >hadoop.fs.FileSystem. Is the URL scheme for s3n registered? Does it work when you try to read from s3 from Spark? _ From: Ankur Srivastava mailto:ankur.srivast...@gmail.com>> Sent: Wednesday, January 4, 2017 9:23 PM Subject: Re: Spark GraphFrame ConnectedComponents To: Felix Cheung mailto:felixcheun...@hotmail.com>> Cc: mailto:user@spark.apache.org>> This is the exact trace from the driver logs Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534) at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340) at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139) at GraphTest.main(GraphTest.java:31) --- Application Class at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10 Thanks Ankur On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava mailto:ankur.srivast...@gmail.com>> wrote: Hi I am rerunning the pipeline to generate the exact trace, I have below part of trace from last run: Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n://, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528) Also I think the error is happening in this part of the code "ConnectedComponents.scala:339" I am referring the code @https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala if (
Re: Spark GraphFrame ConnectedComponents
Adding DEV mailing list to see if this is a defect with ConnectedComponent or if they can recommend any solution. Thanks Ankur On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava wrote: > Yes I did try it out and it choses the local file system as my checkpoint > location starts with s3n:// > > I am not sure how can I make it load the S3FileSystem. > > On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung > wrote: > >> Right, I'd agree, it seems to be only with delete. >> >> Could you by chance run just the delete to see if it fails >> >> FileSystem.get(sc.hadoopConfiguration) >> .delete(new Path(somepath), true) >> -- >> *From:* Ankur Srivastava >> *Sent:* Thursday, January 5, 2017 10:05:03 AM >> *To:* Felix Cheung >> *Cc:* user@spark.apache.org >> >> *Subject:* Re: Spark GraphFrame ConnectedComponents >> >> Yes it works to read the vertices and edges data from S3 location and is >> also able to write the checkpoint files to S3. It only fails when deleting >> the data and that is because it tries to use the default file system. I >> tried looking up how to update the default file system but could not find >> anything in that regard. >> >> Thanks >> Ankur >> >> On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung >> wrote: >> >>> From the stack it looks to be an error from the explicit call to >>> hadoop.fs.FileSystem. >>> >>> Is the URL scheme for s3n registered? >>> Does it work when you try to read from s3 from Spark? >>> >>> _ >>> From: Ankur Srivastava >>> Sent: Wednesday, January 4, 2017 9:23 PM >>> Subject: Re: Spark GraphFrame ConnectedComponents >>> To: Felix Cheung >>> Cc: >>> >>> >>> >>> This is the exact trace from the driver logs >>> >>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong >>> FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7 >>> be/connected-components-c1dbc2b0/3, expected: file:/// >>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) >>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >>> ileSystem.java:80) >>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta >>> tus(RawLocalFileSystem.java:529) >>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt >>> ernal(RawLocalFileSystem.java:747) >>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc >>> alFileSystem.java:524) >>> at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS >>> ystem.java:534) >>> at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib >>> $ConnectedComponents$$run(ConnectedComponents.scala:340) >>> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone >>> nts.scala:139) >>> at GraphTest.main(GraphTest.java:31) --- Application Class >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >>> ssorImpl.java:57) >>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >>> thodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:606) >>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy >>> $SparkSubmit$$runMain(SparkSubmit.scala:731) >>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit >>> .scala:181) >>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) >>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) >>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>> >>> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10 >>> >>> Thanks >>> Ankur >>> >>> On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava < >>> ankur.srivast...@gmail.com> wrote: >>> >>>> Hi >>>> >>>> I am rerunning the pipeline to generate the exact trace, I have below >>>> part of trace from last run: >>>> >>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong >>>> FS: s3n://, expected: file:/// >>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) >>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >>>> ileSystem.java:69) >>>> at org.apache.hadoop.fs.RawLoc
Re: Spark GraphFrame ConnectedComponents
Yes I did try it out and it choses the local file system as my checkpoint location starts with s3n:// I am not sure how can I make it load the S3FileSystem. On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung wrote: > Right, I'd agree, it seems to be only with delete. > > Could you by chance run just the delete to see if it fails > > FileSystem.get(sc.hadoopConfiguration) > .delete(new Path(somepath), true) > -- > *From:* Ankur Srivastava > *Sent:* Thursday, January 5, 2017 10:05:03 AM > *To:* Felix Cheung > *Cc:* user@spark.apache.org > > *Subject:* Re: Spark GraphFrame ConnectedComponents > > Yes it works to read the vertices and edges data from S3 location and is > also able to write the checkpoint files to S3. It only fails when deleting > the data and that is because it tries to use the default file system. I > tried looking up how to update the default file system but could not find > anything in that regard. > > Thanks > Ankur > > On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung > wrote: > >> From the stack it looks to be an error from the explicit call to >> hadoop.fs.FileSystem. >> >> Is the URL scheme for s3n registered? >> Does it work when you try to read from s3 from Spark? >> >> _____ >> From: Ankur Srivastava >> Sent: Wednesday, January 4, 2017 9:23 PM >> Subject: Re: Spark GraphFrame ConnectedComponents >> To: Felix Cheung >> Cc: >> >> >> >> This is the exact trace from the driver logs >> >> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: >> s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7 >> be/connected-components-c1dbc2b0/3, expected: file:/// >> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) >> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >> ileSystem.java:80) >> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta >> tus(RawLocalFileSystem.java:529) >> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt >> ernal(RawLocalFileSystem.java:747) >> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc >> alFileSystem.java:524) >> at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS >> ystem.java:534) >> at org.graphframes.lib.ConnectedComponents$.org$graphframes$ >> lib$ConnectedComponents$$run(ConnectedComponents.scala:340) >> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone >> nts.scala:139) >> at GraphTest.main(GraphTest.java:31) --- Application Class >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >> ssorImpl.java:57) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >> thodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy >> $SparkSubmit$$runMain(SparkSubmit.scala:731) >> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit >> .scala:181) >> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) >> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) >> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> >> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10 >> >> Thanks >> Ankur >> >> On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava < >> ankur.srivast...@gmail.com> wrote: >> >>> Hi >>> >>> I am rerunning the pipeline to generate the exact trace, I have below >>> part of trace from last run: >>> >>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong >>> FS: s3n://, expected: file:/// >>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) >>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >>> ileSystem.java:69) >>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc >>> alFileSystem.java:516) >>> at >>> org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528) >>> >>> >>> Also I think the error is happening in this part of the code >>> "ConnectedComponents.scala:339" I am referring the code @ >>> https://github.com/graphframes/graphframes/blob/master/src/ >>> main/scala/org/graphframes/lib/ConnectedComponents.scala >>> >>> if (shouldCheckpoint && (iteration % checkpointInterval == 0)) { >>>
Re: Spark GraphFrame ConnectedComponents
Right, I'd agree, it seems to be only with delete. Could you by chance run just the delete to see if it fails FileSystem.get(sc.hadoopConfiguration) .delete(new Path(somepath), true) From: Ankur Srivastava Sent: Thursday, January 5, 2017 10:05:03 AM To: Felix Cheung Cc: user@spark.apache.org Subject: Re: Spark GraphFrame ConnectedComponents Yes it works to read the vertices and edges data from S3 location and is also able to write the checkpoint files to S3. It only fails when deleting the data and that is because it tries to use the default file system. I tried looking up how to update the default file system but could not find anything in that regard. Thanks Ankur On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung mailto:felixcheun...@hotmail.com>> wrote: >From the stack it looks to be an error from the explicit call to >hadoop.fs.FileSystem. Is the URL scheme for s3n registered? Does it work when you try to read from s3 from Spark? _ From: Ankur Srivastava mailto:ankur.srivast...@gmail.com>> Sent: Wednesday, January 4, 2017 9:23 PM Subject: Re: Spark GraphFrame ConnectedComponents To: Felix Cheung mailto:felixcheun...@hotmail.com>> Cc: mailto:user@spark.apache.org>> This is the exact trace from the driver logs Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534) at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340) at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139) at GraphTest.main(GraphTest.java:31) --- Application Class at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10 Thanks Ankur On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava mailto:ankur.srivast...@gmail.com>> wrote: Hi I am rerunning the pipeline to generate the exact trace, I have below part of trace from last run: Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n://, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528) Also I think the error is happening in this part of the code "ConnectedComponents.scala:339" I am referring the code @https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala if (shouldCheckpoint && (iteration % checkpointInterval == 0)) { // TODO: remove this after DataFrame.checkpoint is implemented val out = s"${checkpointDir.get}/$iteration" ee.write.parquet(out) // may hit S3 eventually consistent issue ee = sqlContext.read.parquet(out) // remove previous checkpoint if (iteration > checkpointInterval) { FileSystem.get(sc.hadoopConfiguration) .delete(new Path(s"${checkpointDir.get}/${iteration - checkpointInterval}"), true) } System.gc() // hint Spark to clean shuffle directories } Thanks Ankur On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung mailto:felixcheun...@hotmail.com>> wrote: Do you have more of the exception stack? From: Ankur Srivastava mailto:ankur.srivast...@gmail.com>> Sent: Wednesday, January 4, 2017 4:40:02 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Spark GraphFrame ConnectedC
Re: Spark GraphFrame ConnectedComponents
Yes it works to read the vertices and edges data from S3 location and is also able to write the checkpoint files to S3. It only fails when deleting the data and that is because it tries to use the default file system. I tried looking up how to update the default file system but could not find anything in that regard. Thanks Ankur On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung wrote: > From the stack it looks to be an error from the explicit call to > hadoop.fs.FileSystem. > > Is the URL scheme for s3n registered? > Does it work when you try to read from s3 from Spark? > > _ > From: Ankur Srivastava > Sent: Wednesday, January 4, 2017 9:23 PM > Subject: Re: Spark GraphFrame ConnectedComponents > To: Felix Cheung > Cc: > > > > This is the exact trace from the driver logs > > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) > at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile( > RawLocalFileSystem.java:80) > at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus( > RawLocalFileSystem.java:529) > at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal( > RawLocalFileSystem.java:747) > at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus( > RawLocalFileSystem.java:524) > at org.apache.hadoop.fs.ChecksumFileSystem.delete( > ChecksumFileSystem.java:534) > at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ > ConnectedComponents$$run(ConnectedComponents.scala:340) > at org.graphframes.lib.ConnectedComponents.run( > ConnectedComponents.scala:139) > at GraphTest.main(GraphTest.java:31) --- Application Class > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:57) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$ > deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10 > > Thanks > Ankur > > On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava < > ankur.srivast...@gmail.com> wrote: > >> Hi >> >> I am rerunning the pipeline to generate the exact trace, I have below >> part of trace from last run: >> >> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: >> s3n://, expected: file:/// >> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) >> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >> ileSystem.java:69) >> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc >> alFileSystem.java:516) >> at >> org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528) >> >> >> Also I think the error is happening in this part of the code >> "ConnectedComponents.scala:339" I am referring the code @ >> https://github.com/graphframes/graphframes/blob/master/src/ >> main/scala/org/graphframes/lib/ConnectedComponents.scala >> >> if (shouldCheckpoint && (iteration % checkpointInterval == 0)) { >> // TODO: remove this after DataFrame.checkpoint is implemented >> val out = s"${checkpointDir.get}/$iteration" >> ee.write.parquet(out) >> // may hit S3 eventually consistent issue >> ee = sqlContext.read.parquet(out) >> >> // remove previous checkpoint >> if (iteration > checkpointInterval) { >> *FileSystem.get(sc.hadoopConfiguration)* >> *.delete(new Path(s"${checkpointDir.get}/${iteration - >> checkpointInterval}"), true)* >> } >> >> System.gc() // hint Spark to clean shuffle directories >> } >> >> >> Thanks >> Ankur >> >> On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung >> wrote: >> >>> Do you have more of the exception stack? >>> >>> >>> -- >>> *From:* Ankur Srivastava >>> *Sent:* Wednesday, J
Re: Spark GraphFrame ConnectedComponents
>From the stack it looks to be an error from the explicit call to >hadoop.fs.FileSystem. Is the URL scheme for s3n registered? Does it work when you try to read from s3 from Spark? _ From: Ankur Srivastava mailto:ankur.srivast...@gmail.com>> Sent: Wednesday, January 4, 2017 9:23 PM Subject: Re: Spark GraphFrame ConnectedComponents To: Felix Cheung mailto:felixcheun...@hotmail.com>> Cc: mailto:user@spark.apache.org>> This is the exact trace from the driver logs Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534) at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340) at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139) at GraphTest.main(GraphTest.java:31) --- Application Class at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10 Thanks Ankur On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava mailto:ankur.srivast...@gmail.com>> wrote: Hi I am rerunning the pipeline to generate the exact trace, I have below part of trace from last run: Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n://, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528) Also I think the error is happening in this part of the code "ConnectedComponents.scala:339" I am referring the code @https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala if (shouldCheckpoint && (iteration % checkpointInterval == 0)) { // TODO: remove this after DataFrame.checkpoint is implemented val out = s"${checkpointDir.get}/$iteration" ee.write.parquet(out) // may hit S3 eventually consistent issue ee = sqlContext.read.parquet(out) // remove previous checkpoint if (iteration > checkpointInterval) { FileSystem.get(sc.hadoopConfiguration) .delete(new Path(s"${checkpointDir.get}/${iteration - checkpointInterval}"), true) } System.gc() // hint Spark to clean shuffle directories } Thanks Ankur On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung mailto:felixcheun...@hotmail.com>> wrote: Do you have more of the exception stack? From: Ankur Srivastava mailto:ankur.srivast...@gmail.com>> Sent: Wednesday, January 4, 2017 4:40:02 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Spark GraphFrame ConnectedComponents Hi, I am trying to use the ConnectedComponent algorithm of GraphFrames but by default it needs a checkpoint directory. As I am running my spark cluster with S3 as the DFS and do not have access to HDFS file system I tried using a s3 directory as checkpoint directory but I run into below exception: Exception in thread "main"java.lang.IllegalArgumentException: Wrong FS: s3n://, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69) If I set checkpoint interval to -1 to avoid checkpointing the driver just hangs after 3 or 4 iterations. Is there some way I can set the default FileSystem to S3 for Spark or any other option? Thanks Ankur
Re: Spark GraphFrame ConnectedComponents
This is the exact trace from the driver logs Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534) at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340) at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139) at GraphTest.main(GraphTest.java:31) --- Application Class at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10 Thanks Ankur On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava wrote: > Hi > > I am rerunning the pipeline to generate the exact trace, I have below part > of trace from last run: > > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > s3n://, expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile( > RawLocalFileSystem.java:69) > at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus( > RawLocalFileSystem.java:516) > at org.apache.hadoop.fs.ChecksumFileSystem.delete( > ChecksumFileSystem.java:528) > > Also I think the error is happening in this part of the code > "ConnectedComponents.scala:339" I am referring the code @ > https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/ > graphframes/lib/ConnectedComponents.scala > > if (shouldCheckpoint && (iteration % checkpointInterval == 0)) { > // TODO: remove this after DataFrame.checkpoint is implemented > val out = s"${checkpointDir.get}/$iteration" > ee.write.parquet(out) > // may hit S3 eventually consistent issue > ee = sqlContext.read.parquet(out) > > // remove previous checkpoint > if (iteration > checkpointInterval) { > *FileSystem.get(sc.hadoopConfiguration)* > *.delete(new Path(s"${checkpointDir.get}/${iteration - > checkpointInterval}"), true)* > } > > System.gc() // hint Spark to clean shuffle directories > } > > > Thanks > Ankur > > On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung > wrote: > >> Do you have more of the exception stack? >> >> >> -- >> *From:* Ankur Srivastava >> *Sent:* Wednesday, January 4, 2017 4:40:02 PM >> *To:* user@spark.apache.org >> *Subject:* Spark GraphFrame ConnectedComponents >> >> Hi, >> >> I am trying to use the ConnectedComponent algorithm of GraphFrames but by >> default it needs a checkpoint directory. As I am running my spark cluster >> with S3 as the DFS and do not have access to HDFS file system I tried using >> a s3 directory as checkpoint directory but I run into below exception: >> >> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: >> s3n://, expected: file:/// >> >> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) >> >> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >> ileSystem.java:69) >> >> If I set checkpoint interval to -1 to avoid checkpointing the driver just >> hangs after 3 or 4 iterations. >> >> Is there some way I can set the default FileSystem to S3 for Spark or any >> other option? >> >> Thanks >> Ankur >> >> >
Re: Spark GraphFrame ConnectedComponents
Hi I am rerunning the pipeline to generate the exact trace, I have below part of trace from last run: Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n://, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516) at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528) Also I think the error is happening in this part of the code "ConnectedComponents.scala:339" I am referring the code @ https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala if (shouldCheckpoint && (iteration % checkpointInterval == 0)) { // TODO: remove this after DataFrame.checkpoint is implemented val out = s"${checkpointDir.get}/$iteration" ee.write.parquet(out) // may hit S3 eventually consistent issue ee = sqlContext.read.parquet(out) // remove previous checkpoint if (iteration > checkpointInterval) { *FileSystem.get(sc.hadoopConfiguration)* *.delete(new Path(s"${checkpointDir.get}/${iteration - checkpointInterval}"), true)* } System.gc() // hint Spark to clean shuffle directories } Thanks Ankur On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung wrote: > Do you have more of the exception stack? > > > -- > *From:* Ankur Srivastava > *Sent:* Wednesday, January 4, 2017 4:40:02 PM > *To:* user@spark.apache.org > *Subject:* Spark GraphFrame ConnectedComponents > > Hi, > > I am trying to use the ConnectedComponent algorithm of GraphFrames but by > default it needs a checkpoint directory. As I am running my spark cluster > with S3 as the DFS and do not have access to HDFS file system I tried using > a s3 directory as checkpoint directory but I run into below exception: > > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > s3n://, expected: file:/// > > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > > at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile( > RawLocalFileSystem.java:69) > > If I set checkpoint interval to -1 to avoid checkpointing the driver just > hangs after 3 or 4 iterations. > > Is there some way I can set the default FileSystem to S3 for Spark or any > other option? > > Thanks > Ankur > >
Re: Spark GraphFrame ConnectedComponents
Do you have more of the exception stack? From: Ankur Srivastava Sent: Wednesday, January 4, 2017 4:40:02 PM To: user@spark.apache.org Subject: Spark GraphFrame ConnectedComponents Hi, I am trying to use the ConnectedComponent algorithm of GraphFrames but by default it needs a checkpoint directory. As I am running my spark cluster with S3 as the DFS and do not have access to HDFS file system I tried using a s3 directory as checkpoint directory but I run into below exception: Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3n://, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69) If I set checkpoint interval to -1 to avoid checkpointing the driver just hangs after 3 or 4 iterations. Is there some way I can set the default FileSystem to S3 for Spark or any other option? Thanks Ankur