Re: Exception when S3 path contains colons
You can change the names, whatever program that is pushing the record must follow the naming conventions. Try to replace : with _ or something. Thanks Best Regards On Tue, Aug 18, 2015 at 10:20 AM, Brian Stempin brian.stem...@gmail.com wrote: Hi, I'm running Spark on Amazon EMR (Spark 1.4.1, Hadoop 2.6.0). I'm seeing the exception below when encountering file names that contain colons. Any idea on how to get around this? scala val files = sc.textFile(s3a://redactedbucketname/*) 2015-08-18 04:38:34,567 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(242224) called with curMem=669367, maxMem=285203496 2015-08-18 04:38:34,568 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3 stored as values in memory (estimated size 236.5 KB, free 271.1 MB) 2015-08-18 04:38:34,663 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(21533) called with curMem=911591, maxMem=285203496 2015-08-18 04:38:34,664 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.0 KB, free 271.1 MB) 2015-08-18 04:38:34,665 INFO [sparkDriver-akka.actor.default-dispatcher-19] storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Added broadcast_3_piece0 in memory on 10.182.184.26:60338 (size: 21.0 KB, free: 271.9 MB) 2015-08-18 04:38:34,667 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Created broadcast 3 from textFile at console:21 files: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at textFile at console:21 scala files.count 2015-08-18 04:38:37,262 INFO [main] s3a.S3AFileSystem (S3AFileSystem.java:listStatus(533)) - List status for path: s3a://redactedbucketname/ 2015-08-18 04:38:37,262 INFO [main] s3a.S3AFileSystem (S3AFileSystem.java:getFileStatus(684)) - Getting path status for s3a://redactedbucketname/ () java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: [922-212-4438]-[119]-[1]-[2015-08-13T15:43:12.346193%5D-%5B2015-01-01T00:00:00%5D-redacted.csv at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) at org.apache.hadoop.fs.Path.init(Path.java:94) at org.apache.hadoop.fs.Globber.glob(Globber.java:240) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1700) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:200) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:279) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781) at org.apache.spark.rdd.RDD.count(RDD.scala:1099) at $iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.init(console:24) at $iwC$iwC$iwC$iwC$iwC$iwC$iwC.init(console:29) at $iwC$iwC$iwC$iwC$iwC$iwC.init(console:31) at $iwC$iwC$iwC$iwC$iwC.init(console:33) at $iwC$iwC$iwC$iwC.init(console:35) at $iwC$iwC$iwC.init(console:37) at $iwC$iwC.init(console:39) at $iwC.init(console:41) at init(console:43) at .init(console:47) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at
Re: Exception when S3 path contains colons
I am not quite sure about this but should the notation not be s3n://redactedbucketname/* instead of s3a://redactedbucketname/* The best way is to use s3://bucketname/path/* Regards, Gourav On Tue, Aug 25, 2015 at 10:35 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can change the names, whatever program that is pushing the record must follow the naming conventions. Try to replace : with _ or something. Thanks Best Regards On Tue, Aug 18, 2015 at 10:20 AM, Brian Stempin brian.stem...@gmail.com wrote: Hi, I'm running Spark on Amazon EMR (Spark 1.4.1, Hadoop 2.6.0). I'm seeing the exception below when encountering file names that contain colons. Any idea on how to get around this? scala val files = sc.textFile(s3a://redactedbucketname/*) 2015-08-18 04:38:34,567 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(242224) called with curMem=669367, maxMem=285203496 2015-08-18 04:38:34,568 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3 stored as values in memory (estimated size 236.5 KB, free 271.1 MB) 2015-08-18 04:38:34,663 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(21533) called with curMem=911591, maxMem=285203496 2015-08-18 04:38:34,664 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.0 KB, free 271.1 MB) 2015-08-18 04:38:34,665 INFO [sparkDriver-akka.actor.default-dispatcher-19] storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Added broadcast_3_piece0 in memory on 10.182.184.26:60338 (size: 21.0 KB, free: 271.9 MB) 2015-08-18 04:38:34,667 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Created broadcast 3 from textFile at console:21 files: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at textFile at console:21 scala files.count 2015-08-18 04:38:37,262 INFO [main] s3a.S3AFileSystem (S3AFileSystem.java:listStatus(533)) - List status for path: s3a://redactedbucketname/ 2015-08-18 04:38:37,262 INFO [main] s3a.S3AFileSystem (S3AFileSystem.java:getFileStatus(684)) - Getting path status for s3a://redactedbucketname/ () java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: [922-212-4438]-[119]-[1]-[2015-08-13T15:43:12.346193%5D-%5B2015-01-01T00:00:00%5D-redacted.csv at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) at org.apache.hadoop.fs.Path.init(Path.java:94) at org.apache.hadoop.fs.Globber.glob(Globber.java:240) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1700) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:200) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:279) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781) at org.apache.spark.rdd.RDD.count(RDD.scala:1099) at $iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.init(console:24) at $iwC$iwC$iwC$iwC$iwC$iwC$iwC.init(console:29) at $iwC$iwC$iwC$iwC$iwC$iwC.init(console:31) at $iwC$iwC$iwC$iwC$iwC.init(console:33) at $iwC$iwC$iwC$iwC.init(console:35) at $iwC$iwC$iwC.init(console:37) at $iwC$iwC.init(console:39) at $iwC.init(console:41) at init(console:43) at .init(console:47) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at
Re: Exception when S3 path contains colons
Hello, We had the same problem. I've written a blog post with the detailed explanation and workaround: http://labs.totango.com/spark-read-file-with-colon/ Greetings, Romi K. On Tue, Aug 25, 2015 at 2:47 PM Gourav Sengupta gourav.sengu...@gmail.com wrote: I am not quite sure about this but should the notation not be s3n://redactedbucketname/* instead of s3a://redactedbucketname/* The best way is to use s3://bucketname/path/* Regards, Gourav On Tue, Aug 25, 2015 at 10:35 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can change the names, whatever program that is pushing the record must follow the naming conventions. Try to replace : with _ or something. Thanks Best Regards On Tue, Aug 18, 2015 at 10:20 AM, Brian Stempin brian.stem...@gmail.com wrote: Hi, I'm running Spark on Amazon EMR (Spark 1.4.1, Hadoop 2.6.0). I'm seeing the exception below when encountering file names that contain colons. Any idea on how to get around this? scala val files = sc.textFile(s3a://redactedbucketname/*) 2015-08-18 04:38:34,567 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(242224) called with curMem=669367, maxMem=285203496 2015-08-18 04:38:34,568 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3 stored as values in memory (estimated size 236.5 KB, free 271.1 MB) 2015-08-18 04:38:34,663 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(21533) called with curMem=911591, maxMem=285203496 2015-08-18 04:38:34,664 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.0 KB, free 271.1 MB) 2015-08-18 04:38:34,665 INFO [sparkDriver-akka.actor.default-dispatcher-19] storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Added broadcast_3_piece0 in memory on 10.182.184.26:60338 (size: 21.0 KB, free: 271.9 MB) 2015-08-18 04:38:34,667 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Created broadcast 3 from textFile at console:21 files: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at textFile at console:21 scala files.count 2015-08-18 04:38:37,262 INFO [main] s3a.S3AFileSystem (S3AFileSystem.java:listStatus(533)) - List status for path: s3a://redactedbucketname/ 2015-08-18 04:38:37,262 INFO [main] s3a.S3AFileSystem (S3AFileSystem.java:getFileStatus(684)) - Getting path status for s3a://redactedbucketname/ () java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: [922-212-4438]-[119]-[1]-[2015-08-13T15:43:12.346193%5D-%5B2015-01-01T00:00:00%5D-redacted.csv at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) at org.apache.hadoop.fs.Path.init(Path.java:94) at org.apache.hadoop.fs.Globber.glob(Globber.java:240) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1700) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:200) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:279) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781) at org.apache.spark.rdd.RDD.count(RDD.scala:1099) at $iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.init(console:24) at $iwC$iwC$iwC$iwC$iwC$iwC$iwC.init(console:29) at $iwC$iwC$iwC$iwC$iwC$iwC.init(console:31) at $iwC$iwC$iwC$iwC$iwC.init(console:33) at $iwC$iwC$iwC$iwC.init(console:35) at $iwC$iwC$iwC.init(console:37) at $iwC$iwC.init(console:39) at $iwC.init(console:41) at init(console:43) at .init(console:47) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at
Exception when S3 path contains colons
Hi, I'm running Spark on Amazon EMR (Spark 1.4.1, Hadoop 2.6.0). I'm seeing the exception below when encountering file names that contain colons. Any idea on how to get around this? scala val files = sc.textFile(s3a://redactedbucketname/*) 2015-08-18 04:38:34,567 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(242224) called with curMem=669367, maxMem=285203496 2015-08-18 04:38:34,568 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3 stored as values in memory (estimated size 236.5 KB, free 271.1 MB) 2015-08-18 04:38:34,663 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(21533) called with curMem=911591, maxMem=285203496 2015-08-18 04:38:34,664 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.0 KB, free 271.1 MB) 2015-08-18 04:38:34,665 INFO [sparkDriver-akka.actor.default-dispatcher-19] storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Added broadcast_3_piece0 in memory on 10.182.184.26:60338 (size: 21.0 KB, free: 271.9 MB) 2015-08-18 04:38:34,667 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Created broadcast 3 from textFile at console:21 files: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at textFile at console:21 scala files.count 2015-08-18 04:38:37,262 INFO [main] s3a.S3AFileSystem (S3AFileSystem.java:listStatus(533)) - List status for path: s3a://redactedbucketname/ 2015-08-18 04:38:37,262 INFO [main] s3a.S3AFileSystem (S3AFileSystem.java:getFileStatus(684)) - Getting path status for s3a://redactedbucketname/ () java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: [922-212-4438]-[119]-[1]-[2015-08-13T15:43:12.346193%5D-%5B2015-01-01T00:00:00%5D-redacted.csv at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) at org.apache.hadoop.fs.Path.init(Path.java:94) at org.apache.hadoop.fs.Globber.glob(Globber.java:240) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1700) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:200) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:279) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781) at org.apache.spark.rdd.RDD.count(RDD.scala:1099) at $iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.init(console:24) at $iwC$iwC$iwC$iwC$iwC$iwC$iwC.init(console:29) at $iwC$iwC$iwC$iwC$iwC$iwC.init(console:31) at $iwC$iwC$iwC$iwC$iwC.init(console:33) at $iwC$iwC$iwC$iwC.init(console:35) at $iwC$iwC$iwC.init(console:37) at $iwC$iwC.init(console:39) at $iwC.init(console:41) at init(console:43) at .init(console:47) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org http://org.apache.spark.repl.sparkiloop.org/ $apache$spark$repl$SparkILoop$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$anonfun$org$apache$spark$repl$SparkILoop$process$1.apply$mcZ$sp(SparkILoop.scala:997) at