[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369330#comment-16369330
 ] 

Steve Loughran commented on SPARK-23420:


Can I note that if there's a colon in the path, it'd still fail, even if the 
glob is bypassed. Long outstanding issue

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> java.util.regex.Pattern.compile(Pattern.java:1054) at 
> org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at 
> org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at 
> org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 
> 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, 
> (reason: User class threw exception: java.io.IOException: Illegal file 
> pattern: Unmatched closing ')' near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^) 18/02/14 04:52:46 INFO spark.SparkContext: Invoking 

[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-16 Thread Mitchell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367529#comment-16367529
 ] 

Mitchell commented on SPARK-23420:
--

Yes, I agree there appears to be no way currently for a user to distinguish a 
path to be treated normally vs. one to be treated as a glob. I think having two 
separate methods for specifying, or an option to specify how it should be 
treated. This probably isn't a common situation to have files/paths with these 
characters in them, but it's possible and should be able to be done.

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> java.util.regex.Pattern.compile(Pattern.java:1054) at 
> org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at 
> org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at 
> org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 
> 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, 
> (reason: User class threw exception: java.io.IOException: Illegal file 
> pattern: Unmatched closing ')' near index 

[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364324#comment-16364324
 ] 

Sean Owen commented on SPARK-23420:
---

Hm. That's a different problem then. Those are URIs and should be interpreted 
as such. I also recall there was a fix for something like this recently. See 
https://issues.apache.org/jira/browse/SPARK-21996 or 
https://issues.apache.org/jira/browse/SPARK-22585 for example; it could be yet 
another instance.

Although encoding these characters is probably right, I don't think that's the 
issue here after all, as this check occurs on a Path object, after it has been 
parsed as an HDFS URI. I think there's no way to reliably distinguish a string 
that means "*" as a glob and "*" as a character in the path.

I think the real fix may be another way of specifying the path as a "glob" 
parameter instead of "path". That's what the HDFS API does. However here even 
that wouldn't fix the fact that this is already the expected behavior of "path".

I don't see a workaround right now other than to avoid these chars. Anyone else?

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> 

[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-14 Thread Mitchell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364284#comment-16364284
 ] 

Mitchell commented on SPARK-23420:
--

Sean, I'm a little confused by your response. From what I've seen, the 
datasource API does not correctly handle setting a fully URI encoded path, 
otherwise I would be doing that. As such, I am stuck with an unencoded path 
which in this case obviously has these characters in it. Even for a simple file 
with a space this does not work if passed encoded.

 

file: "/tmp/space file.csv"

Dataset input = sqlContext.read().option("header", "true").option("sep", 
",").option("quote", "\"").option("charset", "utf8").option("escape", 
"\\").csv("hdfs:///tmp/space%20file.csv"); --> File not found

Dataset input = sqlContext.read().option("header", "true").option("sep", 
",").option("quote", "\"").option("charset", "utf8").option("escape", 
"\\").csv("hdfs:///tmp/space file.csv"); -->Works fine

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> 

[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364251#comment-16364251
 ] 

Sean Owen commented on SPARK-23420:
---

I think that logic is mostly correct, because those characters ought to be 
encoded in a file URI, as they're either reserved in URIs or file names. Not 
entirely sure about brackets though. At least, that should be a workaround; 
that may actually be the answer though.

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> java.util.regex.Pattern.compile(Pattern.java:1054) at 
> org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at 
> org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at 
> org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 
> 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, 
> (reason: User class threw exception: java.io.IOException: Illegal file 
> pattern: Unmatched closing ')' near index 130 
> 

[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-14 Thread Mitchell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364190#comment-16364190
 ] 

Mitchell commented on SPARK-23420:
--

Greetings Marco, thanks for the response. Without pulling master, I don't 
believe that this is fixed in master.

When looking in the source code, the SparkHadoopUtils class attempts to do a 
glob if it contains any of ... {}[]*?\\. Our file has these characters...but 
they are in no way meant to be treated as a glob. It's during the subsequent 
glob that we have our failure. It seems to me that this would not have been 
fixed.

def isGlobPath(pattern: Path): Boolean = {
 pattern.toString.exists("{}[]*?\\".toSet.contains)
 }

Please let me know if there's a specific commit, pull request, or other issue 
that I could look at which might pertain to this.

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> java.util.regex.Pattern.compile(Pattern.java:1054) at 
> org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at 
> org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at 
> 

[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-14 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363634#comment-16363634
 ] 

Marco Gaido commented on SPARK-23420:
-

I don't remember the ticket number but this may be solved. May you please try 
with the current master branch if the problem still exists? Thanks.

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> java.util.regex.Pattern.compile(Pattern.java:1054) at 
> org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at 
> org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at 
> org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 
> 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, 
> (reason: User class threw exception: java.io.IOException: Illegal file 
> pattern: Unmatched closing ')' near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^) 18/02/14 04:52:46 INFO