[jira] [Commented] (SPARK-19013) java.util.ConcurrentModificationException when using s3 path as checkpointLocation
[ https://issues.apache.org/jira/browse/SPARK-19013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15799881#comment-15799881 ] Tim Chan commented on SPARK-19013: -- [~zsxwing] {code} Error: java.util.ConcurrentModificationException: Multiple HDFSMetadataLog are using s3://lumos-emr-logs/streaming-insights-ebb-and-flow-speed-accuracy/offsets at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:162) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(HDFSMetadataLog.scala:119) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1$$anonfun$apply$mcZ$sp$1.apply(HDFSMetadataLog.scala:119) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1$$anonfun$apply$mcZ$sp$1.apply(HDFSMetadataLog.scala:119) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:79) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:119) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:115) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:115) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:115) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$constructNextBatch$1.apply$mcV$sp(StreamExecution.scala:346) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$constructNextBatch$1.apply(StreamExecution.scala:345) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$constructNextBatch$1.apply(StreamExecution.scala:345) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$reportTimeTaken(StreamExecution.scala:656) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$constructNextBatch(StreamExecution.scala:345) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcZ$sp(StreamExecution.scala:219) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:213) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:213) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$reportTimeTaken(StreamExecution.scala:656) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:212) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:208) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:142) Caused by: java.io.FileNotFoundException: No such file or directory 's3://lumos-emr-logs/streaming-insights-ebb-and-flow-speed-accuracy/offsets/.45b98c69-6158-4434-a7b2-c3f73d27294e.tmp' at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:812) at org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2286) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileLinkStatus(EmrFileSystem.java:521) at org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:130) at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:705) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.rename(HDFSMetadataLog.scala:309) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:150)
[jira] [Commented] (SPARK-19013) java.util.ConcurrentModificationException when using s3 path as checkpointLocation
[ https://issues.apache.org/jira/browse/SPARK-19013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15781029#comment-15781029 ] Tim Chan commented on SPARK-19013: -- Perhaps the documentation should be revised to recommend against using s3 as a location for {{checkpointLocation}}? I will test with an hdfs location and update this ticket with my findings. > java.util.ConcurrentModificationException when using s3 path as > checkpointLocation > --- > > Key: SPARK-19013 > URL: https://issues.apache.org/jira/browse/SPARK-19013 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tim Chan > > I have a structured stream job running on EMR. The job will fail due to this > {code} > Multiple HDFSMetadataLog are using s3://mybucket/myapp > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:162) > {code} > There is only one instance of this stream job running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19013) java.util.ConcurrentModificationException when using s3 path as checkpointLocation
[ https://issues.apache.org/jira/browse/SPARK-19013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Chan updated SPARK-19013: - Description: I have a structured stream job running on EMR. The job will fail due to this {code} Multiple HDFSMetadataLog are using s3://mybucket/myapp org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:162) {code} There is only one instance of this stream job running. was: I have a structured stream job running on EMR. The job will fail due to this ``` Multiple HDFSMetadataLog are using s3://mybucket/myapp org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:162) ``` There is only one instance of this stream job running. > java.util.ConcurrentModificationException when using s3 path as > checkpointLocation > --- > > Key: SPARK-19013 > URL: https://issues.apache.org/jira/browse/SPARK-19013 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tim Chan > > I have a structured stream job running on EMR. The job will fail due to this > {code} > Multiple HDFSMetadataLog are using s3://mybucket/myapp > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:162) > {code} > There is only one instance of this stream job running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19013) java.util.ConcurrentModificationException when using s3 path as checkpointLocation
Tim Chan created SPARK-19013: Summary: java.util.ConcurrentModificationException when using s3 path as checkpointLocation Key: SPARK-19013 URL: https://issues.apache.org/jira/browse/SPARK-19013 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.0.2 Reporter: Tim Chan I have a structured stream job running on EMR. The job will fail due to this ``` Multiple HDFSMetadataLog are using s3://mybucket/myapp org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:162) ``` There is only one instance of this stream job running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17466) Error message is not very clear
[ https://issues.apache.org/jira/browse/SPARK-17466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477873#comment-15477873 ] Tim Chan commented on SPARK-17466: -- Thanks [~srowen]! > Error message is not very clear > --- > > Key: SPARK-17466 > URL: https://issues.apache.org/jira/browse/SPARK-17466 > Project: Spark > Issue Type: Improvement >Reporter: Tim Chan >Priority: Trivial > > User class threw exception: org.apache.spark.sql.AnalysisException: Window > Frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the > required frame ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING; > The same Spark SQL that throws this exception in EMR 5.0.0 works just fine in > Databricks using Spark 2.0.0/Scala 2.11. I don't even understand what the > error means. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17466) Error message is not very clear
[ https://issues.apache.org/jira/browse/SPARK-17466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477849#comment-15477849 ] Tim Chan commented on SPARK-17466: -- I suppose, I don't understand why I'm limited to 1 preceding. > Error message is not very clear > --- > > Key: SPARK-17466 > URL: https://issues.apache.org/jira/browse/SPARK-17466 > Project: Spark > Issue Type: Improvement >Reporter: Tim Chan >Priority: Trivial > > User class threw exception: org.apache.spark.sql.AnalysisException: Window > Frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the > required frame ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING; > The same Spark SQL that throws this exception in EMR 5.0.0 works just fine in > Databricks using Spark 2.0.0/Scala 2.11. I don't even understand what the > error means. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17466) Error message is not very clear
Tim Chan created SPARK-17466: Summary: Error message is not very clear Key: SPARK-17466 URL: https://issues.apache.org/jira/browse/SPARK-17466 Project: Spark Issue Type: Improvement Reporter: Tim Chan Priority: Trivial User class threw exception: org.apache.spark.sql.AnalysisException: Window Frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING; The same Spark SQL that throws this exception in EMR 5.0.0 works just fine in Databricks using Spark 2.0.0/Scala 2.11. I don't even understand what the error means. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17423) Support IGNORE NULLS option in Window functions
[ https://issues.apache.org/jira/browse/SPARK-17423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472394#comment-15472394 ] Tim Chan commented on SPARK-17423: -- [~hvanhovell] I was able to rewrite this Redshift fragment: {code:sql} DATEDIFF(day, LAG(CASE WHEN SUM(activities.activity_one, activities.activity_two) > 0 THEN activities.date END) IGNORE NULLS OVER (PARTITION BY activities.user_id ORDER BY activities.date), activities.date ) AS days_since_last_activity {code} as this Spark SQL fragment: {code:sql} DATEDIFF(activities.date, LAST(CASE WHEN SUM(activities.activity_one, activities.activity_two) > 0 THEN activities.date END, true) OVER (PARTITION BY activities.user_id ORDER BY activities.date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS days_since_last_activity {code} Thanks for pointing me in the right direction. > Support IGNORE NULLS option in Window functions > --- > > Key: SPARK-17423 > URL: https://issues.apache.org/jira/browse/SPARK-17423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tim Chan >Priority: Minor > > http://stackoverflow.com/questions/24338119/is-it-possible-to-ignore-null-values-when-using-lag-and-lead-functions-in-sq -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17423) Support IGNORE NULLS option in Window functions
Tim Chan created SPARK-17423: Summary: Support IGNORE NULLS option in Window functions Key: SPARK-17423 URL: https://issues.apache.org/jira/browse/SPARK-17423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Tim Chan Priority: Minor http://stackoverflow.com/questions/24338119/is-it-possible-to-ignore-null-values-when-using-lag-and-lead-functions-in-sq -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org