[jira] [Commented] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect
[ https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580692#comment-16580692 ] Hyukjin Kwon commented on SPARK-25109: -- It should be helpful if we can narrow down this problem. > spark python should retry reading another datanode if the first one fails to > connect > > > Key: SPARK-25109 > URL: https://issues.apache.org/jira/browse/SPARK-25109 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Yuanbo Liu >Priority: Major > Attachments: > WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png > > > We use this code to read parquet files from HDFS: > spark.read.parquet('xxx') > and get error as below: > !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png! > > What we can get is that one of the replica block cannot be read for some > reason, but spark python doesn't try to read another replica which can be > read successfully. So the application fails after throwing exception. > When I use hadoop fs -text to read the file, I can get content correctly. It > would be great that spark python can retry reading another replica block > instead of failing. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect
[ https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580689#comment-16580689 ] Yuanbo Liu commented on SPARK-25109: [~hyukjin.kwon] Thanks for your comments. Not sure about that, we didn't use Scala API in our cluster. > spark python should retry reading another datanode if the first one fails to > connect > > > Key: SPARK-25109 > URL: https://issues.apache.org/jira/browse/SPARK-25109 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Yuanbo Liu >Priority: Major > Attachments: > WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png > > > We use this code to read parquet files from HDFS: > spark.read.parquet('xxx') > and get error as below: > !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png! > > What we can get is that one of the replica block cannot be read for some > reason, but spark python doesn't try to read another replica which can be > read successfully. So the application fails after throwing exception. > When I use hadoop fs -text to read the file, I can get content correctly. It > would be great that spark python can retry reading another replica block > instead of failing. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event
[ https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580670#comment-16580670 ] deshanxiao commented on SPARK-25120: Sure, I find the tab "Executors" in HistorySever sometimes miss the info of driver in executor-id colunm, it isn't convenient when we analysis the problem of driver. [~hyukjin.kwon] > EventLogListener may miss driver SparkListenerBlockManagerAdded event > -- > > Key: SPARK-25120 > URL: https://issues.apache.org/jira/browse/SPARK-25120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: deshanxiao >Priority: Major > > Sometimes in spark history tab "Executors" , it couldn't find driver > information because the event of SparkListenerBlockManagerAdded is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event
[ https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580658#comment-16580658 ] Apache Spark commented on SPARK-25120: -- User 'deshanxiao' has created a pull request for this issue: https://github.com/apache/spark/pull/22109 > EventLogListener may miss driver SparkListenerBlockManagerAdded event > -- > > Key: SPARK-25120 > URL: https://issues.apache.org/jira/browse/SPARK-25120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: deshanxiao >Priority: Major > > Sometimes in spark history tab "Executors" , it couldn't find driver > information because the event of SparkListenerBlockManagerAdded is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event
[ https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25120: Assignee: (was: Apache Spark) > EventLogListener may miss driver SparkListenerBlockManagerAdded event > -- > > Key: SPARK-25120 > URL: https://issues.apache.org/jira/browse/SPARK-25120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: deshanxiao >Priority: Major > > Sometimes in spark history tab "Executors" , it couldn't find driver > information because the event of SparkListenerBlockManagerAdded is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event
[ https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25120: Assignee: Apache Spark > EventLogListener may miss driver SparkListenerBlockManagerAdded event > -- > > Key: SPARK-25120 > URL: https://issues.apache.org/jira/browse/SPARK-25120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: deshanxiao >Assignee: Apache Spark >Priority: Major > > Sometimes in spark history tab "Executors" , it couldn't find driver > information because the event of SparkListenerBlockManagerAdded is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event
[ https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580648#comment-16580648 ] Hyukjin Kwon commented on SPARK-25120: -- Can you describe a bit more details? When does it happen? Also, can you upload screen shot and describe the expected output please? > EventLogListener may miss driver SparkListenerBlockManagerAdded event > -- > > Key: SPARK-25120 > URL: https://issues.apache.org/jira/browse/SPARK-25120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: deshanxiao >Priority: Major > > Sometimes in spark history tab "Executors" , it couldn't find driver > information because the event of SparkListenerBlockManagerAdded is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect
[ https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580647#comment-16580647 ] Hyukjin Kwon commented on SPARK-25109: -- Does the same thing happen in Scala API too? > spark python should retry reading another datanode if the first one fails to > connect > > > Key: SPARK-25109 > URL: https://issues.apache.org/jira/browse/SPARK-25109 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Yuanbo Liu >Priority: Major > Attachments: > WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png > > > We use this code to read parquet files from HDFS: > spark.read.parquet('xxx') > and get error as below: > !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png! > > What we can get is that one of the replica block cannot be read for some > reason, but spark python doesn't try to read another replica which can be > read successfully. So the application fails after throwing exception. > When I use hadoop fs -text to read the file, I can get content correctly. It > would be great that spark python can retry reading another replica block > instead of failing. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580644#comment-16580644 ] Wenchen Fan commented on SPARK-24771: - I've sent the email, let's wait for the feedback. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event
deshanxiao created SPARK-25120: -- Summary: EventLogListener may miss driver SparkListenerBlockManagerAdded event Key: SPARK-25120 URL: https://issues.apache.org/jira/browse/SPARK-25120 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: deshanxiao Sometimes in spark history tab "Executors" , it couldn't find driver information because the event of SparkListenerBlockManagerAdded is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25083) remove the type erasure hack in data source scan
[ https://issues.apache.org/jira/browse/SPARK-25083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25083: Target Version/s: 3.0.0 > remove the type erasure hack in data source scan > > > Key: SPARK-25083 > URL: https://issues.apache.org/jira/browse/SPARK-25083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > It's hacky to pretend a `RDD[ColumnarBatch]` to be a `RDD[InternalRow]`. We > should make the type explicit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25083) remove the type erasure hack in data source scan
[ https://issues.apache.org/jira/browse/SPARK-25083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580638#comment-16580638 ] Wenchen Fan commented on SPARK-25083: - I've set the target version as 3.0. It's a code refactor and has no impact to end users. > remove the type erasure hack in data source scan > > > Key: SPARK-25083 > URL: https://issues.apache.org/jira/browse/SPARK-25083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > It's hacky to pretend a `RDD[ColumnarBatch]` to be a `RDD[InternalRow]`. We > should make the type explicit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-23874. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21939 [https://github.com/apache/spark/pull/21939] > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.
[ https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-25115. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22105 [https://github.com/apache/spark/pull/22105] > Eliminate extra memory copy done when a ByteBuf is used that is backed by > > 1 ByteBuffer. > - > > Key: SPARK-25115 > URL: https://issues.apache.org/jira/browse/SPARK-25115 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.1 >Reporter: Norman Maurer >Assignee: Norman Maurer >Priority: Major > Fix For: 2.4.0 > > > Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by > more then 1 ByteBuf. In this case it makes more sense to call > ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.
[ https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-25115: --- Assignee: Norman Maurer > Eliminate extra memory copy done when a ByteBuf is used that is backed by > > 1 ByteBuffer. > - > > Key: SPARK-25115 > URL: https://issues.apache.org/jira/browse/SPARK-25115 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.1 >Reporter: Norman Maurer >Assignee: Norman Maurer >Priority: Major > Fix For: 2.4.0 > > > Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by > more then 1 ByteBuf. In this case it makes more sense to call > ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit
[ https://issues.apache.org/jira/browse/SPARK-25113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25113. - Resolution: Fixed Assignee: Kris Mok Fix Version/s: 2.4.0 > Add logging to CodeGenerator when any generated method's bytecode size goes > above HugeMethodLimit > - > > Key: SPARK-25113 > URL: https://issues.apache.org/jira/browse/SPARK-25113 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 2.4.0 > > > Add logging to help collect statistics on how often real world usage sees the > {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 > bytes (HugeMethodLimit) threshold. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25119) stages in wrong order within job page DAG chart
[ https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunjian Zhang updated SPARK-25119: -- Attachment: Screen Shot 2018-08-14 at 3.35.34 PM.png > stages in wrong order within job page DAG chart > --- > > Key: SPARK-25119 > URL: https://issues.apache.org/jira/browse/SPARK-25119 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Yunjian Zhang >Priority: Minor > Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png > > > {color:#33}multiple stages for same job are shown with wrong order in DAG > Visualization of job page.{color} > e.g. > stage27 stage19 stage20 stage24 stage21 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25119) stages in wrong order within job page DAG chart
[ https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunjian Zhang updated SPARK-25119: -- Description: {color:#33}multiple stages for same job are shown with wrong order in DAG Visualization of job page.{color} e.g. stage27 stage19 stage20 stage24 stage21 was: multiple stages for same job are shown with wrong order in job page. e.g. > stages in wrong order within job page DAG chart > --- > > Key: SPARK-25119 > URL: https://issues.apache.org/jira/browse/SPARK-25119 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Yunjian Zhang >Priority: Minor > Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png > > > {color:#33}multiple stages for same job are shown with wrong order in DAG > Visualization of job page.{color} > e.g. > stage27 stage19 stage20 stage24 stage21 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25119) stages in wrong order within job page DAG chart
Yunjian Zhang created SPARK-25119: - Summary: stages in wrong order within job page DAG chart Key: SPARK-25119 URL: https://issues.apache.org/jira/browse/SPARK-25119 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1 Reporter: Yunjian Zhang multiple stages for same job are shown with wrong order in job page. e.g. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20384) supporting value classes over primitives in DataSets
[ https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580498#comment-16580498 ] Minh Thai commented on SPARK-20384: --- _(from my comment in SPARK-17368)_ I think the main problem is there was no way to create an encoder specifically for value classes even until today. However, I think we can make a [universal trait|https://docs.scala-lang.org/overviews/core/value-classes.html] called {{OpaqueValue}}^1^ to be used as an upper type bound in encoder. This means: - Any user-defined value class has to mixin {{OpaqueValue}} - An encoder can be created to target those value classes. {code:java} trait OpaqueValue extends Any implicit def newValueClassEncoder[T <: Product with OpaqueValue : TypeTag] = ??? case class Id(value: Int) extends AnyVal with OpaqueValue {code} this doesn't clash with the existing encoder for case class since the type constraint is more specific {code:java} implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = Encoders.product[T] {code} _I'm experimenting with this on my fork and will make a PR if it works well._ _(1) the name is inspired from [Opaque Type|https://docs.scala-lang.org/sips/opaque-types.html] feature of Scala 3_ > supporting value classes over primitives in DataSets > > > Key: SPARK-20384 > URL: https://issues.apache.org/jira/browse/SPARK-20384 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Daniel Davis >Priority: Minor > > As a spark user who uses value classes in scala for modelling domain objects, > I also would like to make use of them for datasets. > For example, I would like to use the {{User}} case class which is using a > value-class for it's {{id}} as the type for a DataSet: > - the underlying primitive should be mapped to the value-class column > - function on the column (for example comparison ) should only work if > defined on the value-class and use these implementation > - show() should pick up the toString method of the value-class > {code} > case class Id(value: Long) extends AnyVal { > def toString: String = value.toHexString > } > case class User(id: Id, name: String) > val ds = spark.sparkContext > .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS() > .withColumnRenamed("_1", "id") > .withColumnRenamed("_2", "name") > // mapping should work > val usrs = ds.as[User] > // show should use toString > usrs.show() > // comparison with long should throw exception, as not defined on Id > usrs.col("id") > 0L > {code} > For example `.show()` should use the toString of the `Id` value class: > {noformat} > +---+---+ > | id| name| > +---+---+ > | 0| name-0| > | 1| name-1| > | 2| name-2| > | 3| name-3| > | 4| name-4| > | 5| name-5| > | 6| name-6| > | 7| name-7| > | 8| name-8| > | 9| name-9| > | A|name-10| > | B|name-11| > | C|name-12| > +---+---+ > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25092) Add RewriteExceptAll and RewriteIntersectAll in the list of nonExcludableRules
[ https://issues.apache.org/jira/browse/SPARK-25092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580464#comment-16580464 ] Apache Spark commented on SPARK-25092: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/22108 > Add RewriteExceptAll and RewriteIntersectAll in the list of nonExcludableRules > -- > > Key: SPARK-25092 > URL: https://issues.apache.org/jira/browse/SPARK-25092 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.0 > > > Add RewriteExceptAll and RewriteIntersectAll in the list of > nonExcludableRules as the rewrites are essential for the functioning of > EXCEPT ALL and INTERSECT ALL feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25118) Need a solution to persist Spark application console outputs when running in shell/yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-25118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Gupta updated SPARK-25118: Description: We execute Spark applications in YARN Client mode a lot of time. When we do so the Spark Driver logs are printed to the console. We need a solution to persist the console outputs for later usage. This can be either for doing some troubleshooting or for some another log analysis. Ideally, we would like to persist these along with Yarn logs (when application is run in Yarn Client mode). Also, this has to be end-user agnostic, so that the logs are available for later usage without requiring the end-user to make some configuration changes. was: We execute Spark applications in YARN Client mode a lot of time. When we do so the Spark Driver logs are printed to the console. We need a solution to persist the console outputs for later usage. This can be either for doing some troubleshooting or for some another log analysis. Ideally, we would like to persist these along with Yarn logs (when application is run in Yarn Client mode). > Need a solution to persist Spark application console outputs when running in > shell/yarn client mode > --- > > Key: SPARK-25118 > URL: https://issues.apache.org/jira/browse/SPARK-25118 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Ankur Gupta >Priority: Major > > We execute Spark applications in YARN Client mode a lot of time. When we do > so the Spark Driver logs are printed to the console. > We need a solution to persist the console outputs for later usage. This can > be either for doing some troubleshooting or for some another log analysis. > Ideally, we would like to persist these along with Yarn logs (when > application is run in Yarn Client mode). Also, this has to be end-user > agnostic, so that the logs are available for later usage without requiring > the end-user to make some configuration changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25118) Need a solution to persist Spark application console outputs when running in shell/yarn client mode
Ankur Gupta created SPARK-25118: --- Summary: Need a solution to persist Spark application console outputs when running in shell/yarn client mode Key: SPARK-25118 URL: https://issues.apache.org/jira/browse/SPARK-25118 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 2.3.0, 2.2.0, 2.1.0, 2.0.0 Reporter: Ankur Gupta We execute Spark applications in YARN Client mode a lot of time. When we do so the Spark Driver logs are printed to the console. We need a solution to persist the console outputs for later usage. This can be either for doing some troubleshooting or for some another log analysis. Ideally, we would like to persist these along with Yarn logs (when application is run in Yarn Client mode). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16406) Reference resolution for large number of columns should be faster
[ https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580444#comment-16580444 ] antonkulaga commented on SPARK-16406: - Are you going to backport it to 2.3.2 as well? > Reference resolution for large number of columns should be faster > - > > Key: SPARK-16406 > URL: https://issues.apache.org/jira/browse/SPARK-16406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > Fix For: 2.4.0 > > > Resolving columns in a LogicalPlan on average takes n / 2 (n being the number > of columns). This gets problematic as soon as you try to resolve a large > number of columns (m) on a large table: O(m * n / 2) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects
[ https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580440#comment-16580440 ] Aydin Kocas commented on SPARK-23337: - [~marmbrus] Can you give a hint how to do it with "withColumn" in java? Dataset jsonRow = spark.readStream() .schema(...) .json(..).withColumn("createTime", ?? );.withWatermark("createTime", "10 minutes"); > withWatermark raises an exception on struct objects > --- > > Key: SPARK-23337 > URL: https://issues.apache.org/jira/browse/SPARK-23337 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1 > Environment: Linux Ubuntu, Spark on standalone mode >Reporter: Aydin Kocas >Priority: Major > > Hi, > > when using a nested object (I mean an object within a struct, here concrete: > _source.createTime) from a json file as the parameter for the > withWatermark-method, I get an exception (see below). > Anything else works flawlessly with the nested object. > > +*{color:#14892c}works:{color}*+ > {code:java} > Dataset jsonRow = > spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("myTime", > "10 seconds").toDF();{code} > > json structure: > {code:java} > root > |-- _id: string (nullable = true) > |-- _index: string (nullable = true) > |-- _score: long (nullable = true) > |-- myTime: timestamp (nullable = true) > ..{code} > +*{color:#d04437}does not work - nested json{color}:*+ > {code:java} > Dataset jsonRow = > spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("_source.createTime", > "10 seconds").toDF();{code} > > json structure: > > {code:java} > root > |-- _id: string (nullable = true) > |-- _index: string (nullable = true) > |-- _score: long (nullable = true) > |-- _source: struct (nullable = true) > | |-- createTime: timestamp (nullable = true) > .. > > Exception in thread "main" > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > 'EventTimeWatermark '_source.createTime, interval 10 seconds > +- Deduplicate [_id#0], true > +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@5dbbb292,json,List(),Some(StructType(StructField(_id,StringType,true), > StructField(_index,StringType,true), StructField(_score,LongType,true), > StructField(_source,StructType(StructField(additionalData,StringType,true), > StructField(client,StringType,true), > StructField(clientDomain,BooleanType,true), > StructField(clientVersion,StringType,true), > StructField(country,StringType,true), > StructField(countryName,StringType,true), > StructField(createTime,TimestampType,true), > StructField(externalIP,StringType,true), > StructField(hostname,StringType,true), > StructField(internalIP,StringType,true), > StructField(location,StringType,true), > StructField(locationDestination,StringType,true), > StructField(login,StringType,true), > StructField(originalRequestString,StringType,true), > StructField(password,StringType,true), > StructField(peerIdent,StringType,true), > StructField(peerType,StringType,true), > StructField(recievedTime,TimestampType,true), > StructField(sessionEnd,StringType,true), > StructField(sessionStart,StringType,true), > StructField(sourceEntryAS,StringType,true), > StructField(sourceEntryIp,StringType,true), > StructField(sourceEntryPort,StringType,true), > StructField(targetCountry,StringType,true), > StructField(targetCountryName,StringType,true), > StructField(targetEntryAS,StringType,true), > StructField(targetEntryIp,StringType,true), > StructField(targetEntryPort,StringType,true), > StructField(targetport,StringType,true), > StructField(username,StringType,true), > StructField(vulnid,StringType,true)),true), > StructField(_type,StringType,true))),List(),None,Map(path -> ./input/),None), > FileSource[./input/], [_id#0, _index#1, _score#2L, _source#3, _type#4] > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:854) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:796) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > at >
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580299#comment-16580299 ] Imran Rashid commented on SPARK-24938: -- {quote} So perhaps the fix here is not to use the default netty pool, but to use ctx.alloc().buffer() instead of .heapBuffer()? {quote} yeah thats what I meant -- that {{alloc().buffer()}} could be onheap or offheap, depending on how netty's alloctor is configured (in spark via spark..io.preferDirectBufs). > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators
[ https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24838: Target Version/s: 2.4.0 > Support uncorrelated IN/EXISTS subqueries for more operators > - > > Key: SPARK-24838 > URL: https://issues.apache.org/jira/browse/SPARK-24838 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Qifan Pu >Priority: Major > > Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. > Running a query: > {{select name in (select * from valid_names)}} > {{from all_names}} > returns error: > {code:java} > Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries > can only be used in a Filter > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators
[ https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580298#comment-16580298 ] Xiao Li commented on SPARK-24838: - [~maurits] Any update? Also cc [~liwen] [~hvanhovell] > Support uncorrelated IN/EXISTS subqueries for more operators > - > > Key: SPARK-24838 > URL: https://issues.apache.org/jira/browse/SPARK-24838 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Qifan Pu >Priority: Major > > Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. > Running a query: > {{select name in (select * from valid_names)}} > {{from all_names}} > returns error: > {code:java} > Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries > can only be used in a Filter > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators
[ https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24838: Labels: (was: spree) > Support uncorrelated IN/EXISTS subqueries for more operators > - > > Key: SPARK-24838 > URL: https://issues.apache.org/jira/browse/SPARK-24838 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Qifan Pu >Priority: Major > > Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. > Running a query: > {{select name in (select * from valid_names)}} > {{from all_names}} > returns error: > {code:java} > Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries > can only be used in a Filter > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well
[ https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580285#comment-16580285 ] Apache Spark commented on SPARK-25105: -- User 'kevinyu98' has created a pull request for this issue: https://github.com/apache/spark/pull/22100 > Importing all of pyspark.sql.functions should bring PandasUDFType in as well > > > Key: SPARK-25105 > URL: https://issues.apache.org/jira/browse/SPARK-25105 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > > {code:java} > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > Traceback (most recent call last): > File "", line 1, in > NameError: name 'PandasUDFType' is not defined > > {code} > When explicitly imported it works fine: > {code:java} > > >>> from pyspark.sql.functions import PandasUDFType > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > {code} > > We just need to make sure it's included in __all__/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well
[ https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25105: Assignee: (was: Apache Spark) > Importing all of pyspark.sql.functions should bring PandasUDFType in as well > > > Key: SPARK-25105 > URL: https://issues.apache.org/jira/browse/SPARK-25105 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > > {code:java} > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > Traceback (most recent call last): > File "", line 1, in > NameError: name 'PandasUDFType' is not defined > > {code} > When explicitly imported it works fine: > {code:java} > > >>> from pyspark.sql.functions import PandasUDFType > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > {code} > > We just need to make sure it's included in __all__/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well
[ https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25105: Assignee: Apache Spark > Importing all of pyspark.sql.functions should bring PandasUDFType in as well > > > Key: SPARK-25105 > URL: https://issues.apache.org/jira/browse/SPARK-25105 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Assignee: Apache Spark >Priority: Trivial > > > {code:java} > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > Traceback (most recent call last): > File "", line 1, in > NameError: name 'PandasUDFType' is not defined > > {code} > When explicitly imported it works fine: > {code:java} > > >>> from pyspark.sql.functions import PandasUDFType > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > {code} > > We just need to make sure it's included in __all__/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection
[ https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580275#comment-16580275 ] Eric Wohlstadter commented on SPARK-21375: -- [~bryanc] Hi Brian, I'm using the Spark-Arrow conversion support inside of a DataSourceV2 {{SupportsColumnBatchScan}} DataReader. It uses {{ArrowStreamReader}} to read from the external data source, and converts the input from the stream to Spark's {{ArrowColumnVector}}. I'm having trouble when the original input comes from a Hive TimeStamp (without timezone). It looks like {{ArrowColumnVector}} requires {{TimeStampMicroTZVector.}} So I need to fill in a time zone when creating the {{TimeStampMicroTZVector}} on the Writer-side of the arrow stream. This creates some inconsistency when the two ends of the arrow stream are in different time zones. I'm wondering if I might be missing some other way of handling this correctly. Would you happen to know a better way to handle conversion of Timestamp (without time zone) using the Spark-Arrow conversion support? /cc [~dongjoon] [~hyukjin.kwon] > Add date and timestamp support to ArrowConverters for toPandas() collection > --- > > Key: SPARK-21375 > URL: https://issues.apache.org/jira/browse/SPARK-21375 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.3.0 > > > Date and timestamp are not yet supported in DataFrame.toPandas() using > ArrowConverters. These are common types for data analysis used in both Spark > and Pandas and should be supported. > There is a discrepancy with the way that PySpark and Arrow store timestamps, > without timezone specified, internally. PySpark takes a UTC timestamp that > is adjusted to local time and Arrow is in UTC time. Hopefully there is a > clean way to resolve this. > Spark internal storage spec: > * *DateType* stored as days > * *Timestamp* stored as microseconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580260#comment-16580260 ] Marcelo Vanzin commented on SPARK-24938: I think that unless there's a measurable performance issue, we should switch to using the allocator's default mode. Otherwise, since these headers are pretty small, we could just switch to unpooled buffers. > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580229#comment-16580229 ] Joe Pallas commented on SPARK-22236: {quote}people with preexisting datasets exported by Spark would suffer (unexpected) data loss {quote} Yes, that is an important consideration. But is the implication that we can never change the default behavior to match the standard, *even in a major revision*, because of the impact on reading previously written data? I don't see how that would be sensible. So, how can we minimize the risk associated with making this change? * Documentation is certainly one way, in both the API doc and the release notes. * We could add a {{spark2-compat}} option (equivalent to setting escape, but the name emphasizes the compatibility semantics). * We could flag a warning or error if we see {{\"}} in the input. Since that could appear in legal RFC4180 input, we might then want an option to suppress the warning/error when we know the input was not generated with the old settings. I've no idea how difficult the implementation would be on that last one. > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580225#comment-16580225 ] Nihar Sheth commented on SPARK-24938: - His comment for that change is "The header is a very small buffer, I thought it was overkill to try to get direct byte bufs for it." Not sure why that would be a thing, though, wouldn't the already-allocated offheap memory be used? > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25043) spark-sql should print the appId and master on startup
[ https://issues.apache.org/jira/browse/SPARK-25043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-25043: - Assignee: Alessandro Bellina > spark-sql should print the appId and master on startup > -- > > Key: SPARK-25043 > URL: https://issues.apache.org/jira/browse/SPARK-25043 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Alessandro Bellina >Assignee: Alessandro Bellina >Priority: Trivial > Fix For: 2.4.0 > > > In spark-sql, if logging is turned down all the way, it's not possible to > find out what appId is running at the moment. This small change as a print to > stdout containing the master type and the appId to have that on screen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580213#comment-16580213 ] Marcelo Vanzin commented on SPARK-24938: BTW the change from buffer() to heapBuffer() was made in SPARK-4188, but I don't really expect Aaron to remember why at this point (if we can even reach him). > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25043) spark-sql should print the appId and master on startup
[ https://issues.apache.org/jira/browse/SPARK-25043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-25043. --- Resolution: Fixed Fix Version/s: 2.4.0 > spark-sql should print the appId and master on startup > -- > > Key: SPARK-25043 > URL: https://issues.apache.org/jira/browse/SPARK-25043 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Alessandro Bellina >Priority: Trivial > Fix For: 2.4.0 > > > In spark-sql, if logging is turned down all the way, it's not possible to > find out what appId is running at the moment. This small change as a print to > stdout containing the master type and the appId to have that on screen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.
[ https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580208#comment-16580208 ] Apache Spark commented on SPARK-25117: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/22107 > Add EXEPT ALL and INTERSECT ALL support in R. > - > > Key: SPARK-25117 > URL: https://issues.apache.org/jira/browse/SPARK-25117 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.
[ https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25117: Assignee: (was: Apache Spark) > Add EXEPT ALL and INTERSECT ALL support in R. > - > > Key: SPARK-25117 > URL: https://issues.apache.org/jira/browse/SPARK-25117 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.
[ https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25117: Assignee: Apache Spark > Add EXEPT ALL and INTERSECT ALL support in R. > - > > Key: SPARK-25117 > URL: https://issues.apache.org/jira/browse/SPARK-25117 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25116) Fix the "exit code 1" error when terminating Kafka tests
[ https://issues.apache.org/jira/browse/SPARK-25116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25116: Assignee: Shixiong Zhu (was: Apache Spark) > Fix the "exit code 1" error when terminating Kafka tests > > > Key: SPARK-25116 > URL: https://issues.apache.org/jira/browse/SPARK-25116 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > SBT may report the following error when all Kafka tests are finished > {code} > sbt/sbt/0.13.17/test-interface-1.0.jar sbt.ForkMain 39359 failed with exit > code 1 > [error] (sql-kafka-0-10/test:test) sbt.TestsFailedException: Tests > unsuccessful > {code} > This is because we are leaking a Kafka cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25116) Fix the "exit code 1" error when terminating Kafka tests
[ https://issues.apache.org/jira/browse/SPARK-25116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580202#comment-16580202 ] Apache Spark commented on SPARK-25116: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22106 > Fix the "exit code 1" error when terminating Kafka tests > > > Key: SPARK-25116 > URL: https://issues.apache.org/jira/browse/SPARK-25116 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > SBT may report the following error when all Kafka tests are finished > {code} > sbt/sbt/0.13.17/test-interface-1.0.jar sbt.ForkMain 39359 failed with exit > code 1 > [error] (sql-kafka-0-10/test:test) sbt.TestsFailedException: Tests > unsuccessful > {code} > This is because we are leaking a Kafka cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25116) Fix the "exit code 1" error when terminating Kafka tests
[ https://issues.apache.org/jira/browse/SPARK-25116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25116: Assignee: Apache Spark (was: Shixiong Zhu) > Fix the "exit code 1" error when terminating Kafka tests > > > Key: SPARK-25116 > URL: https://issues.apache.org/jira/browse/SPARK-25116 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > SBT may report the following error when all Kafka tests are finished > {code} > sbt/sbt/0.13.17/test-interface-1.0.jar sbt.ForkMain 39359 failed with exit > code 1 > [error] (sql-kafka-0-10/test:test) sbt.TestsFailedException: Tests > unsuccessful > {code} > This is because we are leaking a Kafka cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580195#comment-16580195 ] Marcelo Vanzin commented on SPARK-24938: The line you mention is this, right? {code} ByteBuf header = ctx.alloc().heapBuffer(headerLength); {code} My understanding of that line is that it's using the allocator used when building the client or server. So perhaps the fix here is not to use the default netty pool, but to use {{ctx.alloc().buffer()}} instead of {{.heapBuffer()}}? Seems that this way you'd be actually using the shared buffers when the allocator is configured for direct buffers, instead of initializing a heap pool just for the message encoder... > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25088) Rest Server default & doc updates
[ https://issues.apache.org/jira/browse/SPARK-25088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25088. --- Resolution: Fixed Fix Version/s: 2.4.0 Resolved by https://github.com/apache/spark/pull/22071 > Rest Server default & doc updates > - > > Key: SPARK-25088 > URL: https://issues.apache.org/jira/browse/SPARK-25088 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > The rest server could use some updates on defaults & docs, both in standalone > and mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.
Dilip Biswal created SPARK-25117: Summary: Add EXEPT ALL and INTERSECT ALL support in R. Key: SPARK-25117 URL: https://issues.apache.org/jira/browse/SPARK-25117 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.3.1 Reporter: Dilip Biswal -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25116) Fix the "exit code 1" error when terminating Kafka tests
Shixiong Zhu created SPARK-25116: Summary: Fix the "exit code 1" error when terminating Kafka tests Key: SPARK-25116 URL: https://issues.apache.org/jira/browse/SPARK-25116 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.4.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu SBT may report the following error when all Kafka tests are finished {code} sbt/sbt/0.13.17/test-interface-1.0.jar sbt.ForkMain 39359 failed with exit code 1 [error] (sql-kafka-0-10/test:test) sbt.TestsFailedException: Tests unsuccessful {code} This is because we are leaking a Kafka cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25114: Target Version/s: 2.3.2, 2.4.0 (was: 2.4.0) > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580153#comment-16580153 ] Xiao Li commented on SPARK-25114: - [~jerryshao] Another blocker for 2.3 > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580140#comment-16580140 ] Imran Rashid commented on SPARK-24938: -- Cool, sounds like the info we need for making this change, then. [~zsxwing] [~vanzin] do you have thoughts on this? Any reason why MessageEncoder is explicitly using onheap pools, rather than the configured default netty pool? > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25051. - Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 2.3.3 > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: MIK >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 2.3.3 > > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25051: Fix Version/s: (was: 2.3.3) 2.3.2 > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: MIK >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 2.3.2 > > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580113#comment-16580113 ] Nihar Sheth commented on SPARK-24938: - Your expectation is correct, the offheap pools remained at 16 mb after adding this change. There does appear to be a tiny corresponding change in the number of offheap bytes used, so I agree that netty is only using the offheap buffers. > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.
[ https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25115: Assignee: Apache Spark > Eliminate extra memory copy done when a ByteBuf is used that is backed by > > 1 ByteBuffer. > - > > Key: SPARK-25115 > URL: https://issues.apache.org/jira/browse/SPARK-25115 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.1 >Reporter: Norman Maurer >Assignee: Apache Spark >Priority: Major > > Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by > more then 1 ByteBuf. In this case it makes more sense to call > ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.
[ https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580106#comment-16580106 ] Apache Spark commented on SPARK-25115: -- User 'normanmaurer' has created a pull request for this issue: https://github.com/apache/spark/pull/22105 > Eliminate extra memory copy done when a ByteBuf is used that is backed by > > 1 ByteBuffer. > - > > Key: SPARK-25115 > URL: https://issues.apache.org/jira/browse/SPARK-25115 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.1 >Reporter: Norman Maurer >Priority: Major > > Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by > more then 1 ByteBuf. In this case it makes more sense to call > ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.
[ https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25115: Assignee: (was: Apache Spark) > Eliminate extra memory copy done when a ByteBuf is used that is backed by > > 1 ByteBuffer. > - > > Key: SPARK-25115 > URL: https://issues.apache.org/jira/browse/SPARK-25115 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.1 >Reporter: Norman Maurer >Priority: Major > > Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by > more then 1 ByteBuf. In this case it makes more sense to call > ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.
[ https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580087#comment-16580087 ] Norman Maurer commented on SPARK-25115: --- I opened the following PR to fix it: https://github.com/apache/spark/pull/22105 > Eliminate extra memory copy done when a ByteBuf is used that is backed by > > 1 ByteBuffer. > - > > Key: SPARK-25115 > URL: https://issues.apache.org/jira/browse/SPARK-25115 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.1 >Reporter: Norman Maurer >Priority: Major > > Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by > more then 1 ByteBuf. In this case it makes more sense to call > ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.
Norman Maurer created SPARK-25115: - Summary: Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer. Key: SPARK-25115 URL: https://issues.apache.org/jira/browse/SPARK-25115 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.3.1 Reporter: Norman Maurer Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by more then 1 ByteBuf. In this case it makes more sense to call ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging
[ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580064#comment-16580064 ] Marcelo Vanzin commented on SPARK-24787: In that case it might be good to only use hsync in "safer" contexts (i.e. not in event storms like task updates). > Events being dropped at an alarming rate due to hsync being slow for > eventLogging > - > > Key: SPARK-24787 > URL: https://issues.apache.org/jira/browse/SPARK-24787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: Sanket Reddy >Priority: Minor > > [https://github.com/apache/spark/pull/16924/files] updates the length of the > inprogress files allowing history server being responsive. > Although we have a production job that has 6 tasks per stage and due to > hsync being slow it starts dropping events and the history server has wrong > stats due to events being dropped. > A viable solution is not to make it sync very frequently or make it > configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580062#comment-16580062 ] Marcelo Vanzin commented on SPARK-24771: Asking is a good start. But I have anecdotal evidence that there are quite a few people who use Avro/RDDs... not sure whether they're planning to move to SQL any time soon. In any case, it would be good to know exactly what breaks so that we can have a proper release note, instead of just dumping the problem on the user's lap. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579974#comment-16579974 ] Ondrej Kokes commented on SPARK-22236: -- Multiline=true by default would cause some slowdown, but data quality would either increase or stay the same - it would never go down. So the discussion there is mostly about performance. With escape changes, while we would see improvements in data quality on the input side, *people with preexisting datasets exported by Spark would suffer (unexpected) data loss,* because the escaping strategy would potentially differ from the time the data was written. I think that's a bit more important aspect to consider. > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25051: Component/s: (was: Spark Core) > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Blocker > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579967#comment-16579967 ] Xiao Li commented on SPARK-25051: - [~mgaido] This breaks the backport rule. We are unable to remove AnalysisBarrier from 2.3. AnalysisBarrier is a nightmare to many. Sorry for that. > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Blocker > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)
[ https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579965#comment-16579965 ] Li Jin commented on SPARK-24561: I am looking into this. Early investigation: https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit > User-defined window functions with pandas udf (bounded window) > -- > > Key: SPARK-24561 > URL: https://issues.apache.org/jira/browse/SPARK-24561 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25083) remove the type erasure hack in data source scan
[ https://issues.apache.org/jira/browse/SPARK-25083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579961#comment-16579961 ] Ryan Blue commented on SPARK-25083: --- [~cloud_fan], what release is this targeting? > remove the type erasure hack in data source scan > > > Key: SPARK-25083 > URL: https://issues.apache.org/jira/browse/SPARK-25083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > It's hacky to pretend a `RDD[ColumnarBatch]` to be a `RDD[InternalRow]`. We > should make the type explicit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-24721: --- Component/s: SQL > Failed to use PythonUDF with literal inputs in filter with data sources > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at >
[jira] [Updated] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-24721: --- Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-22216) > Failed to use PythonUDF with literal inputs in filter with data sources > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at >
[jira] [Comment Edited] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579956#comment-16579956 ] Li Jin edited comment on SPARK-24721 at 8/14/18 3:26 PM: - Updated Jira title to reflect the actual issue was (Author: icexelloss): Updates Jira title to reflect the actual issue > Failed to use PythonUDF with literal inputs in filter with data sources > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at >
[jira] [Commented] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579956#comment-16579956 ] Li Jin commented on SPARK-24721: Updates Jira title to reflect the actual issue > Failed to use PythonUDF with literal inputs in filter with data sources > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at >
[jira] [Updated] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-24721: --- Summary: Failed to use PythonUDF with literal inputs in filter with data sources (was: Failed to call PythonUDF whose input is the output of another PythonUDF) > Failed to use PythonUDF with literal inputs in filter with data sources > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at >
[jira] [Commented] (SPARK-24941) Add RDDBarrier.coalesce() function
[ https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579930#comment-16579930 ] Jiang Xingbo commented on SPARK-24941: -- Shall we add something like `spark.default.parallelism`? It maybe not like a fixed number but be a fraction to say that any barrier stage shall launch tasks less than the fraction * totalCores ? > Add RDDBarrier.coalesce() function > -- > > Key: SPARK-24941 > URL: https://issues.apache.org/jira/browse/SPARK-24941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > https://github.com/apache/spark/pull/21758#discussion_r204917245 > The number of partitions from the input data can be unexpectedly large, eg. > if you do > {code} > sc.textFile(...).barrier().mapPartitions() > {code} > The number of input partitions is based on the hdfs input splits. We shall > provide a way in RDDBarrier to enable users to specify the number of tasks in > a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) > . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579883#comment-16579883 ] Apache Spark commented on SPARK-24721: -- User 'icexelloss' has created a pull request for this issue: https://github.com/apache/spark/pull/22104 > Failed to call PythonUDF whose input is the output of another PythonUDF > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at >
[jira] [Assigned] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24721: Assignee: (was: Apache Spark) > Failed to call PythonUDF whose input is the output of another PythonUDF > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at >
[jira] [Assigned] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24721: Assignee: Apache Spark > Failed to call PythonUDF whose input is the output of another PythonUDF > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at >
[jira] [Assigned] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25114: Assignee: (was: Apache Spark) > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579880#comment-16579880 ] Apache Spark commented on SPARK-25114: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/22101 > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25114: Assignee: Apache Spark > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579879#comment-16579879 ] Jiang Xingbo commented on SPARK-25114: -- I created https://github.com/apache/spark/pull/22101 for this. > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiang Xingbo updated SPARK-25114: - Labels: correctness (was: ) > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiang Xingbo updated SPARK-25114: - Priority: Blocker (was: Major) > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
Jiang Xingbo created SPARK-25114: Summary: RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE Key: SPARK-25114 URL: https://issues.apache.org/jira/browse/SPARK-25114 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Jiang Xingbo It is possible for two objects to be unequal and yet we consider them as equal within RecordBinaryComparator, if the long values are separated by Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging
[ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579854#comment-16579854 ] Thomas Graves commented on SPARK-24787: --- Yes it was caused by hsync, hsync has to go to the namenode in addition to the datanode, hflush is datanode only operation. We saw huge increase in dropped events with this on large jobs, we reverted the change and went back to only hflush and it stopped dropping. Talked to one of our hdfs experts and he said hsync is expensive. it might depend on how loaded your hdfs cluster is. > Events being dropped at an alarming rate due to hsync being slow for > eventLogging > - > > Key: SPARK-24787 > URL: https://issues.apache.org/jira/browse/SPARK-24787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: Sanket Reddy >Priority: Minor > > [https://github.com/apache/spark/pull/16924/files] updates the length of the > inprogress files allowing history server being responsive. > Although we have a production job that has 6 tasks per stage and due to > hsync being slow it starts dropping events and the history server has wrong > stats due to events being dropped. > A viable solution is not to make it sync very frequently or make it > configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579851#comment-16579851 ] Thomas Graves commented on SPARK-24918: --- Personally I like the explicit config on better (spark.executor.plugins). Opt out in my opinion is easier for users to mess up. Someone includes jar someone some other group and doesn't realize it has this ServiceLoader. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-25051: -- Priority: Blocker (was: Major) > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Blocker > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23938) High-order function: map_zip_with(map, map, function) → map
[ https://issues.apache.org/jira/browse/SPARK-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23938. --- Resolution: Fixed Assignee: Marek Novotny Fix Version/s: 2.4.0 Issue resolved by pull request 22017 https://github.com/apache/spark/pull/22017 > High-order function: map_zip_with(map, map, function V3>) → map > --- > > Key: SPARK-23938 > URL: https://issues.apache.org/jira/browse/SPARK-23938 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marek Novotny >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/map.html > Merges the two given maps into a single map by applying function to the pair > of values with the same key. For keys only presented in one map, NULL will be > passed as the value for the missing key. > {noformat} > SELECT map_zip_with(MAP(ARRAY[1, 2, 3], ARRAY['a', 'b', 'c']), -- {1 -> ad, 2 > -> be, 3 -> cf} > MAP(ARRAY[1, 2, 3], ARRAY['d', 'e', 'f']), > (k, v1, v2) -> concat(v1, v2)); > SELECT map_zip_with(MAP(ARRAY['k1', 'k2'], ARRAY[1, 2]), -- {k1 -> ROW(1, > null), k2 -> ROW(2, 4), k3 -> ROW(null, 9)} > MAP(ARRAY['k2', 'k3'], ARRAY[4, 9]), > (k, v1, v2) -> (v1, v2)); > SELECT map_zip_with(MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 8, 27]), -- {a -> a1, > b -> b4, c -> c9} > MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3]), > (k, v1, v2) -> k || CAST(v1/v2 AS VARCHAR)); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24411) Adding native Java tests for `isInCollection`
[ https://issues.apache.org/jira/browse/SPARK-24411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579650#comment-16579650 ] Aleksei Izmalkin commented on SPARK-24411: -- I will work on this issue. > Adding native Java tests for `isInCollection` > - > > Key: SPARK-24411 > URL: https://issues.apache.org/jira/browse/SPARK-24411 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Minor > Labels: starter > > In the past, some of our Java APIs have been difficult to call from Java. We > should add tests in Java directly to make sure it works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit
[ https://issues.apache.org/jira/browse/SPARK-25113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579645#comment-16579645 ] Apache Spark commented on SPARK-25113: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/22103 > Add logging to CodeGenerator when any generated method's bytecode size goes > above HugeMethodLimit > - > > Key: SPARK-25113 > URL: https://issues.apache.org/jira/browse/SPARK-25113 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Major > > Add logging to help collect statistics on how often real world usage sees the > {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 > bytes (HugeMethodLimit) threshold. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit
[ https://issues.apache.org/jira/browse/SPARK-25113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25113: Assignee: (was: Apache Spark) > Add logging to CodeGenerator when any generated method's bytecode size goes > above HugeMethodLimit > - > > Key: SPARK-25113 > URL: https://issues.apache.org/jira/browse/SPARK-25113 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Major > > Add logging to help collect statistics on how often real world usage sees the > {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 > bytes (HugeMethodLimit) threshold. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit
[ https://issues.apache.org/jira/browse/SPARK-25113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25113: Assignee: Apache Spark > Add logging to CodeGenerator when any generated method's bytecode size goes > above HugeMethodLimit > - > > Key: SPARK-25113 > URL: https://issues.apache.org/jira/browse/SPARK-25113 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Major > > Add logging to help collect statistics on how often real world usage sees the > {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 > bytes (HugeMethodLimit) threshold. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit
Kris Mok created SPARK-25113: Summary: Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit Key: SPARK-25113 URL: https://issues.apache.org/jira/browse/SPARK-25113 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kris Mok Add logging to help collect statistics on how often real world usage sees the {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 bytes (HugeMethodLimit) threshold. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25102) Write Spark version information to Parquet file footers
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579622#comment-16579622 ] Nikita Poberezkin commented on SPARK-25102: --- I will work on this issue > Write Spark version information to Parquet file footers > --- > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zoltan Ivanfi >Priority: Major > > -PARQUET-352- added support for the "writer.model.name" property in the > Parquet metadata to identify the object model (application) that wrote the > file. > The easiest way to write this property is by overriding getName() of > org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding > getName() to the > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25051: Assignee: Apache Spark > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Assignee: Apache Spark >Priority: Major > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579526#comment-16579526 ] Apache Spark commented on SPARK-25051: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/22102 > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25051: Assignee: (was: Apache Spark) > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579514#comment-16579514 ] Marco Gaido commented on SPARK-25051: - This was caused by the introduction of AnalysisBarrier. I will submit a PR for branch 2.3. On 2.4+ (current master) we don't have anymore this issue because AnalysisBarrier was removed. Anyway, this brings a question to me: shall we remove AnalysisBarrier from 2.3 line too? In the current situation, backporting any analyzer fix to 2.3 is going to be painful. cc [~rxin] [~cloud_fan] > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579444#comment-16579444 ] Marco Gaido commented on SPARK-25051: - cc [~jerryshao] shall we set it as a blocker for 2.3.2? > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-25051: Labels: correctness (was: ) > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25068) High-order function: exists(array, function) → boolean
[ https://issues.apache.org/jira/browse/SPARK-25068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579386#comment-16579386 ] Marek Novotny commented on SPARK-25068: --- That's a good point. Thanks for your answer! > High-order function: exists(array, function) → boolean > - > > Key: SPARK-25068 > URL: https://issues.apache.org/jira/browse/SPARK-25068 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > Tests if arrays have those elements for which function returns true. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org