[jira] [Updated] (SPARK-12659) NPE when spill in CartisianProduct
[ https://issues.apache.org/jira/browse/SPARK-12659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12659: --- Affects Version/s: (was: 1.6.0) 2.0.0 > NPE when spill in CartisianProduct > -- > > Key: SPARK-12659 > URL: https://issues.apache.org/jira/browse/SPARK-12659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > {code} > java.lang.NullPointerException > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:54) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:37) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:231) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12542) Support intersect/except in Hive SQL
[ https://issues.apache.org/jira/browse/SPARK-12542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12542: -- Assignee: Davies Liu (was: Xiao Li) > Support intersect/except in Hive SQL > > > Key: SPARK-12542 > URL: https://issues.apache.org/jira/browse/SPARK-12542 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12542) Support intersect/except in Hive SQL
[ https://issues.apache.org/jira/browse/SPARK-12542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12542: --- Summary: Support intersect/except in Hive SQL (was: Support union/intersect/except in Hive SQL) > Support intersect/except in Hive SQL > > > Key: SPARK-12542 > URL: https://issues.apache.org/jira/browse/SPARK-12542 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12662) Add a local sort operator to DataFrame used by randomSplit
[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086056#comment-15086056 ] Davies Liu commented on SPARK-12662: Another way to make the DataFrame deterministic is materialize it, via cache. > Add a local sort operator to DataFrame used by randomSplit > -- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Yin Huai >Assignee: Sameer Agarwal > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add a local sort > operator to make row ordering within a partition deterministic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12681) Split IdentifiersParser.g to avoid single huge java source
[ https://issues.apache.org/jira/browse/SPARK-12681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12681. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10624 [https://github.com/apache/spark/pull/10624] > Split IdentifiersParser.g to avoid single huge java source > -- > > Key: SPARK-12681 > URL: https://issues.apache.org/jira/browse/SPARK-12681 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > {code} > [error] > ~spark/sql/hive/target/scala-2.10/src_managed/main/org/apache/spark/sql/parser/SparkSqlParser_IdentifiersParser.java:11056: > ERROR: too long > [error] static final String[] DFA5_transitionS = { > [error] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12681) Split IdentifiersParser.g to avoid single huge java source
Davies Liu created SPARK-12681: -- Summary: Split IdentifiersParser.g to avoid single huge java source Key: SPARK-12681 URL: https://issues.apache.org/jira/browse/SPARK-12681 Project: Spark Issue Type: Sub-task Reporter: Davies Liu {code} [error] ~spark/sql/hive/target/scala-2.10/src_managed/main/org/apache/spark/sql/parser/SparkSqlParser_IdentifiersParser.java:11056: ERROR: too long [error] static final String[] DFA5_transitionS = { [error] ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12681) Split IdentifiersParser.g to avoid single huge java source
[ https://issues.apache.org/jira/browse/SPARK-12681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12681: -- Assignee: Davies Liu > Split IdentifiersParser.g to avoid single huge java source > -- > > Key: SPARK-12681 > URL: https://issues.apache.org/jira/browse/SPARK-12681 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > {code} > [error] > ~spark/sql/hive/target/scala-2.10/src_managed/main/org/apache/spark/sql/parser/SparkSqlParser_IdentifiersParser.java:11056: > ERROR: too long > [error] static final String[] DFA5_transitionS = { > [error] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12659) NPE when spill in CartisianProduct
Davies Liu created SPARK-12659: -- Summary: NPE when spill in CartisianProduct Key: SPARK-12659 URL: https://issues.apache.org/jira/browse/SPARK-12659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Davies Liu Assignee: Davies Liu {code} java.lang.NullPointerException at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:54) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:37) at org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:231) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12511) streaming driver with checkpointing unable to finalize leading to OOM
[ https://issues.apache.org/jira/browse/SPARK-12511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12511. Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10514 [https://github.com/apache/spark/pull/10514] > streaming driver with checkpointing unable to finalize leading to OOM > - > > Key: SPARK-12511 > URL: https://issues.apache.org/jira/browse/SPARK-12511 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.5.2, 1.6.0 > Environment: pyspark 1.5.2 > yarn 2.6.0 > python 2.6 > centos 6.5 > openjdk 1.8.0 >Reporter: Antony Mayi >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.0, 1.6.1 > > Attachments: bug.py, finalizer-classes.png, finalizer-pending.png, > finalizer-spark_assembly.png > > > Spark streaming application when configured with checkpointing is filling > driver's heap with multiple ZipFileInputStream instances as results of > spark-assembly.jar (potentially some others like for example snappy-java.jar) > getting repetitively referenced (loaded?). Java Finalizer can't finalize > these ZipFileInputStream instances and it eventually takes all heap leading > the driver to OOM crash. > h2. Steps to reproduce: > * Submit attached [^bug.py] to spark > * Leave it running and monitor the driver java process heap > ** with heap dump you will primarily see growing instances of byte array data > (here cumulated zip payload of the jar refs): > {noformat} > num #instances #bytes class name > -- >1: 32653 32735296 [B >2: 480005135816 [C >3:411344144 [Lscala.concurrent.forkjoin.ForkJoinTask; >4: 113621261816 java.lang.Class >5: 470541129296 java.lang.String >6: 254601018400 java.lang.ref.Finalizer >7: 9802 789400 [Ljava.lang.Object; > {noformat} > ** with visualvm you can see: > *** increasing number of objects pending for finalization > !finalizer-pending.png! > *** increasing number of ZipFileInputStreams instances related to the > spark-assembly.jar referenced by Finalizer > !finalizer-spark_assembly.png! > * Depending on the heap size and running time this will lead to driver OOM > crash > h2. Comments > * The [^bug.py] is lightweight proof of the problem. In production I am > experiencing this as quite rapid effect - in few hours it eats gigs of heap > and kills the app. > * If the same [^bug.py] is run without checkpointing there is no issue > whatsoever. > * Not sure if it is just pyspark related. > * In [^bug.py] I am using the socketTextStream input but seems to be > independent of the input type (in production having same problem with Kafka > direct stream, have seen it even with textFileStream). > * It is happening even if the input stream doesn't produce any data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12617) socket descriptor leak killing streaming app
[ https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12617. Resolution: Fixed Fix Version/s: 1.6.1 1.5.3 2.0.0 Issue resolved by pull request 10579 [https://github.com/apache/spark/pull/10579] > socket descriptor leak killing streaming app > > > Key: SPARK-12617 > URL: https://issues.apache.org/jira/browse/SPARK-12617 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.5.2 > Environment: pyspark (python 2.6) >Reporter: Antony Mayi >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.0, 1.5.3, 1.6.1 > > Attachments: bug.py > > > There is a socket descriptor leakage in a pyspark streaming app when > configured with batch interval more then 30 seconds. This is due to default > timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 > seconds of inactivity and creates new one next time. That connection doesn't > get closed on the python CallbackServer side and keep piling up until it > eventually blocks new connections. > h2. Steps to reproduce: > * Submit attached [^bug.py] to spark > * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback > connections of which 0 will ever be closed > {code} > [BUG] py4j callback server port: 51282 > [BUG] py4j CB 0/0 closed > ... > [BUG] py4j CB 0/123 closed > {code} > * You can confirm the reality by using lsof on the pyspark driver process: > {code} > $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282 > python2.6 39770 das 94u IPv4 138824906 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT) > python2.6 39770 das 95u IPv4 138867747 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT) > python2.6 39770 das 96u IPv4 138831829 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT) > python2.6 39770 das 97u IPv4 138890524 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT) > python2.6 39770 das 98u IPv4 138860190 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT) > python2.6 39770 das 99u IPv4 138860439 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT) > ... > {code} > * If you leave it running for long enough the CallbackServer will eventually > become unable to accept new connections from the gateway and the app will > crash: > {code} > 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for > time 145171140 ms > py4j.Py4JException: Error while obtaining a new communication channel > ... > Caused by: java.net.ConnectException: Connection timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at java.net.Socket.(Socket.java:434) > at java.net.Socket.(Socket.java:244) > at py4j.CallbackConnection.start(CallbackConnection.java:104) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12636) Expose API on UnsafeRowRecordReader to just run on files
[ https://issues.apache.org/jira/browse/SPARK-12636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12636. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10581 [https://github.com/apache/spark/pull/10581] > Expose API on UnsafeRowRecordReader to just run on files > > > Key: SPARK-12636 > URL: https://issues.apache.org/jira/browse/SPARK-12636 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > Fix For: 2.0.0 > > > This is beneficial just from a code testability point of view to be able to > exercise individual components. Also makes it easy to benchmark it. It would > be able to read data without need to create al the associate hadoop input > split, etc components. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12661) Drop Python 2.6 support in PySpark
Davies Liu created SPARK-12661: -- Summary: Drop Python 2.6 support in PySpark Key: SPARK-12661 URL: https://issues.apache.org/jira/browse/SPARK-12661 Project: Spark Issue Type: Task Reporter: Davies Liu 1. stop testing with 2.6 2. remove the code for python 2.6 see discussion : https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12470. Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10421 [https://github.com/apache/spark/pull/10421] > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Priority: Minor > Fix For: 2.0.0, 1.6.1 > > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12541) Support rollup/cube in SQL query
[ https://issues.apache.org/jira/browse/SPARK-12541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12541. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10522 [https://github.com/apache/spark/pull/10522] > Support rollup/cube in SQL query > > > Key: SPARK-12541 > URL: https://issues.apache.org/jira/browse/SPARK-12541 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > Fix For: 2.0.0 > > > We have DataFrame API for rollup/cube, but do not support them in SQL parser > (both SQLContext and HiveContext). > PS: Hive parser only support `group by a, b,c WITH cube/rollup` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12624) When schema is specified, we should treat undeclared fields as null (in Python)
[ https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081932#comment-15081932 ] Davies Liu commented on SPARK-12624: [~rxin] The RDD or list could be just tuple, how do we know which is the missing column, the last one(s)? > When schema is specified, we should treat undeclared fields as null (in > Python) > --- > > Key: SPARK-12624 > URL: https://issues.apache.org/jira/browse/SPARK-12624 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Reynold Xin >Priority: Critical > > See https://github.com/apache/spark/pull/10564 > Basically that test case should pass without the above fix and just assume b > is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12289) Support UnsafeRow in TakeOrderedAndProject/Limit
[ https://issues.apache.org/jira/browse/SPARK-12289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12289. Resolution: Fixed Assignee: Davies Liu (was: Apache Spark) Fix Version/s: 2.0.0 > Support UnsafeRow in TakeOrderedAndProject/Limit > > > Key: SPARK-12289 > URL: https://issues.apache.org/jira/browse/SPARK-12289 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12292) Support UnsafeRow in Generate
[ https://issues.apache.org/jira/browse/SPARK-12292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12292. Resolution: Fixed Assignee: Davies Liu Fix Version/s: 2.0.0 > Support UnsafeRow in Generate > - > > Key: SPARK-12292 > URL: https://issues.apache.org/jira/browse/SPARK-12292 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12293) Support UnsafeRow in LocalTableScan
[ https://issues.apache.org/jira/browse/SPARK-12293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12293. Resolution: Fixed Assignee: Davies Liu (was: Liang-Chi Hsieh) Fix Version/s: 2.0.0 > Support UnsafeRow in LocalTableScan > --- > > Key: SPARK-12293 > URL: https://issues.apache.org/jira/browse/SPARK-12293 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()
[ https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12585. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10528 [https://github.com/apache/spark/pull/10528] > The numFields of UnsafeRow should not changed by pointTo() > -- > > Key: SPARK-12585 > URL: https://issues.apache.org/jira/browse/SPARK-12585 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Apache Spark > Fix For: 2.0.0 > > > Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes > is calculated, making pointTo() a little bit heavy. > It should be part of constructor of UnsafeRow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()
Davies Liu created SPARK-12585: -- Summary: The numFields of UnsafeRow should not changed by pointTo() Key: SPARK-12585 URL: https://issues.apache.org/jira/browse/SPARK-12585 Project: Spark Issue Type: Improvement Reporter: Davies Liu Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()
[ https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12585: --- Description: Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy. It should be part of constructor of UnsafeRow. was:Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy. > The numFields of UnsafeRow should not changed by pointTo() > -- > > Key: SPARK-12585 > URL: https://issues.apache.org/jira/browse/SPARK-12585 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > > Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes > is calculated, making pointTo() a little bit heavy. > It should be part of constructor of UnsafeRow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()
[ https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12585: -- Assignee: Davies Liu > The numFields of UnsafeRow should not changed by pointTo() > -- > > Key: SPARK-12585 > URL: https://issues.apache.org/jira/browse/SPARK-12585 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes > is calculated, making pointTo() a little bit heavy. > It should be part of constructor of UnsafeRow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12300) Fix schema inferance on local collections
[ https://issues.apache.org/jira/browse/SPARK-12300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12300: --- Fix Version/s: (was: 1.6.0) 1.6.1 > Fix schema inferance on local collections > - > > Key: SPARK-12300 > URL: https://issues.apache.org/jira/browse/SPARK-12300 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: holdenk >Priority: Minor > Fix For: 1.6.1, 2.0.0 > > > Current schema inferance for local python collections halts as soon as there > are no NullTypes. This is different than when we specify a sampling ratio of > 1.0 on a distributed collection. This could result in incomplete schema > information. > Repro: > {code} > input = [{"a": 1}, {"b": "coffee"}] > df = sqlContext.createDataFrame(input) > print df.schema > {code} > Discovered while looking at SPARK-2870 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12300) Fix schema inferance on local collections
[ https://issues.apache.org/jira/browse/SPARK-12300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12300. Resolution: Fixed Fix Version/s: 1.6.0 2.0.0 Issue resolved by pull request 10275 [https://github.com/apache/spark/pull/10275] > Fix schema inferance on local collections > - > > Key: SPARK-12300 > URL: https://issues.apache.org/jira/browse/SPARK-12300 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: holdenk >Priority: Minor > Fix For: 2.0.0, 1.6.0 > > > Current schema inferance for local python collections halts as soon as there > are no NullTypes. This is different than when we specify a sampling ratio of > 1.0 on a distributed collection. This could result in incomplete schema > information. > Repro: > {code} > input = [{"a": 1}, {"b": "coffee"}] > df = sqlContext.createDataFrame(input) > print df.schema > {code} > Discovered while looking at SPARK-2870 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12542) Support union/intersect/except in Hive SQL
[ https://issues.apache.org/jira/browse/SPARK-12542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074368#comment-15074368 ] Davies Liu commented on SPARK-12542: [~viirya] [~xiaol] That's true, updated the title. Once SPARK-12362 is resolved, you can start to work on it. > Support union/intersect/except in Hive SQL > -- > > Key: SPARK-12542 > URL: https://issues.apache.org/jira/browse/SPARK-12542 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12542) Support union/intersect/except in Hive SQL
[ https://issues.apache.org/jira/browse/SPARK-12542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12542: --- Assignee: Xiao Li > Support union/intersect/except in Hive SQL > -- > > Key: SPARK-12542 > URL: https://issues.apache.org/jira/browse/SPARK-12542 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12542) Support union/intersect/except in Hive SQL
[ https://issues.apache.org/jira/browse/SPARK-12542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12542: --- Summary: Support union/intersect/except in Hive SQL (was: Support intersect/except in SQL) > Support union/intersect/except in Hive SQL > -- > > Key: SPARK-12542 > URL: https://issues.apache.org/jira/browse/SPARK-12542 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Also `UNION ALL`in Hive SQL parser -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12542) Support union/intersect/except in Hive SQL
[ https://issues.apache.org/jira/browse/SPARK-12542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12542: --- Description: (was: Also `UNION ALL`in Hive SQL parser) > Support union/intersect/except in Hive SQL > -- > > Key: SPARK-12542 > URL: https://issues.apache.org/jira/browse/SPARK-12542 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12385) Push projection into Join
[ https://issues.apache.org/jira/browse/SPARK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074383#comment-15074383 ] Davies Liu commented on SPARK-12385: [~smilegator] Thanks for working on this, could you take the Aggregation as an example, Join could have a optional resultExpression (which does not equal to left.output ++ right.output), then expressions of Projection could be pushed into it. > Push projection into Join > - > > Key: SPARK-12385 > URL: https://issues.apache.org/jira/browse/SPARK-12385 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > We usually have Join followed by a projection to pruning some columns, but > Join already have a result projection to produce UnsafeRow, we should combine > them together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12544) Support window functions in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074369#comment-15074369 ] Davies Liu commented on SPARK-12544: [~hvanhovell] Right now, the SQL parser can't parse the clause like `rank() over (partition by xxx)`, but Hive parser can. Once we use the Hive parser replace the Spark SQL once, we can close this JIRA. > Support window functions in SQLContext > -- > > Key: SPARK-12544 > URL: https://issues.apache.org/jira/browse/SPARK-12544 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12541) Support rollup/cube in SQL query
[ https://issues.apache.org/jira/browse/SPARK-12541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12541: -- Assignee: Davies Liu > Support rollup/cube in SQL query > > > Key: SPARK-12541 > URL: https://issues.apache.org/jira/browse/SPARK-12541 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > We have DataFrame API for rollup/cube, but do not support them in SQL parser > (both SQLContext and HiveContext). > PS: Hive parser only support `group by a, b,c WITH cube/rollup` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12541) Support rollup/cube in SQL query
[ https://issues.apache.org/jira/browse/SPARK-12541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12541: --- Description: We have DataFrame API for rollup/cube, but do not support them in SQL parser (both SQLContext and HiveContext). (was: We have DataFrame API for rollup/cube/pivot, but do not support them in SQL parser (both SQLContext and HiveContext).) > Support rollup/cube in SQL query > > > Key: SPARK-12541 > URL: https://issues.apache.org/jira/browse/SPARK-12541 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > We have DataFrame API for rollup/cube, but do not support them in SQL parser > (both SQLContext and HiveContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12541) Support rollup/cube in SQL query
[ https://issues.apache.org/jira/browse/SPARK-12541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12541: --- Description: We have DataFrame API for rollup/cube, but do not support them in SQL parser (both SQLContext and HiveContext). PS: Hive parser only support `group by a, b,c WITH cube/rollup` was:We have DataFrame API for rollup/cube, but do not support them in SQL parser (both SQLContext and HiveContext). > Support rollup/cube in SQL query > > > Key: SPARK-12541 > URL: https://issues.apache.org/jira/browse/SPARK-12541 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > We have DataFrame API for rollup/cube, but do not support them in SQL parser > (both SQLContext and HiveContext). > PS: Hive parser only support `group by a, b,c WITH cube/rollup` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12541) Support rollup/cube in SQL query
[ https://issues.apache.org/jira/browse/SPARK-12541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12541: --- Summary: Support rollup/cube in SQL query (was: Support rollup/cube/povit in SQL query) > Support rollup/cube in SQL query > > > Key: SPARK-12541 > URL: https://issues.apache.org/jira/browse/SPARK-12541 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > We have DataFrame API for rollup/cube/pivot, but do not support them in SQL > parser (both SQLContext and HiveContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074726#comment-15074726 ] Davies Liu commented on SPARK-11437: [~maver1ck] Maybe we should provide a API to verify the schema (called manually by user)? > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Jason White > Fix For: 1.6.0 > > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12291) Support UnsafeRow in BroadcastLeftSemiJoinHash
[ https://issues.apache.org/jira/browse/SPARK-12291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-12291. -- Resolution: Not A Problem > Support UnsafeRow in BroadcastLeftSemiJoinHash > -- > > Key: SPARK-12291 > URL: https://issues.apache.org/jira/browse/SPARK-12291 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11855) Catalyst breaks backwards compatibility in branch-1.6
[ https://issues.apache.org/jira/browse/SPARK-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-11855. -- Resolution: Won't Fix > Catalyst breaks backwards compatibility in branch-1.6 > - > > Key: SPARK-11855 > URL: https://issues.apache.org/jira/browse/SPARK-11855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Santiago M. Mola >Priority: Critical > > There's a number of APIs broken in catalyst 1.6.0. I'm trying to compile most > cases: > *UnresolvedRelation*'s constructor has been changed from taking a Seq to a > TableIdentifier. A deprecated constructor taking Seq would be needed to be > backwards compatible. > {code} > case class UnresolvedRelation( > -tableIdentifier: Seq[String], > +tableIdentifier: TableIdentifier, > alias: Option[String] = None) extends LeafNode { > {code} > It is similar with *UnresolvedStar*: > {code} > -case class UnresolvedStar(table: Option[String]) extends Star with > Unevaluable { > +case class UnresolvedStar(target: Option[Seq[String]]) extends Star with > Unevaluable { > {code} > *Catalog* did get a lot of signatures changed too (because of > TableIdentifier). Providing the older methods as deprecated also seems viable > here. > Spark 1.5 already broke backwards compatibility of part of catalyst API with > respect to 1.4. I understand there are good reasons for some cases, but we > should try to minimize backwards compatibility breakages for 1.x. Specially > now that 2.x is on the horizon and there will be a near opportunity to remove > deprecated stuff. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12540) Support all TPCDS queries
[ https://issues.apache.org/jira/browse/SPARK-12540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12540: -- Assignee: Davies Liu > Support all TPCDS queries > - > > Key: SPARK-12540 > URL: https://issues.apache.org/jira/browse/SPARK-12540 > Project: Spark > Issue Type: Epic > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Spark SQL 1.6 can run 55 out of 99 TPCDS queries, the goal is to support all > of them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12291) Support UnsafeRow in BroadcastLeftSemiJoinHash
[ https://issues.apache.org/jira/browse/SPARK-12291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074729#comment-15074729 ] Davies Liu commented on SPARK-12291: Thanks, I will close this one. > Support UnsafeRow in BroadcastLeftSemiJoinHash > -- > > Key: SPARK-12291 > URL: https://issues.apache.org/jira/browse/SPARK-12291 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12541) Support rollup/cube/povit in SQL query
Davies Liu created SPARK-12541: -- Summary: Support rollup/cube/povit in SQL query Key: SPARK-12541 URL: https://issues.apache.org/jira/browse/SPARK-12541 Project: Spark Issue Type: New Feature Reporter: Davies Liu We have DataFrame API for rollup/cube/pivot, but do not support them in SQL parser (both SQLContext and HiveContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12544) Support window functions in SQLContext
Davies Liu created SPARK-12544: -- Summary: Support window functions in SQLContext Key: SPARK-12544 URL: https://issues.apache.org/jira/browse/SPARK-12544 Project: Spark Issue Type: New Feature Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12545) Support exists condition
Davies Liu created SPARK-12545: -- Summary: Support exists condition Key: SPARK-12545 URL: https://issues.apache.org/jira/browse/SPARK-12545 Project: Spark Issue Type: New Feature Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12540) Support all TPCDS queries
Davies Liu created SPARK-12540: -- Summary: Support all TPCDS queries Key: SPARK-12540 URL: https://issues.apache.org/jira/browse/SPARK-12540 Project: Spark Issue Type: Epic Components: SQL Reporter: Davies Liu Spark SQL 1.6 can run 55 out of 99 TPCDS queries, the goal is to support all of them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12543) Support subquery in select/where/having
Davies Liu created SPARK-12543: -- Summary: Support subquery in select/where/having Key: SPARK-12543 URL: https://issues.apache.org/jira/browse/SPARK-12543 Project: Spark Issue Type: New Feature Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12542) Support intersect/except in SQL
Davies Liu created SPARK-12542: -- Summary: Support intersect/except in SQL Key: SPARK-12542 URL: https://issues.apache.org/jira/browse/SPARK-12542 Project: Spark Issue Type: New Feature Reporter: Davies Liu Also `UNION ALL`in Hive SQL parser -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12511) streaming driver with checkpointing unable to finalize leading to OOM
[ https://issues.apache.org/jira/browse/SPARK-12511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12511: --- Assignee: Shixiong Zhu > streaming driver with checkpointing unable to finalize leading to OOM > - > > Key: SPARK-12511 > URL: https://issues.apache.org/jira/browse/SPARK-12511 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.5.2 > Environment: pyspark 1.5.2 > yarn 2.6.0 > python 2.6 > centos 6.5 > openjdk 1.8.0 >Reporter: Antony Mayi >Assignee: Shixiong Zhu >Priority: Critical > Attachments: bug.py, finalizer-classes.png, finalizer-pending.png, > finalizer-spark_assembly.png > > > Spark streaming application when configured with checkpointing is filling > driver's heap with multiple ZipFileInputStream instances as results of > spark-assembly.jar (potentially some others like for example snappy-java.jar) > getting repetitively referenced (loaded?). Java Finalizer can't finalize > these ZipFileInputStream instances and it eventually takes all heap leading > the driver to OOM crash. > h2. Steps to reproduce: > * Submit attached [^bug.py] to spark > * Leave it running and monitor the driver java process heap > ** with heap dump you will primarily see growing instances of byte array data > (here cumulated zip payload of the jar refs): > {noformat} > num #instances #bytes class name > -- >1: 32653 32735296 [B >2: 480005135816 [C >3:411344144 [Lscala.concurrent.forkjoin.ForkJoinTask; >4: 113621261816 java.lang.Class >5: 470541129296 java.lang.String >6: 254601018400 java.lang.ref.Finalizer >7: 9802 789400 [Ljava.lang.Object; > {noformat} > ** with visualvm you can see: > *** increasing number of objects pending for finalization > !finalizer-pending.png! > *** increasing number of ZipFileInputStreams instances related to the > spark-assembly.jar referenced by Finalizer > !finalizer-spark_assembly.png! > * Depending on the heap size and running time this will lead to driver OOM > crash > h2. Comments > * The [^bug.py] is lightweight proof of the problem. In production I am > experiencing this as quite rapid effect - in few hours it eats gigs of heap > and kills the app. > * If the same [^bug.py] is run without checkpointing there is no issue > whatsoever. > * Not sure if it is just pyspark related. > * In [^bug.py] I am using the socketTextStream input but seems to be > independent of the input type (in production having same problem with Kafka > direct stream, have seen it even with textFileStream). > * It is happening even if the input stream doesn't produce any data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12520) Python API dataframe join returns wrong results on outer join
[ https://issues.apache.org/jira/browse/SPARK-12520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12520. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10477 [https://github.com/apache/spark/pull/10477] > Python API dataframe join returns wrong results on outer join > - > > Key: SPARK-12520 > URL: https://issues.apache.org/jira/browse/SPARK-12520 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Aravind B > Fix For: 2.0.0 > > > Consider the following dataframes: > """ > left_table: > +++-+--+ > |head_id_left|tail_id_left|weight|joining_column| > +++-+--+ > | 1| 2|1| 1~2| > +++-+--+ > right_table: > +-+-+--+ > |head_id_right|tail_id_right|joining_column| > +-+-+--+ > +-+-+--+ > """ > The following code returns an empty dataframe: > """ > joined_table = left_table.join(right_table, "joining_column", "outer") > """ > joined_table has zero rows. > However: > """ > joined_table = left_table.join(right_table, left_table.joining_column == > right_table.joining_column, "outer") > """ > returns the correct answer with one row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12520) Python API dataframe join returns wrong results on outer join
[ https://issues.apache.org/jira/browse/SPARK-12520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12520: --- Assignee: Xiao Li > Python API dataframe join returns wrong results on outer join > - > > Key: SPARK-12520 > URL: https://issues.apache.org/jira/browse/SPARK-12520 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Aravind B >Assignee: Xiao Li > Fix For: 1.6.0, 2.0.0 > > > Consider the following dataframes: > """ > left_table: > +++-+--+ > |head_id_left|tail_id_left|weight|joining_column| > +++-+--+ > | 1| 2|1| 1~2| > +++-+--+ > right_table: > +-+-+--+ > |head_id_right|tail_id_right|joining_column| > +-+-+--+ > +-+-+--+ > """ > The following code returns an empty dataframe: > """ > joined_table = left_table.join(right_table, "joining_column", "outer") > """ > joined_table has zero rows. > However: > """ > joined_table = left_table.join(right_table, left_table.joining_column == > right_table.joining_column, "outer") > """ > returns the correct answer with one row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12520) Python API dataframe join returns wrong results on outer join
[ https://issues.apache.org/jira/browse/SPARK-12520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12520: --- Fix Version/s: 1.6.0 > Python API dataframe join returns wrong results on outer join > - > > Key: SPARK-12520 > URL: https://issues.apache.org/jira/browse/SPARK-12520 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Aravind B > Fix For: 1.6.0, 2.0.0 > > > Consider the following dataframes: > """ > left_table: > +++-+--+ > |head_id_left|tail_id_left|weight|joining_column| > +++-+--+ > | 1| 2|1| 1~2| > +++-+--+ > right_table: > +-+-+--+ > |head_id_right|tail_id_right|joining_column| > +-+-+--+ > +-+-+--+ > """ > The following code returns an empty dataframe: > """ > joined_table = left_table.join(right_table, "joining_column", "outer") > """ > joined_table has zero rows. > However: > """ > joined_table = left_table.join(right_table, left_table.joining_column == > right_table.joining_column, "outer") > """ > returns the correct answer with one row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12520) Python API dataframe join returns wrong results on outer join
[ https://issues.apache.org/jira/browse/SPARK-12520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12520: --- Fix Version/s: 1.5.3 > Python API dataframe join returns wrong results on outer join > - > > Key: SPARK-12520 > URL: https://issues.apache.org/jira/browse/SPARK-12520 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Aravind B >Assignee: Xiao Li > Fix For: 1.5.3, 1.6.0, 2.0.0 > > > Consider the following dataframes: > """ > left_table: > +++-+--+ > |head_id_left|tail_id_left|weight|joining_column| > +++-+--+ > | 1| 2|1| 1~2| > +++-+--+ > right_table: > +-+-+--+ > |head_id_right|tail_id_right|joining_column| > +-+-+--+ > +-+-+--+ > """ > The following code returns an empty dataframe: > """ > joined_table = left_table.join(right_table, "joining_column", "outer") > """ > joined_table has zero rows. > However: > """ > joined_table = left_table.join(right_table, left_table.joining_column == > right_table.joining_column, "outer") > """ > returns the correct answer with one row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12472) OOM when sort a table and save as parquet
Davies Liu created SPARK-12472: -- Summary: OOM when sort a table and save as parquet Key: SPARK-12472 URL: https://issues.apache.org/jira/browse/SPARK-12472 Project: Spark Issue Type: Bug Reporter: Davies Liu {code} t = sqlContext.table('store_sales') t.unionAll(t).coalesce(2).sortWithinPartitions(t[0]).write.partitionBy('ss_sold_date_sk').parquet("/tmp/ttt") {code} {code} 15/12/21 14:35:52 WARN TaskSetManager: Lost task 1.0 in stage 25.0 (TID 96, 192.168.0.143): java.lang.OutOfMemoryError: Java heap space at org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:86) at org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:32) at org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951) at org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:699) at org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525) at org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453) at org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325) at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153) at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:226) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:327) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:342) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12388) Change default compressor to LZ4
[ https://issues.apache.org/jira/browse/SPARK-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12388. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10342 [https://github.com/apache/spark/pull/10342] > Change default compressor to LZ4 > > > Key: SPARK-12388 > URL: https://issues.apache.org/jira/browse/SPARK-12388 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > Labels: releasenotes > Fix For: 2.0.0 > > > According the benchmark [1], LZ4-java could be 80% (or 30%) faster than > Snappy. > After changing the compressor to LZ4, I saw 20% improvement on end-to-end > time for a TPCDS query (Q4). > [1] https://github.com/ning/jvm-compressor-benchmark/wiki -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12472) OOM when sort a table and save as parquet
[ https://issues.apache.org/jira/browse/SPARK-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067413#comment-15067413 ] Davies Liu commented on SPARK-12472: could be workarounded by decreasing the memory used by both Spark and Parquet using spark.memory.fraction (for example, 0.4) and parquet.memory.pool.ratio (for example, 0.3, in core-site.xml) > OOM when sort a table and save as parquet > - > > Key: SPARK-12472 > URL: https://issues.apache.org/jira/browse/SPARK-12472 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu > > {code} > t = sqlContext.table('store_sales') > t.unionAll(t).coalesce(2).sortWithinPartitions(t[0]).write.partitionBy('ss_sold_date_sk').parquet("/tmp/ttt") > {code} > {code} > 15/12/21 14:35:52 WARN TaskSetManager: Lost task 1.0 in stage 25.0 (TID 96, > 192.168.0.143): java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:86) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:32) > at > org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951) > at > org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:699) > at > org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525) > at > org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453) > at > org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:226) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:327) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:342) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12054) Consider nullable in codegen
[ https://issues.apache.org/jira/browse/SPARK-12054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12054. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10333 [https://github.com/apache/spark/pull/10333] > Consider nullable in codegen > > > Key: SPARK-12054 > URL: https://issues.apache.org/jira/browse/SPARK-12054 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Currently, we always check the nullability for results of expressions, we > could skip that if the expression is not nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12341) The "comment" field of DESCRIBE result set should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12341. Resolution: Fixed Fix Version/s: 2.0.0 https://github.com/apache/spark/pull/10333 > The "comment" field of DESCRIBE result set should be nullable > - > > Key: SPARK-12341 > URL: https://issues.apache.org/jira/browse/SPARK-12341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12342) Corr (Pearson correlation) should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12342. Resolution: Fixed Fix Version/s: 2.0.0 https://github.com/apache/spark/pull/10333 > Corr (Pearson correlation) should be nullable > - > > Key: SPARK-12342 > URL: https://issues.apache.org/jira/browse/SPARK-12342 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12336) Outer join using multiple columns results in wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12336. Resolution: Fixed Fix Version/s: 2.0.0 https://github.com/apache/spark/pull/10333 > Outer join using multiple columns results in wrong nullability > -- > > Key: SPARK-12336 > URL: https://issues.apache.org/jira/browse/SPARK-12336 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > > When joining two DataFrames using multiple columns, a temporary inner join is > used to compute join output. Then a real join operator is created and > projected. However, the final projection list is based on the inner join > rather than real join operator. When the real join operator is an outer join, > nullability of the final projection can be wrong, since outer join may alter > nullability of its child plan(s). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12342) Corr (Pearson correlation) should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12342: -- Assignee: Davies Liu (was: Cheng Lian) > Corr (Pearson correlation) should be nullable > - > > Key: SPARK-12342 > URL: https://issues.apache.org/jira/browse/SPARK-12342 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12341) The "comment" field of DESCRIBE result set should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12341: -- Assignee: Davies Liu (was: Apache Spark) > The "comment" field of DESCRIBE result set should be nullable > - > > Key: SPARK-12341 > URL: https://issues.apache.org/jira/browse/SPARK-12341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12335) CentralMomentAgg should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12335: -- Assignee: Davies Liu (was: Apache Spark) > CentralMomentAgg should be nullable > --- > > Key: SPARK-12335 > URL: https://issues.apache.org/jira/browse/SPARK-12335 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > > According to the {{getStatistics}} method overriden in all its subclasses, > {{CentralMomentAgg}} should be nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12335) CentralMomentAgg should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12335. Resolution: Fixed Fix Version/s: 2.0.0 https://github.com/apache/spark/pull/10333 > CentralMomentAgg should be nullable > --- > > Key: SPARK-12335 > URL: https://issues.apache.org/jira/browse/SPARK-12335 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > > According to the {{getStatistics}} method overriden in all its subclasses, > {{CentralMomentAgg}} should be nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12336) Outer join using multiple columns results in wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12336: -- Assignee: Davies Liu (was: Cheng Lian) > Outer join using multiple columns results in wrong nullability > -- > > Key: SPARK-12336 > URL: https://issues.apache.org/jira/browse/SPARK-12336 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > > When joining two DataFrames using multiple columns, a temporary inner join is > used to compute join output. Then a real join operator is created and > projected. However, the final projection list is based on the inner join > rather than real join operator. When the real join operator is an outer join, > nullability of the final projection can be wrong, since outer join may alter > nullability of its child plan(s). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12091) [PySpark] Removal of the JAVA-specific deserialized storage levels
[ https://issues.apache.org/jira/browse/SPARK-12091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12091. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10092 [https://github.com/apache/spark/pull/10092] > [PySpark] Removal of the JAVA-specific deserialized storage levels > -- > > Key: SPARK-12091 > URL: https://issues.apache.org/jira/browse/SPARK-12091 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Xiao Li > Fix For: 2.0.0 > > > Since the data is always serialized on the Python side, the JAVA-specific > deserialized levels are not provided, such as MEMORY_ONLY. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12395) Result of DataFrame.join(usingColumns) could be wrong for outer join
[ https://issues.apache.org/jira/browse/SPARK-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12395: -- Assignee: Davies Liu > Result of DataFrame.join(usingColumns) could be wrong for outer join > > > Key: SPARK-12395 > URL: https://issues.apache.org/jira/browse/SPARK-12395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > For API DataFrame.join(right, usingColumns, joinType), if the joinType is > right_outer or full_outer, the resulting join column could be wrong (null). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12402) Memory leak in broadcast hash join
Davies Liu created SPARK-12402: -- Summary: Memory leak in broadcast hash join Key: SPARK-12402 URL: https://issues.apache.org/jira/browse/SPARK-12402 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu The broadcasted HashRelation is not destroyed after query finished (also can't be reused). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12395) Result of DataFrame.join(usingColumns) could be wrong for outer join
[ https://issues.apache.org/jira/browse/SPARK-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12395. Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10353 [https://github.com/apache/spark/pull/10353] > Result of DataFrame.join(usingColumns) could be wrong for outer join > > > Key: SPARK-12395 > URL: https://issues.apache.org/jira/browse/SPARK-12395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 2.0.0, 1.6.1 > > > For API DataFrame.join(right, usingColumns, joinType), if the joinType is > right_outer or full_outer, the resulting join column could be wrong (null). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12402) Memory leak in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12402: --- Description: After run some SQL query in PySpark, the DataFrame are still referenced by py4j Gateway, they are freed after calll `gc.collect()` in Python. (was: The broadcasted HashRelation is not destroyed after query finished (also can't be reused).) > Memory leak in pyspark > -- > > Key: SPARK-12402 > URL: https://issues.apache.org/jira/browse/SPARK-12402 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu > > After run some SQL query in PySpark, the DataFrame are still referenced by > py4j Gateway, they are freed after calll `gc.collect()` in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12402) Memory leak in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12402: --- Summary: Memory leak in pyspark (was: Memory leak in broadcast hash join) > Memory leak in pyspark > -- > > Key: SPARK-12402 > URL: https://issues.apache.org/jira/browse/SPARK-12402 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu > > The broadcasted HashRelation is not destroyed after query finished (also > can't be reused). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8745) Remove GenerateProjection
[ https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8745. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10316 [https://github.com/apache/spark/pull/10316] > Remove GenerateProjection > - > > Key: SPARK-8745 > URL: https://issues.apache.org/jira/browse/SPARK-8745 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu > Fix For: 2.0.0 > > > Based on discussion offline with [~marmbrus], we should remove > GenerateProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12380) MLLib should use existing SQLContext instead create new one
[ https://issues.apache.org/jira/browse/SPARK-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12380. Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10338 [https://github.com/apache/spark/pull/10338] > MLLib should use existing SQLContext instead create new one > --- > > Key: SPARK-12380 > URL: https://issues.apache.org/jira/browse/SPARK-12380 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0, 1.6.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12380) MLLib should use existing SQLContext instead create new one
Davies Liu created SPARK-12380: -- Summary: MLLib should use existing SQLContext instead create new one Key: SPARK-12380 URL: https://issues.apache.org/jira/browse/SPARK-12380 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12385) Push projection into Join
Davies Liu created SPARK-12385: -- Summary: Push projection into Join Key: SPARK-12385 URL: https://issues.apache.org/jira/browse/SPARK-12385 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu We usually have Join followed by a projection to pruning some columns, but Join already have a result projection to produce UnsafeRow, we should combine them together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12395) Result of DataFrame.join(usingColumns) could be wrong for outer join
Davies Liu created SPARK-12395: -- Summary: Result of DataFrame.join(usingColumns) could be wrong for outer join Key: SPARK-12395 URL: https://issues.apache.org/jira/browse/SPARK-12395 Project: Spark Issue Type: Bug Reporter: Davies Liu For API DataFrame.join(right, usingColumns, joinType), if the joinType is right_outer or full_outer, the resulting join column could be wrong (null). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12395) Result of DataFrame.join(usingColumns) could be wrong for outer join
[ https://issues.apache.org/jira/browse/SPARK-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12395: --- Priority: Critical (was: Major) > Result of DataFrame.join(usingColumns) could be wrong for outer join > > > Key: SPARK-12395 > URL: https://issues.apache.org/jira/browse/SPARK-12395 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Priority: Critical > > For API DataFrame.join(right, usingColumns, joinType), if the joinType is > right_outer or full_outer, the resulting join column could be wrong (null). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12395) Result of DataFrame.join(usingColumns) could be wrong for outer join
[ https://issues.apache.org/jira/browse/SPARK-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12395: --- Priority: Blocker (was: Critical) > Result of DataFrame.join(usingColumns) could be wrong for outer join > > > Key: SPARK-12395 > URL: https://issues.apache.org/jira/browse/SPARK-12395 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Priority: Blocker > > For API DataFrame.join(right, usingColumns, joinType), if the joinType is > right_outer or full_outer, the resulting join column could be wrong (null). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12388) Change default compressor to LZ4
Davies Liu created SPARK-12388: -- Summary: Change default compressor to LZ4 Key: SPARK-12388 URL: https://issues.apache.org/jira/browse/SPARK-12388 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy. After changing the compressor to LZ4, I saw 25% improvement on end-to-end time for a TPCDS query (Q4). [1] https://github.com/ning/jvm-compressor-benchmark/wiki -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12388) Change default compressor to LZ4
[ https://issues.apache.org/jira/browse/SPARK-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12388: --- Description: According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy. After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4). [1] https://github.com/ning/jvm-compressor-benchmark/wiki was: According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy. After changing the compressor to LZ4, I saw 25% improvement on end-to-end time for a TPCDS query (Q4). [1] https://github.com/ning/jvm-compressor-benchmark/wiki > Change default compressor to LZ4 > > > Key: SPARK-12388 > URL: https://issues.apache.org/jira/browse/SPARK-12388 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > According the benchmark [1], LZ4-java could be 80% (or 30%) faster than > Snappy. > After changing the compressor to LZ4, I saw 20% improvement on end-to-end > time for a TPCDS query (Q4). > [1] https://github.com/ning/jvm-compressor-benchmark/wiki -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12395) Result of DataFrame.join(usingColumns) could be wrong for outer join
[ https://issues.apache.org/jira/browse/SPARK-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12395: --- Affects Version/s: 1.6.0 > Result of DataFrame.join(usingColumns) could be wrong for outer join > > > Key: SPARK-12395 > URL: https://issues.apache.org/jira/browse/SPARK-12395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Davies Liu >Priority: Blocker > > For API DataFrame.join(right, usingColumns, joinType), if the joinType is > right_outer or full_outer, the resulting join column could be wrong (null). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12395) Result of DataFrame.join(usingColumns) could be wrong for outer join
[ https://issues.apache.org/jira/browse/SPARK-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12395: --- Priority: Blocker (was: Critical) > Result of DataFrame.join(usingColumns) could be wrong for outer join > > > Key: SPARK-12395 > URL: https://issues.apache.org/jira/browse/SPARK-12395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Davies Liu >Priority: Blocker > > For API DataFrame.join(right, usingColumns, joinType), if the joinType is > right_outer or full_outer, the resulting join column could be wrong (null). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12395) Result of DataFrame.join(usingColumns) could be wrong for outer join
[ https://issues.apache.org/jira/browse/SPARK-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12395: --- Component/s: SQL > Result of DataFrame.join(usingColumns) could be wrong for outer join > > > Key: SPARK-12395 > URL: https://issues.apache.org/jira/browse/SPARK-12395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Davies Liu >Priority: Blocker > > For API DataFrame.join(right, usingColumns, joinType), if the joinType is > right_outer or full_outer, the resulting join column could be wrong (null). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12395) Result of DataFrame.join(usingColumns) could be wrong for outer join
[ https://issues.apache.org/jira/browse/SPARK-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12395: --- Priority: Critical (was: Blocker) > Result of DataFrame.join(usingColumns) could be wrong for outer join > > > Key: SPARK-12395 > URL: https://issues.apache.org/jira/browse/SPARK-12395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Davies Liu >Priority: Critical > > For API DataFrame.join(right, usingColumns, joinType), if the joinType is > right_outer or full_outer, the resulting join column could be wrong (null). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12288) Support UnsafeRow in Coalesce/Except/Intersect
[ https://issues.apache.org/jira/browse/SPARK-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12288: --- Assignee: Xiao Li > Support UnsafeRow in Coalesce/Except/Intersect > -- > > Key: SPARK-12288 > URL: https://issues.apache.org/jira/browse/SPARK-12288 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Xiao Li > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12293) Support UnsafeRow in LocalTableScan
[ https://issues.apache.org/jira/browse/SPARK-12293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12293: --- Assignee: Liang-Chi Hsieh (was: Apache Spark) > Support UnsafeRow in LocalTableScan > --- > > Key: SPARK-12293 > URL: https://issues.apache.org/jira/browse/SPARK-12293 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Liang-Chi Hsieh > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code
[ https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060213#comment-15060213 ] Davies Liu commented on SPARK-12179: I think this UDF is not thread safe, rowNum and comparedColumn will be updated by multiple threads > Spark SQL get different result with the same code > - > > Key: SPARK-12179 > URL: https://issues.apache.org/jira/browse/SPARK-12179 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, > 1.5.2, 1.5.3 > Environment: hadoop version: 2.5.0-cdh5.3.2 > spark version: 1.5.3 > run mode: yarn-client >Reporter: Tao Li >Priority: Critical > > I run the sql in yarn-client mode, but get different result each time. > As you can see the example, I get the different shuffle write with the same > shuffle read in two jobs with the same code. > Some of my spark app runs well, but some always met this problem. And I met > this problem on spark 1.3, 1.4 and 1.5 version. > Can you give me some suggestions about the possible causes or how do I figure > out the problem? > 1. First Run > Details for Stage 9 (Attempt 0) > Total Time Across All Tasks: 5.8 min > Shuffle Read: 24.4 MB / 205399 > Shuffle Write: 6.8 MB / 54934 > 2. Second Run > Details for Stage 9 (Attempt 0) > Total Time Across All Tasks: 5.6 min > Shuffle Read: 24.4 MB / 205399 > Shuffle Write: 6.8 MB / 54905 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code
[ https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060217#comment-15060217 ] Davies Liu commented on SPARK-12179: Which version of Spark are you using? > Spark SQL get different result with the same code > - > > Key: SPARK-12179 > URL: https://issues.apache.org/jira/browse/SPARK-12179 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, > 1.5.2, 1.5.3 > Environment: hadoop version: 2.5.0-cdh5.3.2 > spark version: 1.5.3 > run mode: yarn-client >Reporter: Tao Li >Priority: Critical > > I run the sql in yarn-client mode, but get different result each time. > As you can see the example, I get the different shuffle write with the same > shuffle read in two jobs with the same code. > Some of my spark app runs well, but some always met this problem. And I met > this problem on spark 1.3, 1.4 and 1.5 version. > Can you give me some suggestions about the possible causes or how do I figure > out the problem? > 1. First Run > Details for Stage 9 (Attempt 0) > Total Time Across All Tasks: 5.8 min > Shuffle Read: 24.4 MB / 205399 > Shuffle Write: 6.8 MB / 54934 > 2. Second Run > Details for Stage 9 (Attempt 0) > Total Time Across All Tasks: 5.6 min > Shuffle Read: 24.4 MB / 205399 > Shuffle Write: 6.8 MB / 54905 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12179) Spark SQL get different result with the same code
[ https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060217#comment-15060217 ] Davies Liu edited comment on SPARK-12179 at 12/16/15 4:04 PM: -- Which version of Spark are you using? Can you try latest 1.5 branch or 1.6 RC? was (Author: davies): Which version of Spark are you using? > Spark SQL get different result with the same code > - > > Key: SPARK-12179 > URL: https://issues.apache.org/jira/browse/SPARK-12179 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, > 1.5.2, 1.5.3 > Environment: hadoop version: 2.5.0-cdh5.3.2 > spark version: 1.5.3 > run mode: yarn-client >Reporter: Tao Li >Priority: Critical > > I run the sql in yarn-client mode, but get different result each time. > As you can see the example, I get the different shuffle write with the same > shuffle read in two jobs with the same code. > Some of my spark app runs well, but some always met this problem. And I met > this problem on spark 1.3, 1.4 and 1.5 version. > Can you give me some suggestions about the possible causes or how do I figure > out the problem? > 1. First Run > Details for Stage 9 (Attempt 0) > Total Time Across All Tasks: 5.8 min > Shuffle Read: 24.4 MB / 205399 > Shuffle Write: 6.8 MB / 54934 > 2. Second Run > Details for Stage 9 (Attempt 0) > Total Time Across All Tasks: 5.6 min > Shuffle Read: 24.4 MB / 205399 > Shuffle Write: 6.8 MB / 54905 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8745) Remove GenerateProjection
[ https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-8745: - Assignee: Davies Liu > Remove GenerateProjection > - > > Key: SPARK-8745 > URL: https://issues.apache.org/jira/browse/SPARK-8745 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu > > Based on discussion offline with [~marmbrus], we should remove > GenerateProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12288) Support UnsafeRow in Coalesce/Except/Intersect
[ https://issues.apache.org/jira/browse/SPARK-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12288. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10285 [https://github.com/apache/spark/pull/10285] > Support UnsafeRow in Coalesce/Except/Intersect > -- > > Key: SPARK-12288 > URL: https://issues.apache.org/jira/browse/SPARK-12288 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12016) word2vec load model can't use findSynonyms to get words
[ https://issues.apache.org/jira/browse/SPARK-12016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12016. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10100 [https://github.com/apache/spark/pull/10100] > word2vec load model can't use findSynonyms to get words > > > Key: SPARK-12016 > URL: https://issues.apache.org/jira/browse/SPARK-12016 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: ubuntu 14.04 >Reporter: yuangang.liu > Fix For: 2.0.0 > > > I use word2vec.fit to train a word2vecModel and then save the model to file > system. when I load the model from file system, I found I can use > transform('a') to get a vector, but I can't use findSynonyms('a', 2) to get > some words. > I use the fellow code to test word2vec > from pyspark import SparkContext > from pyspark.mllib.feature import Word2Vec, Word2VecModel > import os, tempfile > from shutil import rmtree > if __name__ == '__main__': > sc = SparkContext('local', 'test') > sentence = "a b " * 100 + "a c " * 10 > localDoc = [sentence, sentence] > doc = sc.parallelize(localDoc).map(lambda line: line.split(" ")) > model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc) > syms = model.findSynonyms("a", 2) > print [s[0] for s in syms] > path = tempfile.mkdtemp() > model.save(sc, path) > sameModel = Word2VecModel.load(sc, path) > print model.transform("a") == sameModel.transform("a") > syms = sameModel.findSynonyms("a", 2) > print [s[0] for s in syms] > try: > rmtree(path) > except OSError: > pass > I got "[u'b', u'c']" when the first printf > then the “True” and " [u'__class__'] " > I don't know how to get 'b' or 'c' with sameModel.findSynonyms("a", 2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12284) Output UnsafeRow from window function
Davies Liu created SPARK-12284: -- Summary: Output UnsafeRow from window function Key: SPARK-12284 URL: https://issues.apache.org/jira/browse/SPARK-12284 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)
Davies Liu created SPARK-12286: -- Summary: Support UnsafeRow in all SparkPlan (if possible) Key: SPARK-12286 URL: https://issues.apache.org/jira/browse/SPARK-12286 Project: Spark Issue Type: Epic Components: SQL Reporter: Davies Liu There are still some SparkPlan does not support UnsafeRow (or does not support well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12289) Support UnsafeRow in TakeOrderedAndProject/Limit
Davies Liu created SPARK-12289: -- Summary: Support UnsafeRow in TakeOrderedAndProject/Limit Key: SPARK-12289 URL: https://issues.apache.org/jira/browse/SPARK-12289 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12290) Change the default value in SparkPlan
Davies Liu created SPARK-12290: -- Summary: Change the default value in SparkPlan Key: SPARK-12290 URL: https://issues.apache.org/jira/browse/SPARK-12290 Project: Spark Issue Type: Improvement Reporter: Davies Liu supportUnsafeRows = true supportSafeRows = false // outputUnsafeRows = true -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12283) Use UnsafeRow as the buffer in SortBasedAggregation to avoid Unsafe/Safe conversion
Davies Liu created SPARK-12283: -- Summary: Use UnsafeRow as the buffer in SortBasedAggregation to avoid Unsafe/Safe conversion Key: SPARK-12283 URL: https://issues.apache.org/jira/browse/SPARK-12283 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu SortBasedAggregation use GenericMutableRow as aggregation buffer, also requires that the input can't be UnsafeRow, because the we can't compare/evaluate UnsafeRow and GenericInternalRow in the same time. The TungstenSort output UnsafeRow, so multiple Safe/Unsafe projections will be inserted between them. If we can make sure that all the mutating happens in ascending order, the buffer of UnsafeRow could be used to update var-length object (String, Binary, Struct etc.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)
[ https://issues.apache.org/jira/browse/SPARK-12286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12286: -- Assignee: Davies Liu > Support UnsafeRow in all SparkPlan (if possible) > > > Key: SPARK-12286 > URL: https://issues.apache.org/jira/browse/SPARK-12286 > Project: Spark > Issue Type: Epic > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > There are still some SparkPlan does not support UnsafeRow (or does not > support well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12287) Support UnsafeRow in MapPartitions/MapGroups/CoGroup
[ https://issues.apache.org/jira/browse/SPARK-12287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12287: --- Issue Type: Improvement (was: Epic) > Support UnsafeRow in MapPartitions/MapGroups/CoGroup > > > Key: SPARK-12287 > URL: https://issues.apache.org/jira/browse/SPARK-12287 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12288) Support UnsafeRow in Coalesce/Except/Intersect
Davies Liu created SPARK-12288: -- Summary: Support UnsafeRow in Coalesce/Except/Intersect Key: SPARK-12288 URL: https://issues.apache.org/jira/browse/SPARK-12288 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12291) Support UnsafeRow in BroadcastLeftSemiJoinHash
Davies Liu created SPARK-12291: -- Summary: Support UnsafeRow in BroadcastLeftSemiJoinHash Key: SPARK-12291 URL: https://issues.apache.org/jira/browse/SPARK-12291 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12293) Support UnsafeRow in LocalTableScan
Davies Liu created SPARK-12293: -- Summary: Support UnsafeRow in LocalTableScan Key: SPARK-12293 URL: https://issues.apache.org/jira/browse/SPARK-12293 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12294) Support UnsafeRow in HiveTableScan
Davies Liu created SPARK-12294: -- Summary: Support UnsafeRow in HiveTableScan Key: SPARK-12294 URL: https://issues.apache.org/jira/browse/SPARK-12294 Project: Spark Issue Type: Improvement Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org