[jira] [Comment Edited] (SPARK-15420) Repartition and sort before Parquet writes

2016-11-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627859#comment-15627859
 ] 

Reynold Xin edited comment on SPARK-15420 at 11/2/16 5:57 AM:
--

Ryan I looked at this just now (sorry not looking earlier). I recently 
refactored this part of the code so I have a pretty good idea about what's 
going on with this. I think a simpler solution is to move the sort out of the 
writer, and then use the planner to inject sorting and exchanges necessary. It 
will be more obvious in the explain plan, and also allows us to reuse existing 
operator code. WDYT?

Also why do we need DataFrameWriter to expose partitioning? Couldn't that be 
done via DataFrame.repartition function itself before calling the writer?



was (Author: rxin):
Ryan I looked at this just now (sorry not looking earlier). I recently 
refactored this part of the code so I have a pretty good idea about what's 
going on with this. I think a simpler solution is to move the sort out of the 
writer, and then use the planner to inject sorting and exchanges necessary. It 
will be more obvious in the explain plan, and also allows us to reuse existing 
operator code. WDYT?



> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15420) Repartition and sort before Parquet writes

2016-11-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627859#comment-15627859
 ] 

Reynold Xin commented on SPARK-15420:
-

Ryan I looked at this just now (sorry not looking earlier). I recently 
refactored this part of the code so I have a pretty good idea about what's 
going on with this. I think a simpler solution is to move the sort out of the 
writer, and then use the planner to inject sorting and exchanges necessary. It 
will be more obvious in the explain plan, and also allows us to reuse existing 
operator code. WDYT?



> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18133) Python ML Pipeline Example has syntax errors

2016-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627856#comment-15627856
 ] 

Apache Spark commented on SPARK-18133:
--

User 'jagadeesanas2' has created a pull request for this issue:
https://github.com/apache/spark/pull/15729

> Python ML Pipeline Example has syntax errors
> 
>
> Key: SPARK-18133
> URL: https://issues.apache.org/jira/browse/SPARK-18133
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Nirmal Fernando
>Assignee: Jagadeesan A S
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.0
>
>
> $ ./bin/spark-submit examples/src/main/python/ml/pipeline_example.py
>   File 
> "/spark-2.0.0-bin-hadoop2.7/examples/src/main/python/ml/pipeline_example.py", 
> line 38
> (0L, "a b c d e spark", 1.0),
>   ^
> SyntaxError: invalid syntax
> Removing 'L' from all occurrences resolves the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15420:

Target Version/s: 2.2.0  (was: 2.1.0)

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-11-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627846#comment-15627846
 ] 

Felix Cheung commented on SPARK-17822:
--

I don't have a good handle on what actually is the problem. [~yhuai] could you 
give us some pointers?


> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> Seems it is pretty easy to remove objects from JVMObjectTracker.objMap. So, 
> seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-01 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627838#comment-15627838
 ] 

Cody Koeninger commented on SPARK-18212:


So here's a heavily excerpted version of what I see happening in that log:

{code}
16/11/01 14:08:46.593 pool-1-thread-1-ScalaTest-running-KafkaSourceSuite INFO 
KafkaTestUtils:   Sent 34 to partition 2, offset 3
16/11/01 14:08:46.593 pool-1-thread-1-ScalaTest-running-KafkaSourceSuite INFO 
KafkaProducer: Closing the Kafka producer with timeoutMillis = 
9223372036854775807 ms.
16/11/01 14:08:46.596 pool-1-thread-1-ScalaTest-running-KafkaSourceSuite INFO 
KafkaTestUtils: Created consumer to get latest offsets


16/11/01 14:08:47.833 Executor task launch worker-2 ERROR Executor: Exception 
in task 1.0 in stage 29.0 (TID 142)
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-kafka-source-a9485cc4-c83d-4e97-a20e-3960565b3fdb-335403166-execut\
or topic-5-2 3 after polling for 512


16/11/01 14:08:49.252 pool-1-thread-1-ScalaTest-running-KafkaSourceSuite INFO 
KafkaTestUtils: Closed consumer to get latest offsets
16/11/01 14:08:49.252 pool-1-thread-1-ScalaTest-running-KafkaSourceSuite INFO 
KafkaSourceSuite: Added data, expected offset [(topic-5-0,4), (topic-5-1,4), 
(topic-5-2,4), (topic-5-3,4), (topic-5-4,4)]
{code}


We're waiting on the producer's send future for up to 10 seconds; it takes 
almost 3 seconds between when the producer send finishes and the consumer 
that's being used to verify the post-send offsets finishes; but in the meantime 
we're only waiting half a second for executor fetches.

It's really ugly, but probably the easiest way to make this less flaky is to 
increase the value of kafkaConsumer.pollTimeoutMs to the same order of 
magnitude being used for the other test waits.

[~zsxwing] unless you see anything else wrong in the log or have a better idea, 
I can put in a pr tomorrow to increase that poll timeout in tests.


> Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from 
> specific offsets
> ---
>
> Key: SPARK-18212
> URL: https://issues.apache.org/jira/browse/SPARK-18212
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Davies Liu
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1968/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17868:

Target Version/s: 2.2.0  (was: 2.1.0)

> Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
> 
>
> Key: SPARK-17868
> URL: https://issues.apache.org/jira/browse/SPARK-17868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> We generate bitmasks for grouping sets during the parsing process, and use 
> these during analysis. These bitmasks are difficult to work with in practice 
> and have lead to numerous bugs. I suggest that we remove these and use actual 
> sets instead, however we would need to generate these offsets for the 
> grouping_id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17402) separate the management of temp views and metastore tables/views in SessionCatalog

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-17402.
---
Resolution: Fixed

> separate the management of temp views and metastore tables/views in 
> SessionCatalog
> --
>
> Key: SPARK-17402
> URL: https://issues.apache.org/jira/browse/SPARK-17402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18193) queueStream not updated if rddQueue.add after create queueStream in Java

2016-11-01 Thread Hubert Kang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627774#comment-15627774
 ] 

Hubert Kang commented on SPARK-18193:
-

Thanks Sean.

While it's inconsistent with that in QueueStream.scala.
something is pushed to the queue after ssc.start().

It's expecte so that live data stream could be handled.

Hubert

> queueStream not updated if rddQueue.add after create queueStream in Java
> 
>
> Key: SPARK-18193
> URL: https://issues.apache.org/jira/browse/SPARK-18193
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Hubert Kang
>
> Within 
> examples\src\main\java\org\apache\spark\examples\streaming\JavaQueueStream.java,
>  no any data is deteceted if below code to put something to rddQueue is 
> executed after queueStream is created (line 65).
> for (int i = 0; i < 30; i++) {
>   rddQueue.add(ssc.sparkContext().parallelize(list));
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17992) HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17992.
-
   Resolution: Fixed
 Assignee: Michael Allman
Fix Version/s: 2.1.0

> HiveClient.getPartitionsByFilter throws an exception for some unsupported 
> filters when hive.metastore.try.direct.sql=false
> --
>
> Key: SPARK-17992
> URL: https://issues.apache.org/jira/browse/SPARK-17992
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Allman
>Assignee: Michael Allman
> Fix For: 2.1.0
>
>
> We recently added (and enabled by default) table partition pruning for 
> partitioned Hive tables converted to using {{TableFileCatalog}}. When the 
> Hive configuration option {{hive.metastore.try.direct.sql}} is set to 
> {{false}}, Hive will throw an exception for unsupported filter expressions. 
> For example, attempting to filter on an integer partition column will throw a 
> {{org.apache.hadoop.hive.metastore.api.MetaException}}.
> I discovered this behavior because VideoAmp uses the CDH version of Hive with 
> a Postgresql metastore DB. In this configuration, CDH sets 
> {{hive.metastore.try.direct.sql}} to {{false}} by default, and queries that 
> filter on a non-string partition column will fail. That would be a rather 
> rude surprise for these Spark 2.1 users...
> I'm not sure exactly what behavior we should expect, but I suggest that 
> {{HiveClientImpl.getPartitionsByFilter}} catch this metastore exception and 
> return all partitions instead. This is what Spark does for Hive 0.12 users, 
> which does not support this feature at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.

2016-11-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627770#comment-15627770
 ] 

Felix Cheung commented on SPARK-17838:
--

merged to master. this should be very safe to go in branch-2.1 if it makes it, 
but not critical.

> Strict type checking for arguments with a better messages across APIs.
> --
>
> Key: SPARK-17838
> URL: https://issues.apache.org/jira/browse/SPARK-17838
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> It seems there should be more strict type checking for arguments in SparkR 
> APIs. This was discussed in several PRs. 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> Roughly it seems there are three cases as below:
> The first case below was described in 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> - Check for {{zero-length variable name}}
> Some of other cases below were handled in 
> https://github.com/apache/spark/pull/15231#discussion_r80417904
> - Catch the exception from JVM and format it as pretty
> - Check strictly types before calling JVM in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.

2016-11-01 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-17838.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Strict type checking for arguments with a better messages across APIs.
> --
>
> Key: SPARK-17838
> URL: https://issues.apache.org/jira/browse/SPARK-17838
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> It seems there should be more strict type checking for arguments in SparkR 
> APIs. This was discussed in several PRs. 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> Roughly it seems there are three cases as below:
> The first case below was described in 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> - Check for {{zero-length variable name}}
> Some of other cases below were handled in 
> https://github.com/apache/spark/pull/15231#discussion_r80417904
> - Catch the exception from JVM and format it as pretty
> - Check strictly types before calling JVM in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7755) MetadataCache.refresh does not take into account _SUCCESS

2016-11-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627754#comment-15627754
 ] 

Hyukjin Kwon commented on SPARK-7755:
-

Hi [~liancheng], didn't we remove {{DirectParquetOutputCommitter}} support for 
now? I just wonder if it is still an issue to open in Spark side if this 
problem is from {{ParquetOutputCommitter}}.

> MetadataCache.refresh does not take into account _SUCCESS
> -
>
> Key: SPARK-7755
> URL: https://issues.apache.org/jira/browse/SPARK-7755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Rowan Chattaway
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When you make a call to sqlc.parquetFile(path) where that path contains 
> partially written files, then refresh will fail in strange ways when it 
> attempts to read footer files.
> I would like to adjust the file discovery to take into account the presence 
> of _SUCCESS and therefore only attempt to ready is we have the success marker.
> I have made the changes locally and it doesn't appear to have any side 
> effects.
> What are peoples thoughts about this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18073) Migrate wiki to spark.apache.org web site

2016-11-01 Thread Siddharth Ahuja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627731#comment-15627731
 ] 

Siddharth Ahuja commented on SPARK-18073:
-

Hi [~srowen], I would be happy to work on this one if you are ok with it as I 
need to "get my feet wet" in Spark commits :)

> Migrate wiki to spark.apache.org web site
> -
>
> Key: SPARK-18073
> URL: https://issues.apache.org/jira/browse/SPARK-18073
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>
> Per 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html
>  , let's consider migrating all wiki pages to documents at 
> github.com/apache/spark-website (i.e. spark.apache.org).
> Some reasons:
> * No pull request system or history for changes to the wiki
> * Separate, not-so-clear system for granting write access to wiki
> * Wiki doesn't change much
> * One less place to maintain or look for docs
> The idea would be to then update all wiki pages with a message pointing to 
> the new home of the information (or message saying it's obsolete).
> Here are the current wikis and my general proposal for what to do with the 
> content:
> * Additional Language Bindings -> roll this into wherever Third Party 
> Projects ends up
> * Committers -> Migrate to a new /committers.html page, linked under 
> Community menu (alread exists)
> * Contributing to Spark -> Make this CONTRIBUTING.md? or a new 
> /contributing.html page under Community menu
> ** Jira Permissions Scheme -> obsolete
> ** Spark Code Style Guide -> roll this into new contributing.html page
> * Development Discussions -> obsolete?
> * Powered By Spark -> Make into new /powered-by.html linked by the existing 
> Commnunity menu item
> * Preparing Spark Releases -> see below; roll into where "versioning policy" 
> goes?
> * Profiling Spark Applications -> roll into where Useful Developer Tools goes
> ** Profiling Spark Applications Using YourKit -> ditto
> * Spark Internals -> all of these look somewhat to very stale; remove?
> ** Java API Internals
> ** PySpark Internals
> ** Shuffle Internals
> ** Spark SQL Internals
> ** Web UI Internals
> * Spark QA Infrastructure -> tough one. Good info to document; does it belong 
> on the website? we can just migrate it
> * Spark Versioning Policy -> new page living under Community (?) that 
> documents release policy and process (better menu?)
> ** spark-ec2 AMI list and install file version mappings -> obsolete
> ** Spark-Shark version mapping -> obsolete
> * Third Party Projects -> new Community menu item
> * Useful Developer Tools -> new page under new Developer menu? Community?
> ** Jenkins -> obsolete, remove
> Of course, another outcome is to just remove outdated wikis, migrate some, 
> leave the rest.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6825) Data sources implementation to support `sequenceFile`

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-6825.
--
Resolution: Won't Fix

I'm marking this as won't fix for now. It's unclear how the interface would 
look like, or if anybody still uses sequence files.


> Data sources implementation to support `sequenceFile`
> -
>
> Key: SPARK-6825
> URL: https://issues.apache.org/jira/browse/SPARK-6825
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> SequenceFiles are a widely used input format and right now they are not 
> supported in SparkR. 
> It would be good to add support for SequenceFiles by implementing a new data 
> source that can create a DataFrame from a SequenceFile. However as 
> SequenceFiles can have arbitrary types, we probably need to map them to 
> User-defined types in SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627715#comment-15627715
 ] 

Xiao Li commented on SPARK-18209:
-

Ok, will do it. Thanks!

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6825) Data sources implementation to support `sequenceFile`

2016-11-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627714#comment-15627714
 ] 

Hyukjin Kwon commented on SPARK-6825:
-

Hi [~shivaram], do we still need this? If so, I can maybe give a try.

> Data sources implementation to support `sequenceFile`
> -
>
> Key: SPARK-6825
> URL: https://issues.apache.org/jira/browse/SPARK-6825
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> SequenceFiles are a widely used input format and right now they are not 
> supported in SparkR. 
> It would be good to add support for SequenceFiles by implementing a new data 
> source that can create a DataFrame from a SequenceFile. However as 
> SequenceFiles can have arbitrary types, we probably need to map them to 
> User-defined types in SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18217) Disallow creating permanent views based on temporary views

2016-11-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18217:
---

 Summary: Disallow creating permanent views based on temporary views
 Key: SPARK-18217
 URL: https://issues.apache.org/jira/browse/SPARK-18217
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


See the discussion in the parent ticket SPARK-18209. It doesn't really make 
sense to create permanent views based on temporary views.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18217) Disallow creating permanent views based on temporary views

2016-11-01 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-18217:
---

Assignee: Xiao Li

> Disallow creating permanent views based on temporary views
> --
>
> Key: SPARK-18217
> URL: https://issues.apache.org/jira/browse/SPARK-18217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Xiao Li
>
> See the discussion in the parent ticket SPARK-18209. It doesn't really make 
> sense to create permanent views based on temporary views.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627710#comment-15627710
 ] 

Reynold Xin commented on SPARK-18209:
-

Actually I'd consider it a "bug" and fix this bug first in Spark 2.1. Don't 
allow permanent view creation based on temporary views.


> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16808) History Server main page does not honor APPLICATION_WEB_PROXY_BASE

2016-11-01 Thread Vinayak Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627706#comment-15627706
 ] 

Vinayak Joshi commented on SPARK-16808:
---

The same issue also affects the case where "spark.ui.proxyBase" is set and it 
does not get honored in HistoryServer 2.0. 

This is a regression in 2.0 from 1.6. 

In 2.0, the refactoring of HistoryPage.scala changed the History application 
listing to occur from an ajax call. The page listing is rendered using 
ui/static/historypage-template.html which does not take into account the 
"spark.ui.proxyBase" / APPLICATION_WEB_PROXYBASE when generating the links to 
the applications.

The existing History Server test suite 
-spark/deploy/history/HistoryServerSuite.scala has an explicit test case to 
validate this scenario: "relative links are prefixed with uiRoot 
(spark.ui.proxyBase)"  - but this test case is broken too because the html it's 
receiving and validating no longer contains any application links (as noted 
earlier, they are now generated through a js ajax call etc which does not 
happen in this test) - so even though the test case passes, it's failing it's 
purpose. 

> History Server main page does not honor APPLICATION_WEB_PROXY_BASE
> --
>
> Key: SPARK-16808
> URL: https://issues.apache.org/jira/browse/SPARK-16808
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> The root of the history server is rendered dynamically with javascript, and 
> this doesn't honor APPLICATION_WEB_PROXY_BASE: 
> https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/historypage-template.html#L67
> Other links in the history server do honor it: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L146
> This means the links on the history server root page are broken when deployed 
> behind a proxy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627702#comment-15627702
 ] 

Xiao Li commented on SPARK-18209:
-

True.

{code}
  Seq((1, (1, 1))).toDF().createTempView("temp_jt")
  sql("CREATE VIEW jtv1 AS SELECT * FROM temp_jt")
{code}

This will fail with an ugly error message:
{code}
Failed to analyze the canonicalized SQL: SELECT `gen_attr_0` AS `_1`, 
`gen_attr_1` AS `_2` FROM (SELECT `gen_attr_0`, `gen_attr_1` FROM (VALUES (1, 
[0,1,1]) AS gen_subquery_0(gen_attr_0, gen_attr_1)) AS temp_jt) AS temp_jt
{code}

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4549) Support BigInt -> Decimal in convertToCatalyst in SparkSQL

2016-11-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627695#comment-15627695
 ] 

Hyukjin Kwon edited comment on SPARK-4549 at 11/2/16 4:41 AM:
--

Could we maybe close this for now if anyone can't explain when it is needed?


was (Author: hyukjin.kwon):
Could we maybe close this for now if no one can't explain when it is needed?

> Support BigInt -> Decimal in convertToCatalyst in SparkSQL
> --
>
> Key: SPARK-4549
> URL: https://issues.apache.org/jira/browse/SPARK-4549
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>Priority: Minor
>
> Since BigDecimal is just a wrapper around BigInt, let's also convert to 
> BigInt to Decimal.
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4549) Support BigInt -> Decimal in convertToCatalyst in SparkSQL

2016-11-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627695#comment-15627695
 ] 

Hyukjin Kwon commented on SPARK-4549:
-

Could we maybe close this for now if no one can't explain when it is needed?

> Support BigInt -> Decimal in convertToCatalyst in SparkSQL
> --
>
> Key: SPARK-4549
> URL: https://issues.apache.org/jira/browse/SPARK-4549
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>Priority: Minor
>
> Since BigDecimal is just a wrapper around BigInt, let's also convert to 
> BigInt to Decimal.
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources

2016-11-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627698#comment-15627698
 ] 

Reynold Xin commented on SPARK-17967:
-

+1 on json arrays.

> Support for list or other types as an option for datasources
> 
>
> Key: SPARK-17967
> URL: https://issues.apache.org/jira/browse/SPARK-17967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Hyukjin Kwon
>
> This was discussed in SPARK-17878
> For other datasources, it seems okay with string/long/boolean/double value as 
> an option but it seems it is not enough for the datasource such as CSV. As it 
> is an interface for other external datasources, I guess it'd affect several 
> ones out there.
> I took a look a first but it seems it'd be difficult to support this (need to 
> change a lot).
> One suggestion is support this as a JSON array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627679#comment-15627679
 ] 

Reynold Xin commented on SPARK-18209:
-

Yes, both global and local temp views. The issue with temp view is more than 
just dependency. Not all temp views support sql generation, e.g. the ones that 
are created from dataframes.


> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627673#comment-15627673
 ] 

Xiao Li commented on SPARK-18209:
-

Without SQL expansion, also need to block the global temp view usage in 
persistent view definition, right?

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627664#comment-15627664
 ] 

Reynold Xin commented on SPARK-18209:
-

Yup - we will have to disallow it, for good reasons actually.

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627662#comment-15627662
 ] 

Xiao Li commented on SPARK-18209:
-

Yeah, I just posted the example to show it. I mentioned [~vssrinath] in the 
comment. : )

Regarding the temp view, unfortunately, we already allow users to reference 
temp views in perm view creation. : (

{code}
  sql("CREATE TEMPORARY VIEW temp_jt AS SELECT * FROM jt WHERE id > 0")
  sql("CREATE VIEW jtv1 AS SELECT * FROM temp_jt WHERE id > 3")
  sql("CREATE VIEW jtv2 AS SELECT * FROM jtv1 WHERE id < 6")
  sql("DESC FORMATTED jtv1").show(50, false)
  sql("DESC FORMATTED jtv2").show(50, false)
{code}

{code}
|View Expanded Text: |SELECT `gen_attr_0` AS `id`, `gen_attr_1` AS 
`id1` FROM (SELECT `gen_attr_0`, `gen_attr_1` FROM (SELECT `gen_attr_2` AS 
`gen_attr_0`, `gen_attr_3` AS `gen_attr_1` FROM (SELECT `gen_attr_2`, 
`gen_attr_3` FROM (SELECT `gen_attr_2`, `gen_attr_3` FROM (SELECT `gen_attr_4` 
AS `gen_attr_2`, `gen_attr_5` AS `gen_attr_3` FROM (SELECT `id` AS 
`gen_attr_4`, `id1` AS `gen_attr_5` FROM `default`.`jt`) AS gen_subquery_0) AS 
gen_subquery_0 WHERE (`gen_attr_2` > CAST(0 AS BIGINT))) AS temp_jt WHERE 
(`gen_attr_2` > CAST(3 AS BIGINT))) AS temp_jt) AS jtv1 WHERE (`gen_attr_0` < 
CAST(6 AS BIGINT))) AS jtv1|   |
{code}

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18133) Python ML Pipeline Example has syntax errors

2016-11-01 Thread Nirmal Fernando (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627648#comment-15627648
 ] 

Nirmal Fernando commented on SPARK-18133:
-

Thanks All.

> Python ML Pipeline Example has syntax errors
> 
>
> Key: SPARK-18133
> URL: https://issues.apache.org/jira/browse/SPARK-18133
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Nirmal Fernando
>Assignee: Jagadeesan A S
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.0
>
>
> $ ./bin/spark-submit examples/src/main/python/ml/pipeline_example.py
>   File 
> "/spark-2.0.0-bin-hadoop2.7/examples/src/main/python/ml/pipeline_example.py", 
> line 38
> (0L, "a b c d e spark", 1.0),
>   ^
> SyntaxError: invalid syntax
> Removing 'L' from all occurrences resolves the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18216) Make Column.expr public

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18216.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Make Column.expr public
> ---
>
> Key: SPARK-18216
> URL: https://issues.apache.org/jira/browse/SPARK-18216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> Column.expr is private[sql], but it's an actually really useful field to have 
> for debugging. We should open it up, similar to how we use QueryExecution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18133) Python ML Pipeline Example has syntax errors

2016-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627633#comment-15627633
 ] 

Apache Spark commented on SPARK-18133:
--

User 'jagadeesanas2' has created a pull request for this issue:
https://github.com/apache/spark/pull/15728

> Python ML Pipeline Example has syntax errors
> 
>
> Key: SPARK-18133
> URL: https://issues.apache.org/jira/browse/SPARK-18133
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Nirmal Fernando
>Assignee: Jagadeesan A S
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.0
>
>
> $ ./bin/spark-submit examples/src/main/python/ml/pipeline_example.py
>   File 
> "/spark-2.0.0-bin-hadoop2.7/examples/src/main/python/ml/pipeline_example.py", 
> line 38
> (0L, "a b c d e spark", 1.0),
>   ^
> SyntaxError: invalid syntax
> Removing 'L' from all occurrences resolves the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627628#comment-15627628
 ] 

Reynold Xin commented on SPARK-18209:
-

That's what [~vssrinath] pointed out isn't that?

Good point about temp views. I think permanent views shouldn't be allowed to 
reference temp views. Otherwise it really makes no sense.

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17816) Json serialzation of accumulators are failing with ConcurrentModificationException

2016-11-01 Thread Jonathan Alvarado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627622#comment-15627622
 ] 

Jonathan Alvarado commented on SPARK-17816:
---

Can I assume that I can disregard this error for proper operation of my spark 
job, but I just might lose some logging for the UI?  

The title of this bug says that "accumulators" are failing with 
ConcurrentModificationException.  However, the stacktrace shows the issues 
arising from the eventlog reporting which I understand to be related to event 
logging for the UI.  I'm having this error come up during my job and I need 
accumulators to work correctly for proper operation of my spark job. Can I 
assume that I can disregard this error for proper operation of my job, but I 
just might lose some logging for the UI?  Or are the accumulators not working 
correctly?


> Json serialzation of accumulators are failing with 
> ConcurrentModificationException
> --
>
> Key: SPARK-17816
> URL: https://issues.apache.org/jira/browse/SPARK-17816
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ergin Seyfe
>Assignee: Ergin Seyfe
> Fix For: 2.0.2, 2.1.0
>
>
> This is the stack trace: See  {{ConcurrentModificationException}}:
> {code}
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
> at java.util.ArrayList$Itr.next(ArrayList.java:851)
> at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
> at scala.collection.AbstractTraversable.to(Traversable.scala:104)
> at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
> at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
> at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
> at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
> at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
> at 
> 

[jira] [Comment Edited] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627595#comment-15627595
 ] 

Xiao Li edited comment on SPARK-18209 at 11/2/16 4:00 AM:
--

If we do not qualify the table/persistent view names that are used in the view 
definition, we could refer to a temp view at query time. 


was (Author: smilegator):
If we do not qualify the table/persistent view name when we create a view, we 
could refer to a temp view. 

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627599#comment-15627599
 ] 

Xiao Li commented on SPARK-18209:
-

Please hold on until we finalize the design.

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"

2016-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17895:


Assignee: (was: Apache Spark)

> Improve documentation of "rowsBetween" and "rangeBetween"
> -
>
> Key: SPARK-17895
> URL: https://issues.apache.org/jira/browse/SPARK-17895
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Reporter: Weiluo Ren
>Priority: Minor
>
> This is an issue found by [~junyangq] when he was fixing SparkR docs.
> In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
>  However, the description of "rangeBetween" does not clearly differentiate it 
> from "rowsBetween". Even though in 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
>  we have pretty nice description for "RangeFrame" and "RowFrame" which are 
> used in "rangeBetween" and "rowsBetween", I cannot find them in the online 
> Spark scala api. 
> We could add small examples to the description of "rangeBetween" and 
> "rowsBetween" like
> {code}
> val df = Seq(1,1,2).toDF("id")
> df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  4|
>  * |  1|  4|
>  * |  2|  2|
>  * +---+---+
> */
> df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  2|
>  * |  1|  3|
>  * |  2|  2|
>  * +---+---+
> */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"

2016-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17895:


Assignee: Apache Spark

> Improve documentation of "rowsBetween" and "rangeBetween"
> -
>
> Key: SPARK-17895
> URL: https://issues.apache.org/jira/browse/SPARK-17895
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Reporter: Weiluo Ren
>Assignee: Apache Spark
>Priority: Minor
>
> This is an issue found by [~junyangq] when he was fixing SparkR docs.
> In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
>  However, the description of "rangeBetween" does not clearly differentiate it 
> from "rowsBetween". Even though in 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
>  we have pretty nice description for "RangeFrame" and "RowFrame" which are 
> used in "rangeBetween" and "rowsBetween", I cannot find them in the online 
> Spark scala api. 
> We could add small examples to the description of "rangeBetween" and 
> "rowsBetween" like
> {code}
> val df = Seq(1,1,2).toDF("id")
> df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  4|
>  * |  1|  4|
>  * |  2|  2|
>  * +---+---+
> */
> df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  2|
>  * |  1|  3|
>  * |  2|  2|
>  * +---+---+
> */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"

2016-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627597#comment-15627597
 ] 

Apache Spark commented on SPARK-17895:
--

User 'david-weiluo-ren' has created a pull request for this issue:
https://github.com/apache/spark/pull/15727

> Improve documentation of "rowsBetween" and "rangeBetween"
> -
>
> Key: SPARK-17895
> URL: https://issues.apache.org/jira/browse/SPARK-17895
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Reporter: Weiluo Ren
>Priority: Minor
>
> This is an issue found by [~junyangq] when he was fixing SparkR docs.
> In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
>  However, the description of "rangeBetween" does not clearly differentiate it 
> from "rowsBetween". Even though in 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
>  we have pretty nice description for "RangeFrame" and "RowFrame" which are 
> used in "rangeBetween" and "rowsBetween", I cannot find them in the online 
> Spark scala api. 
> We could add small examples to the description of "rangeBetween" and 
> "rowsBetween" like
> {code}
> val df = Seq(1,1,2).toDF("id")
> df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  4|
>  * |  1|  4|
>  * |  2|  2|
>  * +---+---+
> */
> df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  2|
>  * |  1|  3|
>  * |  2|  2|
>  * +---+---+
> */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627595#comment-15627595
 ] 

Xiao Li commented on SPARK-18209:
-

If we do not qualify the table/persistent view name when we create a view, we 
could refer to a temp view. 

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627588#comment-15627588
 ] 

Xiao Li commented on SPARK-18209:
-

{code}
  sql("CREATE VIEW jtv1 AS SELECT * FROM jt WHERE id > 3")
  sql("CREATE VIEW jtv2 AS SELECT * FROM jtv1 WHERE id < 6")
  sql("DESC FORMATTED jtv1").show(50, false)
  sql("DESC FORMATTED jtv2").show(50, false)
{code}

You can see the expanded view text of {{jtv2}} is
{code}
SELECT `gen_attr_0` AS `id`, `gen_attr_1` AS `id1` FROM (SELECT `gen_attr_0`, 
`gen_attr_1` FROM (SELECT `gen_attr_2` AS `gen_attr_0`, `gen_attr_3` AS 
`gen_attr_1` FROM (SELECT `gen_attr_2`, `gen_attr_3` FROM (SELECT `gen_attr_4` 
AS `gen_attr_2`, `gen_attr_5` AS `gen_attr_3` FROM (SELECT `id` AS 
`gen_attr_4`, `id1` AS `gen_attr_5` FROM `default`.`jt`) AS gen_subquery_0) AS 
gen_subquery_0 WHERE (`gen_attr_2` > CAST(3 AS BIGINT))) AS jt) AS jtv1 WHERE 
(`gen_attr_0` < CAST(6 AS BIGINT))) AS jtv1
{code}

When we query {{jtv2}}, we are querying the original table {{jt}}. As 
[~vssrinath] said, without SQL expansions, we might not be able to query the 
original table. Even if the view {{jtv1}} is not dropped/changed, we still 
could hit an issue if temporary view with the same name {{jtv1}} is created 
after creation of {{jtv2}}. 

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18198) Highlight code snippets for Streaming integretion docs

2016-11-01 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-18198:
--
Component/s: (was: SQL)
 Structured Streaming

> Highlight code snippets for Streaming integretion docs
> --
>
> Key: SPARK-18198
> URL: https://issues.apache.org/jira/browse/SPARK-18198
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, DStreams, Structured Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liwei Lin
>Priority: Minor
>
> We should use {% highlight lang %} {% endhighlight %} to highlight code 
> snippets in the Structured Streaming Kafka010 integration doc and the Spark 
> Streaming Kafka010 integration doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16545) Structured Streaming : foreachSink creates the Physical Plan multiple times per TriggerInterval

2016-11-01 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627554#comment-15627554
 ] 

Liwei Lin commented on SPARK-16545:
---

hi [~mariobriggs], per discussion on the PR, would you mind closing this? 

> Structured Streaming : foreachSink creates the Physical Plan multiple times 
> per TriggerInterval 
> 
>
> Key: SPARK-16545
> URL: https://issues.apache.org/jira/browse/SPARK-16545
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Mario Briggs
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627518#comment-15627518
 ] 

Dongjoon Hyun commented on SPARK-18209:
---

Thank you. Now, I understand. I have been wondering the reason why all 
committers suddenly lost their interest about the issue for a long time.

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627518#comment-15627518
 ] 

Dongjoon Hyun edited comment on SPARK-18209 at 11/2/16 3:21 AM:


Thank you. Now, I understand. I had been wondering the reason why all 
committers suddenly lost their interest about the issue for a long time.


was (Author: dongjoon):
Thank you. Now, I understand. I have been wondering the reason why all 
committers suddenly lost their interest about the issue for a long time.

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627502#comment-15627502
 ] 

Reynold Xin commented on SPARK-18209:
-

The pr is still useful -- the only thing that made it very difficult to merge 
was the sql generation. Once we don't need sql generation (this ticket), it'd 
be much easier to merge that one.

As a matter of fact, your pr got reviewed almost immediately after it was 
created and was almost merged by other committers, but then I stopped it 
because of sql generation would be broken with the initial set of changes. With 
SQL generation the pr became more complicated to reason about and review.


> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18206) Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg

2016-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627497#comment-15627497
 ] 

Apache Spark commented on SPARK-18206:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15671

> Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg
> ---
>
> Key: SPARK-18206
> URL: https://issues.apache.org/jira/browse/SPARK-18206
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: zhengruifeng
>Priority: Minor
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18206) Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg

2016-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18206:


Assignee: zhengruifeng  (was: Apache Spark)

> Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg
> ---
>
> Key: SPARK-18206
> URL: https://issues.apache.org/jira/browse/SPARK-18206
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: zhengruifeng
>Priority: Minor
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18206) Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg

2016-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18206:


Assignee: Apache Spark  (was: zhengruifeng)

> Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg
> ---
>
> Key: SPARK-18206
> URL: https://issues.apache.org/jira/browse/SPARK-18206
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627426#comment-15627426
 ] 

Apache Spark commented on SPARK-18107:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15726

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17879) Don't compact metadata logs constantly into a single compacted file

2016-11-01 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627421#comment-15627421
 ] 

Burak Yavuz commented on SPARK-17879:
-

We should be doing the second. What you said makes sense, we can close this.

> Don't compact metadata logs constantly into a single compacted file
> ---
>
> Key: SPARK-17879
> URL: https://issues.apache.org/jira/browse/SPARK-17879
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>
> With metadata log compaction, we compact all files into a single file every 
> "n" batches. The problem is, over time, this single file becomes huge, and 
> could become an issue to constantly write out in the driver.
> It would be a good idea to cap the compacted file size, so that we don't end 
> up writing huge files in the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627407#comment-15627407
 ] 

Dongjoon Hyun commented on SPARK-18209:
---

BTW, [~rxin].

I didn't notice that the following was the main reason. The truth is that the 
PR didn't get a feedback for a long time.

> This is the main reason broadcast join hint has taken forever to be merged 
> because it is very difficult to guarantee correctness.

I can close the PR if you want since I don't want to be a bottleneck for the 
community. Otherwise, you can override the PR with the superior implementation 
if you want at any time.

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-11-01 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627406#comment-15627406
 ] 

Franck Tago commented on SPARK-17982:
-

Wanted to mention that I was able to successfully  verify my cases with the 
changes made under this request.

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>Priority: Blocker
>
> The following statement fails in the spark shell . 
> {noformat}
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> {noformat}
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627367#comment-15627367
 ] 

Jiang Xingbo commented on SPARK-18209:
--

I'm working on this, thanks!

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627343#comment-15627343
 ] 

Apache Spark commented on SPARK-18167:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15725

> Flaky test when hive partition pruning is enabled
> -
>
> Key: SPARK-18167
> URL: https://issues.apache.org/jira/browse/SPARK-18167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
> partition pruning is enabled.
> Based on the stack traces, it seems to be an old issue where Hive fails to 
> cast a numeric partition column ("Invalid character string format for type 
> DECIMAL"). There are two possibilities here: either we are somehow corrupting 
> the partition table to have non-decimal values in that column, or there is a 
> transient issue with Derby.
> {code}
> Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null 
>at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497) at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
> at 
> org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
> at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)  
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  at scala.collection.immutable.List.foreach(List.scala:381)  at 
> 

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17937:
-
Component/s: Structured Streaming

> Clarify Kafka offset semantics for Structured Streaming
> ---
>
> Key: SPARK-17937
> URL: https://issues.apache.org/jira/browse/SPARK-17937
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Possible events for which offsets are needed:
> # New partition is discovered
> # Offset out of range (aka, data has been lost).   It's possible to separate 
> this into offset too small and offset too large, but I'm not sure it matters 
> for us.
> Possible sources of offsets:
> # *Earliest* position in log
> # *Latest* position in log
> # *Fail* and kill the query
> # *Checkpoint* position
> # *User specified* per topicpartition
> # *Kafka commit log*.  Currently unsupported.  This means users who want to 
> migrate from existing kafka jobs need to jump through hoops.  Even if we 
> never want to support it, as soon as we take on SPARK-17815 we need to make 
> sure Kafka commit log state is clearly documented and handled.
> # *Timestamp*.  Currently unsupported.  This could be supported with old, 
> inaccurate Kafka time api, or upcoming time index
> # *X offsets* before or after latest / earliest position.  Currently 
> unsupported.  I think the semantics of this are super unclear by comparison 
> with timestamp, given that Kafka doesn't have a single range of offsets.
> Currently allowed pre-query configuration, all "ORs" are exclusive:
> # startingOffsets: *earliest* OR *latest* OR *User specified* json per 
> topicpartition  (SPARK-17812)
> # failOnDataLoss: true (which implies *Fail* above) OR false (which implies 
> *Earliest* above)  In general, I see no reason this couldn't specify Latest 
> as an option.
> Possible lifecycle times in which an offset-related event may happen:
> # At initial query start
> #* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
> startingOffsets is *User specified* perTopicpartition, and the new partition 
> isn't in the map, *Fail*.  Note that this is effectively undistinguishable 
> from new parititon during query, because partitions may have changed in 
> between pre-query configuration and query start, but we treat it differently, 
> and users in this case are SOL
> #* Offset out of range on driver: We don't technically have behavior for this 
> case yet.  Could use the value of failOnDataLoss, but it's possible people 
> may want to know at startup that something was wrong, even if they're ok with 
> earliest for a during-query out of range
> #* Offset out of range on executor: seems like it should be *Fail* or 
> *Earliest*, based on failOnDataLoss.  but it looks like this setting is 
> currently ignored, and the executor will just fail...
> # During query
> #* New partition:  *Earliest*, only.  This seems to be by fiat, I see no 
> reason this can't be configurable.
> #* Offset out of range on driver:  this _probably_ doesn't happen, because 
> we're doing explicit seeks to the latest position
> #* Offset out of range on executor:  ?
> # At query restart 
> #* New partition: *Checkpoint*, fall back to *Earliest*.  Again, no reason 
> this couldn't be configurable fall back to Latest
> #* Offset out of range on driver:   this _probably_ doesn't happen, because 
> we're doing explicit seeks to the specified position
> #* Offset out of range on executor:  ?
> I've probably missed something, chime in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.1.0

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18057:
-
Component/s: Structured Streaming

> Update structured streaming kafka from 10.0.1 to 10.1.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17343) Prerequisites for Kafka 0.8 support in Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17343:
-
Component/s: (was: DStreams)
 Structured Streaming

> Prerequisites for Kafka 0.8 support in Structured Streaming
> ---
>
> Key: SPARK-17343
> URL: https://issues.apache.org/jira/browse/SPARK-17343
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Frederick Reiss
>
> This issue covers any API changes, refactoring, and utility classes/methods 
> that are necessary to make it possible to implement support for Kafka 0.8 
> sources and sinks in Structured Streaming.
> From a quick glance, it looks like some refactoring of the existing state 
> storage mechanism in the Kafka 0.8 DStream may suffice. But there might be 
> some additional groundwork in other areas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17837) Disaster recovery of offsets from WAL

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17837:
-
Component/s: Structured Streaming

> Disaster recovery of offsets from WAL
> -
>
> Key: SPARK-17837
> URL: https://issues.apache.org/jira/browse/SPARK-17837
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> "The SQL offsets are stored in a WAL at $checkpointLocation/offsets/$batchId. 
> As reynold suggests though, we should change this to use a less opaque 
> format."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17834:
-
Component/s: (was: SQL)
 Structured Streaming

> Fetch the earliest offsets manually in KafkaSource instead of counting on 
> KafkaConsumer
> ---
>
> Key: SPARK-17834
> URL: https://issues.apache.org/jira/browse/SPARK-17834
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.2, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17346) Kafka 0.10 support in Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17346:
-
Component/s: (was: DStreams)
 Structured Streaming

> Kafka 0.10 support in Structured Streaming
> --
>
> Key: SPARK-17346
> URL: https://issues.apache.org/jira/browse/SPARK-17346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Frederick Reiss
>Assignee: Shixiong Zhu
> Fix For: 2.0.2, 2.1.0
>
>
> Implement Kafka 0.10-based sources and sinks for Structured Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17345) Prerequisites for Kafka 0.10 support in Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17345:
-
Component/s: (was: DStreams)
 Structured Streaming

> Prerequisites for Kafka 0.10 support in Structured Streaming
> 
>
> Key: SPARK-17345
> URL: https://issues.apache.org/jira/browse/SPARK-17345
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Frederick Reiss
>
> This issue covers any API changes, refactoring, and utility classes/methods 
> that are necessary to make it possible to implement support for Kafka 0.10 
> sources and sinks in Structured Streaming.
> At a minimum, the changes in SPARK-16963 will be needed in order for the 
> Kafka commit protocol to work. Given that KIP-33 ("Add a time based log 
> index") is not yet in place, it may be necessary to make additional API 
> changes in Spark for commit to work efficiently. Some refactoring of the 
> existing KafkaRDD class is probably in order also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18201) add toDense and toSparse into Matrix trait, like Vector design

2016-11-01 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu closed SPARK-18201.
--
Resolution: Duplicate

It will fix in this PR https://github.com/apache/spark/pull/15628

> add toDense and toSparse into Matrix trait, like Vector design
> --
>
> Key: SPARK-18201
> URL: https://issues.apache.org/jira/browse/SPARK-18201
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> add toDense and toSparse into Matrix trait, like Vector design
> so that when we have a Matrix object `matrix: Matrix`, maybe dense or sparse,
> we can call `matrix.toDense` to get DenseMatrix
> and call `matrix.toSparse` to get SparseMatrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17815) Report committed offsets

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17815:
-
Component/s: (was: SQL)
 Structured Streaming

> Report committed offsets
> 
>
> Key: SPARK-17815
> URL: https://issues.apache.org/jira/browse/SPARK-17815
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>
> Since we manage our own offsets, we have turned off auto-commit.  However, 
> this means that external tools are not able to report on how far behind a 
> given streaming job is.  When the user manually gives us a group.id, we 
> should report back to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17812) More granular control of starting offsets (assign)

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17812:
-
Component/s: (was: SQL)
 Structured Streaming

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
> Fix For: 2.0.2, 2.1.0
>
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions
> currently agreed on plan:
> Mutually exclusive subscription options (only assign is new to this ticket)
> {noformat}
> .option("subscribe","topicFoo,topicBar")
> .option("subscribePattern","topic.*")
> .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
> {noformat}
> where assign can only be specified that way, no inline offsets
> Single starting position option with three mutually exclusive types of value
> {noformat}
> .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 
> 1234, "1": -2}, "topicBar":{"0": -1}}""")
> {noformat}
> startingOffsets with json fails if any topicpartition in the assignments 
> doesn't have an offset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17813) Maximum data per trigger

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17813:
-
Component/s: (was: SQL)
 Structured Streaming

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
> Fix For: 2.0.2, 2.1.0
>
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17344:
-
Component/s: (was: DStreams)
 Structured Streaming

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17345) Prerequisites for Kafka 0.10 support in Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-17345.

Resolution: Fixed

> Prerequisites for Kafka 0.10 support in Structured Streaming
> 
>
> Key: SPARK-17345
> URL: https://issues.apache.org/jira/browse/SPARK-17345
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Reporter: Frederick Reiss
>
> This issue covers any API changes, refactoring, and utility classes/methods 
> that are necessary to make it possible to implement support for Kafka 0.10 
> sources and sinks in Structured Streaming.
> At a minimum, the changes in SPARK-16963 will be needed in order for the 
> Kafka commit protocol to work. Given that KIP-33 ("Add a time based log 
> index") is not yet in place, it may be necessary to make additional API 
> changes in Spark for commit to work efficiently. Some refactoring of the 
> existing KafkaRDD class is probably in order also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15406:
-
Component/s: Structured Streaming

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17183) put hive serde table schema to table properties like data source table

2016-11-01 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17183:
-
Priority: Blocker  (was: Major)

> put hive serde table schema to table properties like data source table
> --
>
> Key: SPARK-17183
> URL: https://issues.apache.org/jira/browse/SPARK-17183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18025) Port streaming to use the commit protocol API

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-18025.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15710
[https://github.com/apache/spark/pull/15710]

> Port streaming to use the commit protocol API
> -
>
> Key: SPARK-18025
> URL: https://issues.apache.org/jira/browse/SPARK-18025
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Reporter: Reynold Xin
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18192) Support all file formats in structured streaming

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18192:

Component/s: Structured Streaming

> Support all file formats in structured streaming
> 
>
> Key: SPARK-18192
> URL: https://issues.apache.org/jira/browse/SPARK-18192
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18025) Port streaming to use the commit protocol API

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18025:
-
Component/s: Structured Streaming

> Port streaming to use the commit protocol API
> -
>
> Key: SPARK-18025
> URL: https://issues.apache.org/jira/browse/SPARK-18025
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18215) Make Column.expr public

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-18215.
---
  Resolution: Duplicate
Target Version/s:   (was: 2.1.0)

> Make Column.expr public
> ---
>
> Key: SPARK-18215
> URL: https://issues.apache.org/jira/browse/SPARK-18215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Column.expr is private[sql], but it's an actually really useful field to have 
> for debugging. We should open it up, similar to how we use QueryExecution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18216) Make Column.expr public

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-18216:
---

Assignee: Reynold Xin

> Make Column.expr public
> ---
>
> Key: SPARK-18216
> URL: https://issues.apache.org/jira/browse/SPARK-18216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Column.expr is private[sql], but it's an actually really useful field to have 
> for debugging. We should open it up, similar to how we use QueryExecution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18215) Make Column.expr public

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-18215:
---

Assignee: Reynold Xin

> Make Column.expr public
> ---
>
> Key: SPARK-18215
> URL: https://issues.apache.org/jira/browse/SPARK-18215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Column.expr is private[sql], but it's an actually really useful field to have 
> for debugging. We should open it up, similar to how we use QueryExecution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18216) Make Column.expr public

2016-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18216:


Assignee: (was: Apache Spark)

> Make Column.expr public
> ---
>
> Key: SPARK-18216
> URL: https://issues.apache.org/jira/browse/SPARK-18216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Column.expr is private[sql], but it's an actually really useful field to have 
> for debugging. We should open it up, similar to how we use QueryExecution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18216) Make Column.expr public

2016-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18216:


Assignee: Apache Spark

> Make Column.expr public
> ---
>
> Key: SPARK-18216
> URL: https://issues.apache.org/jira/browse/SPARK-18216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Column.expr is private[sql], but it's an actually really useful field to have 
> for debugging. We should open it up, similar to how we use QueryExecution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18216) Make Column.expr public

2016-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627220#comment-15627220
 ] 

Apache Spark commented on SPARK-18216:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15724

> Make Column.expr public
> ---
>
> Key: SPARK-18216
> URL: https://issues.apache.org/jira/browse/SPARK-18216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Column.expr is private[sql], but it's an actually really useful field to have 
> for debugging. We should open it up, similar to how we use QueryExecution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18215) Make Column.expr public

2016-11-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18215:
---

 Summary: Make Column.expr public
 Key: SPARK-18215
 URL: https://issues.apache.org/jira/browse/SPARK-18215
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


Column.expr is private[sql], but it's an actually really useful field to have 
for debugging. We should open it up, similar to how we use QueryExecution.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18216) Make Column.expr public

2016-11-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18216:
---

 Summary: Make Column.expr public
 Key: SPARK-18216
 URL: https://issues.apache.org/jira/browse/SPARK-18216
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


Column.expr is private[sql], but it's an actually really useful field to have 
for debugging. We should open it up, similar to how we use QueryExecution.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18214) Simplify RuntimeReplaceable type coercion

2016-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627202#comment-15627202
 ] 

Apache Spark commented on SPARK-18214:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15723

> Simplify RuntimeReplaceable type coercion
> -
>
> Key: SPARK-18214
> URL: https://issues.apache.org/jira/browse/SPARK-18214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> RuntimeReplaceable is used to create aliases for expressions, but the way it 
> deals with type coercion is pretty weird (each expression is responsible for 
> how to handle type coercion, which does not obey the normal implicit type 
> cast rules).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18214) Simplify RuntimeReplaceable type coercion

2016-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18214:


Assignee: Reynold Xin  (was: Apache Spark)

> Simplify RuntimeReplaceable type coercion
> -
>
> Key: SPARK-18214
> URL: https://issues.apache.org/jira/browse/SPARK-18214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> RuntimeReplaceable is used to create aliases for expressions, but the way it 
> deals with type coercion is pretty weird (each expression is responsible for 
> how to handle type coercion, which does not obey the normal implicit type 
> cast rules).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18214) Simplify RuntimeReplaceable type coercion

2016-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18214:


Assignee: Apache Spark  (was: Reynold Xin)

> Simplify RuntimeReplaceable type coercion
> -
>
> Key: SPARK-18214
> URL: https://issues.apache.org/jira/browse/SPARK-18214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> RuntimeReplaceable is used to create aliases for expressions, but the way it 
> deals with type coercion is pretty weird (each expression is responsible for 
> how to handle type coercion, which does not obey the normal implicit type 
> cast rules).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18214) Simplify RuntimeReplaceable type coercion

2016-11-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18214:
---

 Summary: Simplify RuntimeReplaceable type coercion
 Key: SPARK-18214
 URL: https://issues.apache.org/jira/browse/SPARK-18214
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


RuntimeReplaceable is used to create aliases for expressions, but the way it 
deals with type coercion is pretty weird (each expression is responsible for 
how to handle type coercion, which does not obey the normal implicit type cast 
rules).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-11-01 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626664#comment-15626664
 ] 

Don Drake edited comment on SPARK-16845 at 11/2/16 12:32 AM:
-

I've been struggling to duplicate this and finally came up with a strategy that 
duplicates it in a spark-shell.  It's a combination of a wide dataset with 
nested (array) structures and performing a union that seem to trigger it.

I opened SPARK-18207.


was (Author: dondrake):
I've been struggling to duplicate this and finally came up with a strategy that 
duplicates it in a spark-shell.  It's a combination of a wide dataset with 
nested (array) structures and performing a union that seem to trigger it.

I'll open a new JIRA.

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15581) MLlib 2.1 Roadmap

2016-11-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15581:
--
Fix Version/s: 2.1.0

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity 
> with the Scala/Java API. You can find a [complete list here| 
> 

[jira] [Closed] (SPARK-15581) MLlib 2.1 Roadmap

2016-11-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-15581.
-
Resolution: Done

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity 
> with the Scala/Java API. You can find a [complete list here| 
> 

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-11-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627180#comment-15627180
 ] 

Joseph K. Bradley commented on SPARK-15581:
---

Well, the 2.1 code freeze has come up quickly, though we did make a fair amount 
of progress on parity and cleanups.  I'm going to close this, though I want to 
copy some over for 2.2.

Apart from more limited dev time during 2.1 than usual, I do think active ML 
committers, myself included, could do a better job of messaging what we are 
going to focus on.  That will likely mean:
* Initializing the roadmap JIRA with fewer tasks.  Perhaps committers who add 
tasks can preemptively mark themselves as shepherds, i.e., willing to spend 
review cycles on those tasks.
* Maintaining the roadmap.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and 

[jira] [Updated] (SPARK-16578) Configurable hostname for RBackend

2016-11-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16578:
--
Target Version/s: 2.2.0  (was: 2.1.0)

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16411) Add textFile API to structured streaming.

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-16411.
--
   Resolution: Fixed
 Assignee: Prashant Sharma
Fix Version/s: 2.1.0

> Add textFile API to structured streaming.
> -
>
> Key: SPARK-16411
> URL: https://issues.apache.org/jira/browse/SPARK-16411
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Minor
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15944) Make spark.ml package backward compatible with spark.mllib vectors

2016-11-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15944.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Make spark.ml package backward compatible with spark.mllib vectors
> --
>
> Key: SPARK-15944
> URL: https://issues.apache.org/jira/browse/SPARK-15944
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 2.1.0
>
>
> During QA, we found that it is not trivial to convert a DataFrame with old 
> vector columns to new vector columns. So it would be easier for users to 
> migrate their datasets and pipelines if we:
> 1) provide utils to convert DataFrames with vector columns
> 2) automatically detect and convert old vector columns in ML pipelines
> This is an umbrella JIRA to track the progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-11-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-16000.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Make model loading backward compatible with saved models using old vector 
> columns
> -
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
> Fix For: 2.1.0
>
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. The main incompatibility is the 
> vector column type change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-11-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627151#comment-15627151
 ] 

Joseph K. Bradley commented on SPARK-16000:
---

I just checked through, and the PRs cover all MLWritable types from 1.6.

> Make model loading backward compatible with saved models using old vector 
> columns
> -
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
> Fix For: 2.1.0
>
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. The main incompatibility is the 
> vector column type change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16738) Queryable state for Spark State Store

2016-11-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627148#comment-15627148
 ] 

Michael Armbrust commented on SPARK-16738:
--

You can already query the state store today, if you write the results of a 
{{complete}} mode streaming query into the {{memory}} sink.  Now, this does 
require collecting all of the result to the driver, so might no work for your 
use case.  Eitherway, it would be helpful to understand what you are trying to 
accomplish a little bit better so we can make sure we build the right 
interfaces.

> Queryable state for Spark State Store
> -
>
> Key: SPARK-16738
> URL: https://issues.apache.org/jira/browse/SPARK-16738
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Mark Sumner
>  Labels: features
>
> Spark 2.0 will introduce the new State Store to allow state managment outside 
> the RDD model (see: SPARK-13809)
> This proposal seeks to include a mechanism (in a future release) to expose 
> this internal store to external applications for querying.
> This would then make it possible to interact with aggregated state without 
> needing to synchronously write (and read) to/from an external store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch

2016-11-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18187:
-
Priority: Critical  (was: Major)

> CompactibleFileStreamLog should not rely on "compactInterval" to detect a 
> compaction batch
> --
>
> Key: SPARK-18187
> URL: https://issues.apache.org/jira/browse/SPARK-18187
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Priority: Critical
>
> Right now CompactibleFileStreamLog uses compactInterval to check if a batch 
> is a compaction batch. However, since this conf is controlled by the user, 
> they may just change it, and CompactibleFileStreamLog will just silently 
> return the wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch

2016-11-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627134#comment-15627134
 ] 

Michael Armbrust commented on SPARK-18187:
--

I think the configuration should only be used when deciding if we should 
perform a new compaction.  The identification of a compaction vs a delta should 
be done based on the file itself.  Today this could be done by looking for the 
{{compact}} suffix.  However, I think this mechanism also has issues, as two 
streams writing to the same log but with different configurations would fail to 
conflict.

That said, I think fixing the latter issue is going to require us rev-ing the 
log version.  Since thats not free, we would probably want to see if there are 
other changes we should lump into the new version.  Given that, I'd be okay 
keeping the existing format, looking at file names instead of modular 
arithmetic, and revisiting moving the compaction identifier into the log itself 
(rather than the filename) in a follow up.

> CompactibleFileStreamLog should not rely on "compactInterval" to detect a 
> compaction batch
> --
>
> Key: SPARK-18187
> URL: https://issues.apache.org/jira/browse/SPARK-18187
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>
> Right now CompactibleFileStreamLog uses compactInterval to check if a batch 
> is a compaction batch. However, since this conf is controlled by the user, 
> they may just change it, and CompactibleFileStreamLog will just silently 
> return the wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

2016-11-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16240:
--
Issue Type: Improvement  (was: Bug)

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Improvement
>Reporter: yuhao yang
>Assignee: Gayathri Murali
> Fix For: 2.0.1, 2.1.0
>
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16454) Consider adding a per-batch transform for structured streaming

2016-11-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627125#comment-15627125
 ] 

Michael Armbrust commented on SPARK-16454:
--

What specifically is missing from the {{foreach}} sink?

> Consider adding a per-batch transform for structured streaming
> --
>
> Key: SPARK-16454
> URL: https://issues.apache.org/jira/browse/SPARK-16454
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: holdenk
>
> The new structured streaming API lacks the DStream functionality of transform 
> (which allowed one to mix in existing RDD transformation logic). It would be 
> useful to be able to do per-batch (even without any specific gaurantees about 
> the batch being complete provided you eventually get called with the "catch 
> up" records) processing as was done in the DStream API.
> This might be useful for implementing Streaming Machine Learning on 
> Structured Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's

2016-11-01 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627120#comment-15627120
 ] 

Tejas Patil commented on SPARK-15867:
-

Yes. I am interested in this support.

> TABLESAMPLE BUCKET semantics don't match Hive's
> ---
>
> Key: SPARK-15867
> URL: https://issues.apache.org/jira/browse/SPARK-15867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Andrew Or
>
> {code}
> SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
> {code}
> In Hive, this would select the 3rd bucket out of every 16 buckets there are 
> in the table. E.g. if the table was clustered by 32 buckets then this would 
> sample the 3rd and the 19th bucket. (See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)
> In Spark, however, we simply sample 3/16 of the number of input rows.
> Either we don't support it in Spark or do it in a way that's consistent with 
> Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >