[jira] [Commented] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-06 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365658#comment-15365658
 ] 

zenglinxi commented on SPARK-16408:
---

as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two 
functions in SparkContext.scala:
{quote}
def addFile(path: String): Unit = {
addFile(path, false)
  }
def addFile(path: String, recursive: Boolean): Unit = {
...
}
{quote}
But there are no config to turn on or off recursive, and spark always call 
addFile(path) in default, which means the value of recursive is  false, this is 
why we get the exceptions.

> SparkSQL Added file get Exception: is a directory and recursive is not turned 
> on
> 
>
> Key: SPARK-16408
> URL: https://issues.apache.org/jira/browse/SPARK-16408
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: zenglinxi
>
> when use Spark-sql to execute sql like:
> {quote}
> add file hdfs://xxx/user/test;
> {quote}
> if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
> exception like:
> {quote}
> org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a 
> directory and recursive is not turned on.
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
>at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14743) Improve delegation token handling in secure clusters

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365656#comment-15365656
 ] 

Apache Spark commented on SPARK-14743:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/14065

> Improve delegation token handling in secure clusters
> 
>
> Key: SPARK-14743
> URL: https://issues.apache.org/jira/browse/SPARK-14743
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> In a way, I'd consider this a parent bug of SPARK-7252.
> Spark's current support for delegation tokens is a little all over the place:
> - for HDFS, there's support for re-creating tokens if a principal and keytab 
> are provided
> - for HBase and Hive, Spark will fetch delegation tokens so that apps can 
> work in cluster mode, but will not re-create them, so apps that need those 
> will stop working after 7 days
> - for anything else, Spark doesn't do anything. Lots of other services use 
> delegation tokens, and supporting them as data sources in Spark becomes more 
> complicated because of that. e.g., Kafka will (hopefully) soon support them.
> It would be nice if Spark had consistent support for handling delegation 
> tokens regardless of who needs them. I'd list these as the requirements:
> - Spark to provide a generic interface for fetching delegation tokens. This 
> would allow Spark's delegation token support to be extended using some plugin 
> architecture (e.g. Java services), meaning Spark itself doesn't need to 
> support every possible service out there.
> This would be used to fetch tokens when launching apps in cluster mode, and 
> when a principal and a keytab are provided to Spark.
> - A way to manually update delegation tokens in Spark. For example, a new 
> SparkContext API, or some configuration that tells Spark to monitor a file 
> for changes and load tokens from said file.
> This would allow external applications to manage tokens outside of Spark and 
> be able to update a running Spark application (think, for example, a job 
> sever like Oozie, or something like Hive-on-Spark which manages Spark apps 
> running remotely).
> - A way to notify running code that new delegation tokens have been loaded.
> This may not be strictly necessary; it might be possible for code to detect 
> that, e.g., by peeking into the UserGroupInformation structure. But an event 
> sent to the listener bus would allow applications to react when new tokens 
> are available (e.g., the Hive backend could re-create connections to the 
> metastore server using the new tokens).
> Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the 
> mailing list recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14743) Improve delegation token handling in secure clusters

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14743:


Assignee: (was: Apache Spark)

> Improve delegation token handling in secure clusters
> 
>
> Key: SPARK-14743
> URL: https://issues.apache.org/jira/browse/SPARK-14743
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> In a way, I'd consider this a parent bug of SPARK-7252.
> Spark's current support for delegation tokens is a little all over the place:
> - for HDFS, there's support for re-creating tokens if a principal and keytab 
> are provided
> - for HBase and Hive, Spark will fetch delegation tokens so that apps can 
> work in cluster mode, but will not re-create them, so apps that need those 
> will stop working after 7 days
> - for anything else, Spark doesn't do anything. Lots of other services use 
> delegation tokens, and supporting them as data sources in Spark becomes more 
> complicated because of that. e.g., Kafka will (hopefully) soon support them.
> It would be nice if Spark had consistent support for handling delegation 
> tokens regardless of who needs them. I'd list these as the requirements:
> - Spark to provide a generic interface for fetching delegation tokens. This 
> would allow Spark's delegation token support to be extended using some plugin 
> architecture (e.g. Java services), meaning Spark itself doesn't need to 
> support every possible service out there.
> This would be used to fetch tokens when launching apps in cluster mode, and 
> when a principal and a keytab are provided to Spark.
> - A way to manually update delegation tokens in Spark. For example, a new 
> SparkContext API, or some configuration that tells Spark to monitor a file 
> for changes and load tokens from said file.
> This would allow external applications to manage tokens outside of Spark and 
> be able to update a running Spark application (think, for example, a job 
> sever like Oozie, or something like Hive-on-Spark which manages Spark apps 
> running remotely).
> - A way to notify running code that new delegation tokens have been loaded.
> This may not be strictly necessary; it might be possible for code to detect 
> that, e.g., by peeking into the UserGroupInformation structure. But an event 
> sent to the listener bus would allow applications to react when new tokens 
> are available (e.g., the Hive backend could re-create connections to the 
> metastore server using the new tokens).
> Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the 
> mailing list recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14743) Improve delegation token handling in secure clusters

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14743:


Assignee: Apache Spark

> Improve delegation token handling in secure clusters
> 
>
> Key: SPARK-14743
> URL: https://issues.apache.org/jira/browse/SPARK-14743
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> In a way, I'd consider this a parent bug of SPARK-7252.
> Spark's current support for delegation tokens is a little all over the place:
> - for HDFS, there's support for re-creating tokens if a principal and keytab 
> are provided
> - for HBase and Hive, Spark will fetch delegation tokens so that apps can 
> work in cluster mode, but will not re-create them, so apps that need those 
> will stop working after 7 days
> - for anything else, Spark doesn't do anything. Lots of other services use 
> delegation tokens, and supporting them as data sources in Spark becomes more 
> complicated because of that. e.g., Kafka will (hopefully) soon support them.
> It would be nice if Spark had consistent support for handling delegation 
> tokens regardless of who needs them. I'd list these as the requirements:
> - Spark to provide a generic interface for fetching delegation tokens. This 
> would allow Spark's delegation token support to be extended using some plugin 
> architecture (e.g. Java services), meaning Spark itself doesn't need to 
> support every possible service out there.
> This would be used to fetch tokens when launching apps in cluster mode, and 
> when a principal and a keytab are provided to Spark.
> - A way to manually update delegation tokens in Spark. For example, a new 
> SparkContext API, or some configuration that tells Spark to monitor a file 
> for changes and load tokens from said file.
> This would allow external applications to manage tokens outside of Spark and 
> be able to update a running Spark application (think, for example, a job 
> sever like Oozie, or something like Hive-on-Spark which manages Spark apps 
> running remotely).
> - A way to notify running code that new delegation tokens have been loaded.
> This may not be strictly necessary; it might be possible for code to detect 
> that, e.g., by peeking into the UserGroupInformation structure. But an event 
> sent to the listener bus would allow applications to react when new tokens 
> are available (e.g., the Hive backend could re-create connections to the 
> metastore server using the new tokens).
> Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the 
> mailing list recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-06 Thread zenglinxi (JIRA)
zenglinxi created SPARK-16408:
-

 Summary: SparkSQL Added file get Exception: is a directory and 
recursive is not turned on
 Key: SPARK-16408
 URL: https://issues.apache.org/jira/browse/SPARK-16408
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.6.2
Reporter: zenglinxi


when use Spark-sql to execute sql like:
{quote}
add file hdfs://xxx/user/test;
{quote}
if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
exception like:
{quote}
org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory 
and recursive is not turned on.
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
   at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
   at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16398) Make cancelJob and cancelStage API public

2016-07-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16398.
-
   Resolution: Fixed
 Assignee: Mitesh Patel
Fix Version/s: 2.1.0

> Make cancelJob and cancelStage API public
> -
>
> Key: SPARK-16398
> URL: https://issues.apache.org/jira/browse/SPARK-16398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Mitesh
>Assignee: Mitesh Patel
>Priority: Trivial
> Fix For: 2.1.0
>
>
> Make the SparkContext {{cancelJob}} and {{cancelStage}} APIs public. This 
> allows applications to use {{SparkListener}} to do their own management of 
> jobs via events, but without using the REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365641#comment-15365641
 ] 

Apache Spark commented on SPARK-16021:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/14084

> Zero out freed memory in test to help catch correctness bugs
> 
>
> Key: SPARK-16021
> URL: https://issues.apache.org/jira/browse/SPARK-16021
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> In both on-heap and off-heap modes, it would be helpful to immediately zero 
> out (or otherwise fill with a sentinel value) memory when an object is 
> deallocated.
> Currently, in on-heap mode, freed memory can be accessed without visible 
> error if no other consumer has written to the same space. Similarly, off-heap 
> memory can be accessed without fault if the allocation library has not 
> released the pages back to the OS. Zeroing out freed memory would make these 
> errors immediately visible as a correctness problem.
> Since this would add some performance overhead, it would make sense to 
> conf-flag and enable only in test.
> cc [~sameerag] [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14743) Improve delegation token handling in secure clusters

2016-07-06 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-14743:

Component/s: YARN

> Improve delegation token handling in secure clusters
> 
>
> Key: SPARK-14743
> URL: https://issues.apache.org/jira/browse/SPARK-14743
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> In a way, I'd consider this a parent bug of SPARK-7252.
> Spark's current support for delegation tokens is a little all over the place:
> - for HDFS, there's support for re-creating tokens if a principal and keytab 
> are provided
> - for HBase and Hive, Spark will fetch delegation tokens so that apps can 
> work in cluster mode, but will not re-create them, so apps that need those 
> will stop working after 7 days
> - for anything else, Spark doesn't do anything. Lots of other services use 
> delegation tokens, and supporting them as data sources in Spark becomes more 
> complicated because of that. e.g., Kafka will (hopefully) soon support them.
> It would be nice if Spark had consistent support for handling delegation 
> tokens regardless of who needs them. I'd list these as the requirements:
> - Spark to provide a generic interface for fetching delegation tokens. This 
> would allow Spark's delegation token support to be extended using some plugin 
> architecture (e.g. Java services), meaning Spark itself doesn't need to 
> support every possible service out there.
> This would be used to fetch tokens when launching apps in cluster mode, and 
> when a principal and a keytab are provided to Spark.
> - A way to manually update delegation tokens in Spark. For example, a new 
> SparkContext API, or some configuration that tells Spark to monitor a file 
> for changes and load tokens from said file.
> This would allow external applications to manage tokens outside of Spark and 
> be able to update a running Spark application (think, for example, a job 
> sever like Oozie, or something like Hive-on-Spark which manages Spark apps 
> running remotely).
> - A way to notify running code that new delegation tokens have been loaded.
> This may not be strictly necessary; it might be possible for code to detect 
> that, e.g., by peeking into the UserGroupInformation structure. But an event 
> sent to the listener bus would allow applications to react when new tokens 
> are available (e.g., the Hive backend could re-create connections to the 
> metastore server using the new tokens).
> Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the 
> mailing list recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365599#comment-15365599
 ] 

Semet commented on SPARK-16367:
---

Wheels are tagged by os, architecture, Python version and it seems to be enough 
for being compiled on one machine and work on another, if compatible. Pip 
install is responsible for finding the right wheel of a wanted module.

For example on my machine, when I do a "pip install numpy" I don't have any 
compilation, pip directly takes the binary wheel from pypi, so installation is 
fast. But if you have an older version of Python, for instance 2.6, since there 
is no wheels for 2.6, pip install will compile some C modules and store the 
wheel in ~/.cache/pip. So futur installation will not require compilation.

You can even take this wheel and add it to you pypi-local repository on 
artifactory so this package will be available on you pypi mirror (see doc about 
artifactory support for pypi).

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via 

[jira] [Commented] (SPARK-16381) Update SQL examples and programming guide for R language binding

2016-07-06 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365597#comment-15365597
 ] 

Xin Ren commented on SPARK-16381:
-

Hi Cheng, do you mind tell me where to find the RC date, or release schedule? 

I tried here 
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:versions-panel,
 but not much information found

> Update SQL examples and programming guide for R language binding
> 
>
> Key: SPARK-16381
> URL: https://issues.apache.org/jira/browse/SPARK-16381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xin Ren
>
> Please follow guidelines listed in this SPARK-16303 
> [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16380) Update SQL examples and programming guide for Python language binding

2016-07-06 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365591#comment-15365591
 ] 

Cheng Lian commented on SPARK-16380:


[~wm624] Considering 2.0.0 RC2 has already been cut, it's possible that we 
can't have this in 2.0.0. However, we'd like to have it in 2.0.0 if there's 
another RC.

> Update SQL examples and programming guide for Python language binding
> -
>
> Key: SPARK-16380
> URL: https://issues.apache.org/jira/browse/SPARK-16380
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Miao Wang
>
> Please follow guidelines listed in this SPARK-16303 
> [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16303) Update SQL examples and programming guide for Scala and Java language bindings

2016-07-06 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365590#comment-15365590
 ] 

Cheng Lian commented on SPARK-16303:


[~aokolnychyi] Considering 2.0.0 RC2 has already been cut, it's possible that 
we can't have this in 2.0.0. However, we'd like to have it in 2.0.0 if there's 
another RC.

> Update SQL examples and programming guide for Scala and Java language bindings
> --
>
> Key: SPARK-16303
> URL: https://issues.apache.org/jira/browse/SPARK-16303
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Anton Okolnychyi
>
> We need to update SQL examples code under the {{examples}} sub-project, and 
> then replace hard-coded snippets in the SQL programming guide with snippets 
> automatically extracted from actual source files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16381) Update SQL examples and programming guide for R language binding

2016-07-06 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365583#comment-15365583
 ] 

Cheng Lian commented on SPARK-16381:


Thanks for volunteering! I've assigned this ticket to you.

Considering 2.0.0 RC2 has already been cut, it's possible that we can't have 
this in 2.0.0. However, we'd like to have it if there's another RC.

> Update SQL examples and programming guide for R language binding
> 
>
> Key: SPARK-16381
> URL: https://issues.apache.org/jira/browse/SPARK-16381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xin Ren
>
> Please follow guidelines listed in this SPARK-16303 
> [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16381) Update SQL examples and programming guide for R language binding

2016-07-06 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16381:
---
Assignee: Xin Ren

> Update SQL examples and programming guide for R language binding
> 
>
> Key: SPARK-16381
> URL: https://issues.apache.org/jira/browse/SPARK-16381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xin Ren
>
> Please follow guidelines listed in this SPARK-16303 
> [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16380) Update SQL examples and programming guide for Python language binding

2016-07-06 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365579#comment-15365579
 ] 

Cheng Lian commented on SPARK-16380:


I just noticed that I put "Scala" into the JIRA ticket title by mistake. Please 
note that the scope of this ticket only covers Python examples.

> Update SQL examples and programming guide for Python language binding
> -
>
> Key: SPARK-16380
> URL: https://issues.apache.org/jira/browse/SPARK-16380
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Miao Wang
>
> Please follow guidelines listed in this SPARK-16303 
> [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16380) Update SQL examples and programming guide for Python language binding

2016-07-06 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16380:
---
Summary: Update SQL examples and programming guide for Python language 
binding  (was: Update SQL examples and programming guide for Scala Python 
language binding)

> Update SQL examples and programming guide for Python language binding
> -
>
> Key: SPARK-16380
> URL: https://issues.apache.org/jira/browse/SPARK-16380
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Miao Wang
>
> Please follow guidelines listed in this SPARK-16303 
> [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16374) Remove Alias from MetastoreRelation and SimpleCatalogRelation

2016-07-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16374.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14053
[https://github.com/apache/spark/pull/14053]

> Remove Alias from MetastoreRelation and SimpleCatalogRelation
> -
>
> Key: SPARK-16374
> URL: https://issues.apache.org/jira/browse/SPARK-16374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Different from the other leaf nodes, `MetastoreRelation` and 
> `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change 
> the qualifier of the node. However, based on the existing alias handling, 
> alias should be put in `SubqueryAlias`. 
> This PR is to separate alias handling from `MetastoreRelation` and 
> `SimpleCatalogRelation` to make it consistent with the other nodes. 
> For example, below is an example query for `MetastoreRelation`, which is 
> converted to `LogicalRelation`:
> {noformat}
> SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2
> {noformat}
> Before changes, the analyzed plan is
> {noformat}
> == Analyzed Logical Plan ==
> (a + 1): int
> Project [(a#951 + 1) AS (a + 1)#952]
> +- Filter (a#951 > 2)
>+- SubqueryAlias tmp
>   +- Relation[a#951] parquet
> {noformat}
> After changes, the analyzed plan becomes
> {noformat}
> == Analyzed Logical Plan ==
> (a + 1): int
> Project [(a#951 + 1) AS (a + 1)#952]
> +- Filter (a#951 > 2)
>+- SubqueryAlias tmp
>   +- SubqueryAlias test_parquet_ctas
>  +- Relation[a#951] parquet
> {noformat}
> **Note: the optimized plans are the same.**
> For `SimpleCatalogRelation`, the existing code always generates two 
> Subqueries. Thus, no change is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16374) Remove Alias from MetastoreRelation and SimpleCatalogRelation

2016-07-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16374:

Assignee: Xiao Li

> Remove Alias from MetastoreRelation and SimpleCatalogRelation
> -
>
> Key: SPARK-16374
> URL: https://issues.apache.org/jira/browse/SPARK-16374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Different from the other leaf nodes, `MetastoreRelation` and 
> `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change 
> the qualifier of the node. However, based on the existing alias handling, 
> alias should be put in `SubqueryAlias`. 
> This PR is to separate alias handling from `MetastoreRelation` and 
> `SimpleCatalogRelation` to make it consistent with the other nodes. 
> For example, below is an example query for `MetastoreRelation`, which is 
> converted to `LogicalRelation`:
> {noformat}
> SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2
> {noformat}
> Before changes, the analyzed plan is
> {noformat}
> == Analyzed Logical Plan ==
> (a + 1): int
> Project [(a#951 + 1) AS (a + 1)#952]
> +- Filter (a#951 > 2)
>+- SubqueryAlias tmp
>   +- Relation[a#951] parquet
> {noformat}
> After changes, the analyzed plan becomes
> {noformat}
> == Analyzed Logical Plan ==
> (a + 1): int
> Project [(a#951 + 1) AS (a + 1)#952]
> +- Filter (a#951 > 2)
>+- SubqueryAlias tmp
>   +- SubqueryAlias test_parquet_ctas
>  +- Relation[a#951] parquet
> {noformat}
> **Note: the optimized plans are the same.**
> For `SimpleCatalogRelation`, the existing code always generates two 
> Subqueries. Thus, no change is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14839) Support for other types as option in OPTIONS clause

2016-07-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-14839.
---
   Resolution: Resolved
 Assignee: Hyukjin Kwon
Fix Version/s: 2.1.0

> Support for other types as option in OPTIONS clause
> ---
>
> Key: SPARK-14839
> URL: https://issues.apache.org/jira/browse/SPARK-14839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This was found in https://github.com/apache/spark/pull/12494.
> Currently, Spark SQL does not support other types and {{null}} as a value of 
> an options. 
> For example, 
> {code}
> CREATE ...
> USING csv
> OPTIONS (path "your-path", quote null)
> {code}
> throws an exception below
> {code}
> Unsupported SQL statement
> == SQL ==
>  CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, 
> modelName string, comments string, grp string) USING csv OPTIONS (path 
> "your-path", quote null)   
> org.apache.spark.sql.catalyst.parser.ParseException: 
> Unsupported SQL statement
> == SQL ==
>  CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, 
> modelName string, comments string, grp string) USING csv OPTIONS (path 
> "your-path", quote null)   
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764)
> ...
> {code}
> Currently, Scala API supports to take options with the types, {{String}}, 
> {{Long}}, {{Double}} and {{Boolean}} and Python API also supports other 
> types. I think in this way we can support data sources in a consistent way.
> It looks it is okay to  to provide other types as arguments just like 
> [Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] 
> because [SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] 
> standard mentions options as below:
> {quote}
> An implementation remains conforming even if it provides user op-
> tions to process nonconforming SQL language or to process conform-
> ing SQL language in a nonconforming manner.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-07-06 Thread holdenk (JIRA)
holdenk created SPARK-16407:
---

 Summary: Allow users to supply custom StreamSinkProviders
 Key: SPARK-16407
 URL: https://issues.apache.org/jira/browse/SPARK-16407
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: holdenk


The current DataStreamWriter allows users to specify a class name as format, 
however it could be easier for people to directly pass in a specific provider 
instance - e.g. for user equivalent of ForeachSink or other sink with 
non-string parameters.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2016-07-06 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365547#comment-15365547
 ] 

SuYan commented on SPARK-3630:
--

Snappy-java support concatenate since from snappy 1.1.2: 
https://github.com/xerial/snappy-java/issues/103

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2016-07-06 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365543#comment-15365543
 ] 

SuYan commented on SPARK-3630:
--

may the reason was snappy 1.0.4.1 not support Concatenating? because the code 
path was "UnsafeShuffleWriter-> mergeSpillsWithFastFileStream", it will 
concatenate the same partition snappy compression data from different spilled 
files.


> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN

2016-07-06 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao closed SPARK-16342.
---
Resolution: Duplicate

> Add a new Configurable Token Manager  for Spark Running on YARN
> ---
>
> Key: SPARK-16342
> URL: https://issues.apache.org/jira/browse/SPARK-16342
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> Current Spark on YARN token management has some problems:
> 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported 
> for token fetching. For other third-party services which need to be 
> communicated with Spark in Kerberos way, currently the only way is to modify 
> Spark code.
> 2. Current token renewal and update mechanism is also hard-coded, which means 
> other third-party services cannot be benefited from this system and will be 
> failed when token is expired.
> 3. Also In the code level, current token obtain and update codes are placed 
> in several different places without elegant structured, which makes it hard 
> to maintain and extend.
> So here propose a new Configurable Token Manager class to solve the issues 
> mentioned above. 
> Basically this new proposal will have two changes:
> 1. Abstract a ServiceTokenProvider for different services, this is 
> configurable and pluggable, by default there will be hdfs, hbase, hive 
> service, also user could add their own services through configuration. This 
> interface offers a way to retrieve the tokens and token renewal interval.
> 2. Provide a ConfigurableTokenManager to manage all the added-in token 
> providers, also expose APIs for external modules to get and update tokens.
> Details are in the design doc 
> (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing),
>  any suggestion and comment is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN

2016-07-06 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365535#comment-15365535
 ] 

Saisai Shao commented on SPARK-16342:
-

Close as JIRA as duplicated and move to SPARK-14743.

> Add a new Configurable Token Manager  for Spark Running on YARN
> ---
>
> Key: SPARK-16342
> URL: https://issues.apache.org/jira/browse/SPARK-16342
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> Current Spark on YARN token management has some problems:
> 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported 
> for token fetching. For other third-party services which need to be 
> communicated with Spark in Kerberos way, currently the only way is to modify 
> Spark code.
> 2. Current token renewal and update mechanism is also hard-coded, which means 
> other third-party services cannot be benefited from this system and will be 
> failed when token is expired.
> 3. Also In the code level, current token obtain and update codes are placed 
> in several different places without elegant structured, which makes it hard 
> to maintain and extend.
> So here propose a new Configurable Token Manager class to solve the issues 
> mentioned above. 
> Basically this new proposal will have two changes:
> 1. Abstract a ServiceTokenProvider for different services, this is 
> configurable and pluggable, by default there will be hdfs, hbase, hive 
> service, also user could add their own services through configuration. This 
> interface offers a way to retrieve the tokens and token renewal interval.
> 2. Provide a ConfigurableTokenManager to manage all the added-in token 
> providers, also expose APIs for external modules to get and update tokens.
> Details are in the design doc 
> (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing),
>  any suggestion and comment is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14743) Improve delegation token handling in secure clusters

2016-07-06 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365534#comment-15365534
 ] 

Saisai Shao commented on SPARK-14743:
-

Post design doc here and move SPARK-16342 to here.

> Improve delegation token handling in secure clusters
> 
>
> Key: SPARK-14743
> URL: https://issues.apache.org/jira/browse/SPARK-14743
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> In a way, I'd consider this a parent bug of SPARK-7252.
> Spark's current support for delegation tokens is a little all over the place:
> - for HDFS, there's support for re-creating tokens if a principal and keytab 
> are provided
> - for HBase and Hive, Spark will fetch delegation tokens so that apps can 
> work in cluster mode, but will not re-create them, so apps that need those 
> will stop working after 7 days
> - for anything else, Spark doesn't do anything. Lots of other services use 
> delegation tokens, and supporting them as data sources in Spark becomes more 
> complicated because of that. e.g., Kafka will (hopefully) soon support them.
> It would be nice if Spark had consistent support for handling delegation 
> tokens regardless of who needs them. I'd list these as the requirements:
> - Spark to provide a generic interface for fetching delegation tokens. This 
> would allow Spark's delegation token support to be extended using some plugin 
> architecture (e.g. Java services), meaning Spark itself doesn't need to 
> support every possible service out there.
> This would be used to fetch tokens when launching apps in cluster mode, and 
> when a principal and a keytab are provided to Spark.
> - A way to manually update delegation tokens in Spark. For example, a new 
> SparkContext API, or some configuration that tells Spark to monitor a file 
> for changes and load tokens from said file.
> This would allow external applications to manage tokens outside of Spark and 
> be able to update a running Spark application (think, for example, a job 
> sever like Oozie, or something like Hive-on-Spark which manages Spark apps 
> running remotely).
> - A way to notify running code that new delegation tokens have been loaded.
> This may not be strictly necessary; it might be possible for code to detect 
> that, e.g., by peeking into the UserGroupInformation structure. But an event 
> sent to the listener bus would allow applications to react when new tokens 
> are available (e.g., the Hive backend could re-create connections to the 
> metastore server using the new tokens).
> Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the 
> mailing list recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

2016-07-06 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365533#comment-15365533
 ] 

Gayathri Murali commented on SPARK-16240:
-

I can work on this

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Priority: Minor
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14743) Improve delegation token handling in secure clusters

2016-07-06 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365534#comment-15365534
 ] 

Saisai Shao edited comment on SPARK-14743 at 7/7/16 3:18 AM:
-

Post design doc and move SPARK-16342 to here.


was (Author: jerryshao):
Post design doc here and move SPARK-16342 to here.

> Improve delegation token handling in secure clusters
> 
>
> Key: SPARK-14743
> URL: https://issues.apache.org/jira/browse/SPARK-14743
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> In a way, I'd consider this a parent bug of SPARK-7252.
> Spark's current support for delegation tokens is a little all over the place:
> - for HDFS, there's support for re-creating tokens if a principal and keytab 
> are provided
> - for HBase and Hive, Spark will fetch delegation tokens so that apps can 
> work in cluster mode, but will not re-create them, so apps that need those 
> will stop working after 7 days
> - for anything else, Spark doesn't do anything. Lots of other services use 
> delegation tokens, and supporting them as data sources in Spark becomes more 
> complicated because of that. e.g., Kafka will (hopefully) soon support them.
> It would be nice if Spark had consistent support for handling delegation 
> tokens regardless of who needs them. I'd list these as the requirements:
> - Spark to provide a generic interface for fetching delegation tokens. This 
> would allow Spark's delegation token support to be extended using some plugin 
> architecture (e.g. Java services), meaning Spark itself doesn't need to 
> support every possible service out there.
> This would be used to fetch tokens when launching apps in cluster mode, and 
> when a principal and a keytab are provided to Spark.
> - A way to manually update delegation tokens in Spark. For example, a new 
> SparkContext API, or some configuration that tells Spark to monitor a file 
> for changes and load tokens from said file.
> This would allow external applications to manage tokens outside of Spark and 
> be able to update a running Spark application (think, for example, a job 
> sever like Oozie, or something like Hive-on-Spark which manages Spark apps 
> running remotely).
> - A way to notify running code that new delegation tokens have been loaded.
> This may not be strictly necessary; it might be possible for code to detect 
> that, e.g., by peeking into the UserGroupInformation structure. But an event 
> sent to the listener bus would allow applications to react when new tokens 
> are available (e.g., the Hive backend could re-create connections to the 
> metastore server using the new tokens).
> Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the 
> mailing list recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN

2016-07-06 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365483#comment-15365483
 ] 

Saisai Shao commented on SPARK-16342:
-

OK, I see. Sorry I didn't notice your JIRA, let me consolidate things to your 
opened JIRA if you don't mind.

> Add a new Configurable Token Manager  for Spark Running on YARN
> ---
>
> Key: SPARK-16342
> URL: https://issues.apache.org/jira/browse/SPARK-16342
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> Current Spark on YARN token management has some problems:
> 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported 
> for token fetching. For other third-party services which need to be 
> communicated with Spark in Kerberos way, currently the only way is to modify 
> Spark code.
> 2. Current token renewal and update mechanism is also hard-coded, which means 
> other third-party services cannot be benefited from this system and will be 
> failed when token is expired.
> 3. Also In the code level, current token obtain and update codes are placed 
> in several different places without elegant structured, which makes it hard 
> to maintain and extend.
> So here propose a new Configurable Token Manager class to solve the issues 
> mentioned above. 
> Basically this new proposal will have two changes:
> 1. Abstract a ServiceTokenProvider for different services, this is 
> configurable and pluggable, by default there will be hdfs, hbase, hive 
> service, also user could add their own services through configuration. This 
> interface offers a way to retrieve the tokens and token renewal interval.
> 2. Provide a ConfigurableTokenManager to manage all the added-in token 
> providers, also expose APIs for external modules to get and update tokens.
> Details are in the design doc 
> (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing),
>  any suggestion and comment is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN

2016-07-06 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365482#comment-15365482
 ] 

Marcelo Vanzin commented on SPARK-16342:


I'm not working on it, I filed a bug because it's a missing feature that's 
needed. I'm just saying that instead of filing a bug with pretty much the same 
contents, it's better just to consolidate things.

> Add a new Configurable Token Manager  for Spark Running on YARN
> ---
>
> Key: SPARK-16342
> URL: https://issues.apache.org/jira/browse/SPARK-16342
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> Current Spark on YARN token management has some problems:
> 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported 
> for token fetching. For other third-party services which need to be 
> communicated with Spark in Kerberos way, currently the only way is to modify 
> Spark code.
> 2. Current token renewal and update mechanism is also hard-coded, which means 
> other third-party services cannot be benefited from this system and will be 
> failed when token is expired.
> 3. Also In the code level, current token obtain and update codes are placed 
> in several different places without elegant structured, which makes it hard 
> to maintain and extend.
> So here propose a new Configurable Token Manager class to solve the issues 
> mentioned above. 
> Basically this new proposal will have two changes:
> 1. Abstract a ServiceTokenProvider for different services, this is 
> configurable and pluggable, by default there will be hdfs, hbase, hive 
> service, also user could add their own services through configuration. This 
> interface offers a way to retrieve the tokens and token renewal interval.
> 2. Provide a ConfigurableTokenManager to manage all the added-in token 
> providers, also expose APIs for external modules to get and update tokens.
> Details are in the design doc 
> (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing),
>  any suggestion and comment is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN

2016-07-06 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-16342:

Comment: was deleted

(was: Thanks [~vanzin] for pointing out the jira, looks like most part of the 
ideas are similar. I'm not sure what is your progress on it, I think it would 
be great to collaborate to make that happen. Thanks a lot. )

> Add a new Configurable Token Manager  for Spark Running on YARN
> ---
>
> Key: SPARK-16342
> URL: https://issues.apache.org/jira/browse/SPARK-16342
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> Current Spark on YARN token management has some problems:
> 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported 
> for token fetching. For other third-party services which need to be 
> communicated with Spark in Kerberos way, currently the only way is to modify 
> Spark code.
> 2. Current token renewal and update mechanism is also hard-coded, which means 
> other third-party services cannot be benefited from this system and will be 
> failed when token is expired.
> 3. Also In the code level, current token obtain and update codes are placed 
> in several different places without elegant structured, which makes it hard 
> to maintain and extend.
> So here propose a new Configurable Token Manager class to solve the issues 
> mentioned above. 
> Basically this new proposal will have two changes:
> 1. Abstract a ServiceTokenProvider for different services, this is 
> configurable and pluggable, by default there will be hdfs, hbase, hive 
> service, also user could add their own services through configuration. This 
> interface offers a way to retrieve the tokens and token renewal interval.
> 2. Provide a ConfigurableTokenManager to manage all the added-in token 
> providers, also expose APIs for external modules to get and update tokens.
> Details are in the design doc 
> (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing),
>  any suggestion and comment is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN

2016-07-06 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365477#comment-15365477
 ] 

Saisai Shao commented on SPARK-16342:
-

Thanks [~vanzin] for pointing out the jira, looks like most part of the ideas 
are similar. I'm not sure what is your progress on it, I think it would be 
great to collaborate to make that happen. Thanks a lot. 

> Add a new Configurable Token Manager  for Spark Running on YARN
> ---
>
> Key: SPARK-16342
> URL: https://issues.apache.org/jira/browse/SPARK-16342
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> Current Spark on YARN token management has some problems:
> 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported 
> for token fetching. For other third-party services which need to be 
> communicated with Spark in Kerberos way, currently the only way is to modify 
> Spark code.
> 2. Current token renewal and update mechanism is also hard-coded, which means 
> other third-party services cannot be benefited from this system and will be 
> failed when token is expired.
> 3. Also In the code level, current token obtain and update codes are placed 
> in several different places without elegant structured, which makes it hard 
> to maintain and extend.
> So here propose a new Configurable Token Manager class to solve the issues 
> mentioned above. 
> Basically this new proposal will have two changes:
> 1. Abstract a ServiceTokenProvider for different services, this is 
> configurable and pluggable, by default there will be hdfs, hbase, hive 
> service, also user could add their own services through configuration. This 
> interface offers a way to retrieve the tokens and token renewal interval.
> 2. Provide a ConfigurableTokenManager to manage all the added-in token 
> providers, also expose APIs for external modules to get and update tokens.
> Details are in the design doc 
> (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing),
>  any suggestion and comment is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN

2016-07-06 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365478#comment-15365478
 ] 

Saisai Shao commented on SPARK-16342:
-

Thanks [~vanzin] for pointing out the jira, looks like most part of the ideas 
are similar. I'm not sure what is your progress on it, I think it would be 
great to collaborate to make that happen. Thanks a lot. 

> Add a new Configurable Token Manager  for Spark Running on YARN
> ---
>
> Key: SPARK-16342
> URL: https://issues.apache.org/jira/browse/SPARK-16342
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> Current Spark on YARN token management has some problems:
> 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported 
> for token fetching. For other third-party services which need to be 
> communicated with Spark in Kerberos way, currently the only way is to modify 
> Spark code.
> 2. Current token renewal and update mechanism is also hard-coded, which means 
> other third-party services cannot be benefited from this system and will be 
> failed when token is expired.
> 3. Also In the code level, current token obtain and update codes are placed 
> in several different places without elegant structured, which makes it hard 
> to maintain and extend.
> So here propose a new Configurable Token Manager class to solve the issues 
> mentioned above. 
> Basically this new proposal will have two changes:
> 1. Abstract a ServiceTokenProvider for different services, this is 
> configurable and pluggable, by default there will be hdfs, hbase, hive 
> service, also user could add their own services through configuration. This 
> interface offers a way to retrieve the tokens and token renewal interval.
> 2. Provide a ConfigurableTokenManager to manage all the added-in token 
> providers, also expose APIs for external modules to get and update tokens.
> Details are in the design doc 
> (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing),
>  any suggestion and comment is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16174) Improve `OptimizeIn` optimizer to remove literal repetitions

2016-07-06 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16174:
--
Summary: Improve `OptimizeIn` optimizer to remove literal repetitions  
(was: Improve `OptimizeIn` optimizer to remove deterministic repetitions)

> Improve `OptimizeIn` optimizer to remove literal repetitions
> 
>
> Key: SPARK-16174
> URL: https://issues.apache.org/jira/browse/SPARK-16174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue improves `OptimizeIn` optimizer to remove the deterministic 
> repetitions from SQL `IN` predicates. This optimizer prevents user mistakes 
> and also can optimize some queries like 
> [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16174) Improve `OptimizeIn` optimizer to remove literal repetitions

2016-07-06 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16174:
--
Description: This issue improves `OptimizeIn` optimizer to remove the 
literal repetitions from SQL `IN` predicates. This optimizer prevents user 
mistakes and also can optimize some queries like 
[TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].
  (was: This issue improves `OptimizeIn` optimizer to remove the deterministic 
repetitions from SQL `IN` predicates. This optimizer prevents user mistakes and 
also can optimize some queries like 
[TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].)

> Improve `OptimizeIn` optimizer to remove literal repetitions
> 
>
> Key: SPARK-16174
> URL: https://issues.apache.org/jira/browse/SPARK-16174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue improves `OptimizeIn` optimizer to remove the literal repetitions 
> from SQL `IN` predicates. This optimizer prevents user mistakes and also can 
> optimize some queries like 
> [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16406) Reference resolution for large number of columns should be faster

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365362#comment-15365362
 ] 

Apache Spark commented on SPARK-16406:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/14083

> Reference resolution for large number of columns should be faster
> -
>
> Key: SPARK-16406
> URL: https://issues.apache.org/jira/browse/SPARK-16406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> Resolving columns in a LogicalPlan on average takes n / 2 (n being the number 
> of columns). This gets problematic as soon as you try to resolve a large 
> number of columns (m) on a large table: O(m * n / 2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16406) Reference resolution for large number of columns should be faster

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16406:


Assignee: Apache Spark  (was: Herman van Hovell)

> Reference resolution for large number of columns should be faster
> -
>
> Key: SPARK-16406
> URL: https://issues.apache.org/jira/browse/SPARK-16406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> Resolving columns in a LogicalPlan on average takes n / 2 (n being the number 
> of columns). This gets problematic as soon as you try to resolve a large 
> number of columns (m) on a large table: O(m * n / 2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16406) Reference resolution for large number of columns should be faster

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16406:


Assignee: Herman van Hovell  (was: Apache Spark)

> Reference resolution for large number of columns should be faster
> -
>
> Key: SPARK-16406
> URL: https://issues.apache.org/jira/browse/SPARK-16406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> Resolving columns in a LogicalPlan on average takes n / 2 (n being the number 
> of columns). This gets problematic as soon as you try to resolve a large 
> number of columns (m) on a large table: O(m * n / 2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16381) Update SQL examples and programming guide for R language binding

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16381:


Assignee: (was: Apache Spark)

> Update SQL examples and programming guide for R language binding
> 
>
> Key: SPARK-16381
> URL: https://issues.apache.org/jira/browse/SPARK-16381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Please follow guidelines listed in this SPARK-16303 
> [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16381) Update SQL examples and programming guide for R language binding

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16381:


Assignee: Apache Spark

> Update SQL examples and programming guide for R language binding
> 
>
> Key: SPARK-16381
> URL: https://issues.apache.org/jira/browse/SPARK-16381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> Please follow guidelines listed in this SPARK-16303 
> [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16381) Update SQL examples and programming guide for R language binding

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365352#comment-15365352
 ] 

Apache Spark commented on SPARK-16381:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14082

> Update SQL examples and programming guide for R language binding
> 
>
> Key: SPARK-16381
> URL: https://issues.apache.org/jira/browse/SPARK-16381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Please follow guidelines listed in this SPARK-16303 
> [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16406) Reference resolution for large number of columns should be faster

2016-07-06 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-16406:
-

 Summary: Reference resolution for large number of columns should 
be faster
 Key: SPARK-16406
 URL: https://issues.apache.org/jira/browse/SPARK-16406
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


Resolving columns in a LogicalPlan on average takes n / 2 (n being the number 
of columns). This gets problematic as soon as you try to resolve a large number 
of columns (m) on a large table: O(m * n / 2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16403) Example cleanup and fix minor issues

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365340#comment-15365340
 ] 

Apache Spark commented on SPARK-16403:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/14081

> Example cleanup and fix minor issues
> 
>
> Key: SPARK-16403
> URL: https://issues.apache.org/jira/browse/SPARK-16403
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> General cleanup of examples, focused on PySpark ML, to remove unused imports, 
> sync with Scala examples, improve consistency and fix minor issues such as 
> arg checks etc.
> * consistent appNames, most are camel case
> * fix formatting, add newlines if difficult to read - many examples are just 
> solid blocks of code
> * should use __future__ print function
> * pipeline_example is a duplicate of simple_text_classification_pipeline
> * some spelling errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16403) Example cleanup and fix minor issues

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16403:


Assignee: (was: Apache Spark)

> Example cleanup and fix minor issues
> 
>
> Key: SPARK-16403
> URL: https://issues.apache.org/jira/browse/SPARK-16403
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> General cleanup of examples, focused on PySpark ML, to remove unused imports, 
> sync with Scala examples, improve consistency and fix minor issues such as 
> arg checks etc.
> * consistent appNames, most are camel case
> * fix formatting, add newlines if difficult to read - many examples are just 
> solid blocks of code
> * should use __future__ print function
> * pipeline_example is a duplicate of simple_text_classification_pipeline
> * some spelling errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16403) Example cleanup and fix minor issues

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16403:


Assignee: Apache Spark

> Example cleanup and fix minor issues
> 
>
> Key: SPARK-16403
> URL: https://issues.apache.org/jira/browse/SPARK-16403
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, PySpark
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Trivial
>
> General cleanup of examples, focused on PySpark ML, to remove unused imports, 
> sync with Scala examples, improve consistency and fix minor issues such as 
> arg checks etc.
> * consistent appNames, most are camel case
> * fix formatting, add newlines if difficult to read - many examples are just 
> solid blocks of code
> * should use __future__ print function
> * pipeline_example is a duplicate of simple_text_classification_pipeline
> * some spelling errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs

2016-07-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16021.
-
   Resolution: Fixed
 Assignee: Eric Liang  (was: Apache Spark)
Fix Version/s: 2.1.0

> Zero out freed memory in test to help catch correctness bugs
> 
>
> Key: SPARK-16021
> URL: https://issues.apache.org/jira/browse/SPARK-16021
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> In both on-heap and off-heap modes, it would be helpful to immediately zero 
> out (or otherwise fill with a sentinel value) memory when an object is 
> deallocated.
> Currently, in on-heap mode, freed memory can be accessed without visible 
> error if no other consumer has written to the same space. Similarly, off-heap 
> memory can be accessed without fault if the allocation library has not 
> released the pages back to the OS. Zeroing out freed memory would make these 
> errors immediately visible as a correctness problem.
> Since this would add some performance overhead, it would make sense to 
> conf-flag and enable only in test.
> cc [~sameerag] [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16403) Example cleanup and fix minor issues

2016-07-06 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-16403:
-
Description: 
General cleanup of examples, focused on PySpark ML, to remove unused imports, 
sync with Scala examples, improve consistency and fix minor issues such as arg 
checks etc.

* consistent appNames, most are camel case
* fix formatting, add newlines if difficult to read - many examples are just 
solid blocks of code
* should use __future__ print function
* pipeline_example is a duplicate of simple_text_classification_pipeline
* some spelling errors

  was:General cleanup of examples, focused on PySpark ML, to remove unused 
imports, sync with Scala examples, improve consistency and fix minor issues 
such as arg checks etc.


> Example cleanup and fix minor issues
> 
>
> Key: SPARK-16403
> URL: https://issues.apache.org/jira/browse/SPARK-16403
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> General cleanup of examples, focused on PySpark ML, to remove unused imports, 
> sync with Scala examples, improve consistency and fix minor issues such as 
> arg checks etc.
> * consistent appNames, most are camel case
> * fix formatting, add newlines if difficult to read - many examples are just 
> solid blocks of code
> * should use __future__ print function
> * pipeline_example is a duplicate of simple_text_classification_pipeline
> * some spelling errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2016-07-06 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365321#comment-15365321
 ] 

Shixiong Zhu commented on SPARK-6028:
-

Changing the default RPC to Netty mostly because we want to test it broadly 
before dropping Akka. The class version conflict is a different story anyway. 
Even if we don't switch to Netty, you probably will see some other weird error.

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16382) YARN - Dynamic allocation with spark.executor.instances should increase max executors.

2016-07-06 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-16382.
---
Resolution: Won't Fix

> YARN - Dynamic allocation with spark.executor.instances should increase max 
> executors.
> --
>
> Key: SPARK-16382
> URL: https://issues.apache.org/jira/browse/SPARK-16382
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Ryan Blue
>
> SPARK-13723 changed the behavior of dynamic allocation when 
> {{--num-executors}} ({{spark.executor.instances}}) is set. Rather than 
> turning off dynamic allocation, the value is used as the initial number of 
> executors. This did not change the behavior of 
> {{spark.dynamicAllocation.maxExecutors}}. We've noticed that some users set 
> {{--num-executors}} higher than the max and the expectation is that the max 
> increases.
> I think that either max should be increased, or Spark should fail and 
> complain that the executors requested is higher than the max.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16382) YARN - Dynamic allocation with spark.executor.instances should increase max executors.

2016-07-06 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365255#comment-15365255
 ] 

Ryan Blue commented on SPARK-16382:
---

[~jerryshao], [~tgraves], I think you're both right that this is currently 
caught. The behavior I observed was in our local copy with an older patch for 
SPARK-13723 that used {{spark.executor.instances}} to increase the min rather 
than the initial number of executors. For those jobs where min was then higher 
than max, Spark would try to get the min number of executors and never let go 
of any and there wasn't a problem that it was higher than max.

I was originally suggesting that max should be increased, which doesn't 
currently happen, but then I thought that it may be better to fail so I added 
that to the description. That's why I missed that Spark already fails. I'll 
close this. Thanks!

> YARN - Dynamic allocation with spark.executor.instances should increase max 
> executors.
> --
>
> Key: SPARK-16382
> URL: https://issues.apache.org/jira/browse/SPARK-16382
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Ryan Blue
>
> SPARK-13723 changed the behavior of dynamic allocation when 
> {{--num-executors}} ({{spark.executor.instances}}) is set. Rather than 
> turning off dynamic allocation, the value is used as the initial number of 
> executors. This did not change the behavior of 
> {{spark.dynamicAllocation.maxExecutors}}. We've noticed that some users set 
> {{--num-executors}} higher than the max and the expectation is that the max 
> increases.
> I think that either max should be increased, or Spark should fail and 
> complain that the executors requested is higher than the max.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16405) Add metrics and source for external shuffle service

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365247#comment-15365247
 ] 

Apache Spark commented on SPARK-16405:
--

User 'lovexi' has created a pull request for this issue:
https://github.com/apache/spark/pull/14080

> Add metrics and source for external shuffle service
> ---
>
> Key: SPARK-16405
> URL: https://issues.apache.org/jira/browse/SPARK-16405
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: YangyangLiu
>  Labels: Metrics, Monitoring, features
>
> ExternalShuffleService is essential for spark. In order to better monitor 
> shuffle service, we added various metrics in shuffle service and  
> ExternalShuffleServiceSource for metric system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16405) Add metrics and source for external shuffle service

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16405:


Assignee: (was: Apache Spark)

> Add metrics and source for external shuffle service
> ---
>
> Key: SPARK-16405
> URL: https://issues.apache.org/jira/browse/SPARK-16405
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: YangyangLiu
>  Labels: Metrics, Monitoring, features
>
> ExternalShuffleService is essential for spark. In order to better monitor 
> shuffle service, we added various metrics in shuffle service and  
> ExternalShuffleServiceSource for metric system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16405) Add metrics and source for external shuffle service

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16405:


Assignee: Apache Spark

> Add metrics and source for external shuffle service
> ---
>
> Key: SPARK-16405
> URL: https://issues.apache.org/jira/browse/SPARK-16405
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: YangyangLiu
>Assignee: Apache Spark
>  Labels: Metrics, Monitoring, features
>
> ExternalShuffleService is essential for spark. In order to better monitor 
> shuffle service, we added various metrics in shuffle service and  
> ExternalShuffleServiceSource for metric system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8425) Add blacklist mechanism for task scheduling

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365244#comment-15365244
 ] 

Apache Spark commented on SPARK-8425:
-

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/14079

> Add blacklist mechanism for task scheduling
> ---
>
> Key: SPARK-8425
> URL: https://issues.apache.org/jira/browse/SPARK-8425
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Assignee: Imran Rashid
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16405) Add metrics and source for external shuffle service

2016-07-06 Thread YangyangLiu (JIRA)
YangyangLiu created SPARK-16405:
---

 Summary: Add metrics and source for external shuffle service
 Key: SPARK-16405
 URL: https://issues.apache.org/jira/browse/SPARK-16405
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: YangyangLiu


ExternalShuffleService is essential for spark. In order to better monitor 
shuffle service, we added various metrics in shuffle service and  
ExternalShuffleServiceSource for metric system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data

2016-07-06 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365227#comment-15365227
 ] 

Seth Hendrickson commented on SPARK-16404:
--

cc [~dbtsai] I looked in to using the @transient tag, but this prevents the 
coefficients from being serialized and broadcast to the executors at all, 
resulting in a {{NullPointerException}}. I am not sure of a way around this. I 
can submit a patch utilizing the same strategy as in LoR later this week.

> LeastSquaresAggregator in Linear Regression serializes unnecessary data
> ---
>
> Key: SPARK-16404
> URL: https://issues.apache.org/jira/browse/SPARK-16404
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> This is basically the same issue as 
> [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for 
> linear regression, where {{coefficients}} and {{featuresStd}} are 
> unnecessarily serialized between stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data

2016-07-06 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-16404:


 Summary: LeastSquaresAggregator in Linear Regression serializes 
unnecessary data
 Key: SPARK-16404
 URL: https://issues.apache.org/jira/browse/SPARK-16404
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Seth Hendrickson


This is basically the same issue as 
[SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for linear 
regression, where {{coefficients}} and {{featuresStd}} are unnecessarily 
serialized between stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11857) Remove Mesos fine-grained mode subject to discussions

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11857:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove Mesos fine-grained mode subject to discussions
> -
>
> Key: SPARK-11857
> URL: https://issues.apache.org/jira/browse/SPARK-11857
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> See discussions in
> http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-the-Mesos-fine-grained-mode-td15277.html
> and
> http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11857) Remove Mesos fine-grained mode subject to discussions

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11857:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove Mesos fine-grained mode subject to discussions
> -
>
> Key: SPARK-11857
> URL: https://issues.apache.org/jira/browse/SPARK-11857
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> See discussions in
> http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-the-Mesos-fine-grained-mode-td15277.html
> and
> http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11857) Remove Mesos fine-grained mode subject to discussions

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365190#comment-15365190
 ] 

Apache Spark commented on SPARK-11857:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14078

> Remove Mesos fine-grained mode subject to discussions
> -
>
> Key: SPARK-11857
> URL: https://issues.apache.org/jira/browse/SPARK-11857
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> See discussions in
> http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-the-Mesos-fine-grained-mode-td15277.html
> and
> http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16403) Example cleanup and fix minor issues

2016-07-06 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365174#comment-15365174
 ] 

Bryan Cutler commented on SPARK-16403:
--

I'm working on this

> Example cleanup and fix minor issues
> 
>
> Key: SPARK-16403
> URL: https://issues.apache.org/jira/browse/SPARK-16403
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, PySpark
>Reporter: Bryan Cutler
>
> General cleanup of examples, focused on PySpark ML, to remove unused imports, 
> sync with Scala examples, improve consistency and fix minor issues such as 
> arg checks etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16403) Example cleanup and fix minor issues

2016-07-06 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-16403:
-
Priority: Trivial  (was: Major)

> Example cleanup and fix minor issues
> 
>
> Key: SPARK-16403
> URL: https://issues.apache.org/jira/browse/SPARK-16403
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> General cleanup of examples, focused on PySpark ML, to remove unused imports, 
> sync with Scala examples, improve consistency and fix minor issues such as 
> arg checks etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16403) Example cleanup and fix minor issues

2016-07-06 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-16403:


 Summary: Example cleanup and fix minor issues
 Key: SPARK-16403
 URL: https://issues.apache.org/jira/browse/SPARK-16403
 Project: Spark
  Issue Type: Sub-task
  Components: Examples, PySpark
Reporter: Bryan Cutler


General cleanup of examples, focused on PySpark ML, to remove unused imports, 
sync with Scala examples, improve consistency and fix minor issues such as arg 
checks etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16402) JDBC source: Implement save API

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16402:


Assignee: Apache Spark

> JDBC source: Implement save API
> ---
>
> Key: SPARK-16402
> URL: https://issues.apache.org/jira/browse/SPARK-16402
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, we are unable to call the `save` API of `DataFrameWriter` when the 
> source is JDBC. For example, 
> {noformat}
> df.write
>   .format("jdbc")
>   .option("url", url1)
>   .option("dbtable", "TEST.TRUNCATETEST")
>   .option("user", "testUser")
>   .option("password", "testPass")
>   .save() 
> {noformat}
> The error message users will get is like
> {noformat}
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not 
> allow create table as select.
> java.lang.RuntimeException: 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not 
> allow create table as select.
> {noformat}
> However, the `save` API is very common for all the data sources, like parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16402) JDBC source: Implement save API

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365150#comment-15365150
 ] 

Apache Spark commented on SPARK-16402:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14077

> JDBC source: Implement save API
> ---
>
> Key: SPARK-16402
> URL: https://issues.apache.org/jira/browse/SPARK-16402
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, we are unable to call the `save` API of `DataFrameWriter` when the 
> source is JDBC. For example, 
> {noformat}
> df.write
>   .format("jdbc")
>   .option("url", url1)
>   .option("dbtable", "TEST.TRUNCATETEST")
>   .option("user", "testUser")
>   .option("password", "testPass")
>   .save() 
> {noformat}
> The error message users will get is like
> {noformat}
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not 
> allow create table as select.
> java.lang.RuntimeException: 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not 
> allow create table as select.
> {noformat}
> However, the `save` API is very common for all the data sources, like parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16402) JDBC source: Implement save API

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16402:


Assignee: (was: Apache Spark)

> JDBC source: Implement save API
> ---
>
> Key: SPARK-16402
> URL: https://issues.apache.org/jira/browse/SPARK-16402
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, we are unable to call the `save` API of `DataFrameWriter` when the 
> source is JDBC. For example, 
> {noformat}
> df.write
>   .format("jdbc")
>   .option("url", url1)
>   .option("dbtable", "TEST.TRUNCATETEST")
>   .option("user", "testUser")
>   .option("password", "testPass")
>   .save() 
> {noformat}
> The error message users will get is like
> {noformat}
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not 
> allow create table as select.
> java.lang.RuntimeException: 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not 
> allow create table as select.
> {noformat}
> However, the `save` API is very common for all the data sources, like parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16402) JDBC source: Implement save API

2016-07-06 Thread Xiao Li (JIRA)
Xiao Li created SPARK-16402:
---

 Summary: JDBC source: Implement save API
 Key: SPARK-16402
 URL: https://issues.apache.org/jira/browse/SPARK-16402
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Currently, we are unable to call the `save` API of `DataFrameWriter` when the 
source is JDBC. For example, 
{noformat}
df.write
  .format("jdbc")
  .option("url", url1)
  .option("dbtable", "TEST.TRUNCATETEST")
  .option("user", "testUser")
  .option("password", "testPass")
  .save() 
{noformat}
The error message users will get is like
{noformat}
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not 
allow create table as select.
java.lang.RuntimeException: 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not 
allow create table as select.
{noformat}

However, the `save` API is very common for all the data sources, like parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365117#comment-15365117
 ] 

Charles Allen commented on SPARK-16379:
---

That's great, thanks a ton!

> Spark on mesos is broken due to race condition in Logging
> -
>
> Key: SPARK-16379
> URL: https://issues.apache.org/jira/browse/SPARK-16379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: out.txt
>
>
> This commit introduced a transient lazy log val: 
> https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec
> This has caused problems in the past:
> https://github.com/apache/spark/pull/1004
> One commit before that everything works fine.
> I spotted that when my CI started to fail:
> https://ci.typesafe.com/job/mit-docker-test-ref/191/
> You can easily verify it by installing mesos on your machine and try to 
> connect with spark shell from bin dir:
> ./spark-shell --master mesos://zk://localhost:2181/mesos --conf 
> spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz
> It gets stuck at the point where it tries to create the SparkContext.
> Logging gets stuck here:
> I0705 12:10:10.076617  9303 group.cpp:700] Trying to get 
> '/mesos/json.info_000152' in ZooKeeper
> I0705 12:10:10.076920  9304 detector.cpp:479] A new leading master 
> (UPID=master@127.0.1.1:5050) is detected
> I0705 12:10:10.076956  9303 sched.cpp:326] New master detected at 
> master@127.0.1.1:5050
> I0705 12:10:10.077057  9303 sched.cpp:336] No credentials provided. 
> Attempting to register without authentication
> I0705 12:10:10.090709  9301 sched.cpp:703] Framework registered with 
> 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001
> I verified it also by changing @transient lazy val log to def and it works as 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging

2016-07-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365112#comment-15365112
 ] 

Sean Owen commented on SPARK-16379:
---

https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20Priority%20%3D%20Blocker%20AND%20%22Target%20Version%2Fs%22%20%3D%202.0.0%20AND%20Resolution%20%3D%20Unresolved
 ? you can filter JIRA how you like. Target Version should be pretty reliable.

> Spark on mesos is broken due to race condition in Logging
> -
>
> Key: SPARK-16379
> URL: https://issues.apache.org/jira/browse/SPARK-16379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: out.txt
>
>
> This commit introduced a transient lazy log val: 
> https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec
> This has caused problems in the past:
> https://github.com/apache/spark/pull/1004
> One commit before that everything works fine.
> I spotted that when my CI started to fail:
> https://ci.typesafe.com/job/mit-docker-test-ref/191/
> You can easily verify it by installing mesos on your machine and try to 
> connect with spark shell from bin dir:
> ./spark-shell --master mesos://zk://localhost:2181/mesos --conf 
> spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz
> It gets stuck at the point where it tries to create the SparkContext.
> Logging gets stuck here:
> I0705 12:10:10.076617  9303 group.cpp:700] Trying to get 
> '/mesos/json.info_000152' in ZooKeeper
> I0705 12:10:10.076920  9304 detector.cpp:479] A new leading master 
> (UPID=master@127.0.1.1:5050) is detected
> I0705 12:10:10.076956  9303 sched.cpp:326] New master detected at 
> master@127.0.1.1:5050
> I0705 12:10:10.077057  9303 sched.cpp:336] No credentials provided. 
> Attempting to register without authentication
> I0705 12:10:10.090709  9301 sched.cpp:703] Framework registered with 
> 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001
> I verified it also by changing @transient lazy val log to def and it works as 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365109#comment-15365109
 ] 

Charles Allen commented on SPARK-16379:
---

[~srowen] is there a list of blockers somewhere? I also want to get branch-2.0 
tested from our side but would like to know what sort of caveats to expect.

> Spark on mesos is broken due to race condition in Logging
> -
>
> Key: SPARK-16379
> URL: https://issues.apache.org/jira/browse/SPARK-16379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: out.txt
>
>
> This commit introduced a transient lazy log val: 
> https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec
> This has caused problems in the past:
> https://github.com/apache/spark/pull/1004
> One commit before that everything works fine.
> I spotted that when my CI started to fail:
> https://ci.typesafe.com/job/mit-docker-test-ref/191/
> You can easily verify it by installing mesos on your machine and try to 
> connect with spark shell from bin dir:
> ./spark-shell --master mesos://zk://localhost:2181/mesos --conf 
> spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz
> It gets stuck at the point where it tries to create the SparkContext.
> Logging gets stuck here:
> I0705 12:10:10.076617  9303 group.cpp:700] Trying to get 
> '/mesos/json.info_000152' in ZooKeeper
> I0705 12:10:10.076920  9304 detector.cpp:479] A new leading master 
> (UPID=master@127.0.1.1:5050) is detected
> I0705 12:10:10.076956  9303 sched.cpp:326] New master detected at 
> master@127.0.1.1:5050
> I0705 12:10:10.077057  9303 sched.cpp:336] No credentials provided. 
> Attempting to register without authentication
> I0705 12:10:10.090709  9301 sched.cpp:703] Framework registered with 
> 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001
> I verified it also by changing @transient lazy val log to def and it works as 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

2016-07-06 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365062#comment-15365062
 ] 

Ryan Blue commented on SPARK-16344:
---

It looks like the main change is to specifically catch the 3-level name 
structure, {{list-name (LIST) -> "list" -> "element"}}. The problem with this 
approach is that it doesn't solve the problem entirely either.

Let me try to give a bit more background. In parquet-avro, there are two 
{{isElementType}} methods; one in the schema converter and one in the record 
converter. The one in the schema converter will guess whether the Parquet type 
uses a 3-level list or a 2-level list when it can't be determined according to 
the spec's backward-compatibility rules. That guess assumes a 2-level structure 
by default and at the next major release will guess a 3-level structure. (This 
can be controlled by a property.) But this is only used when the reader doesn't 
supply a read schema / expected schema and the code has to convert from 
Parquet's type to get one. Ideally, we always have a read schema from the file, 
from the reader's expected class (if using Java objects), or from the reader 
passing in the expected schema. That's why the other {{isElementType}} method 
exists: it looks at the expected schema and the file schema to determine 
whether the caller has passed in a schema with the extra single-field 
list/element struct.

That code has to distinguish between two cases for a 3-level list:
1. When the caller expects {{List}}, with the extra 
record layer that was originally returned when Avro only knew about 2-level 
lists.
2. When the caller expects {{List}}, without an extra layer.

The code currently assumes that if the element schema appears to match the 
repeated type that the caller has passed a schema indicating case 1. This issue 
points out that the matching isn't perfect and an element with a single field 
named "element" will incorrectly match case 1 when it was really case 2. The 
problem with the solution in PR #14013, if it were applied to Avro, is that it 
breaks if the caller is actually passing a schema for case 1.

I'm not sure whether Spark works like Avro and has two {{isElementType}} 
methods. If Spark can guarantee that the table schema is never case 1, then it 
is correct to use the logic in the PR. I don't think that's always the case 
because the table schema may come from user objects in a Dataset or from the 
Hive MetaStore. But, this may be a reasonable heuristic if you think case 2 is 
far more common than case 1. For parquet-avro, I think the user supplying a 
single-field record with the inner field named "element" is rare enough that it 
doesn't really matter, but it's up to you guys in the Spark community on this 
issue.

One last thing: based on the rest of the schema structure, there should be only 
one way to match the expected schema to the file schema. You could always try 
both and fall back to the other case, or have a more complicated 
{{isElementType}} method that recurses down the sub-trees to find a match. I 
didn't implement this in parquet-avro because I think it's a rare problem and 
not worth the time.

> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> 
>
> Key: SPARK-16344
> URL: https://issues.apache.org/jira/browse/SPARK-16344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> This is a weird corner case. Users may hit this issue if they have a schema 
> that
> # has an array field whose element type is a struct, and
> # the struct has one and only one field, and
> # that field is named as "element".
> The following Spark shell snippet for Spark 1.6 reproduces this bug:
> {code}
> case class A(element: Long)
> case class B(f: Array[A])
> val path = "/tmp/silly.parquet"
> Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path)
> val df = sqlContext.read.parquet(path)
> df.printSchema()
> // root
> //  |-- f0: array (nullable = true)
> //  ||-- element: struct (containsNull = true)
> //  |||-- element: long (nullable = true)
> df.show()
> {code}
> Exception thrown:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> 

[jira] [Resolved] (SPARK-16379) Spark on mesos is broken due to race condition in Logging

2016-07-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16379.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.0.0

> Spark on mesos is broken due to race condition in Logging
> -
>
> Key: SPARK-16379
> URL: https://issues.apache.org/jira/browse/SPARK-16379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: out.txt
>
>
> This commit introduced a transient lazy log val: 
> https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec
> This has caused problems in the past:
> https://github.com/apache/spark/pull/1004
> One commit before that everything works fine.
> I spotted that when my CI started to fail:
> https://ci.typesafe.com/job/mit-docker-test-ref/191/
> You can easily verify it by installing mesos on your machine and try to 
> connect with spark shell from bin dir:
> ./spark-shell --master mesos://zk://localhost:2181/mesos --conf 
> spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz
> It gets stuck at the point where it tries to create the SparkContext.
> Logging gets stuck here:
> I0705 12:10:10.076617  9303 group.cpp:700] Trying to get 
> '/mesos/json.info_000152' in ZooKeeper
> I0705 12:10:10.076920  9304 detector.cpp:479] A new leading master 
> (UPID=master@127.0.1.1:5050) is detected
> I0705 12:10:10.076956  9303 sched.cpp:326] New master detected at 
> master@127.0.1.1:5050
> I0705 12:10:10.077057  9303 sched.cpp:336] No credentials provided. 
> Attempting to register without authentication
> I0705 12:10:10.090709  9301 sched.cpp:703] Framework registered with 
> 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001
> I verified it also by changing @transient lazy val log to def and it works as 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16400) Remove InSet filter pushdown from Parquet

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16400:


Assignee: Apache Spark

> Remove InSet filter pushdown from Parquet
> -
>
> Key: SPARK-16400
> URL: https://issues.apache.org/jira/browse/SPARK-16400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Filter pushdown that needs to be evaluated per row is not useful to Spark, 
> since parquet-mr own filtering is likely to be less performant than Spark's 
> due to boxing and virtual function dispatches.
> To simplify the code base, we should remove the InSet filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16400) Remove InSet filter pushdown from Parquet

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365035#comment-15365035
 ] 

Apache Spark commented on SPARK-16400:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14076

> Remove InSet filter pushdown from Parquet
> -
>
> Key: SPARK-16400
> URL: https://issues.apache.org/jira/browse/SPARK-16400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> Filter pushdown that needs to be evaluated per row is not useful to Spark, 
> since parquet-mr own filtering is likely to be less performant than Spark's 
> due to boxing and virtual function dispatches.
> To simplify the code base, we should remove the InSet filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16400) Remove InSet filter pushdown from Parquet

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16400:


Assignee: (was: Apache Spark)

> Remove InSet filter pushdown from Parquet
> -
>
> Key: SPARK-16400
> URL: https://issues.apache.org/jira/browse/SPARK-16400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> Filter pushdown that needs to be evaluated per row is not useful to Spark, 
> since parquet-mr own filtering is likely to be less performant than Spark's 
> due to boxing and virtual function dispatches.
> To simplify the code base, we should remove the InSet filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-07-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15740:
--
Fix Version/s: (was: 2.0.0)
   2.1.0
   2.0.1

> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Antonio Murgia
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-07-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15740:
--
Target Version/s:   (was: 2.0.0)

> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Antonio Murgia
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364972#comment-15364972
 ] 

Jeff Zhang commented on SPARK-16367:


[~gae...@xeberon.net] I still don't understand how the binary wheel work in 
different OS machines since you build wheel on the client machine. 

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy 
> -- create a virtualenv if not already in one: 
> {code} 
> virtualenv env 
> {code} 
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need. 
> - create 

[jira] [Updated] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-07-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15740:
--
Assignee: Antonio Murgia

> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Antonio Murgia
>Priority: Critical
> Fix For: 2.0.0
>
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16401) Data Source APIs: Extending RelationProvider and CreatableRelationProvider Without SchemaRelationProvider

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364970#comment-15364970
 ] 

Apache Spark commented on SPARK-16401:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14075

> Data Source APIs: Extending RelationProvider and CreatableRelationProvider 
> Without SchemaRelationProvider
> -
>
> Key: SPARK-16401
> URL: https://issues.apache.org/jira/browse/SPARK-16401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Critical
>
> When users try to implement a data source API with extending only 
> RelationProvider and CreatableRelationProvider, they will hit an error when 
> resolving the relation.
> {noformat}
> spark.read
> .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
>   .load()
>   .write.
> format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
>   .save()
> {noformat}
> The error they hit is like
> {noformat}
> xyzDataSource does not allow user-specified schemas.;
> org.apache.spark.sql.AnalysisException: xyzDataSource does not allow 
> user-specified schemas.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16401) Data Source APIs: Extending RelationProvider and CreatableRelationProvider Without SchemaRelationProvider

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16401:


Assignee: Apache Spark

> Data Source APIs: Extending RelationProvider and CreatableRelationProvider 
> Without SchemaRelationProvider
> -
>
> Key: SPARK-16401
> URL: https://issues.apache.org/jira/browse/SPARK-16401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> When users try to implement a data source API with extending only 
> RelationProvider and CreatableRelationProvider, they will hit an error when 
> resolving the relation.
> {noformat}
> spark.read
> .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
>   .load()
>   .write.
> format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
>   .save()
> {noformat}
> The error they hit is like
> {noformat}
> xyzDataSource does not allow user-specified schemas.;
> org.apache.spark.sql.AnalysisException: xyzDataSource does not allow 
> user-specified schemas.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16401) Data Source APIs: Extending RelationProvider and CreatableRelationProvider Without SchemaRelationProvider

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16401:


Assignee: (was: Apache Spark)

> Data Source APIs: Extending RelationProvider and CreatableRelationProvider 
> Without SchemaRelationProvider
> -
>
> Key: SPARK-16401
> URL: https://issues.apache.org/jira/browse/SPARK-16401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Critical
>
> When users try to implement a data source API with extending only 
> RelationProvider and CreatableRelationProvider, they will hit an error when 
> resolving the relation.
> {noformat}
> spark.read
> .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
>   .load()
>   .write.
> format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
>   .save()
> {noformat}
> The error they hit is like
> {noformat}
> xyzDataSource does not allow user-specified schemas.;
> org.apache.spark.sql.AnalysisException: xyzDataSource does not allow 
> user-specified schemas.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-07-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15740.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13509
[https://github.com/apache/spark/pull/13509]

> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
> Fix For: 2.0.0
>
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16401) Data Source APIs: Extending RelationProvider and CreatableRelationProvider Without SchemaRelationProvider

2016-07-06 Thread Xiao Li (JIRA)
Xiao Li created SPARK-16401:
---

 Summary: Data Source APIs: Extending RelationProvider and 
CreatableRelationProvider Without SchemaRelationProvider
 Key: SPARK-16401
 URL: https://issues.apache.org/jira/browse/SPARK-16401
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li
Priority: Critical


When users try to implement a data source API with extending only 
RelationProvider and CreatableRelationProvider, they will hit an error when 
resolving the relation.
{noformat}
spark.read
.format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
  .load()
  .write.
format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
  .save()
{noformat}

The error they hit is like
{noformat}
xyzDataSource does not allow user-specified schemas.;
org.apache.spark.sql.AnalysisException: xyzDataSource does not allow 
user-specified schemas.;
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364959#comment-15364959
 ] 

Apache Spark commented on SPARK-16371:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14074

> IS NOT NULL clause gives false for nested not empty column
> --
>
> Key: SPARK-16371
> URL: https://issues.apache.org/jira/browse/SPARK-16371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I have df where column1 is struct type and there is 1M rows.
> (sample data from https://issues.apache.org/jira/browse/SPARK-16320)
> {code}
> df.where("column1 is not null").count()
> {code}
> gives:
> 1M in Spark 1.6
> *0* in Spark 2.0
> Is there a change in IS NOT NULL behaviour in Spark 2.0 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column

2016-07-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16371.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.0.0

> IS NOT NULL clause gives false for nested not empty column
> --
>
> Key: SPARK-16371
> URL: https://issues.apache.org/jira/browse/SPARK-16371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I have df where column1 is struct type and there is 1M rows.
> (sample data from https://issues.apache.org/jira/browse/SPARK-16320)
> {code}
> df.where("column1 is not null").count()
> {code}
> gives:
> 1M in Spark 1.6
> *0* in Spark 2.0
> Is there a change in IS NOT NULL behaviour in Spark 2.0 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16212) code cleanup of kafka-0-8 to match review feedback on 0-10

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364926#comment-15364926
 ] 

Apache Spark commented on SPARK-16212:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/14073

> code cleanup of kafka-0-8 to match review feedback on 0-10
> --
>
> Key: SPARK-16212
> URL: https://issues.apache.org/jira/browse/SPARK-16212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16400) Remove InSet filter pushdown from Parquet

2016-07-06 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-16400:
---

 Summary: Remove InSet filter pushdown from Parquet
 Key: SPARK-16400
 URL: https://issues.apache.org/jira/browse/SPARK-16400
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin


Filter pushdown that needs to be evaluated per row is not useful to Spark, 
since parquet-mr own filtering is likely to be less performant than Spark's due 
to boxing and virtual function dispatches.

To simplify the code base, we should remove the InSet filters.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-07-06 Thread Vladimir Ivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363438#comment-15363438
 ] 

Vladimir Ivanov edited comment on SPARK-16334 at 7/6/16 6:54 PM:
-

Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case 
it's thrown during {noformat}DataFrame.rdd{noformat} call. Moreover it somehow 
depends on volume of data, because it is not thrown when we change filter 
criteria accordingly. We used SparkSQL to write these parquet files and didn't 
explicitly specify WriterVersion option so I believe whatever version is set by 
default was used.


was (Author: vivanov):
Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case 
it's thrown during DataFrame.rdd call. Moreover it somehow depends on volume of 
data, because it is not thrown when we change filter criteria accordingly. We 
used SparkSQL to write these parquet files and didn't explicitly specify 
WriterVersion option so I believe whatever version is set by default was used.

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
>  Labels: sql
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-07-06 Thread Vladimir Ivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363438#comment-15363438
 ] 

Vladimir Ivanov edited comment on SPARK-16334 at 7/6/16 6:54 PM:
-

Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case 
it's thrown during DataFrame.rdd call. Moreover it somehow depends on volume of 
data, because it is not thrown when we change filter criteria accordingly. We 
used SparkSQL to write these parquet files and didn't explicitly specify 
WriterVersion option so I believe whatever version is set by default was used.


was (Author: vivanov):
Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case 
it's thrown during {noformat}DataFrame.rdd{noformat} call. Moreover it somehow 
depends on volume of data, because it is not thrown when we change filter 
criteria accordingly. We used SparkSQL to write these parquet files and didn't 
explicitly specify WriterVersion option so I believe whatever version is set by 
default was used.

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
>  Labels: sql
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2016-07-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364870#comment-15364870
 ] 

Dongjoon Hyun commented on SPARK-16387:
---

Oh, it means Pull Request.
Since you know `JdbcDialect` class, I think you can make a code patch for that.

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8425) Add blacklist mechanism for task scheduling

2016-07-06 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364834#comment-15364834
 ] 

Thomas Graves commented on SPARK-8425:
--

Added some questions to the design doc

> Add blacklist mechanism for task scheduling
> ---
>
> Key: SPARK-8425
> URL: https://issues.apache.org/jira/browse/SPARK-8425
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Assignee: Imran Rashid
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16399:


Assignee: Apache Spark

> Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
> --
>
> Key: SPARK-16399
> URL: https://issues.apache.org/jira/browse/SPARK-16399
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Manoj Kumar
>Assignee: Apache Spark
>Priority: Minor
>
> Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even 
> though higher versions of Python seem to be installed.
> It should be better to force "PYSPARK_PYTHON" to python instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"

2016-07-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16399:


Assignee: (was: Apache Spark)

> Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
> --
>
> Key: SPARK-16399
> URL: https://issues.apache.org/jira/browse/SPARK-16399
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even 
> though higher versions of Python seem to be installed.
> It should be better to force "PYSPARK_PYTHON" to python instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"

2016-07-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364814#comment-15364814
 ] 

Apache Spark commented on SPARK-16399:
--

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/14016

> Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
> --
>
> Key: SPARK-16399
> URL: https://issues.apache.org/jira/browse/SPARK-16399
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even 
> though higher versions of Python seem to be installed.
> It should be better to force "PYSPARK_PYTHON" to python instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"

2016-07-06 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-16399:
---

 Summary: Set PYSPARK_PYTHON to point to "python" instead of 
"python2.7"
 Key: SPARK-16399
 URL: https://issues.apache.org/jira/browse/SPARK-16399
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Manoj Kumar
Priority: Minor


Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even though 
higher versions of Python seem to be installed.
It should be better to force "PYSPARK_PYTHON" to python instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16394) Timestamp conversion error in pyspark.sql.Row.asDict because of timezones

2016-07-06 Thread Martin Tapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364792#comment-15364792
 ] 

Martin Tapp commented on SPARK-16394:
-

It seems the root problem is the conversion from spark's internal 
representation to a pyspark Row object that already causes the timezone 
conversion problem. Hence, the only fix we have for now is to cast the column 
to StringType.

> Timestamp conversion error in pyspark.sql.Row.asDict because of timezones
> -
>
> Key: SPARK-16394
> URL: https://issues.apache.org/jira/browse/SPARK-16394
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Martin Tapp
>Priority: Minor
>
> We use DataFrame.map to convert each row to a dictionary using Row.asDict(). 
> The problem occurs when a Timestamp column is converted. It seems the 
> Timestamp gets converted to a naive Python datetime. This causes processing 
> errors since all naive datetimes get adjusted to the process' timezone. For 
> instance, a Timestamp with a time of midnight see's it's time bounce based on 
> the local timezone (+/- x hours).
> Current fix is to apply the pytz.utc timezone to each datetime instance.
> Proposed solution is to make all datetime instances aware and use the 
> pytz.utc timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16394) Timestamp conversion error in pyspark.sql.Row because of timezones

2016-07-06 Thread Martin Tapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Tapp updated SPARK-16394:

Summary: Timestamp conversion error in pyspark.sql.Row because of timezones 
 (was: Timestamp conversion error in pyspark.sql.Row.asDict because of 
timezones)

> Timestamp conversion error in pyspark.sql.Row because of timezones
> --
>
> Key: SPARK-16394
> URL: https://issues.apache.org/jira/browse/SPARK-16394
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Martin Tapp
>Priority: Minor
>
> We use DataFrame.map to convert each row to a dictionary using Row.asDict(). 
> The problem occurs when a Timestamp column is converted. It seems the 
> Timestamp gets converted to a naive Python datetime. This causes processing 
> errors since all naive datetimes get adjusted to the process' timezone. For 
> instance, a Timestamp with a time of midnight see's it's time bounce based on 
> the local timezone (+/- x hours).
> Current fix is to apply the pytz.utc timezone to each datetime instance.
> Proposed solution is to make all datetime instances aware and use the 
> pytz.utc timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364762#comment-15364762
 ] 

Charles Allen commented on SPARK-6028:
--

ClassLoader problem on my side. Loader was pulling in 1.5.2 classes for the 
driver but 1.6.1 classes in the tasks.

Ideally the default behavior would not have changed, the tasks would have 
launched, then class version conflicts would have given logging, rather than 
having a uri naming conflict.

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364757#comment-15364757
 ] 

Charles Allen commented on SPARK-6028:
--

Was semi-related. The patch changed the default from akka to netty, and I had 
improper classloader in my app which was loading in the 1.5.2 classes instead 
of the 1.6.1 classes.

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16398) Make cancelJob and cancelStage API public

2016-07-06 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-16398:
---
Description: Make the SparkContext {{cancelJob}} and {{cancelStage}} APIs 
public. This allows applications to use {{SparkListener}} to do their own 
management of jobs via events, but without using the REST API.  (was: Make the 
SparkContext {{cancelJob}} and {{cancelStage}} APIs public. This allows 
applications to use `SparkListener` to do their own management of jobs via 
events, but without using the REST API.)

> Make cancelJob and cancelStage API public
> -
>
> Key: SPARK-16398
> URL: https://issues.apache.org/jira/browse/SPARK-16398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Mitesh
>Priority: Trivial
>
> Make the SparkContext {{cancelJob}} and {{cancelStage}} APIs public. This 
> allows applications to use {{SparkListener}} to do their own management of 
> jobs via events, but without using the REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >