[GitHub] spark pull request #12337: [SPARK-15566] Expose null checking function to Py...

2017-02-09 Thread kevincox
Github user kevincox closed the pull request at: https://github.com/apache/spark/pull/12337 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request #12335: [SPARK-11321] [SQL] Python non null udfs

2017-02-13 Thread kevincox
Github user kevincox closed the pull request at: https://github.com/apache/spark/pull/12335 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark issue #12335: [SPARK-11321] [SQL] Python non null udfs

2017-02-13 Thread kevincox
Github user kevincox commented on the issue: https://github.com/apache/spark/pull/12335 I unfortunately don't have time to work on this. Anyone else should feel free to pick this up if they find it useful. --- If your project is set up for it, you can reply to this email and

[GitHub] spark issue #12337: [SPARK-15566] Expose null checking function to Python la...

2016-10-12 Thread kevincox
Github user kevincox commented on the issue: https://github.com/apache/spark/pull/12337 @holdenk The point is that this is inline. It doesn't require evaluating the whole dataframe and counting the nulls you find. Instead you use this on a column and it asserts that every

[GitHub] spark issue #12337: [SPARK-15566] Expose null checking function to Python la...

2016-10-12 Thread kevincox
Github user kevincox commented on the issue: https://github.com/apache/spark/pull/12337 The thing is that if you to `take(1)` you have to evaluate the dataframe (at least partially) to see if there are any nulls. It has very significant overhead. This version, which works inline

[GitHub] spark pull request: [SPARK-15566] Expose null checking function to...

2016-05-26 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/12337#issuecomment-221931980 Ok, I created a matching issue in JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-15567] Refactor ExecutorAllocationManag...

2016-05-26 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/10761#issuecomment-221933006 Can this be reopened? I now have time to work on it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-28 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-215498477 I've added some tests but I'm having trouble getting the test suite to run locally before or after my changes. So I'm kinda just praying that ev

[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-28 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-215525434 @davies You mean to support non-null return values? I don't think I know enough scala to automatically infer that. --- If your project is set up for it, yo

[GitHub] spark pull request: Kevincox spark timestamp

2015-08-19 Thread kevincox
GitHub user kevincox opened a pull request: https://github.com/apache/spark/pull/8319 Kevincox spark timestamp You can merge this pull request into a Git repository by running: $ git pull https://github.com/Shopify/spark kevincox-spark-timestamp Alternatively you can review

[GitHub] spark pull request: Ignore

2015-08-19 Thread kevincox
Github user kevincox closed the pull request at: https://github.com/apache/spark/pull/8319 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: Fix datetime parsing in SparkSQL.

2015-08-24 Thread kevincox
GitHub user kevincox opened a pull request: https://github.com/apache/spark/pull/8396 Fix datetime parsing in SparkSQL. This fixes https://issues.apache.org/jira/browse/SPARK-9794 by using a real ISO8601 parser. (courtesy of the xml component of the standard java library

[GitHub] spark pull request: [SPARK-9794] [SQL] Fix datetime parsing in Spa...

2015-08-24 Thread kevincox
Github user kevincox commented on a diff in the pull request: https://github.com/apache/spark/pull/8396#discussion_r37762298 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -107,30 +107,21 @@ object DateTimeUtils

[GitHub] spark pull request: [SPARK-8202] [PYSPARK] fix infinite loop durin...

2015-06-09 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/6714#issuecomment-110364692 Also why keep the batch size once you know you are going to spill to disk. All that does is force you to draw from the iterator in batches. Once you know how big

[GitHub] spark pull request: [SPARK-8202] [PYSPARK] fix infinite loop durin...

2015-06-09 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/6714#issuecomment-110471373 @davies No, it appears that you just changed the original memory limit. I am saying once you figure out how large the chunk can be you should set that as the batch

[GitHub] spark pull request: Kevincox hanging no executor fix

2015-08-10 Thread kevincox
Github user kevincox closed the pull request at: https://github.com/apache/spark/pull/8075 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: Kevincox hanging no executor fix

2015-08-10 Thread kevincox
GitHub user kevincox opened a pull request: https://github.com/apache/spark/pull/8075 Kevincox hanging no executor fix Merge in https://github.com/apache/spark/pull/7716 @angelini @solackerman @kmtaylor-github You can merge this pull request into a Git repository by

[GitHub] spark pull request: [SPARK-11319][SQL] Making StructField's nullab...

2016-03-20 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/11785#issuecomment-197957375 :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-10447][WIP][PYSPARK] upgrade pyspark to...

2015-10-16 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/8615#issuecomment-148842137 It's a shame that this has caused because there is a huge performance improvement in 0.9 that I was hoping to cash in on. --- If your project is set up for it

[GitHub] spark pull request: [SPARK-10447][WIP][PYSPARK] upgrade pyspark to...

2015-10-17 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/8615#issuecomment-148933488 Of course, I just wanted to encourage the work on this :smiley: @holdenk I'm quite busy at the moment but if I find time i will definitely try to help yo

[GitHub] spark pull request: [SPARK-10447][PYSPARK] upgrade pyspark to py4j...

2015-10-20 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/8615#issuecomment-149545940 Awesome work @holdenk! I would live to see this merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-12 Thread kevincox
GitHub user kevincox opened a pull request: https://github.com/apache/spark/pull/12335 [SPARK-11321] [SQL] Python non null udfs ## What changes were proposed in this pull request? This patch allows Python UDFs to return non-nullable values. ## How was this patch

[GitHub] spark pull request: Expose null checking function to Python land.

2016-04-12 Thread kevincox
GitHub user kevincox opened a pull request: https://github.com/apache/spark/pull/12337 Expose null checking function to Python land. This allows efficiently mapping a column that shouldn't contain any nulls to a columns that Spark knows doesn't have any nulls.

[GitHub] spark pull request: Expose null checking function to Python land.

2016-04-12 Thread kevincox
Github user kevincox commented on a diff in the pull request: https://github.com/apache/spark/pull/12337#discussion_r59466558 --- Diff: python/pyspark/sql/functions.py --- @@ -138,6 +138,10 @@ def _(): ' eliminated.' } +_fun

[GitHub] spark pull request: [SPARK-11319][SQL] Making StructField's nullab...

2016-04-14 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/11785#issuecomment-210144539 I would love for this to be a constraint. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-18 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-211574940 Sure thing. It'll be a while until I get around to it but I will make sure to do that. --- If your project is set up for it, you can reply to this email and

[GitHub] spark pull request: Refactor ExecutorAllocationManager.

2016-04-21 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/10761#issuecomment-213100491 @HyukjinKwon The maintainers didn't seem interested and I was busy. I might pursue this further this summer. --- If your project is set up for it, you can rep

[GitHub] spark pull request: Only add hive to classpath if HIVE_HOME is set...

2015-10-06 Thread kevincox
GitHub user kevincox opened a pull request: https://github.com/apache/spark/pull/8994 Only add hive to classpath if HIVE_HOME is set. Currently if it isn't set it scans `/lib/*` and adds every dir to the classpath which makes the env too large and every command c

[GitHub] spark pull request: [SPARK-10952] Only add hive to classpath if HI...

2015-10-06 Thread kevincox
Github user kevincox commented on a diff in the pull request: https://github.com/apache/spark/pull/8994#discussion_r41319534 --- Diff: build/sbt --- @@ -20,10 +20,12 @@ # When creating new tests for Spark SQL Hive, the HADOOP_CLASSPATH must contain the hive jars so

[GitHub] spark pull request: [SPARK-10952] Only add hive to classpath if HI...

2015-10-06 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/8994#issuecomment-145995462 If you would prefer the change I can do it, just let me know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [SPARK-10952] Only add hive to classpath if HI...

2015-10-06 Thread kevincox
Github user kevincox commented on a diff in the pull request: https://github.com/apache/spark/pull/8994#discussion_r41335754 --- Diff: build/sbt --- @@ -20,10 +20,12 @@ # When creating new tests for Spark SQL Hive, the HADOOP_CLASSPATH must contain the hive jars so

[GitHub] spark pull request: [SPARK-10306][WIP] Add dependency so sbt can r...

2015-10-08 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/8471#issuecomment-146510965 Hello, I was running into the same issue building spark and this patch fixed my issue. It appears that multiple people are running into this so it would be nice to

[GitHub] spark pull request: Refactor ExecutorAllocationManager.

2016-01-14 Thread kevincox
GitHub user kevincox opened a pull request: https://github.com/apache/spark/pull/10761 Refactor ExecutorAllocationManager. This changes ExecuorAllocationManager from a tree of if statements run on an interval to a clean event-driven state machine. You can merge this pull

[GitHub] spark pull request: Refactor ExecutorAllocationManager.

2016-01-14 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/10761#issuecomment-171835801 Note that this refactoring was performed to make way for future features in the allocation manager. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: Refactor ExecutorAllocationManager.

2016-01-14 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/10761#issuecomment-171850023 I'm implementing a system where Spark can reduce the number of executors in low-resource situations. This allows jobs to utilize an entire cluster when it is unn

[GitHub] spark pull request: Refactor ExecutorAllocationManager.

2016-01-14 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/10761#issuecomment-171851889 Fair enough. I tried to keep the tests as similar as possible so that it is quite clear that the functionality hasn't changed. I have also done some testing on m

[GitHub] spark pull request: Refactor ExecutorAllocationManager.

2016-01-18 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/10761#issuecomment-172707827 It is similar but it has a number of advantages including that it can be cleaner (you can wait until after you finish a task and prepare it for shuffling) and you get

[GitHub] spark pull request: [SPARK-9794] [SQL] Fix datetime parsing in Spa...

2015-09-08 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/8396#issuecomment-138570024 Sorry, had a busy week. I can definitely add a couple of tests. The existing tests also verify that this is working with the previously tested formats (assuming that

[GitHub] spark pull request: [SPARK-9794] [SQL] Fix datetime parsing in Spa...

2015-09-08 Thread kevincox
Github user kevincox commented on a diff in the pull request: https://github.com/apache/spark/pull/8396#discussion_r38929835 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -107,30 +107,21 @@ object DateTimeUtils

[GitHub] spark pull request: [SPARK-9794] [SQL] Fix datetime parsing in Spa...

2015-09-08 Thread kevincox
Github user kevincox commented on a diff in the pull request: https://github.com/apache/spark/pull/8396#discussion_r38939186 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -107,30 +107,21 @@ object DateTimeUtils

[GitHub] spark pull request: [SPARK-9794] [SQL] Fix datetime parsing in Spa...

2015-09-08 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/8396#issuecomment-138599541 Tests added (there weren't any previously). I tried to ensure they covered all supported cases. --- If your project is set up for it, you can reply to this emai

[GitHub] spark pull request: Update py4j to 0.9.

2015-09-11 Thread kevincox
GitHub user kevincox opened a pull request: https://github.com/apache/spark/pull/8722 Update py4j to 0.9. Py4J 0.9 has a performance improvement for `unescape_new_line` which significantly reduces the time required to transfer large strings. cc @angelini You can merge

[GitHub] spark pull request: Update py4j to 0.9.

2015-09-11 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/8722#issuecomment-139636661 Updated to reference the right package. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: Update py4j to 0.9.

2015-09-11 Thread kevincox
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/8722#issuecomment-139644930 It looks like it is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: Update py4j to 0.9.

2015-09-12 Thread kevincox
Github user kevincox closed the pull request at: https://github.com/apache/spark/pull/8722 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is