Github user kevincox commented on the issue:
https://github.com/apache/spark/pull/12335
I unfortunately don't have time to work on this. Anyone else should feel
free to pick this up if they find it useful.
---
If your project is set up for it, you can reply to this email and
Github user kevincox closed the pull request at:
https://github.com/apache/spark/pull/12335
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user kevincox closed the pull request at:
https://github.com/apache/spark/pull/12337
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user kevincox commented on the issue:
https://github.com/apache/spark/pull/12337
The thing is that if you to `take(1)` you have to evaluate the dataframe
(at least partially) to see if there are any nulls. It has very significant
overhead. This version, which works inline
Github user kevincox commented on the issue:
https://github.com/apache/spark/pull/12337
@holdenk The point is that this is inline. It doesn't require evaluating
the whole dataframe and counting the nulls you find. Instead you use this on a
column and it asserts that every
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/10761#issuecomment-221933006
Can this be reopened? I now have time to work on it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/12337#issuecomment-221931980
Ok, I created a matching issue in JIRA.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/12335#issuecomment-215525434
@davies You mean to support non-null return values? I don't think I know
enough scala to automatically infer that.
---
If your project is set up for it, yo
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/12335#issuecomment-215498477
I've added some tests but I'm having trouble getting the test suite to run
locally before or after my changes. So I'm kinda just praying that ev
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/10761#issuecomment-213100491
@HyukjinKwon The maintainers didn't seem interested and I was busy. I might
pursue this further this summer.
---
If your project is set up for it, you can rep
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/12335#issuecomment-211574940
Sure thing. It'll be a while until I get around to it but I will make sure
to do that.
---
If your project is set up for it, you can reply to this email and
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/11785#issuecomment-210144539
I would love for this to be a constraint.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user kevincox commented on a diff in the pull request:
https://github.com/apache/spark/pull/12337#discussion_r59466558
--- Diff: python/pyspark/sql/functions.py ---
@@ -138,6 +138,10 @@ def _():
' eliminated.'
}
+_fun
GitHub user kevincox opened a pull request:
https://github.com/apache/spark/pull/12337
Expose null checking function to Python land.
This allows efficiently mapping a column that shouldn't contain any nulls
to a columns that Spark knows doesn't have any nulls.
GitHub user kevincox opened a pull request:
https://github.com/apache/spark/pull/12335
[SPARK-11321] [SQL] Python non null udfs
## What changes were proposed in this pull request?
This patch allows Python UDFs to return non-nullable values.
## How was this patch
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/11785#issuecomment-197957375
:+1:
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/10761#issuecomment-172707827
It is similar but it has a number of advantages including that it can be
cleaner (you can wait until after you finish a task and prepare it for
shuffling) and you get
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/10761#issuecomment-171851889
Fair enough. I tried to keep the tests as similar as possible so that it is
quite clear that the functionality hasn't changed. I have also done some
testing on m
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/10761#issuecomment-171850023
I'm implementing a system where Spark can reduce the number of executors in
low-resource situations. This allows jobs to utilize an entire cluster when it
is unn
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/10761#issuecomment-171835801
Note that this refactoring was performed to make way for future features in
the allocation manager.
---
If your project is set up for it, you can reply to this email
GitHub user kevincox opened a pull request:
https://github.com/apache/spark/pull/10761
Refactor ExecutorAllocationManager.
This changes ExecuorAllocationManager from a tree of if
statements run on an interval to a clean event-driven state machine.
You can merge this pull
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/8615#issuecomment-149545940
Awesome work @holdenk! I would live to see this merged.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/8615#issuecomment-148933488
Of course, I just wanted to encourage the work on this :smiley:
@holdenk I'm quite busy at the moment but if I find time i will definitely
try to help yo
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/8615#issuecomment-148842137
It's a shame that this has caused because there is a huge performance
improvement in 0.9 that I was hoping to cash in on.
---
If your project is set up for it
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/8471#issuecomment-146510965
Hello, I was running into the same issue building spark and this patch
fixed my issue. It appears that multiple people are running into this so it
would be nice to
Github user kevincox commented on a diff in the pull request:
https://github.com/apache/spark/pull/8994#discussion_r41335754
--- Diff: build/sbt ---
@@ -20,10 +20,12 @@
# When creating new tests for Spark SQL Hive, the HADOOP_CLASSPATH must
contain the hive jars so
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/8994#issuecomment-145995462
If you would prefer the change I can do it, just let me know.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user kevincox commented on a diff in the pull request:
https://github.com/apache/spark/pull/8994#discussion_r41319534
--- Diff: build/sbt ---
@@ -20,10 +20,12 @@
# When creating new tests for Spark SQL Hive, the HADOOP_CLASSPATH must
contain the hive jars so
GitHub user kevincox opened a pull request:
https://github.com/apache/spark/pull/8994
Only add hive to classpath if HIVE_HOME is set.
Currently if it isn't set it scans `/lib/*` and adds every dir to the
classpath which makes the env too large and every command c
Github user kevincox closed the pull request at:
https://github.com/apache/spark/pull/8722
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/8722#issuecomment-139644930
It looks like it is.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/8722#issuecomment-139636661
Updated to reference the right package.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
GitHub user kevincox opened a pull request:
https://github.com/apache/spark/pull/8722
Update py4j to 0.9.
Py4J 0.9 has a performance improvement for `unescape_new_line` which
significantly reduces the time required to transfer large strings.
cc @angelini
You can merge
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/8396#issuecomment-138599541
Tests added (there weren't any previously). I tried to ensure they covered
all supported cases.
---
If your project is set up for it, you can reply to this emai
Github user kevincox commented on a diff in the pull request:
https://github.com/apache/spark/pull/8396#discussion_r38939186
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
---
@@ -107,30 +107,21 @@ object DateTimeUtils
Github user kevincox commented on a diff in the pull request:
https://github.com/apache/spark/pull/8396#discussion_r38929835
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
---
@@ -107,30 +107,21 @@ object DateTimeUtils
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/8396#issuecomment-138570024
Sorry, had a busy week. I can definitely add a couple of tests. The
existing tests also verify that this is working with the previously tested
formats (assuming that
Github user kevincox commented on a diff in the pull request:
https://github.com/apache/spark/pull/8396#discussion_r37762298
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
---
@@ -107,30 +107,21 @@ object DateTimeUtils
GitHub user kevincox opened a pull request:
https://github.com/apache/spark/pull/8396
Fix datetime parsing in SparkSQL.
This fixes https://issues.apache.org/jira/browse/SPARK-9794 by using a real
ISO8601 parser. (courtesy of the xml component of the standard java library
Github user kevincox closed the pull request at:
https://github.com/apache/spark/pull/8319
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
GitHub user kevincox opened a pull request:
https://github.com/apache/spark/pull/8319
Kevincox spark timestamp
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Shopify/spark kevincox-spark-timestamp
Alternatively you can review
GitHub user kevincox opened a pull request:
https://github.com/apache/spark/pull/8075
Kevincox hanging no executor fix
Merge in https://github.com/apache/spark/pull/7716
@angelini @solackerman @kmtaylor-github
You can merge this pull request into a Git repository by
Github user kevincox closed the pull request at:
https://github.com/apache/spark/pull/8075
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/6714#issuecomment-110471373
@davies No, it appears that you just changed the original memory limit. I
am saying once you figure out how large the chunk can be you should set that as
the batch
Github user kevincox commented on the pull request:
https://github.com/apache/spark/pull/6714#issuecomment-110364692
Also why keep the batch size once you know you are going to spill to disk.
All that does is force you to draw from the iterator in batches. Once you know
how big
45 matches
Mail list logo