Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
asked. > > >   > ---- > *From:* Maciej Szymkiewicz > *Sent:* Tuesday, August 4, 2020 12:59 PM > *To:* Sean Owen > *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; > Spark Dev List > *Subject:* Re: [PySpark] Revisiting PySpark type annotations

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
tubs/graphs/contributors) and at least some use cases (https://stackoverflow.com/q/40163106/). So, subjectively speaking, it seems we're already beyond POC. -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero3

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
separate git repo? >> >> >> From: Hyukjin Kwon >> Sent: Monday, August 3, 2020 1:58:55 AM >> To: Maciej Szymkiewicz >> Cc: Driesprong, Fokko ; Holden Karau >> ; Spark Dev List >> Subject: Re: [PySpark] Revisiting PySpark type annotati

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
nt stubs for different versions of Python? I had to > look up the literals: https://www.python.org/dev/peps/pep-0586/ > I think it is more about portability between Spark versions > > > Cheers, Fokko > > Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz < > mszy

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
e. -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC signature.asc Description: OpenPGP digital signature

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > <mailto:dev-unsubscr...@spark.apache.org> > > > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, > et

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
-- > Sent from: > http://apache-spark-developers-list.1001551.n3.nabble.com/ > > > - > To unsubscribe e-mail: >

Re: Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-18 Thread Maciej Szymkiewicz
treated as private. > > Is this intentional?  If so, what's the rationale?  If not, then it > feels like a bug and DataFrame should have some form of public access > back to the context/session.  I'm happy to log the bug but thought I > would ask here first.  Thanks! -- Best regards, Maciej Szym

Re: Apache Spark Docker image repository

2020-02-06 Thread Maciej Szymkiewicz
Action Jobs and Jenkins K8s > Integration Tests to speed up jobs and to have more stabler > environments) > > > > Bests, > > Dongjoon. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > <mailto:dev-unsubscr...@

Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Maciej Szymkiewicz
I think it is important to distinguish between two different concepts: * Adherence to standards and their well established implementations. * Enabling migrations from some product X to Spark. While these two problems are related, there are independent and one can be achieved without the

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-30 Thread Maciej Szymkiewicz
Apache Spark 3.0.0 RC1 will start next January > (https://spark.apache.org/versioning-policy.html), > I'm +1 for the deprecation (Python < 3.6) > at Apache Spark 3.0.0. > > It's just a deprecation to p

[DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-24 Thread Maciej Szymkiewicz
Hi everyone, While deprecation of Python 2 in 3.0.0 has been announced , there is no clear statement about specific continuing support of different Python 3 version. Specifically: * Python 3.4 has been retired this year.

Is SPARK-9961 is still relevant?

2019-10-05 Thread Maciej Szymkiewicz
Hi everyone, I just encountered SPARK-9961 which seems to be largely outdated today. In the latest releases majority of models computes different evaluation metrics exposed later through corresponding summaries.  At the same time such

Re: Introduce FORMAT clause to CAST with SQL:2016 datetime patterns

2019-03-20 Thread Maciej Szymkiewicz
One concern here is introduction of second formatting convention. This can not only cause confusion among users, but also result in some hard to spot bugs, when wrong format, with different meaning, is used. This is already a problem for Python and R users, with week year and months / minutes

Re: Feature request: split dataset based on condition

2019-02-03 Thread Maciej Szymkiewicz
If the goal is to split the output, then `DataFrameWriter.partitionBy` should do what you need, and no additional methods are required. If not you can also check Silex's implementation muxPartitions (see https://stackoverflow.com/a/37956034), but the applications are rather limited, due to high

[PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Maciej Szymkiewicz
Hello everyone, I'd like to revisit the topic of adding PySpark type annotations in 3.0. It has been discussed before ( http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html and

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Maciej Szymkiewicz
Even if these were documented Sphinx doesn't include dunder methods by default (with exception to __init__). There is :special-members: option which could be passed to, for example, autoclass. On Tue, 23 Oct 2018 at 21:32, Sean Owen wrote: > (& and | are both logical and bitwise operators in

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
ut do we want to take that baggage into Apache Spark 3.x > era? The next time you may drop it would be only 4.0 release because > of breaking change. > > -- > ,,,^..^,,, > On Sat, Sep 15, 2018 at 2:21 PM Maciej Szymkiewicz > wrote: > > > > There is no need to ditch Python 2.

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
There is no need to ditch Python 2. There are basically two options - Use stub files and limit yourself to support only Python 3 support. Python 3 users benefit from type hints, Python 2 users don't, but no core functionality is affected. This is the approach I've used with

Re: [DISCUSS] move away from python doctests

2018-08-29 Thread Maciej Szymkiewicz
Hi Imran, On Wed, 29 Aug 2018 at 22:26, Imran Rashid wrote: > Hi Li, > > yes that makes perfect sense. That more-or-less is the same as my view, > though I framed it differently. I guess in that case, I'm really asking: > > Can pyspark changes please be accompanied by more unit tests, and not

Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Maciej Szymkiewicz
Given popularity of related SO questions: - https://stackoverflow.com/q/41670103/1560062 - https://stackoverflow.com/q/42465568/1560062 - https://stackoverflow.com/q/41670103/1560062 it is probably more "nobody thought about asking", than "it is not used often". On Wed, 22 Aug 2018

Re: Increase Timeout or optimize Spark UT?

2017-08-24 Thread Maciej Szymkiewicz
/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala#L60-L61> > ? > > On Tue, Aug 22, 2017 at 3:25 PM, Maciej Szymkiewicz < > mszymkiew...@gmail.com> wrote: > >> Hi, >> >> From my experience it is possible to cut quite a lot by reducing >> s

Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Maciej Szymkiewicz
Hi, >From my experience it is possible to cut quite a lot by reducing spark.sql.shuffle.partitions to some reasonable value (let's say comparable to the number of cores). 200 is a serious overkill for most of the test cases anyway. Best, Maciej On 21 August 2017 at 03:00, Dong Joon Hyun

Re: Possible bug: inconsistent timestamp behavior

2017-08-15 Thread Maciej Szymkiewicz
ere? > > > > Thanks, > > Assaf > > > > -- > View this message in context: Possible bug: inconsistent timestamp > behavior > <http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-inconsistent-timestamp-behavior-tp22144.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > -- Z poważaniem, Maciej Szymkiewicz

Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Maciej Szymkiewicz
Since 2.2 there is Imputer: https://github.com/apache/spark/blob/branch-2.2/examples/src/main/python/ml/imputer_example.py which should at least partially address the problem. On 06/22/2017 03:03 AM, Franklyn D'souza wrote: > I just wanted to highlight some of the rough edges around using >

Re: spark messing up handling of native dependency code?

2017-06-02 Thread Maciej Szymkiewicz
Maybe not related, but in general geotools are not thread safe,so using from workers is most likely a gamble. On 06/03/2017 01:26 AM, Georg Heiler wrote: > Hi, > > There is a weird problem with spark when handling native dependency code: > I want to use a library (JAI) with spark to parse some

Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Maciej Szymkiewicz
pyspark, they just have > to be run with a compatible packaging (e.g. mypy). > > Meaning that porting for python 2 would provide a very small advantage > over the immediate advantages (IDE usage and testing for most cases). > > > > Am I missing something? > > > &g

Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Maciej Szymkiewicz
metaclasses), which is could be resolved without significant loss of function. On 05/23/2017 12:08 PM, Reynold Xin wrote: > Seems useful to do. Is there a way to do this so it doesn't break > Python 2.x? > > > On Sun, May 14, 2017 at 11:44 PM, Maciej Szymkiewicz > <m

[PYTHON] PySpark typing hints

2017-05-14 Thread Maciej Szymkiewicz
Hi everyone, For the last few months I've been working on static type annotations for PySpark. For those of you, who are not familiar with the idea, typing hints have been introduced by PEP 484 (https://www.python.org/dev/peps/pep-0484/) and further extended with PEP 526

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-29 Thread Maciej Szymkiewicz
I am not sure if it is relevant but explode_outer and posexplode_outer seem to be broken: SPARK-20534 On 04/28/2017 12:49 AM, Sean Owen wrote: > By the way the RC looks good. Sigs and license are OK, tests pass with > -Phive -Pyarn

[SQL] Unresolved reference with chained window functions.

2017-03-24 Thread Maciej Szymkiewicz
errors.package$.attachTree(package.scala:56) ... Caused by: java.lang.RuntimeException: Couldn't find AmtPaidCumSum#366 in [sum#385,max#386,x#360,AmtPaid#361] ... Is it a known issue or do we need a JIRA? -- Best, Maciej Szymkiewicz - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[ML][PYTHON] Collecting data in a class extending SparkSessionTestCase causes AttributeError:

2017-03-06 Thread Maciej Szymkiewicz
Hi everyone, It is a either to late or to early for me to think straight so please forgive me if it is something trivial. I am trying to add a test case extending SparkSessionTestCase to pyspark.ml.tests (example patch attached). If test collects data, and there is another TestCase extending

Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-14 Thread Maciej Szymkiewicz
> py4j in our repo but could instead have a pinned version > required. While we do depend on a lot of py4j internal APIs, > version pinning should be sufficient to ensure functionality > (and simplify the update process). > > Cheers, > >

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Maciej Szymkiewicz
Congratulations! On 02/13/2017 08:16 PM, Reynold Xin wrote: > Hi all, > > Takuya-san has recently been elected an Apache Spark committer. He's > been active in the SQL area and writes very small, surgical patches > that are high quality. Please join me in congratulating Takuya-san! >

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-03 Thread Maciej Szymkiewicz
>>>>> zero323 wrote >>>>>>> Hi everyone, >>>>>>> >>>>>>> While experimenting with ML pipelines I experience a significant >>>>>>> performance regression when switching from 1.6.x to 2.x. >>>>>>> >>>>>>> import org.apache.spark.ml.{

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Maciej Szymkiewicz
, that could lead to this >>> behavior? >>> >>> -- >>> Best, >>> Maciej >>> >>> >>> - >>> To unsubscribe e-mail: >>> dev-unsubscribe@.apache > > > > > - > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > -- Maciej Szymkiewicz

[SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-01-31 Thread Maciej Szymkiewicz
Hi everyone, While experimenting with ML pipelines I experience a significant performance regression when switching from 1.6.x to 2.x. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} val df = (1 to

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
nd less, even > though it could be 365 days, and fix the documentation. > 2) Explicitly disallow it as there may be a lot of data for a given > window, but partial aggregations should help with that. > > My thoughts are to go with 1. What do you think? > > Best, > Burak &g

[SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Hi, Can I ask for some clarifications regarding intended behavior of window / TimeWindow? PySpark documentation states that "Windows in the order of months are not supported". This is further confirmed by the checks in TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l). Surprisingly

Re: [PYSPARK] Python tests organization

2017-01-12 Thread Maciej Szymkiewicz
> > > > > > > > > Following up, any thoughts on next steps for this? > > > > > > > > > > > > > > > *From:* Maciej Szymkiewicz <mszymkiew...@gmail

Re: [PYSPARK] Python tests organization

2017-01-12 Thread Maciej Szymkiewicz
1...@hotmail.com > <mailto:sxk1...@hotmail.com>> wrote: > > > > > > > > > > > > > > > > Following up, any thoughts on next steps for this? > > > > > > > > > > > >

[PYSPARK] Python tests organization

2017-01-11 Thread Maciej Szymkiewicz
Hi, I can't help but wonder if there is any practical reason for keeping monolithic test modules. These things are already pretty large (1500 - 2200 LOCs) and can only grow. Development aside, I assume that many users use tests the same way as me, to check the intended behavior, and largish

Re: [SQL][PYTHON] UDF improvements.

2017-01-10 Thread Maciej Szymkiewicz
rom the gist? Thanks! > > rb > > On Sat, Jan 7, 2017 at 12:39 PM, Maciej Szymkiewicz > <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote: > > Hi, > > I've been looking at the PySpark UserDefinedFunction and I have a > co

[SQL][PYTHON] UDF improvements.

2017-01-07 Thread Maciej Szymkiewicz
Hi, I've been looking at the PySpark UserDefinedFunction and I have a couple of suggestions how it could be improved including: * Full featured decorator syntax. * Docstring handling improvements. * Lazy initialization. I summarized all suggestions with links to possible solutions in gist

Re: shapeless in spark 2.1.0

2016-12-29 Thread Maciej Szymkiewicz
re current versions. > > so this means a spark user that uses shapeless in his own > development cannot upgrade safely from 2.0.0 to 2.1.0, i think. > > wish i had noticed this sooner > -- Maciej Szymkiewicz

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread Maciej Szymkiewicz
> scala> testUnion(5000) > > 822305 miliseconds > > res8: Long = 822305 > > > > > > > > View this message in context: repeated unioning of dataframes take > worse than O(N^2) t

Re: [MLLIB] RankingMetrics.precisionAt

2016-12-06 Thread Maciej Szymkiewicz
; On Tue, Dec 6, 2016 at 9:43 PM Maciej Szymkiewicz > <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote: > > Thank you Sean. > > Maybe I am just confused about the language. When I read that it > returns "the average precision at th

Re: [MLLIB] RankingMetrics.precisionAt

2016-12-06 Thread Maciej Szymkiewicz
not enough sleep. On 12/06/2016 02:45 AM, Sean Owen wrote: > I read it again and that looks like it implements mean precision@k as > I would expect. What is the issue? > > On Tue, Dec 6, 2016, 07:30 Maciej Szymkiewicz <mszymkiew...@gmail.com > <mailto:mszymkiew...@gmail.com>&g

[MLLIB] RankingMetrics.precisionAt

2016-12-05 Thread Maciej Szymkiewicz
Hi, Could I ask fora fresh pair of eyes on this piece of code: https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L59-L80 @Since("1.2.0") def precisionAt(k: Int): Double = { require(k

Re: Future of the Python 2 support.

2016-12-05 Thread Maciej Szymkiewicz
ously and merit a discussion about dropping support, but I > think at this point it's premature to discuss that and we should > just wait and see. > > Nick > > > On Sun, Dec 4, 2016 at 10:59 AM Maciej Szymkiewicz > <mszymkiew...@gmail.com <mailto:msz

Future of the Python 2 support.

2016-12-04 Thread Maciej Szymkiewicz
Hi, I am aware there was a previous discussion about dropping support for different platforms (http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html) but somehow it has been dominated by Scala and JVM and never touched the

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-02 Thread Maciej Szymkiewicz
Sure, here you are: https://issues.apache.org/jira/browse/SPARK-18690 To be fair I am not fully convinced it is worth it. On 12/02/2016 12:51 AM, Reynold Xin wrote: > Can you submit a pull request with test cases based on that change? > > > On Dec 1, 2016, 9:39 AM -0800, Maciej

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Maciej Szymkiewicz
boundary API > > > > Yes I'd define unboundedPreceding to -sys.maxsize, but also any value > less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered > unboundedPreceding too. We need to be careful with long overflow when > transferring data over to Java. > > >

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Maciej Szymkiewicz
axsize, _JAVA_MIN_LONG) are considered > unboundedPreceding too. We need to be careful with long overflow when > transferring data over to Java. > > > On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz > <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Maciej Szymkiewicz
backwards compatibility. On 11/30/2016 06:52 PM, Reynold Xin wrote: > Ah ok for some reason when I did the pull request sys.maxsize was much > larger than 2^63. Do you want to submit a patch to fix this? > > > On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz > <mszymkiew...@gmail.

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Maciej Szymkiewicz
ed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz > <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote: > > Hi, > > I've been looking at the SPARK-17845 and I am curious if there is any > reason to make it a breaking change. In Spark 2.0

[SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Maciej Szymkiewicz
ce incorrect results (ROWS BETWEEN -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use Window.unboundedPreceding equal -sys.maxsize to ensure backward compatibility? -- Maciej Szymkiewicz - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-30 Thread Maciej Szymkiewicz
her major > release. > > I agree that that issue is a major one since it relates to > correctness, but since it's not a regression it technically does not > merit a -1 vote on the release. > > Nick > > On Wed, Nov 30, 2016 at 11:00 AM Maciej Szymkiewicz > <mszymkiew..

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-30 Thread Maciej Szymkiewicz
== > > Committers should look at those and triage. Extremely important > bug fixes, > > documentation, and API tweaks that impact compatibility should > be worked on > > immediately. Everything else please retarget to 2.1.1 or 2.2.0. > > > > > > > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > <mailto:dev-unsubscr...@spark.apache.org> > -- Maciej Szymkiewicz

Re: [SQL][JDBC] Possible regression in JDBC reader

2016-11-25 Thread Maciej Szymkiewicz
02ee8d2c7e995#diff-f70bda59304588cc3abfa3a9840653f4L237 > > // maropu > > On Fri, Nov 25, 2016 at 9:50 PM, Maciej Szymkiewicz > <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote: > > Hi, > > I've been reviewing my notes to https://git.io/v1UVC using Spark > buil

[SQL][JDBC] Possible regression in JDBC reader

2016-11-25 Thread Maciej Szymkiewicz
Hi, I've been reviewing my notes to https://git.io/v1UVC using Spark built from 51b1c1551d3a7147403b9e821fcc7c8f57b4824c and it looks like JDBC ignores both: * (columnName, lowerBound, upperBound, numPartitions) * predicates and loads everything into a single partition. Can anyone confirm

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-22 Thread Maciej Szymkiewicz
ch partition is sorted and the order of partitions defines the global ordering. All what collect does is just preserving this order by creating an array of results for each partition and flattening it. > > Best > > > On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark &g

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-21 Thread Maciej Szymkiewicz
cala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277 > > -- > Niranda Perera > @n1r44 <https://twitter.com/N1R44> > +94 71 554 8430 > https://www.linkedin.com/in/niranda > https://pythagoreanscript.wordpress.com/ -- Best regards, Maciej Szymkiewicz

Re: Handling questions in the mailing lists

2016-11-09 Thread Maciej Szymkiewicz
verbiage for the Spark community page and > welcome email jump started, here's a working document for us to work > with: > https://docs.google.com/document/d/1N0pKatcM15cqBPqFWCqIy6jdgNzIoacZlYDCjufBh2s/edit# > <https://docs.google.com/document/d/1N0pKatcM15cqBPqFWCqIy6jdgNzIoacZlYDCju

Re: Handling questions in the mailing lists

2016-11-07 Thread Maciej Szymkiewicz
es to SO? Sure, I'll be happy to help if I can. > > > On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz > <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote: > > Damn, I always thought that mailing list is only for nice and > welcoming pe

Re: Handling questions in the mailing lists

2016-11-06 Thread Maciej Szymkiewicz
tially underestimated how opinionated people can be on > mailing lists too :) > > On Sunday, November 6, 2016, Maciej Szymkiewicz > <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote: > > You have to remember that Stack Overflow crowd (like me) is highly

Re: Handling questions in the mailing lists

2016-11-06 Thread Maciej Szymkiewicz
You have to remember that Stack Overflow crowd (like me) is highly opinionated, so many questions, which could be just fine on the mailing list, will be quickly downvoted and / or closed as off-topic. Just saying... -- Best, Maciej On 11/07/2016 04:03 AM, Reynold Xin wrote: > OK I've checked

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-30 Thread Maciej Szymkiewicz
nk you could register a custom >> serializer that handles this case. Or work around it in your client >> code. I know there have been other issues with Kryo and Map because, >> for example, sometimes a Map in an application is actually some >> non-serializable wrapper view. >> &

java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Maciej Szymkiewicz
Hi everyone, I suspect there is no point in submitting a JIRA to fix this (not a Spark issue?) but I would like to know if this problem is documented anywhere. Somehow Kryo is loosing default value during serialization: scala> import org.apache.spark.{SparkContext, SparkConf} import

Re: What happens in Dataset limit followed by rdd

2016-08-03 Thread Maciej Szymkiewicz
pushes down across mapping functions, > because the number of rows may change across functions. for example, > flatMap() > > It seems that limit can be pushed across map() which won’t change the > number of rows. Maybe this is a room for Spark optimisation. > >> On Aug 2

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Maciej Szymkiewicz
r, in the second case, the optimisation in the CollectLimitExec > does not help, because the previous limit operation involves a shuffle > operation. All partitions will be computed, and running LocalLimit(1) > on each partition to get 1 row, and then all partitions are shuffled > into a

What happens in Dataset limit followed by rdd

2016-08-01 Thread Maciej Szymkiewicz
Hi everyone, This doesn't look like something expected, does it? http://stackoverflow.com/q/38710018/1560062 Quick glance at the UI suggest that there is a shuffle involved and input for first is ShuffledRowRDD. -- Best regards, Maciej Szymkiewicz

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-27 Thread Maciej Szymkiewicz
Hi Jacek, In this context, don't you think it would be useful, if at least some traits from org.apache.spark.ml.param.shared.sharedParams were public?HasInputCol(s) and HasOutputCol for example. These are useful pretty much every time you create custom Transformer. -- Pozdrawiam, Maciej

ML ALS API

2016-03-07 Thread Maciej Szymkiewicz
/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L436)is using float instead of double like its MLLib counterpart. Is it going to be a default encoding in 2.0+? -- Best, Maciej Szymkiewicz signature.asc Description: OpenPGP digital signature

Re: DataFrame API and Ordering

2016-02-19 Thread Maciej Szymkiewicz
we should document that. > > Any suggestions on where we should document this? In DoubleType and > FloatType? > > On Tuesday, February 16, 2016, Maciej Szymkiewicz > <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote: > > I am not sure if I've misse

DataFrame API and Ordering

2016-02-16 Thread Maciej Szymkiewicz
I am not sure if I've missed something obvious but as far as I can tell DataFrame API doesn't provide a clearly defined ordering rules excluding NaN handling. Methods like DataFrame.sort or sql.functions like min / max provide only general description. Discrepancy between functions.max (min) and