Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-24 Thread Bryan Cutler
Thanks Joris for clearing that up! It's correct that pyspark will allow the user to do operations on the resulting DataFrame, so it doesn't sound like I should set `split_blocks=True` in the conversion. You're right that the unnecessary assignments can be easily avoided if not timestamps, so that w

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-24 Thread Joris Van den Bossche
Hi Bryan, For the case that the column is no timestamp and was not modified: I don't think it will take copies of the full dataframe by assigning columns in a loop like that. But it is still doing work (it will copy data for that column into the array holding those data for 2D blocks), and which c

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-23 Thread Bryan Cutler
Thanks for investigating this and the quick fix Joris and Wes! I just have a couple questions about the behavior observed here. The pyspark code assigns either the same series back to the pandas.DataFrame or makes some modifications if it is a timestamp. In the case there are no timestamps, is th

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-16 Thread Joris Van den Bossche
That sounds like a good solution. Having the zero-copy behavior depending on whether you have only 1 column of a certain type or not, might lead to surprising results. To avoid yet another keyword, only doing it when split_blocks=True sounds good to me (in practice, that's also when it will happen

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-16 Thread Wes McKinney
I created https://issues.apache.org/jira/browse/ARROW-7596 and made it a blocker for 0.16.0 so this does not get lost in the shuffle On Thu, Jan 16, 2020 at 3:43 PM Wes McKinney wrote: > > hi Joris, > > Thanks for investigating this. It seems there were some unintended > consequences of the zero-

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-16 Thread Wes McKinney
hi Joris, Thanks for investigating this. It seems there were some unintended consequences of the zero-copy optimizations from ARROW-3789. Another way forward might be to "opt in" to this behavior, or to only do the zero copy optimizations when split_blocks=True. What do you think? - Wes On Thu,

PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-16 Thread Joris Van den Bossche
So the spark integration build started to fail, and with the following test error: == ERROR: test_toPandas_batch_order (pyspark.sql.tests.test_arrow.EncryptionArrowTests) ---

[NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0

2020-01-15 Thread Crossbow
Arrow Build Report for Job nightly-2020-01-15-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0 Failed Tasks: - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-travis-gandiva-jar-osx - test-conda-py