subject:"PySpark API divergence \+ improving pandas interoperability"

Re: PySpark API divergence + improving pandas interoperability

2016-03-21 Thread Reynold Xin

Hi Wes, I agree it is difficult to do this design case by case, but what I was pointing out was "it is difficult to generalize without seeing a lot more cases". I do think we need to see a lot of these cases and then make a call. My intuition is that we can just have config options that control b

Re: PySpark API divergence + improving pandas interoperability

2016-03-21 Thread Wes McKinney

hi Reynold, It's of course possible to find solutions to specific issues, but what I'm curious about is a general decision-making framework around building strong user experiences for programmers using each of the Spark APIs. Right now, the semantics of using Spark are very tied to the semantics o

Re: PySpark API divergence + improving pandas interoperability

2016-03-21 Thread Reynold Xin

Hi Wes, Thanks for the email. It is difficult to generalize without seeing a lot more cases, but the boolean issue is simply a query analysis rule. I can see us having a config option that changes analysis to match more Python/R like, which changes the behavior of implicit type coercion and allow

PySpark API divergence + improving pandas interoperability

2016-03-19 Thread Wes McKinney

hi everyone, I've recently gotten moving on solving some of the low-level data interoperability problems between Python's NumPy-focused scientific computing and data libraries like pandas and the rest of the big data ecosystem, Spark being a very important part of that. One of the major efforts h