Re: [sympy] Questions about the goals of CSymPy

Aaron Meurer Wed, 23 Apr 2014 07:16:49 -0700

> On Apr 23, 2014, at 12:38 AM, "Ondřej Čertík" <[email protected]> wrote:
>
>> On Tue, Apr 22, 2014 at 9:52 PM, Aaron Meurer <[email protected]> wrote:
>>> On Tue, Apr 22, 2014 at 10:21 PM, Ondřej Čertík <[email protected]> 
>>> wrote:
>>>> On Tue, Apr 22, 2014 at 6:06 PM, Aaron Meurer <[email protected]> wrote:
>>>>> On Tue, Apr 22, 2014 at 12:05 PM, Ondřej Čertík <[email protected]> 
>>>>> wrote:
>>>>> Hi Aaron,
>>>>>
>>>>> Those are good questions. Here are the answers:
>>>>>
>>>>>> On Tue, Apr 22, 2014 at 10:13 AM, Aaron Meurer <[email protected]> 
>>>>>> wrote:
>>>>>> I have some high level questions about CSymPy.
>>>>>>
>>>>>> - What are the goals of the project?
>>>>>
>>>>> The goals of the project are:
>>>>>
>>>>> * Fastest symbolic manipulation library, compared to other codes,
>>>>> commercial or opensource
>>>>> (Sage, GiNaC, Mathematica, ...).
>>>>>
>>>>> * Extension/complement to SymPy
>>>>>
>>>>> * If the above two goals allow, be able to also call it from other
>>>>> languages easily and efficiently (Julia, Ruby, Mathematica, ...)
>>>>>
>>>>> As to technical solution: the core should be a C++ library, which can
>>>>> depend on other compiled libraries if needed.
>>>>> The core should not depend on Python or Ruby or Julia, but rather be
>>>>> just one language, C++. That lowers the barrier
>>>>> of entry significantly, compared to a big mix of C++, Cython and
>>>>> Python, makes it easier to make things fast
>>>>> (you don't need to worry about Python at all). The Python (and other
>>>>> languages) wrappers should be just a thin
>>>>> wrappers around the C++ core (=just better syntax).
>>>>>
>>>>> There might be other technical solutions to this, but I know that I
>>>>> can deliver the above goals with this solution
>>>>> (and I failed to deliver with other solutions, like writing the core
>>>>> in Cython). So that's why we do it this way.
>>>>>
>>>>> Also, by being "just a C++ library", other people can use it in their
>>>>> projects. I hope to get interest of much broader
>>>>> community that way, who can contribute back (somebody will need fast
>>>>> symbolic manipulation in Julia, so they
>>>>> can just use CSymPy with Julia wrappers, and contribute improvements 
>>>>> back).
>>>>>
>>>>>>
>>>>>> - What are the things that should definitely go in CSymPy?
>>>>>
>>>>> At the moment: all things to make specific applications fast, in
>>>>> particular PyDy. For that, it needs basic
>>>>> manipulation, differentiation, series expansion (I think) and
>>>>> matrices. That's all roughly either done, or
>>>>> on the way. Of course, lots of polishing is needed.
>>>>
>>>> I think that's already too much. Why is the series expansion slow in
>>>> SymPy? Is it because the algorithms are slow? If so, then implementing
>>>> the same inefficient algorithms in CSymPy won't help. They will be
>>>> faster, but for large enough expressions they will still slow down. Is
>>>> it because the expression manipulation is slow? In that case, if
>>>> CSymPy has faster expression manipulation, then just use those
>>>> expressions, but use the SymPy series algorithms.
>>>
>>> My experience is that it will actually help to implement the same algorithm,
>>> because there is a little overhead with any Python operation.
>>
>> Exactly. There is a "little" overhead. Not a huge overhead. It matters
>> for the stuff that is in the inner loops, like addition and
>> multiplication of terms, but for whole algorithms, which might be
>> called only a few times (as opposed to a few hundred thousand times),
>> it doesn't make a difference.
>>
>> This is all hypothetical without numbers (and btw, it would be awesome
>> if you could provide real numbers here), but suppose these imaginary
>> numbers were true:
>>
>> SymPy: 1x
>> CSymPy with Python wrappers: 4x
>> Raw CSymPy: 5x
>>
>> Then using CSymPy with Python would already be 4x faster than SymPy.
>> Now doing everything in SymPy would only be 1.25 faster than that.
>>
>> Now, if CSymPy integrates flawlessly, so that it just works (at least
>> as far as the user is concerned), there is little complexity cost of
>> CSymPy + Python. Definitely little enough to warrant the 4x speedup.
>> But as soon as you take that away, i.e., you implement more and more
>> in C++, or CSymPy differs enough from SymPy that the user needs to
>> care about it (which the more that is in CSymPy, the more likely this
>> is to happen), then the complexity cost sky rockets. Maybe 4x would
>> still be worth it here. But not 1.25x.
>>
>>> So if you do
>>> a lot of them (like in series expansion, where you create intermediate
>>> expressions
>>> and so on --- and even if we use CSymPy, there is overhead in the
>>> Python wrappers,
>>> so essentially you don't want to be calling them too often, if you *really* 
>>> care
>>> about performance), they will add up.
>>
>> Sure, you could just implement a whole CAS in C++. That's what some
>> people have done already. But you have to factor in the costs of
>> everything, not just the speed. The costs of:
>>
>> - How much more complicated the code is
>> - Code duplication (and all the associated issues that come with it)
>> - The additional overhead needed for CSymPy to interop with SymPy. The
>> more CSymPy does, the harder this is.
>>
>> Is series expansion slow? Is it an inner loop (i.e., will it matter to
>> people if it is slow)? Is it simple to implement ('simple' being a
>> relative term of course; obviously no part of a CAS is completely
>> simple)?  If the answer to any of those is "no", I think you should
>> seriously consider whether it's worth implementing.
>>
>>>
>>>>
>>>> My points are:
>>>>
>>>> - I think CSymPy should focus on making expression manipulation fast
>>>> (i.e., the things that are the inner loop of any symbolic algorithm).
>>>> It should not reimplement the symbolic algorithms themselves. Those
>>>> are implemented in SymPy. If they use the CSymPy objects instead of
>>>> the SymPy objects, they will be faster.
>>>>
>>>> - I would focus more on making CSymPy interoperate with SymPy and less
>>>> on reimplementing things that are in SymPy in CSymPy. Once there is
>>>> interoperation, we can see what is still slow, and then (and only
>>>> then) implement it in C++.
>>>
>>> The interoperability is important and that is mostly the job of the
>>> Python wrappers.
>>> The C++ API underneath can change a lot, i.e. if we figure out a
>>> faster way to represent
>>> things, we'll switch. I will spend lots of time making the
>>> interoperability work.
>>> This summer though, I want to think hard about raw speed and making it work
>>> for PyDy.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> - What are the things that should definitely not go in CSymPy?
>>>>>
>>>>> Things that don't need to be fast. Things like limits. Also things
>>>>> that are in SymPy, where CSymPy can
>>>>> be used as a drop in replacement for the engine: PyDy, some stuff in
>>>>> physics, and so on. There is no need
>>>>> to rewrite PyDy in C++. Also most user's code would stay in Python.
>>>>> They can just optionally change
>>>>> to CSymPy for some intensive calculation, then finish the thing with 
>>>>> SymPy.
>>>>>
>>>>>>
>>>>>> - How will CSymPy be architectured to allow things to happen in CSymPy
>>>>>> when they can but fallback to SymPy when they cannot.
>>>>>
>>>>> Currently you can simply mix and match SymPy and CSymPy expressions.
>>>>> So you simply
>>>>> convert an expression to SymPy to do some advanced manipulation, and
>>>>> convert to CSymPy
>>>>> to do some fast manipulation. I am open to suggestions how to improve 
>>>>> this.
>>>>>
>>>>>>
>>>>>> My main concern here is that CSymPy has not clear separation from
>>>>>> SymPy, and as a result it will end up growing larger and larger, until
>>>>>> it becomes an independent CAS (which is fine if that's the goal, but
>>>>>> my understanding was that it was supposed to be just a small fast
>>>>>> core).
>>>>>
>>>>> The goals are written above. I am myself concentrating on speed,
>>>>> that's what I really
>>>>> want to nail down. And then enough features so that it's useful for
>>>>> all the people who
>>>>> found SymPy slow. However, let's say somebody comes and reimplements the 
>>>>> Gruntz
>>>>> algorithm in CSymPy. Should we reject such a PR? My answer is that if
>>>>> the code is nice,
>>>>> maintainable, much faster than SymPy and has the same or similar
>>>>> features, I am ok
>>>>> with merging it.  If the code is a mess, then not. But as I said, I am
>>>>> spending my own
>>>>> time on things which people need, and faster limits don't seem to be it.
>>>>>
>>>>>>
>>>>>> In particular, if there is some feature of SymPy functions, how will
>>>>>> CSymPy be architectured so that it can take advantage of it without
>>>>>> having to completely reimplement that function in C++?
>>>>>
>>>>> You can convert any expression back and forth, so you keep it in SymPy
>>>>> if you want to have some particular feature. See also the conversation
>>>>> and specific examples here:
>>>>>
>>>>> https://github.com/sympy/csympy/issues/153
>>>>>
>>>>>>
>>>>>> For instance, a current goal of CSymPy is to implement trig functions.
>>>>>> But this can be quite complicated if you consider all the different
>>>>>> things you can do with trig functions. Without even thinking about
>>>>>> trig simplification, there are complicated evaluation issues (e.g.,
>>>>>> consider sin(pi/7).rewrite(sqrt) in SymPy). It would be a shame to
>>>>>> reimplement all this logic twice, especially it is not needed for
>>>>>> performance.
>>>>>
>>>>> Agreed. On the other hand, we really need very fast trig functions.
>>>>> The functionality
>>>>> that we need is simplifications like sin(2*pi) -> 0, differentiation
>>>>> and series expansion.
>>>>
>>>> Why do you need to implement those in C++? If the expression
>>>> manipulation is fast, then won't it be fine to have the actual
>>>> formulas/algorithms in SymPy?
>>>
>>> Maybe, that depends on this issue:
>>>
>>> https://github.com/sympy/csympy/issues/153
>>>
>>> The problem is that once you start thinking about Python+C++ at once
>>> and performance, things get complex quickly. It's much easier
>>> to think in terms of C++ only and how to write the fastest possible 
>>> algorithm
>>> (that is hard enough!). This sets the bar. Then one should try to see
>>> if it is possible to match this with Python. Not the other way round,
>>> because you need to set the bar first.
>>>
>>>
>>>
>>>> On Tue, Apr 22, 2014 at 6:16 PM, Aaron Meurer <[email protected]> wrote:
>>>>> On Tue, Apr 22, 2014 at 4:58 PM, Joachim Durchholz <[email protected]> 
>>>>> wrote:
>>>>> Hm.
>>>>>
>>>>> One:
>>>>>> * Extension/complement to SymPy
>>>>>
>>>>> Two:
>>>>>
>>>>>
>>>>>> That lowers the barrier
>>>>>> of entry significantly, compared to a big mix of C++, Cython and
>>>>>> Python, makes it easier to make things fast
>>>>>> (you don't need to worry about Python at all).
>>>>>
>>>>>
>>>>> That's not an extension nor a complement, it's a replacement.
>>>>>
>>>>>
>>>>>> The Python (and other
>>>>>>
>>>>>> languages) wrappers should be just a thin
>>>>>> wrappers around the C++ core (=just better syntax).
>>>>>
>>>>>
>>>>> I.e. replace the engine if not the API.
>>>>>
>>>>> Not that I'm judging. I'm just pointing out perceived inconsistencies.
>>>>>
>>>>> The more SymPy itself turns into a set of simplification rulese, the less
>>>>> significance this will have in the end.
>>>>>
>>>>>
>>>>>>> - How will CSymPy be architectured to allow things to happen in CSymPy
>>>>>>> when they can but fallback to SymPy when they cannot.
>>>>>>
>>>>>>
>>>>>> Currently you can simply mix and match SymPy and CSymPy expressions.
>>>>>> So you simply
>>>>>> convert an expression to SymPy to do some advanced manipulation, and
>>>>>> convert to CSymPy
>>>>>> to do some fast manipulation. I am open to suggestions how to improve
>>>>>> this.
>>>>>
>>>>>
>>>>> The alternative would be to have a data structure that can be manipulated
>>>>> from both the C++ and the Python side, but that's going to be unnatural 
>>>>> for
>>>>> at least one of the sides.
>>>>>
>>>>> Note that the data structures can become large-ish, and if the
>>>>> simplification becomes complicated there may be a lot of back-and-forth.
>>>>> It's possible that mixed execution will be slow for some algorithms or use
>>>>> cases for that reason.
>>>>>
>>>>> I do not think that this can be determined in advance, it's something to
>>>>> keep an eye out for during benchmarks.
>>>>>
>>>>>
>>>>>>> My main concern here is that CSymPy has not clear separation from
>>>>>>> SymPy, and as a result it will end up growing larger and larger, until
>>>>>>> it becomes an independent CAS (which is fine if that's the goal, but
>>>>>>> my understanding was that it was supposed to be just a small fast
>>>>>>> core).
>>>>>>
>>>>>>
>>>>>> The goals are written above. I am myself concentrating on speed,
>>>>>> that's what I really
>>>>>> want to nail down.
>>>>>
>>>>>
>>>>> I'm somewhat sceptical about this.
>>>>> A conversion to C++ will give a linear improvement.
>>>>> Better algorithms can improve the big-Oh class.
>>>>> Unless algorithmic improvements have been exhausted, this would be 
>>>>> premature
>>>>> optimization. (Are algorithmic improvements exhausted yet?)
>>>>
>>>> That is true. I usually prefer to use faster algorithms.
>>>>
>>>> But to be the devil's advocate, there are two issues with this line of 
>>>> thinking:
>>>>
>>>> - Big O is basically useless. Consider the extreme effectiveness of
>>>> SAT solvers (which solve an NP-complete problem), or the difference
>>>> between the simplex and Khachiyan's algorithm, or AKS vs. more
>>>> efficient deterministic primality testing algorithms. Asymptotic
>>>> complexity is all fine, but at the end of the day, you don't care how
>>>> fast your algorithm is for increasingly large inputs, you care how
>>>> fast it is for *your* input.
>>>>
>>>> - Faster algorithms have a complexity cost. You can get closer to the
>>>> metal in Python by being very careful about your use of data
>>>> structures, and avoiding things that are slow in Python (like function
>>>> calls), but the cost is high because you end up with code that is not
>>>> only harder to read and maintain, but harder to keep in its fast
>>>> state, because someone else who doesn't know all the little tricks
>>>> might come along and change things in a way that seems equivalent but
>>>> makes things slower.
>>>
>>> Precisely. That's why it's good to stick to just one language, C++, and nail
>>> the speed. That sets the bar. Then one can try to match the speed with 
>>> Python,
>>> which sometimes is possible.
>>>
>>>>
>>>> With that being said, C++ is itself enough of a complexity cost that
>>>> doing this outweighs using it in many (most?) cases. (That's not just
>>>> a knock on C++; using any second language to Python brings a cost,
>>>> both because there are now two languages to think about, and because
>>>> of interoperability questions)
>>>
>>> Yes, again the reason why to stick to just one language and only do thin
>>> wrappers that allow to use it as a blackbox.
>>>
>>>
>>>
>>>> On Tue, Apr 22, 2014 at 6:23 PM, Brian Granger <[email protected]> wrote:
>>>> I too feel that csympy should implement the absolute minimum possible
>>>> for it to be fast. All actual mathematical algorithms should remain in
>>>> sympy. Trying to pull lots of algorithms into csympy will fail - not
>>>> that many people want to write complex mathematical algorithms in C++.
>>>
>>>
>>> I understand your worries. Both yours and Aaron's and other people in
>>> the SymPy community.
>>> Let's be frank about this, here they are:
>>
>> Thanks for your frankness. So I think you do understand the issues.
>>
>>>
>>> * keep things simple, maintainable, in Python, not introducing other 
>>> languages
>>>
>>> * especially not C++, which is notorious for being hilariously
>>> complex. That's why we use Python, because it's better.
>>
>> Well, there are also languages that are fast and not C++, but we can
>> have that discussion separately.
>>
>>>
>>> * danger of splitting a community by introducing a separate CAS
>>> (=wasting our resources, attention, developers, and so on), and we've
>>> been on this path before, e.g. with with sympycore, or even Sage ---
>>> any improvement to those codes does not benefit SymPy. That does not
>>> mean that there is anything bad with those, but one has to try to
>>> focus, in our case SymPy, and just get things done, make it a useful
>>> library so that people can use it and benefit from it.
>>>
>>> * that we should concentrate on features, and sacrifice some
>>> reasonable speed (maybe 100x or so).
>>>
>>> * We should not be reimplementing SymPy in C++, that would be a big waste.
>>
>> One of my worries is not listed here, which is that you are doing
>> things completely backwards from good software design with CSymPy,
>> which is getting speed first, and something that works later.
>
> The goal is to get something that works first, polished API later.
> By "work" I mean fast and working for PyDy (at first).
>
>>
>>>
>>>
>>> I thought about all these very deeply. And I can promise that I will
>>> do my absolute best to make this work, with the SymPy community. I
>>> might not have the best answer to all the worries, but I know that we
>>> need to set the bar:
>>>
>>> * So implement trig functions in C++
>>> * Benchmark against Mathematica, Sage, GiNaC
>>> * Make sure it is as fast or faster
>>> * See if we can match it with Python
>>> * If so, great! If not, what is the penalty? 2x? 10x? 100x?
>>
>> That is exactly what I want to know.
>>
>>>
>>> If we don't have this bar, then we miss on speed. And if we miss on
>>> speed then there will always be reason why people would use other
>>> software, because of speed. If, on the other hand, csympy is as fast
>>> as state of the art, then it fixes the problem. And it will be
>>> integrated with SymPy well, people can keep using SymPy.
>>>
>>> Ondrej
>>
>> I feel like without real numbers, we aren't going to get anywhere, so
>> maybe you could provide some benchmarks. I'm not convinced about
>> expand, because that relies pretty heavily on other things like
>> multinomial coefficient generation. I'd rather see a benchmark that
>> does the exact same expression manipulations everywhere.
>>
>> Feel free to suggest a better one. I'm just coming up with this from
>> the seat of my pants, but something like
>>
>> a = x
>> c = 1
>> for i in range(1000): # Replace with larger numbers if necessary
>>    a += c*i*x # If CSymPy expressions are mutable modify this accordingly
>>    c *= -1
>
> Sure. Here is the code:
>
> from csympy import var, Integer
> #from sympy import var, Integer
> var("x")
> a = x
> c = Integer(1)
> N = 10**5
> for i in range(N):
>    a += c*i*x
>    c *= -1
> print a
>
>
> SymPy:
>
> $ time python a.py
> -49999*x
>
> real 0m35.262s
> user 0m34.870s
> sys 0m0.300s
>
> CSymPy:
>
> $ time python a.py
> -49999x
>
> real 0m0.860s
> user 0m0.852s
> sys 0m0.004s
>
>
> So this particular one is 41x faster in CSymPy. You can modify this to
> generate some long expressions, e.g.:
>
> from csympy import var, Integer
> #from sympy import var, Integer
> var("x")
> a = x
> c = Integer(1)
> N = 3000
> for i in range(N):
>    a += c*x**i
>    c *= -1
>
> SymPy:
>
> $ time python a.py
>
> real 0m37.890s
> user 0m37.626s
> sys 0m0.152s
>
> CSymPy:
>
> $ time python a.py
>
> real 0m1.032s
> user 0m1.020s
> sys 0m0.012s
>
>
> This one is 37x faster. Sage takes about 3s on this benchmark. And so on.


And how fast is the raw C++, without the python wrappers?

Aaron Meurer

>
> We need to also benchmark Mathematica and GiNaC.
>
>>
>> This is a very basic test of how well terms are combined with
>> addition. It also creates a lot of terms, which might penalize a
>> system with too much caching (maybe it should be modified to not do
>> that; I don't know).
>>
>> 1000 should really be replaced with a much larger number, say
>> 10000000. Basically, if a benchmark runs in under a second, I'm weary
>> of it. That means you're benchmarking things that might be irrelevant
>> for larger expressions, when the speed really matters. I suppose
>> really you should benchmark 10**N for some range of N and plot the
>> results.
>
> Exactly, plot depending on N is the best.
>
> Ondrej
>
>>
>> That's just one benchmark, that tests one thing. We need to benchmark
>> all the major aspects of the core. A few others that come to mind:
>>
>> - x1*...*xn + x2*...*xn*x1 + ... (basically, test that large
>> multiplications in different orders are efficiently canonicalized to
>> the same thing)
>> - (some large addition) - (the same addition, in a different order)
>> - x**x**x**x**...**x (do some stuff with it; basically test that
>> highly nested expressions don't kill things. You could also test
>> sin(sin(sin(sin(...(x)...) or something)
>> - (Huge complicated expression in x).subs(x, y) (CSymPy implements
>> some kind of subs or xreplace natively, right?)
>>
>> It actually would be nice to have some kind of benchmarking site for
>> SymPy, similar to http://speed.pypy.org/.
>>
>> Aaron Meurer
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "sympy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/sympy.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/sympy/CAKgW%3D6KX8h7FRnKWa%2BjFihSfiu-9rrexYZmjXymOimLsUQ7BeQ%40mail.gmail.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/sympy.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/sympy/CADDwiVCEo1q-UvKv_bpsDn%3DT3d8suxNqdVnvN8dBeEV7j3CQ4w%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sympy.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sympy/5765638646223569605%40unknownmsgid.
For more options, visit https://groups.google.com/d/optout.

Re: [sympy] Questions about the goals of CSymPy

Reply via email to