Another thing you could look at is what should be done at the parsing stage and what should be done after the parsing. For example, "2 x", "x y", and "tan x" are all the same syntax as far as the parser is concerned (unless you want to put all predefined names in the grammar itself), but the first two are implicit multiplication and the second is implicit calling. So maybe those should be parsed to the same object and then differentiated in software somehow. Then comes questions of how to interpret things like "tan x y" (tan(x)*y or tan(x*y), or fail).
Another interesting example that I thought of is something like sin^2(x) for sin(x)**2 (the former is common notation for this, and indeed SymPy even pretty prints it that way). To parse the one like the other would require changing the precedence order, as it normally would be parsed as sin^(2(x)). So you might think of ways to make that work, and whether those ways work at the parsing stage, the post-parsing stage, or both. So what I would do is try things in order of easiest to hardest (and natural language heuristics are one of the hardest), and stop working when you either run out of time or feel that you've done enough. You almost certainly won't get to do it all, but it's not clear just how far you will get, so set yourself up to do as much as you can. By the way, the standard library tokenize module is exactly the same as the parser in SymPy, except we've extended ours to do some other stuff (e.g., parse "x!" as factorial(x), wrap all undefined names in Symbol, wrap all number literals in Integer or Float, etc.). So for the parts that are just extending tokenize, you should put it there. For the rest, it should go in the parsing module (another good thing to think about by the way is a good way of organizing the parsing code; that was discussed a little bit on that other thread). Aaron Meurer On Tue, Sep 4, 2012 at 3:29 AM, Joachim Durchholz <[email protected]> wrote: > Am 04.09.2012 00:11, schrieb David Li: > >> So perhaps some heuristic for differentiating >> between various input languages and then interpreting them as Python >> (Python, TeX, "English-like", etc.) could also be an interesting task. > > > Heh. That's simple: > - Have a grammar for each syntax that we have, > - run the input through all grammars, > - use the grammar that doesn't return an error. > > The fun begins when considering the following cases: > 1) No grammar matches. > 2) More than one grammar matches. > > For (1), you'd want to somehow rank the grammars according to how close the > input is to each grammar, and assume the user really meant the closest one. > > For (2), you'd want to check if the different grammars all really mean the > same. E.g. "1*1" should parse the same for all math grammars. Just continue > processing. > Otherwise, you'll have to ask the user. Or randomly guess one and let the > user explicitly select grammars. > > There's also a slight complication for case (2): You may get different parse > trees but they'd boil down to the same operations. For examples, grammars > with different numbers of precedence levels tend to end up that way; 1*2 > could end as > > op: * > int: 1 > int: 2 > > or as > > op: * > literal > int: 1 > literal > int: 1 > > where the second grammar would for some reason differentiate between > literals, names, and other representations, where the first does not. > > You'll either need a pass that normalizes grammars, or require that > commonalities between grammars are handled by identical rules. > The first approach probably requires less work because SymPy already has > routines for simplifying expressions; however, that makes error reporting > more difficult because the transformations aren't built for keeping track of > input line/column numbers. > > You see, there's enough to do :-) > > Not all aspects need to be addressed on the first round though. Just choose > how much of this all you want to deal with, and code in a way that the rest > can be added later without rewriting everything. > > >> Since Gamma only deals with mathematical expressions (which is more >> limited >> than Wolfram|Alpha) I believe at least some basic English-like queries can >> be interpreted. > >> ... > >> Given how >> difficult it is, though, I guess just being able to interpret 2x, sin >> x, and integral of x^2 would be a nice step up in functionality. > > Indeed, that's easy enough. You can always write a grammar that accepts a > subset of English. > Main points: > - Do not require parentheses for function parameters; a function call is > just: name {expr} > - Make name {expr} bind weaker than all operators, so sin x+y is equivalent > to sin (x+y). > > >> I should've been more specific about that. I thought that >> >> natural language could help somewhat with the task, or at least point me >> towards algorithms and ideas, which is why I mentioned it. > > > That wouldn't have worked. Parsing natural language is really hard. And the > algorithms beyond parsing aren't related much to natural language. > > Still, the natural language parsers should be suitable. > > > -- > You received this message because you are subscribed to the Google Groups > "sympy" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/sympy?hl=en. > -- You received this message because you are subscribed to the Google Groups "sympy" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/sympy?hl=en.
