#include <cwhyyoushouldbedoingunittestinginstead> only having integration tests is why it's so hard to test toybox ps and why it's going to be hard to fuzz the code: we're missing the boundaries that let us test individual pieces. it's one of the major problems with the toybox design/coding style. sure, it's something all the existing competition in this space gets wrong too, but it's the most obvious argument for the creation of the _next_ generation tool...
On Sun, Mar 13, 2016 at 12:34 AM, Andy Chu <[email protected]> wrote: >> Unfortunately, the test suite needs as much work as the command >> implementations do. :( >> >> Ok, backstory! > > OK, thanks a lot for all the information! That helps. I will work on > this. I think a good initial goal is just to triage the tests that > pass and make sure they don't regress (i.e. make it easy to run the > tests, keep them green, and perhaps have a simple buildbot). For > example, the factor bug is trivial but it's a lot easier to fix if you > get feedback in an hour or so rather than a month later, when you have > to load it back into your head. > >> Really, I need a tests/pending. :( > > Yeah I have some ideas about this. I will try them out and send a > patch. I think there does need to be more than 2 categories as you > say though, and perhaps more than kind of categorization. > >> Building scripts to test each individual input is what the test suite is >> all about. Figuring out what those inputs should _be_ (and the results >> to expect) is, alas, work. > > Right, it is work that the fuzzing should be able to piggy back on... > so I was trying to find a way to leverage the existing test cases, > pretty much like this: > > http://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-easy-way.html > > But the difference is that unlike sqlite, fuzzing toybox could do > arbitrarily bad things to your system, so it really needs to be > sandboxed. It gives really nasty inputs -- I wouldn't be surprised if > it can crash the kernel too. > > Parsers in C are definitely the most likely successful targets for a > fuzzer, and sed seems like the most complex parser in toybox so far. > The regex parsing seem to be handled by libraries, and I don't think > those are instrumented (because they are in a shared library not > compiled with afl-gcc). I'm sure we can find a few more bugs though. > >> There's also the fact that either the correct output or the input to use >> is non-obvious. It's really easy for me to test things like grep by >> going "grep -r xopen toys/pending". There's a lot of data for it to bite >> on, and I can test ubuntu's version vs mine trivially and see where they >> diverge. > > Yeah there are definitely a lot of inputs beside the argv values, like > the file system state and kernel state. Those are harder to test, but > I like that you are testing with Aboriginal Linux and LFS. That is > already a great torture test. > > FWIW I think the test harness is missing a few concepts: > > - exit code > - stderr > - file system state -- the current method of putting setup at the > beginning of foo.test *might* be good enough for some commands, but > probably not all > > But this doesn't need to be addressed initially. > > By the way, is there a target language/style for shell and make? It > looks like POSIX shell, and I'm not sure about the Makefile -- is it > just GNU make or something more restrictive? I like how you put most > stuff in scripts/make.sh -- that's also how I like to do it. > > What about C? Clang is flagging a lot of warnings that GCC doesn't, > mainly -Wuninitialized. > >> But putting that in the test suite, I need to come up with a set of test >> files (the source changes each commit, source changes shouldn't cause >> test case regressions). I've done a start of tests/files with some utf8 >> code in there, but it hasn't got nearly enough complexity yet, and >> there's "standard test load that doesn't change" vs "I thought of a new >> utf8 torture test and added it, but that broke the ls -lR test." > > Some code coverage stats might help? I can probably set that up as > it's similar to making an ASAN build. (Perhaps something like this > HTML http://llvm.org/docs/CoverageMappingFormat.html) > > The build patch I sent yesterday will help with that as well since you > need to set CFLAGS. > > >> Or with testing "top", the output is based on the current system load. >> Even in a controlled environment, it's butterfly effects all the way >> down. I can look at the source files under /proc I calculated the values >> from, but A) hugely complex, B) giant race condition, C) is implementing >> two parallel code paths that do the same thing a valid test? If I'm >> calculating the wrong value because I didn't understand what that field >> should mean, my test would also be wrong... >> >> In theory testing "ps" is easier, but in theory "ps" with no arguments >> is the same as "ps -o pid,tty,time,cmd". But if you run it twice, the >> pid of the "ps" binary changes, and the "TIME" of the shell might tick >> over to the next second. You can't "head -n 2" that it because it's >> sorted by pid, which wraps, so if your ps pid is lower than your bash >> pid it would come first. Oh, and there's no guarantee the shell you're >> running is "bash" unless you're in a controlled environment... That's >> just testing the output with no arguments.) > > Those are definitely hard ones... I agree with the strategy of > classifying the tests, and then we can see how many of the hard cases > are. I think detecting trivial breakages will be an easy first step, > and it should allow others to contribute more easily. > > thanks, > Andy > _______________________________________________ > Toybox mailing list > [email protected] > http://lists.landley.net/listinfo.cgi/toybox-landley.net -- Elliott Hughes - http://who/enh - http://jessies.org/~enh/ Android native code/tools questions? Mail me/drop by/add me as a reviewer. _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
