http://bugzilla.meego.com/show_bug.cgi?id=1007

           Summary: nightly testing + valgrind results
    Classification: MeeGo Projects
           Product: SyncEvolution
           Version: unspecified
          Platform: Netbook
        OS/Version: IA
            Status: ASSIGNED
          Severity: enhancement
          Priority: High
         Component: SyncEvolution
        AssignedTo: [email protected]
        ReportedBy: [email protected]
         QAContact: [email protected]
                CC: [email protected],
                    [email protected]
   Estimated Hours: 0.0


This is from http://bugzilla.moblin.org/show_bug.cgi?id=6457

   Description  From  pohly   2009-09-27 13:32:12 PST   (-) [reply]

While running some routine valgrind checks of the dbus-api branch I noticed a
regression imported from "master" (invalid memory access to result of
SyncSourceBackendDebug()). The code went into "master" earlier last week. The
nightly testing should have found this problem, because syncs should be run
under valgrind control.

Do we still do that? If we do, did valgrind find the problem? It should be
reported at the end of the client-test run to stdout/stderr and result in a
non-zero return code of client-test.

I suspect that since the migration to the HTML format for test results this
particular aspect of testing is no longer included in the output and thus might
have been missed last week.

Yongsheng, can you check this?

------- Comment #1 From yongsheng zhu 2009-09-27 18:04:59 PST (-) [reply]
-------

ok, we'll investigate it.

------- Comment #2 From yongsheng zhu 2009-09-27 18:16:16 PST (-) [reply]
-------

funambol: CLIENT_TEST_SERVER=funambol
CLIENT_TEST_SOURCES=vcard21,ical20,itodo20,text
CLIENT_TEST_SKIP=Client::Sync::ical20::Retry,Client::Sync::ical20::Suspend,Client::Sync::ical20::Resend,Client::Sync::vcard21::Retry,Client::Sync::vcard21::Suspend,Client::Sync::vcard21::Resend,Client::Sync::itodo20::Retry,Client::Sync::itodo20::Suspend,Client::Sync::itodo20::Resend,Client::Sync::text::Retry,Client::Sync::text::Suspend,Client::Sync::text::Resend,Client::Sync::vcard21_ical20_itodo20_text::Retry,Client::Sync::vcard21_ical20_itodo20_text::Suspend,Client::Sync::vcard21_ical20_itodo20_text::Resend,Client::Sync::text_itodo20_ical20_vcard21::Retry,Client::Sync::text_itodo20_ical20_vcard21::Suspend,Client::Sync::text_itodo20_ical20_vcard21::Resend
CLIENT_TEST_XML=1 CLIENT_TEST_MAX_ITEMSIZE=2048  CLIENT_TEST_DELAY=10
CLIENT_TEST_FAILURES= CLIENT_TEST_COMPARE_LOG=T CLIENT_TEST_RESEND_TIMEOUT=5
CLIENT_TEST_INTERRUPT_AT=1 SYNC_EVOLUTION_EVO_CALENDAR_DELAY=1
CLIENT_TEST_ALARM=1200 CLIENT_TEST_LOG=
CLIENT_TEST_EVOLUTION_PREFIX=file:///work/runtests/head/databases schroot -p -c
testing-amd64 -- /work/runtests/dbus-session.sh /work/runtests/valgrindcheck.sh
env LD_LIBRARY_PATH=build-synthesis/src/.libs ./client-test Client::Sync:
failed (return code 256)

Do you mean this return code 256? If yes, we'll add it soon.

------- Comment #3 From pohly 2009-09-27 22:51:59 PST (-) [reply] -------

(In reply to comment #2)
> Do you mean this return code 256? If yes, we'll add it soon.

Yes, please. Any non-zero return indicates an error.

To detect whether valgrind ran and reported errors, you would have to scan the
text output.

------- Comment #4 From yongsheng zhu 2009-09-28 19:02:39 PST (-) [reply]
-------

the fix is done and should go into effect today.

------- Comment #5 From pohly 2009-09-29 13:04:41 PST (-) [reply] -------

(In reply to comment #3)
> To detect whether valgrind ran and reported errors, you would have to scan the
> text output.

(In reply to comment #4)
> the fix is done and should go into effect today.

How do you detect whether valgrind reported an error? The output from today's
test says "Valgrind: failed (return code 256)" for some servers, for example
Memotoo. But the output for valgrind doesn't show anything from valgrind:

http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing/13-memotoo/output.txt

If there was an error from valgrind, can this be made available as a link in
the report? Either link to the output.txt or some reduced version of it with
just the valgrind reports.

Speaking of the report and links, today's report has some inconsistent counts
(negative number for skipped, zero total), no information about which tests
failed for a server and the copy of the result files have permission issues on
runtests.syncevolution.org:
http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing/9-funambol/
=> permission denied

------- Comment #6 From yongsheng zhu 2009-09-29 18:15:21 PST (-) [reply]
-------

(In reply to comment #5)
> (In reply to comment #3)
> > To detect whether valgrind ran and reported errors, you would have to scan 
> > the
> > text output.
> 
> (In reply to comment #4)
> > the fix is done and should go into effect today.
> 
> How do you detect whether valgrind reported an error? The output from today's
> test says "Valgrind: failed (return code 256)" for some servers, for example
> Memotoo. But the output for valgrind doesn't show anything from valgrind:
> 
> http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing/13-memotoo/output.txt
Here is the result from this text output:
schroot -p -c testing-amd64 -- /work/runtests/dbus-session.sh
/work/runtests/valgrindcheck.sh env LD_LIBRARY_PATH=build-synthesis/src/.libs
./client-test Client::Sync: failed (return code 256)


> If there was an error from valgrind, can this be made available as a link in
> the report? Either link to the output.txt or some reduced version of it with
> just the valgrind reports.
I'll add it.
> Speaking of the report and links, today's report has some inconsistent counts
> (negative number for skipped, zero total), no information about which tests
> failed for a server and the copy of the result files have permission issues on
> runtests.syncevolution.org:
> http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing/9-funambol/
> => permission denied
strange. I could access it. The negative number is because Congwu made some
changes on collecting all cases. we'll work this to solve it. thanks

------- Comment #7 From pohly 2009-09-29 23:49:28 PST (-) [reply] -------

(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #3)
> > > To detect whether valgrind ran and reported errors, you would have to 
> > > scan the
> > > text output.
> > 
> > (In reply to comment #4)
> > > the fix is done and should go into effect today.
> > 
> > How do you detect whether valgrind reported an error? The output from 
> > today's
> > test says "Valgrind: failed (return code 256)" for some servers, for example
> > Memotoo. But the output for valgrind doesn't show anything from valgrind:
> > 
> > http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing/13-memotoo/output.txt
> Here is the result from this text output:
> schroot -p -c testing-amd64 -- /work/runtests/dbus-session.sh
> /work/runtests/valgrindcheck.sh env LD_LIBRARY_PATH=build-synthesis/src/.libs
> ./client-test Client::Sync: failed (return code 256)

True, but the 256 return code wasn't caused by valgrindcheck.sh. There is no
error output from valgrind.

My apologies for a probably misleading comment earlier: 256 indicates that
there was a problem, which *might* have been an error found by valgrind.
Scanning the error output is necessary to determine whether it really was
valgrind or just some regular CPPUnit test failure.

Perhaps this can be changed: valgrindcheck.sh could return RET=100 if it found
something *and* override the return code even if the original application also
failed. Here's the patch:

--- a/test/valgrindcheck.sh
+++ b/test/valgrindcheck.sh
@@ -27,15 +27,13 @@ killvalgrind () {
 ( set -x; env GLIBCXX_FORCE_NEW=1 G_SLICE=always-malloc G_DEBUG=gc-friendly
valgrind $VALGRIND_ARGS --l
 RET=$?

-if [ $RET == 0 ]; then
-    # give other valgrind instances some time to settle down, then kill them
-    sleep 1
-    killvalgrind -15
-    sleep 5
-    killvalgrind -9
-    if grep '^==[0-9][0-9]*==' $LOGFILE >/dev/null; then
-        RET=1
-    fi
+# give other valgrind instances some time to settle down, then kill them
+sleep 1
+killvalgrind -15
+sleep 5
+killvalgrind -9
+if grep '^==[0-9][0-9]*==' $LOGFILE >/dev/null; then
+    RET=100
 fi

 exit $RET

------- Comment #8 From yongsheng zhu 2009-09-30 00:42:15 PST (-) [reply]
-------

> --- a/test/valgrindcheck.sh
> +++ b/test/valgrindcheck.sh
> @@ -27,15 +27,13 @@ killvalgrind () {
>  ( set -x; env GLIBCXX_FORCE_NEW=1 G_SLICE=always-malloc G_DEBUG=gc-friendly
> valgrind $VALGRIND_ARGS --l
>  RET=$?
> 
> -if [ $RET == 0 ]; then
> -    # give other valgrind instances some time to settle down, then kill them
> -    sleep 1
> -    killvalgrind -15
> -    sleep 5
> -    killvalgrind -9
> -    if grep '^==[0-9][0-9]*==' $LOGFILE >/dev/null; then
> -        RET=1
> -    fi
> +# give other valgrind instances some time to settle down, then kill them
> +sleep 1
> +killvalgrind -15
> +sleep 5
> +killvalgrind -9
> +if grep '^==[0-9][0-9]*==' $LOGFILE >/dev/null; then
> +    RET=100
>  fi
> 
>  exit $RET
ok, I'll add this and check the return code. thanks

------- Comment #9 From yongsheng zhu 2009-10-12 22:03:19 PST (-) [reply]
-------

> ok, I'll add this and check the return code. thanks
I have submitted the changes. waiting for test results.

------- Comment #10 From yongsheng zhu 2009-10-14 20:05:41 PST (-) [reply]
-------

(In reply to comment #9)
> > ok, I'll add this and check the return code. thanks
> I have submitted the changes. waiting for test results.

These should be done according to test results.

------- Comment #11 From pohly 2009-10-16 09:16:07 PST (-) [reply] -------

(In reply to comment #10)
> (In reply to comment #9)
> > > ok, I'll add this and check the return code. thanks
> > I have submitted the changes. waiting for test results.
> 
> These should be done according to test results.

The test results only showed "valgrind ok", they didn't show that valgrind
errors are found. So I set out to trigger that by adding an intentional memory
leak into SyncContext::sync().

I found several problems:
- valgrind is run on the libtool shell wrapper with --trace-children=no,
  so it never checked our code; fixed
- in
~tester/results/2009-10-16-17-33/head-evolution-testing/6-scheduleworld/output.txt
  one can see that valgrind found something, but the return code isn't 100

I suspect that the return code has to be decoded to distinguish integer error
codes from "killed by signal" results. Yongsheng, can you check this?

To reproduce this problem more or less quickly I used:

diff --git a/src/syncevo/SyncContext.cpp b/src/syncevo/SyncContext.cpp
index 499f54b..aa791f7 100644
--- a/src/syncevo/SyncContext.cpp
+++ b/src/syncevo/SyncContext.cpp
@@ -1462,6 +1462,9 @@ SyncMLStatus SyncContext::sync(SyncReport *report)
 {
     SyncMLStatus status = STATUS_OK;

+    // intentional memory leak
+    malloc(100);
+
     if (!exists()) {
         SE_LOG_ERROR(NULL, NULL, "No configuration for server \"%s\" found.",
m_server.c_str());
         throwError("cannot proceed without configuration");

and

runtests testing scheduleworld sanity

The malloc() is removed again and not needed, because valgrind finds problems
in some of the non-SyncEvolution processes that we end up tracing. Either we
write suppressions for these problems or we must configure valgrind to only
trace our binary (might be hard).

------- Comment #12 From yongsheng zhu 2009-10-18 22:44:51 PST (-) [reply]
-------

> I suspect that the return code has to be decoded to distinguish integer error
> codes from "killed by signal" results. Yongsheng, can you check this?
On Unix, the return value of os.system() is the exit status of the process
encoded in the format specified for os.wait().
For os.wait(),the return value is a 16-bit number, whose low byte is the signal
number that killed the process, and whose high byte is the exit status (if the
signal number is zero);
Thus the return codes printed in the result are always multiples of 256

------- Comment #13 From yongsheng zhu 2009-10-19 00:29:05 PST (-) [reply]
-------

(In reply to comment #12)
> > I suspect that the return code has to be decoded to distinguish integer 
> > error
> > codes from "killed by signal" results. Yongsheng, can you check this?
> On Unix, the return value of os.system() is the exit status of the process
> encoded in the format specified for os.wait().
> For os.wait(),the return value is a 16-bit number, whose low byte is the 
> signal
> number that killed the process, and whose high byte is the exit status (if the
> signal number is zero);
> Thus the return codes printed in the result are always multiples of 256

The patch for this issue was also comitted in the branch 'nightly' in the
directory '~tester/syncevolution-testing/syncevolution'

------- Comment #14 From yongsheng zhu 2009-10-19 00:33:05 PST (-) [reply]
-------

> Either we write suppressions for these problems or we must configure valgrind 
> to only
> trace our binary (might be hard).
We have to add *too many* suppressions if using suppressions. But it seems to
me harder to configure valgrind to only trace our binary.

------- Comment #15 From yongsheng zhu 2009-10-19 18:20:26 PST (-) [reply]
-------

> I found several problems:
> - valgrind is run on the libtool shell wrapper with --trace-children=no,
>   so it never checked our code; fixed
> - in
A problem is that the nightly test could not be completed in the same day it
starts because valgrind make apps running slowly.

------- Comment #16 From pohly 2009-10-20 00:53:17 PST (-) [reply] -------

(In reply to comment #14)
> > Either we write suppressions for these problems or we must configure 
> > valgrind to only
> > trace our binary (might be hard).
> We have to add *too many* suppressions if using suppressions.

How many errors did you find? Looking at tonight's "scheduleworld" run I see
one error for /usr/lib/bonobo-activation/bonobo-activation-server and one for
an unknown binary. Everything else is in SyncEvolution. "Everything" luckily
isn't that much: one leak related to libical.

This doesn't look that bad.

What is more annoying is that multiple processes writing into the same valgrind
file seem to add nul bytes to the file. This is visible when viewing with
"less" and makes me wonder whether the output is really complete.

So, here are two ideas how we can avoid this:
* compile the "testing" binaries with --disable-shared --enable-static to
  avoid the libtool wrapper script, then stop using --trace-children=yes
* run the bonobo-activation-server in advance, so that valgrind doesn't
  catch it

I'll have a look at this.

(In reply to comment #15)
> > I found several problems:
> > - valgrind is run on the libtool shell wrapper with --trace-children=no,
> >   so it never checked our code; fixed
> > - in
> A problem is that the nightly test could not be completed in the same day it
> starts because valgrind make apps running slowly.

Is that a real problem for us or hypothetical? Tonight's run hasn't completed
yet, but that isn't unusual. valgrind is a indeed a problem for CPU bound code,
but we are mostly limited by IO performance and the obligatory sleeps() that we
need to not overload servers and for the time-based change tracking.

------- Comment #17 From yongsheng zhu 2009-10-20 18:25:47 PST (-) [reply]
-------

> Is that a real problem for us or hypothetical? Tonight's run hasn't completed
> yet, but that isn't unusual. valgrind is a indeed a problem for CPU bound 
> code,
> but we are mostly limited by IO performance and the obligatory sleeps() that 
> we
> need to not overload servers and for the time-based change tracking.

We have such this case. Friday's head-evolution-testing was completed in the
sunday morning. The most important concern is that we'll share some resources,
such as valgrind generation output file(/work/runtests/valgrind.out). Two
processes reading/writing it concurrently could make valgrind checking
unexpected.

------- Comment #18 From yongsheng zhu 2009-10-20 18:37:39 PST (-) [reply]
-------

> How many errors did you find? Looking at tonight's "scheduleworld" run I see
> one error for /usr/lib/bonobo-activation/bonobo-activation-server and one for
> an unknown binary. Everything else is in SyncEvolution. "Everything" luckily
> isn't that much: one leak related to libical.
> 
> This doesn't look that bad.
Yes, you are right. each 'interoperability server' has several error, some of
them are the same. so I'll add them to the suppression file if they are not
sync mem errors.

scheduleword: 4
synthesis:11
funambol: 6
zyb:7
google: 7
mobical: 14
memotoo:12

------- Comment #19 From yongsheng zhu 2009-10-20 20:17:52 PST (-) [reply]
-------

> Yes, you are right. each 'interoperability server' has several error, some of
> them are the same. so I'll add them to the suppression file if they are not
> sync mem errors.
> 
> scheduleword: 4
> synthesis:11
> funambol: 6
> zyb:7
> google: 7
> mobical: 14
> memotoo:12

I've added memory errors of other libraries into evo.supp. And there are 3
memory errors from synthesis. A bug #7121 is opened to track them.

------- Comment #20 From pohly 2009-10-21 03:17:35 PST (-) [reply] -------

(In reply to comment #19)
> > Yes, you are right. each 'interoperability server' has several error, some 
> > of
> > them are the same. so I'll add them to the suppression file if they are not
> > sync mem errors.
> > 
> > scheduleword: 4
> > synthesis:11
> > funambol: 6
> > zyb:7
> > google: 7
> > mobical: 14
> > memotoo:12
> 
> I've added memory errors of other libraries into evo.supp.

Careful here, please. For example, you added this one here:

+# ==30565== 6,564 bytes in 144 blocks are definitely lost in loss record 48 of
72
+# ==30565==    at 0x4C2391E: malloc (vg_replace_malloc.c:207)
+# ==30565==    by 0x5B1FA21: strdup (in /lib/libc-2.9.so)
+# ==30565==    by 0xD9E080C: icalparameter_new_from_value_string (in
/usr/lib/libical.so.0.43.0)
+# ==30565==    by 0xD9F6AF0: icalparser_add_line (in
/usr/lib/libical.so.0.43.0)
+# ==30565==    by 0xD9F7088: icalparser_parse (in /usr/lib/libical.so.0.43.0)
+# ==30565==    by 0xD9F7230: icalparser_parse_string (in
/usr/lib/libical.so.0.43.0)
+# ==30565==    by 0xA2CB224:
SyncEvo::EvolutionCalendarSource::insertItem(std::string const&, std::stri

The suppression for it contains the full stack. That has the effect that the
slightest change in how we call the function will no longer match the
suppression, causing the error to pop up again. Other calls into
icalparser_parse_string() are not matched by the suppression either.

When adding a new suppression, think carefully about which stack entries to
include in the match. In the example above,
  malloc
  strdup
  icalparameter_new_from_value_string
would be much to broad. Including SyncEvolution as caller is too narrow. The
right approach IMHO would be to identify the leak with
icalparser_parse_string() as top-level call. Because it does not return the
string which is leaked here, we can be reasonable sure that our code is not at
fault.

A similar problem occurs when specifying object files:
   obj:/usr/lib/libgnome-2.so.0.2600.0
This won't match anymore when the library version or install path changes. It's
better to use wildcards:
   obj:*libgnome-2*

As before, use common sense to choose between "too limited" and "too broad".

All of the new suppressions should be reviewed taking that into account.
However, perhaps we don't need them at all. I'm about to try out compiling with
--disable-shared for valgrind testing.

------- Comment #21 From pohly 2009-10-21 05:58:17 PST (-) [reply] -------

(In reply to comment #16)
> * compile the "testing" binaries with --disable-shared --enable-static to
>   avoid the libtool wrapper script, then stop using --trace-children=yes

I've made that change. I'm going to run the full test once to check how slow it
really is. cron job is disabled and needs to be reactivated.

------- Comment #22 From yongsheng zhu 2009-10-21 18:24:41 PST (-) [reply]
-------

> The suppression for it contains the full stack. That has the effect that the
> slightest change in how we call the function will no longer match the
> suppression, causing the error to pop up again. Other calls into
> icalparser_parse_string() are not matched by the suppression either.
> 
> When adding a new suppression, think carefully about which stack entries to
> include in the match. In the example above,
>   malloc
>   strdup
>   icalparameter_new_from_value_string
> would be much to broad. Including SyncEvolution as caller is too narrow. The
> right approach IMHO would be to identify the leak with
> icalparser_parse_string() as top-level call. Because it does not return the
> string which is leaked here, we can be reasonable sure that our code is not at
> fault.
> 
> A similar problem occurs when specifying object files:
>    obj:/usr/lib/libgnome-2.so.0.2600.0
> This won't match anymore when the library version or install path changes. 
> It's
> better to use wildcards:
>    obj:*libgnome-2*
> 
> As before, use common sense to choose between "too limited" and "too broad".
Yes, reasonable. I'll refine the suppressions.

------- Comment #23 From yongsheng zhu 2009-10-21 18:28:40 PST (-) [reply]
-------

> However, perhaps we don't need them at all. I'm about to try out compiling 
> with
> --disable-shared for valgrind testing.
ok,I'd hold on to fix up suppressions and see your experimental results.

------- Comment #24 From pohly 2009-10-23 04:35:01 PST (-) [reply] -------

(In reply to comment #21)
> (In reply to comment #16)
> > * compile the "testing" binaries with --disable-shared --enable-static to
> >   avoid the libtool wrapper script, then stop using --trace-children=yes
> 
> I've made that change. I'm going to run the full test once to check how slow 
> it
> really is.

The whole run completed in 9 hours. This is fast enough to not overlap, but to
be sure I added a check to runtests.

It seems that someone has already enabled the cron job already.

(In reply to comment #23)
> > However, perhaps we don't need them at all. I'm about to try out compiling 
> > with
> > --disable-shared for valgrind testing.
> ok,I'd hold on to fix up suppressions and see your experimental results.

I removed them from the "nightly" branch. They are in bug6547-valgrind, so if
we find that we need them, we can merge and refine them.

------- Comment #25 From pohly 2009-12-09 08:39:35 PST (-) [reply] -------

(In reply to comment #24)
> (In reply to comment #21)
> > (In reply to comment #16)
> > > * compile the "testing" binaries with --disable-shared --enable-static to
> > >   avoid the libtool wrapper script, then stop using --trace-children=yes
> > 
> > I've made that change. I'm going to run the full test once to check how 
> > slow it
> > really is.
> 
> The whole run completed in 9 hours. This is fast enough to not overlap, but to
> be sure I added a check to runtests.

I'm not sure whether I ran the right binary at that time.

When rechecking on this issue I found that an intentional memory leak was not
reported. Looking further I found that this was because valgrind was invoked
like this:
/work/runtests/dbus-session.sh /work/runtests/valgrindcheck.sh env
LD_LIBRARY_PATH=build-synthesis/src/.libs ./client-test
Client::Sync::vcard30::testItems

In other words, valgrind was checking the "env" command and due to not tracing
children, not client-test.

I've changed the order. Expect "valgrind" failed errors. I'll rebase on the
latest upstream libsynthesis, then we'll need to work on fixing or supressing
issues.

-- 
Configure bugmail: http://bugzilla.meego.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching someone on the CC list of the bug.
_______________________________________________
Syncevolution-issues mailing list
[email protected]
http://lists.syncevolution.org/listinfo/syncevolution-issues

Reply via email to