With the current code base I can easily trigger a bug in farm during a test case that creates a cluster with four nodes, then shuts the cluster down, restarts two of the sheep, starts a new sheep and then restarts the other two original sheep.
The stack trace looks like: #0 0x00007f1b526af3a5 in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f1b526b2b0b in __GI_abort () at abort.c:92 #2 0x0000000000425562 in strbuf_grow (sb=0x7ffffc3599b0, extra=18446744073709551615) at strbuf.c:54 #3 0x0000000000425934 in strbuf_add (sb=0x7ffffc3599b0, data=0x7ffffc3539b0, len=18446744073709551615) at strbuf.c:101 #4 0x000000000041f4be in snap_file_write (epoch=1, trunksha1=0x7ffffc359a60 "\204\344Ú\307\006\v\234\354\025\233w*m\364\341A\353\367\226\377\177", outsha1=0x7ffffc359a40 "", user=0) at farm/snap.c:171 #5 0x0000000000420c42 in farm_end_recover (iocb=0x7ffffc359aa0) at farm/farm.c:543 #6 0x000000000041396a in do_recover_main (work=0x6266340) at recovery.c:415 #7 0x000000000040fd73 in bs_thread_request_done (fd=11, events=1, data=0x0) at work.c:159 #8 0x00000000004219b8 in event_loop (timeout=-1) at event.c:181 #9 0x00000000004049ce in main (argc=10, argv=0x7ffffc35b308) at sheep.c:285 This series avoids having to read the epoch file during the recovery process entirely.
-- sheepdog mailing list [email protected] http://lists.wpkg.org/mailman/listinfo/sheepdog
