Yes, true. Back to the retry/delay for dlr's, I think for this particular problem at least, adding a configurable sleep() on dlr_find would just do the trick.
A more ellaborate solution would imply retrying the missing dlr's, but I'm not sure if it's worth it: the code would be more complex and would double the requests on the missing records (though save a few milliseconds on all dlr's). Opinions? Stipe? Alex? Would something like this make it into the main tree, or shall I keep it for my personal list of dirty hacks? ;) Regards, Alejandro 2009/4/29 Nikos Balkanas <[email protected]> > Maybe a more robust mechanism should be in place for DLRs. On and off > various tickets have surfaced about it. Consider that the 2nd server that > receives the DLR looses its connection to the DB. It should store somewhere > the DLRs until the connectivity is recovered. > > BR, > Nikos > > ----- Original Message ----- > *From:* Nikos Balkanas <[email protected]> > *To:* Alejandro Guerrieri <[email protected]> ; > [email protected] > *Sent:* Thursday, April 30, 2009 12:44 AM > *Subject:* Re: Possible race condition with dlr-mysql > > Definitely. DLRs are not synchronous and therefore a little extra delay > wouldn't hurt. To make it better I would suggest the delay only in the case > of DB storage for DLRs. > > Nikos > > ----- Original Message ----- > *From:* Alejandro Guerrieri <[email protected]> > *To:* [email protected] > *Sent:* Thursday, April 30, 2009 12:28 AM > *Subject:* Possible race condition with dlr-mysql > > Hi, > I'm doing some tests with DLR's (mysql storage) and I've come across a > weird problem. > > I'm using 2 Kannel servers, each one having an SMPP with a carrier. > Messages may come and go over either link, so maybe an MT goes from server > #1 and a DLR comes back on server #2. To solve that issue, I'm using a > central DB and the mysql storage for DLR's. > > The problem is, sometimes (about 1 in 5-6 messages) the DLR arrives * > before* the row is inserted, so kannel ignores it and the record then > remains untouched forever. This usually happens when the MT and the DLR are > processed on different servers, though most of the time it just works (even > when the MT and DLR are processed on different servers, the DLR is found, > processed and deleted). > > Here's an example: > > *Server #1:* > > 2009-04-29 16:44:45 [14318] [7] DEBUG: DLR[mysql]: Adding DLR smsc=my-smsc, > ts=5073a07e, src=OOOO, dst=XXXXXXXXXXX, mask=31, boxc= > > 2009-04-29 16:44:45 [14318] [7] DEBUG: sql: INSERT INTO dlr (smsc, ts, > source, destination, service, url, mask, boxc, status) VALUES ('my-smsc', > '5073a07e', 'OOOO', 'XXXXXXXXXXX', 'kannel', ' > http://my-host-name/dlr?id=f59d4249-65d8-4969-a2d9-636c881b9de7&code=%d&scode=%B', > '31', '', '0'); > > > *Server #2:* > > 2009-04-29 16:44:45 [8395] [9] DEBUG: DLR[mysql]: Looking for DLR > smsc=my-smsc, ts=5073a07e, dst=XXXXXXXXXXX, type=2 > > 2009-04-29 16:44:45 [8395] [9] DEBUG: sql: SELECT mask, service, url, > source, destination, boxc FROM dlr WHERE smsc='my-smsc' AND ts='5073a07e'; > > 2009-04-29 16:44:45 [8395] [9] ERROR: SMPP[my-smsc]: got DLR but could not > find message or was not interested in it id<5073a07e> dst<XXXXXXXXXXX> > > > I think that's because there's a possible race condition inherent on SQL > latency: The dlr only could be inserted after the submit_sm_resp is > received, but perhaps the smsc starts delivering the message on a separate > thread right after receiving the submit_sm. Add some SQL latency and there's > a possible race condition: > > > 1. Kannel sends a submit_sm > > 2. SMSC starts delivering the message on another thread > > 3. SMSC starts delivering the submit_sm_resp > > 4. The SMSC ends delivering the DLR. > > 5. Kannel receives the DLR and searches for it on the DB. Not found - DLR > is ignored. > > 6. The SMSC ends delivering the submit_sm_resp > > 7. Kannel parses the receipted_message_id and inserts the DLR. > > 8. The DLR row is not searched again and remains forever on the queue. > > > A possible solution would be to implement a (configurable/disabled by > default) retry mechanism for missing DLR's. For example, retrying one or two > times after a few milliseconds if the dlr is not found. > > > Opinions? Insights? > > > Regards, > > > Alejandro > > >
