Short, simple question: does this combination ring a bell for anyone?
- Net+OS 7.1.1
- 71 minutes (or 1 hour 11 minutes), give or take a few seconds
I’ve got a Net+OS 7.1.1 project on the CC7U that crashes if an incoming SSL connection exists while the device reaches an uptime of a multiple of 71 minutes after boot.
Without a SSL connection it will run for hours and hours. I can connect and disconnect as often as I want, but I can not be connected when tx_time_get reaches a multiple of 71 minutes, or the device will freeze.
My timing is measured through debug logging: the device broadcasts its uptime over plain unencrypted UDP once a minute, plus event logging the same way when/if necessary.
When “it” happens, those broadcasts stop coming and and the device stops responding to pings.
Through adding more logging I got so far as to notice that it usually begins with a send() or recv() call that never returns, freezing one thread. The other threads keep running for a while longer, but before too long they die as well.
Is it 71 minutes, 35 seconds? Because that is exactly FFFFFFFF nanoseconds.
I wonder if there is a internal tickcount that overflows at that point and Digi’s SSL is using it for a delta and it cannot handle the overflow. Sounds like an OS bug to me.
Followup info: I sent in a demo project that illustrates the problem, and today I received a fix.
They say it’s actually a bug in the Treck stack, but they managed to work around it in their SSL code.
The most important aspect: I just tested it and it works, so I assume the new version of libssl.a they mailed me will be made available for download soon.
Either that, or in the worst case they’ll report the bug to Treck and wait for a fix from them.
Yes!!! Defintiely try this patch: http://ftp1.digi.com/support/patches/SSLFixes_71.htm
It’s a poor description, but I know there was a wrap around issue with timestamps.
Moreover, this site contains several 7.1 patches that could be interesting:
Such an internal wrap-around is what I suspected too, especially because I had already added code to monitor stack use and free memory, so I know it isn’t something like a memory leak.
But I hadn’t thought of counting in nanoseconds - thanks for the tip.
OTOH, I just discovered something that partly contradicts it. It’s probably still some counter that wraps around, but it would count per 10 nsec by default.
I had (almost forgotten that I had) changed BSP_TICKS_PER_SECOND to 1000, one tick per millisecond, for this project. Last night I changed it back to the default of 100 (10 msec), and left it running overnight. I had always expected I might want to change it back some day, so I made all my own timing relative to BSP_TICKS_PER_SECOND, so the functionality remained basically the same.
The problem now occurred between 11h56 and 11h57 uptime. Divide 11h56 by 10, and you get 71 min and 36 seconds.