I have an Atmega328 application CPU connected to an XBee series 2 Pro API router (firmware 23a7), primarily monitoring and reporting back on a collection of thermometers. Communication is via 9600bps serial and API mode point-to-point messages.
Power is via a switching PSU supplying up to 1A, from a trickle-charged (I.e. always full) 12V SLA. Decoupling capacitors follow XBee manual recommendations.
This application/device has been running fine and stably, with the exception that the message size it wants to send has exceeded the maximum API transmission message size (110 bytes) when asking for all the thermometer details at once.
To cope with that, I changed the application to support sending two smaller messages, rather than one oversized message.
Now, two days in a row after updating the firmware with the new two-message firmware, the XBee device has gone completely off the air after running a few hours - requiring a power cycle.
The XBee itself stops routing (causing an unrelated endpoint device to lose connectivity), and the XBee will no longer accept AT commands over the air from another XBee. It appears to have completely hung, at least as far as radio connectivity is concerned.
I do check CTS, but only before buffering an entire message - obviously that is insufficient to guarantee avoiding XBee-side buffer overflows. So an input buffer bug in the XBee firmware is a possible contender. Some kind of race condition due to multiple transmit requests in quick succession might be a factor too. It’s possible that it might be dealing with mesh message routing at the same time too - though it shouldn’t be doing a lot.
I’ve seen other mentions of hangs, such as http://www.digi.com/support/forum/44629/xbee-series-2-module-enter-undefined-state-freeze and http://stackoverflow.com/questions/24434536/digis-xbee-series-2-firmware-freezes-crashes-when-sending-too-much-serial-data but it really isn’t clear if they are the same or different. In particular, it’s an XBee router that’s crashing, so poll times are irrelevant.
But really, this seems like a problem that just shouldn’t happen, particularly if it’s just the input buffer overrunning. Or at a bare minimum, the manual suggests that the XBee has a hardware watchdog that presumably should be rebooting it on catastrophic failures.
Is this a known problem? Does it have a known solution or workaround?