Series 2 XBee (API mode, router, latest firmware) becomes uncontactable OTA after a few hours after minor usage change

I have an Atmega328 application CPU connected to an XBee series 2 Pro API router (firmware 23a7), primarily monitoring and reporting back on a collection of thermometers. Communication is via 9600bps serial and API mode point-to-point messages.

Power is via a switching PSU supplying up to 1A, from a trickle-charged (I.e. always full) 12V SLA. Decoupling capacitors follow XBee manual recommendations.

This application/device has been running fine and stably, with the exception that the message size it wants to send has exceeded the maximum API transmission message size (110 bytes) when asking for all the thermometer details at once.

To cope with that, I changed the application to support sending two smaller messages, rather than one oversized message.

Now, two days in a row after updating the firmware with the new two-message firmware, the XBee device has gone completely off the air after running a few hours - requiring a power cycle.

The XBee itself stops routing (causing an unrelated endpoint device to lose connectivity), and the XBee will no longer accept AT commands over the air from another XBee. It appears to have completely hung, at least as far as radio connectivity is concerned.

I do check CTS, but only before buffering an entire message - obviously that is insufficient to guarantee avoiding XBee-side buffer overflows. So an input buffer bug in the XBee firmware is a possible contender. Some kind of race condition due to multiple transmit requests in quick succession might be a factor too. It’s possible that it might be dealing with mesh message routing at the same time too - though it shouldn’t be doing a lot.

I’ve seen other mentions of hangs, such as http://www.digi.com/support/forum/44629/xbee-series-2-module-enter-undefined-state-freeze and http://stackoverflow.com/questions/24434536/digis-xbee-series-2-firmware-freezes-crashes-when-sending-too-much-serial-data but it really isn’t clear if they are the same or different. In particular, it’s an XBee router that’s crashing, so poll times are irrelevant.

But really, this seems like a problem that just shouldn’t happen, particularly if it’s just the input buffer overrunning. Or at a bare minimum, the manual suggests that the XBee has a hardware watchdog that presumably should be rebooting it on catastrophic failures.

Is this a known problem? Does it have a known solution or workaround?

How are you determining that the module has stopped responding? Are you trying to query a setting from it or just not receiving data?

If it is a buffer over run and you are loosing part of the API frame, the module will be expecting the reset of the API frame along with a valid check sum. Instead, it will get a NEW API frame thus keeping you in the same bad cycle of faulty frames till you power cycle the system.

To get around this, Hardware flow control must be used. That is to say, your buffer on your processor must be big enough to hold more than 1 full packet and that buffer must require CTS to be Low for it to pass data out to the XBee. It must also stop sending data as soon as CTS changes states.

I’m determining it’s not responding by attempting a remote AT command from other XBee devices. And also because the hung XBee stops working as a router for other devices. I.e. the hung XBee has dropped off the network.

There is no other reason it should drop off the network, it’s got a healthy link strength. From the coordinator, receiving at -63dBm, transmissions received at -64dBm, it’s the second strongest signal in the mesh. Until my slight change in the application software talking to it, the XBee had been rock solid.

(However, this particular XBee lives in the roof space, making it the least accessible XBee in the mesh too - power cycling it isn’t fun)

I’m aware API message loss is a predictable consequence of overrunning the buffer, but my application can cope with dropped messages. And escaped API mode allows the XBee to recover API framing immediately. So that’s expected to be a minor inconvenience if it happens, but shouldn’t be fatal.

This sounds like something you really need to report to Digi’s technical support.