The 7107 ENOTCONN occurs between 1 minute and 1 day after powering on and the only way to recover is by power cycling the module, which is not a solution for my needs.
Odd facts
Upon starting there is a loop checking if xbee.atcmd('AI') == 0 before trying to send any message.
Immediately after confirming it is connected a first message is sent. This happens a few milliseconds after the check, and in many cases the first message already fails with ENOTCONN.
When it does go beyond the first message the code runs in a loop, and is catching any exception (BaseException). A repr(exception) String is sent via Zigbee to the coordinator.
Although rare, sometimes OSError: [Errno 7107] ENOTCONN is caught and it succeeds to send the message with the error over the network, just a few milliseconds after it was raised.
I have a whatchdog set for 30 seconds, so when the problem happens, 30s later the XBee is soft rebooted and what is described in the item 1 happens, just replace the “in many cases” by “always” in the sentence.
At this state it will keep in the loop of being rebooted by the WDT and will never be able to send a message. It is irrecoverable unless power cycled.
If I connect another XBee to XCTU and scan the network, the problematic XBee shows up there, even when it is in the irrecoverable loop described above.
Power cycling always solves the problem, at least until it fails again
This is an XB3-24 in non-sleeping (Router) mode
I can put it close to the coordinator or any other stable node and still the problem happens
My mesh is moderate in size, 65 nodes, 61 being routers
My mesh is stable, I don’t have problems with disconnecting devices (except the XBee)
The power source to the XBee is stable
RSSI doesn’t seem to be related, I’m reporting it every 5 minutes, sometimes it is close to -90 dBm and there is no problem, other times, the last reported RSSI is close to -70 dBm and the problem happens
What are the options that I have to deal with this problem?
What is the proper way of doing a Network Reset?
I’ve seen a command for that, but it says it resets all network parameters, so not what I need.
I’ve also read about the Network Watchdog, but I’m not sure if it would work together with the Watchdog Timer.
“Resets network layer parameters on one or more modules within a PAN. Responds immediately with an OK then causes a network restart. The device loses all network configuration and routing information.”
If the device loses all network configuration, it seems to me that it wouldn’t be able to rejoin the network without adjusting the parameters again and re-pairing it.
That would depend on what functions you are using. If your Joining is enabled and open, then I would not expect it to be an issue.
If you prefer, you can issue an ATFR command. This command triggers the module to reset. This same thing as triggering the reset line but from an AT command interface instead.
The ATFR solved the problem of soft rebooting and immediately getting the same ENOTCONN error.
But I was still not happy, these devices should not have such problems, and a reboot at least once a day is not a solution for my needs, I need them to be stable.
Then I decided to downgrade the firmware, from 1012 to 100D and voila! It is running for more than 24h without errors!
The firmware version 1012 seems to have introduced a bug.
I guess I should open a support case, right @mvut?
Edit: 100D is not the previous version, testing now with 1010. But I does work properly with 100D and not with 1012.
Thank you for reporting this bug. At this time, it has been reported by others as well and the only current option for resolution is to downgrade from 1012 to 100D or 1010.
Our firmware engineering group is working on a fix that should be included in the next firmware release. We do not have an estimate at this time for when that release will happen.
For anyone wondering, this problem has not been fixed on firmware version 1013. Still waiting for a solution that doesn’t involve using a very old firmware.