Xbee3 Digimesh sleep bug

There is a bug in the Xbee3 Digimesh sleep protocol. Sometimes, after receiving an updated SP parameter in a sync message, or after transitioning from SM=8 to SM=0, a non-sleep coordinator’s receiver will jam and it is no longer able to receive anything until the module is reset.

I have an open case regarding this issue with tech support. It has been open now for over four months and I am still trying to get them to acknowledge that this is a legitimate problem and that it needs to be addressed by the ‘engineering’ department.

If you have experienced the same bug, or if you have had similar problems being roadblocked by tech support; please comment.

My network is solar/battery powered and uses the Digimesh sleep protocol to reduce power consumption to an acceptable level. The network may be idle for days, weeks, even months, and for these periods the sleep coordinator broadcasts a transmit request telling the network to set SM=8 (go to sleep). When the sleep coordinator needs the network to be responsive it broadcasts a transmit request telling the network to set SM=0.

All nodes are running micropython and are in API-4. The sleep parameters are SP=420 i.e. sleeps for 4200ms, ST=600, i.e. wakes for 600ms. Apart from CH=0x19 and AP=4 the config is factory standard. I first encountered the bug while running firmware 300B and it persists with 300D.

Often, after receiving a ‘wake-up’ broadcast and setting SM=0, a non-sleep coordinator is no longer able to receive anything. It’s receiver, in effect, becomes jammed and the only way to recover it is to reset the module. While the receiver is jammed, it’s micropython continues to run OK and I can establish serial comms with XCTU OK. The jammed node is still able to successfully transmit a broadcast, but cannot transmit an addressed unicast returning ‘Transmit failure: [Errno 7107] ENOTCONN’ in API-4 or a delivery status 0x25 ‘Route not found’ if in API-1 or 2.

I tried a different approach where the network remained in SM=8 permanently. When the sleep coordinator wanted the network to sleep it set SP=420, ST=600. And when it needed the network to be responsive it changed SP=1, ST=4790 i.e only sleep for 10ms in every 4800ms. The same bug occurred however, every so often a non-sleep coordinator would jam. It would receive the updated parameters, as evidenced by OS=1, but after that it would continually miss syncs and would not receive. Again, micropython and serial with XCTU still worked.

I have written a simple micropython sketch that will trap the bug, so that ‘engineering’ can isolate a failed node and troubleshoot it. If you have time, it would be great for you to try it out and comment whether nodes fail for you as well.
I have included two main.py files below. One for the sleep coordinator, and the other for non-sleep coordinators. I had 5 non-sleep coordinators in my test network, all within direct range of each other (12-40 meters).
I have run this test a dozen times or more. At most, 850 loops has occurred before a node failed, at the least 140.

for sleep-coordinator

import time
import xbee

REPLY = bytes([0x06]) # request that receiver reply to sender
WAKEUP = bytes([0x0C]) # broadcast command requesting receivers to set SM = 0
SLEEP = bytes([0x0D]) # broadcast command requesting receivers to set SM = 8

RECEIVE_TIMEOUT = 500 # in millis
sleepMode = 0
loop = 0

broadcast = bytes([0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0xFF])

node0 = bytes([0x00, 0x13, 0xA2, 0x00, 0x41, 0xB7, 0x3C, 0xE2]) # 0013A20041B73CE2
node1 = bytes([0x00, 0x13, 0xA2, 0x00, 0x41, 0xB7, 0x3C, 0xE8]) # 0013A20041B73CE8
node2 = bytes([0x00, 0x13, 0xA2, 0x00, 0x41, 0xB7, 0x3D, 0x2F]) # 0013A20041B73D2F
node3 = bytes([0x00, 0x13, 0xA2, 0x00, 0x41, 0xB7, 0x3C, 0x21]) # 0013A20041B73C21
node4 = bytes([0x00, 0x13, 0xA2, 0x00, 0x41, 0xB7, 0x3C, 0xF0]) # 0013A20041B73CF0
nodes = [node0, node1, node2, node3, node4]

def setSleepMode(mode):
global sleepMode

xbee.atcmd('SM', mode)
sleepMode = mode
# print("sleepMode ", sleepMode)

def transmit(address, message):
for x in range(3):
try:
xbee.transmit(address, message)
return True
except Exception as e:
print(“Transmit failure: %s” % str(e))
return False

def request(address, request):
if transmit(address, request):
sendTime = time.ticks_ms()
while time.ticks_diff(time.ticks_ms(), sendTime) < RECEIVE_TIMEOUT:
received_msg = xbee.receive()
if received_msg:
sender = received_msg[‘sender_eui64’]
payload = received_msg[‘payload’]
if sender == address and payload[0:1] == request[0:1]:
print(“received reply from %s” % (‘’.join(‘{:02x}’.format(x).upper() for x in sender)))
return True

xbee.atcmd(“AP”, 4)
xbee.atcmd(‘SP’, 420) # sleep for 4200 ms
xbee.atcmd(‘ST’, 600) # wake for 600 ms
xbee.atcmd(‘SO’, 1) # always sleep coordinator

while True:
setSleepMode(7)
transmit(broadcast, SLEEP)
time.sleep(50) # delay 50 seconds, time enough for 10 syncs
transmit(broadcast, WAKEUP)
setSleepMode(0)
time.sleep(1) # time enough for broadcast to propagate
loop += 1
print(“loop”, loop)
for x in range(len(nodes)):
if request(nodes[x], REPLY) is True:
continue
else:
while True:
time.sleep(1)
print(“node%d failed to reply in loop %d” % (x, loop))
# This node failed to reply. Check to see if it has jammed, i.e. can no longer receive.

for non-coordinator

import time
import xbee

REPLY = bytes([0x06]) # uni-cast command requesting a reply
WAKEUP = bytes([0x0C]) # broadcast command requesting receivers to set SM = 0
SLEEP = bytes([0x0D]) # broadcast command requesting receivers to set SM = 8

sleepMode = 0
sleepCycles = 0
refMillis = 0

def setSleepMode(mode):
global sleepMode
global sleepCycles

xbee.atcmd('SM', mode)
sleepMode = mode
sleepCycles = 0
print("sleepMode ", sleepMode)

xbee.atcmd('SM', mode)
sleepMode = mode

def transmit(address, message):
for x in range(3):
try:
xbee.transmit(address, message)
return True
except Exception as e:
print(“Transmit failure: %s” % str(e))
return False

xbee.atcmd(“AP”, 4)
xbee.atcmd(“SO”, 6) # b’000000110’ never act as sleep coordinator, receive API sleep status message
xbee.atcmd(‘SM’, 0)

while True:
received_msg = xbee.receive()
if received_msg:
sender = received_msg[‘sender_eui64’]
payload = received_msg[‘payload’]
print(“received from %s” % (‘’.join(‘{:02x}’.format(x).upper() for x in sender)),
(’ x’.join(‘{:02x}’.format(x).upper() for x in payload)))

    if payload[0] == WAKEUP:
        setSleepMode(0)

    elif payload[0] == SLEEP:
        setSleepMode(8)

    elif payload[0:1] == REPLY:
        transmit(sender, REPLY)

if sleepMode == 8:
    status = xbee.modem_status.receive()
    if status == 0x0B:  # we just woke up
        sleepCycles += 1  # gets reset to 0 in setSleepMode()
        MS = xbee.atcmd("MS")
        print("MS", MS)
        if MS &gt; 2 and sleepCycles &gt; 2:  # sometimes MS isn't reset immediately to 0 upon SM = 0, hence sleepCycles.
            setSleepMode(0)  # if we are in SM = 8 and we miss three syncs, then set SM = 0

else:
    if time.ticks_diff(time.ticks_ms(), refMillis) &gt; 4800:  # 4800ms sleep period
        refMillis = time.ticks_ms()
        MS = xbee.atcmd("MS")
        print("MS", MS)  # print something periodically so we can see that micropython is running

This is not a place to have possible bugs addressed. You should take this up on your case or with your account manager.

The big question is why are you switching between synchronous sleep and always on? If you want to send data, then you should be changing the sleep/wake times instead. Then send the data while the network is awake.

That was my original strategy, changing SP from 420 to 1 when I wanted the network to be awake. But nodes were jamming up and I assumed it was because of the changing sleep parameters. That’s why I tried the other strategy of leaving SP and ST as they were and broadcasting a command to the network to set SM=0 when I wanted the network to wake up. It didn’t help however, nodes were still jamming up.

I prefer this strategy anyway because if you change the SP parameter, the network still has to go through an entire sleep cycle before the change is implemented. In my case that’s 4.8 seconds before the network becomes fully responsive. If I broadcast that the network set SM=0 however, it is implemented immediately and the network does not go back to sleep for another cycle.

Interestingly, there is a mistake in the non-coordinators code above:
if payload[0] == WAKEUP:
setSleepMode(0)

elif payload[0] == SLEEP:
setSleepMode(8)

should be:
if payload[0:1] == WAKEUP:
setSleepMode(0)

elif payload[0:1] == SLEEP:
setSleepMode(8)

With the mistake, the non-coordinator nodes never actually set SM=8. But they still frequently jam.
I suspected that the nodes were jamming when they exited SM=8, but this is clearly not the case.
I thought then, perhaps the problem is related to the sync messages, so I commented out:

setSleepMode(7)

in the sleep-coordinators code, so that it never leaves SM=0 either.
I’m testing it like this now, and although the jamming events are much less frequent, they still happen. I have had one jam so far after 1000 loops.

So this problem of jamming receivers may not be directly related to the sleep protocol after all.
It’s causing me a lot of stress however, I need to get past this jamming issue to get a reliable product to market.

I believe this IS the place to have bugs addressed. As I said, I have had this case open for over four months with tech support, and to date they have provided little to no support on this issue. So I have turned to the community for support. If other people comment that they too have had reliability issues, then tech support might take it seriously.

Thanks for your comments,

cheers.

Lets start with what you mean by “Jamming”? Let’s start there.

Next, this is NOT the proper place to get possible bugs in a code addressed unless it is in your Micro Python or external code. The proper place for that is with the Support case.

What I mean by jamming is that the node in question can no longer receive anything, and will remain this way until a P.O.R or forced reset i.e. xbee.atcmd(‘FR’). Everything else, i.e. micropython and API continue to run OK. For instance, when running the supplied code, initially all the nodes respond to the sleep-coordinator and it prints:

loop 1
0013A20041B73CE2 06
0013A20041B73CE8 06
0013A20041B73D2F 06
0013A20041B73C21 06
0013A20041B73CF0 06

After a while, say after a few hundred loops, a node will jam and the sleep coordinator will print:

loop 236
0013A20041B73CE2 06
0013A20041B73CE8 06
‘Transmit failure: [Errno 7107] ENOTCONN’
‘Transmit failure: [Errno 7107] ENOTCONN’
‘Transmit failure: [Errno 7107] ENOTCONN’
0013A20041B73C21 06
0013A20041B73CF0 06

node2 has jammed. If I try to send it an addressed unicast, i.e. a transmit request (frame 0x10) in API=1, the transmit status frame returns a delivery status of 0x25 Route not found.
When I retrieve the jammed node from the field and plug it into the REPL, I can see that the micropython code is running normally as it continually prints out the missed sync parameter. I can also connect to XCTU ‘locally’ (via serial) and see all its parameters. If I try to transmit a broadcast from the jammed node, it works, and all the other nodes receive it. If I try a network discover no nodes are found. If I try to send an addressed unicast from the jammed node to another node, the transmit status response returns with 0x25 Route not found. This all tells me that the jammed node is not able to receive at the MAC level.

Matthew, you need to get everything you can to Digi so they can reproduce the issue.