Well I had about 2 pages typed and basically deleted most of it for being too long. So hopefully this is succinct.
Module: XBee3 Global - Low Power variant (GM1) North America
Mode: Micropython REPL
Cell Network: AT&T
I am getting some seemingly random ENOTCONN (error not connected) errors despite all indications pointing that the module is connected. I poll the Cellular.isconnected() method, rings back True, code then connects to an MQTT broker and throws the error. All settings in XBee studio look good: IP address and time given from network, good signal strength, AT+AI = 0 (connected).
I added some sleep times after the cellular network connect and the MQTT broker connect to see if that would help stabilize things. Seems to maybe help, but then again it seems these errors come in batches and only after the modem has been connected for a while (~20 to 30 minutes). If I manually put the modem in Airplane mode via XBStudio, let it dwell for a few minutes, than turn it back on and run the code, it seems to connect fine. My code fires data about every 10 minutes and after about 30 minutes it seems to barf. I do not disconnect from the cell network in between intervals most of the time* - so the connection can be stale by about 10 minutes tops.
Same code is on our XB3 Global GM2’s. The GM2’s have been running like champs.
Any ideas why the modem is indicating it’s connected but still throws the ENOTCONN error?
*The cell network does not like when you connect more than about 3 or 4 times per hour and will block that ICCID for a period of time. I usually just close all connections (including cell) once or twice per hour under normal conditions. I do this to conserve as much power as possible.
I would suggest submitting a support ticket for this issue by sending an email to email@example.com. I think what you have listed will be good enough if you can provide a copy of the radios firmware settings as well.
Something very strange happened…it all went away. I am suspecting possibly errors on our service provider or the MQTT client on the cloud service we use. The reason I suspect this is that about 2 days after I posted, there was an event logged with our provider that there was an issue with their MQTT endpoint servers. The platform we use provides both our SIMs and our cloud services, so I am guessing something weird was going on. I had noticed a lot of delay in when data was sent and when it showed up on their web portal. Much longer delays than I normally expect.
The other weird thing that was occurring was that there was a mismatch between the ack reply messages. As if it was so delayed that the reply was for an MQTT packet sent almost 10 minutes before.
One thing I will say as a positive…despite a lot of errors coming up on the new Global XBee3’s, they continued to operate. On the previous gen XBee3s (Cat-M/NB-IoT) that many errors would have caused something to stick in the uPython code, processor, or modem. I have a lot of error handling in the code along with the watchdog and despite my best efforts some would freeze up randomly. I could not track down the cause. This was problematic for us since our devices sit in locations that are not easily physically accessible and they are scattered all over the US. The new generation of XBee3 Cellular have been under our testing for about a month now and I have not had an instance with one randomly freezing up. So, big thanks to Digi for those improvements.
Thank you for the response and I will update the post if I find any further information regarding the issue.
I think I found the issue. I figured I’d make a post to summarize my findings. I believe it was an issue on my MQTT broker/server side, but it was being caused by the XBee python code. For some period of time, I kept getting ENOTCONN (#7107) errors. These errors occurred during the MQTT.connect() method in the simple.py library provided by Digi. They were quite random and it only affected some devices on some days.
The problem occurs (I believe) due to a mismatch of connection status on the server. If the device has been off for say a few hours, the connection state on the server side is “clean”. The device connects to the MQTT server, does it thing sending data from our sensors, then issues the MQTT.disconnect() and goes to sleep. The next cycle I would get the error. After careful inspection, I noticed on the server side, the disconnect timestamp did NOT align with the timestamp of the XBee code issuance of the disconnect message. It appeared the server was only showing disconnected after the keepalive timed out. I suspected the disconnect message request from the client was not getting through to the server. This caused a mis-match on the server of connection status and the connection on the client side (XBee) was essentially being denied upon attempt to connect about 10 minutes later. Our cloud storage provider doesn’t seem like to concurrent connections and I made sure to use the True flag for clean connection as well. I also double checked to make sure I was properly disconnecting from the MQTT server instance.
The solution that has been working great was a small modification to the simple.py library file - adding a short wait between sock.write() and sock.close() methods. To put in bonehead Nick terms, it was like the socket was being closed before the entire disconnect message was sent thru the modem and the message was never received by the server. The below is from the simple.py mqtt library file with a short wait in between.
time.sleep(0.5) # 20240130 - added to allow transmission before the socket is closed.
I think the reason it affected some devices and not others depended on the signal quality of the cellular connection. From what I saw, devices that had great signal quality saw very few issues, if any. The devices with poor connections were having a lot of the above mentioned errors. I’m no expert, but I am wondering if latency decreases with increased signal quality - something like this. The modification seems to be working great and I’ll update if I find something else.
Please feel free to make any suggestions or maybe confirm/disconfirm my line of reasoning and solution.