I am running into a recurring problem of deployed CP X2’s crashing. This only happens with X2s that are on a “noisy” ethernet network. By crash I mean it just hangs without throwing any exceptions. The only recovery is to power cycle it. No exceptions are thrown. This is sporadically occurring at the point where my python process is sending data files to my server using httplib (https POST), specifically conn.request() and conn.getresponse(). I have exception handling around everything and it has caught and dealt with many an error. However, this problem results in a freeze of the unit. Although, sometimes, just the python process dies and I can recover without a power cycle. I chased it down into the calls to the socket module as far as I can. Specifically, it seems to crash at the call to self.sock.sendall(str) in HTTPConnection.send().
I was thinking it may have to do with blocking mode of the sockets.
I have tried adding sock.settimeout(3.0) down in HTTPSConnection connect method, but that does not seem to make a difference. (Setting a timeout may only be relevant in systems that support non-blocking, but Digi CP X2’s may not support non-blocking. Anybody know?)
Have others had to deal with their gateway freezing up when sending data over sockets? If so, how have you worked with or around it?
I have the same problem on CPX4 - but im using urllib (which also includes httplib). It’s strange but sometimes my script runs for a week and another time it blocks after an hour. As you mentioned - there is no exception thrown and when I run script on my computer everything works fine so I can’t find where the problem is.
I did some more research into this. It does seem to be the sock.sendall(str) call down in HTTPConnection.send() call that hangs. Sometimes it hangs sending the header, sometimes sending the body.
This was found to be an issue with Python pre-2.7, in general. The addition of the timeout call I added is to jury-rig a solution for it, but still the same problem.
Note: newer Python versions (2.7) have connect call a new function socket.create_connection() to supposedly correct this same issue.
Digi made their own customizations to the socket module. It is not looking like they made any change to alleviate this problem, though.
It also helps to enable the TCP Keepalives for a small number like 4 minutes. The standard Python code does NOT use keepalives, so you need to edit (hack) the module yourself.
What can happen is I assume your X2/X4 is talking via various NAT/routers (your corporate or home router). These will create a temporary firewall rule which live for only 5 minutes. So if during the upload your remote HTTP server pauses to PROCESS the data, by the time it tries to send the HTTP Success, the TCP socket may be dead. I don’t mean closed - it is gone, dead. The host gets NO reset, but their TCP packets all timeout.
Meanwhile, your standard HTTPLIB client is patiently waiting until the end of eternity for that HTTP SUCCESS/FAILURE on a chopped-off TCP socket which will never happen. The NAT/router is silently tossing all of the incoming packets.
Anyway, TCP Keepalive will eventually cause your client to poke the dead socket and get no response, understanding eventually (after dozens of minutes) that the connection is gone.
Are you using multiple threads in this script (for socket communicatino) and blocking events (locks)? You coudl have a thread that is blocking on an event that hasn’t yet happened if your code crashed somewhere else before the release of a lock has happened. You can always change socket options to help, but often those will bring up other problems. Also, be sure if you’re debugging with try/catch statements that youre printing out the expection description. You can also consider adding print statements for the time being in various areas so that it leaves a trail as your script executes. If you still have access to the CLI and the device isn’t frozen, consider using various CLI commands to help debug. I use “threads ?” quite a bit when developing on a Transport model and noticed I had some threads waiting and some blocking which was causing hangups.