I’ve got a ConnectPort X4 that I’ve been running Dia on; I’ve got several drivers and presentations that I’ve developed for it. I’ve noticed that after a few hours of uptime, I can no longer telnet to the gateway, nor does the CLI presentation (running on TCP port 4146) accept connections. When I attempt to telnet to either, I get “Connection closed by foreign host.” Actually, on the 2nd attempt to connect to port 23 (the default telnet port), it did allow me in, but when I tried to fire up a Python interpreter, I get “Unable to start interpreter thread”. I’ve also configured the logging module to send UDP packets to my workstation and that has appeared to stop working, as well. The management web interface seems to be working just fine, and the presentation that I wrote to upload data to my server continues to work in the background, but the system has obviously degraded. There’s nothing in the event log page of the CP.
Any thoughts on what might be happening? I’m guessing it’s some kind of resource starvation (I’m still showing < 100% CPU usage and 400KB+ free memory remaining) but there really aren’t any meaningful diagnostics on the device.
Memory is the likely culprit and you will need to seriously control your design. Unfortunately, many things Python enables chew up memory like crazy.
I was part of a team creating a Python app which supported & was tested long-term running 50 Xbee device, monitoring 250 data points with hourly logging etc. In an X4 with 16MB of RAM we had to rewrite our Pythons several times to make it fit. The X4 (even with 32MB of RAM) is NOT a PC with nearly unlimited virtual resources.
Dia runs garbage collection by default every 6 hours, but if you open/close sockets a lot you MUST manually trigger it sooner. In our project we literally ran garbage collection after every client disconnected as it freed from 50 to 600K of RAM (it depended on what the presentation did). There is information here: http://www.digi.com/wiki/developer/index.php/Python_Garbage_Collection
Bottomline, in our project we had to attempt to keep at least 1MB of RAM free at all times, as various Python functions expect short-term use of several hundred K of RAM.
I’ve been tracking free memory on my CPX4 for the last 10 days or so. I enabled the system stats driver and wrote a Python script to pull the free mem channel very 30 seconds. Attached is the chart, starting late on 7/26 and going through this morning. Yesterday morning there was some sort of network outage with my web host and my Dia presentation wasn’t able to upload the data. That caused about a 400K leak, but there is also a much slower, but very noticeable leak. I would love to hear thoughts on tracking these down
Thanks. Not sure this is just a Python issue, however; I’ve had instances where just trying to connect to the gateway via telnet (not the CLI presentation generally running on port 4146) results in a connection refused.
On the WebUI of the X4 under Administration > System Information, keep an eye on the CPU Utilization and the Free Memory available and see if either of these are getting continually higher.
Running out of resources will cause this sort of behavior, and that’s likely the python app or the way the X4 is configured. Make sure you’re using the latest CP-X4 EOS firmware as well, in case any bugs were fixed which may be related to the behavior you’re seeing.
While memory usage can be a concern, the specific case described (with a telnet returned a “refused” rather than connecting, then immediately disconnecting) could be affected by another data point: the number of available sockets. The “display sockets” CLI command can provide feedback related to that resource.
The firmware itself is limited to something like 128 sockets. You won’t be able to use that many however. Some are going to be used for internal purposes. 64 is a safe bet for you.
If the gateway free memory gets below 1 meg I’ve seen some of the strange behavior you are describing. generally you don’t want it to go below that.
They are going to release a version of the X4 with 32 meg RAM sometime in the future but I don’t know when.
At least part of my Dia config died a few hours ago; I could still telnet into the CLI presentation, but my home-grown presentation stopped uploading data to my server. I rebooted the gateway and everything’s working again, and there’s more free memory (~700kB), but I’m going to have to dig into what’s using up nearly 5MB of memory
socket info:
#> display sockets
Socket Status:
Total Sockets : 128
Free Sockets : 90
Used Sockets : 38
IP Network Sockets : 33
Device Sockets : 3
File Sockets : 1
Mesh Sockets : 1
Thanks, Lynn. I was wondering how I was going to scale this thing up. Combined, my two device drivers generate 16 samples per minute. The samples are collected by a file logger, which is then dumped by my presentation every two minutes and uploaded to my server via HTTP POST. I see a lot of sockets still sitting around after that interaction has completed; do those not get cleaned up until GC runs? Could explain a lot…
Yah, all “trash” sits around until some magic sauces says clean it - as the Python wiki explains. The default python GC paradigm is driven by the number of objects more than their size. So fewer big objects like socket buffers tend to NOT be cleaned up fast enough, yet there is no one-size fits all solution.
Each socket structure includes buffer space, so they can be quite large. Saving & reusing sockets should help. Even your HTTP POST “string” will be disposed of by the system and sit waiting for cleanup - so 30 times a few K bytes per hour is being deallocated for that.
In my case the client comes in only a few times per day so we free up the old sockets & force gc.collect() after every access.
I won’t try that every 2 minutes, but some simple experiments (looking at gc.collect return and system RAM free) should give you a better feel for the rate at which you consume memory & perhaps a good start would be calling gc.collect every 20 to 30 times your POST is triggered.
Interesting graph, and doing a little digging unearths rumors of a DNS lookup leak (not something in Python), so your iDigi uploads are losing memory. New CP firmware would be needed to fix; not sure the timing of such a fix.
Okay, the fixed firmwares are up (or going up) on the support.digi.com site now. You need:
X2 = 82001596_E2.bin
X2-WiFi = 82001630_E1.bin
X4 = 82001536_E3.bin
X8 = 82001115_F2.bin
They should be dated August-24-2009 (or 25th). If dated March or May 2009, those are older.
The cellular products also have new images, but I won’t list all of those. The memory issue only is seen with code using DNS name lookup - so a cellular customer putting a DNS name in Surelink or other settings will cause the problem. The iDigi Dia systems tend to see it since most presentations use DNS names to define the server.