ConnectPort becoming unresponsive

blalor · July 23, 2009, 10:22am

I’ve got a ConnectPort X4 that I’ve been running Dia on; I’ve got several drivers and presentations that I’ve developed for it. I’ve noticed that after a few hours of uptime, I can no longer telnet to the gateway, nor does the CLI presentation (running on TCP port 4146) accept connections. When I attempt to telnet to either, I get “Connection closed by foreign host.” Actually, on the 2nd attempt to connect to port 23 (the default telnet port), it did allow me in, but when I tried to fire up a Python interpreter, I get “Unable to start interpreter thread”. I’ve also configured the logging module to send UDP packets to my workstation and that has appeared to stop working, as well. The management web interface seems to be working just fine, and the presentation that I wrote to upload data to my server continues to work in the background, but the system has obviously degraded. There’s nothing in the event log page of the CP.

Any thoughts on what might be happening? I’m guessing it’s some kind of resource starvation (I’m still showing < 100% CPU usage and 400KB+ free memory remaining) but there really aren’t any meaningful diagnostics on the device.

Thanks,
Brian

Admin1 · July 23, 2009, 3:01pm

I moved this to a more appropriate forum to get your question answered fyi.

lynnl · July 24, 2009, 5:11pm

Memory is the likely culprit and you will need to seriously control your design. Unfortunately, many things Python enables chew up memory like crazy.

I was part of a team creating a Python app which supported & was tested long-term running 50 Xbee device, monitoring 250 data points with hourly logging etc. In an X4 with 16MB of RAM we had to rewrite our Pythons several times to make it fit. The X4 (even with 32MB of RAM) is NOT a PC with nearly unlimited virtual resources.

As for system logging, you can read the memory stats in real-time from within your application & react to changes. Example #3 on this page shows how:
http://www.digi.com/wiki/developer/index.php/Module:rci

Dia runs garbage collection by default every 6 hours, but if you open/close sockets a lot you MUST manually trigger it sooner. In our project we literally ran garbage collection after every client disconnected as it freed from 50 to 600K of RAM (it depended on what the presentation did). There is information here:
http://www.digi.com/wiki/developer/index.php/Python_Garbage_Collection

Bottomline, in our project we had to attempt to keep at least 1MB of RAM free at all times, as various Python functions expect short-term use of several hundred K of RAM.

blalor · August 6, 2009, 10:32am

I’ve been tracking free memory on my CPX4 for the last 10 days or so. I enabled the system stats driver and wrote a Python script to pull the free mem channel very 30 seconds. Attached is the chart, starting late on 7/26 and going through this morning. Yesterday morning there was some sort of network outage with my web host and my Dia presentation wasn’t able to upload the data. That caused about a 400K leak, but there is also a much slower, but very noticeable leak. I would love to hear thoughts on tracking these down

blalor · July 23, 2009, 3:08pm

Thanks. Not sure this is just a Python issue, however; I’ve had instances where just trying to connect to the gateway via telnet (not the CLI presentation generally running on port 4146) results in a connection refused.

Admin1 · July 23, 2009, 3:58pm

On the WebUI of the X4 under Administration > System Information, keep an eye on the CPU Utilization and the Free Memory available and see if either of these are getting continually higher.

Running out of resources will cause this sort of behavior, and that’s likely the python app or the way the X4 is configured. Make sure you’re using the latest CP-X4 EOS firmware as well, in case any bugs were fixed which may be related to the behavior you’re seeing.

52637p · July 23, 2009, 4:37pm

While memory usage can be a concern, the specific case described (with a telnet returned a “refused” rather than connecting, then immediately disconnecting) could be affected by another data point: the number of available sockets. The “display sockets” CLI command can provide feedback related to that resource.

blalor · July 23, 2009, 5:08pm

That’s interesting. What’s the limit on the number of sockets?

kjensen8 · July 23, 2009, 5:51pm

The firmware itself is limited to something like 128 sockets. You won’t be able to use that many however. Some are going to be used for internal purposes. 64 is a safe bet for you.

If the gateway free memory gets below 1 meg I’ve seen some of the strange behavior you are describing. generally you don’t want it to go below that.

They are going to release a version of the X4 with 32 meg RAM sometime in the future but I don’t know when.

52637p · July 23, 2009, 6:16pm

The number is not smaller than 128, in my experience. The “display sockets” command provides information for the running product.

blalor · July 23, 2009, 5:59pm

Below 1 meg? Gak. I’m running at around 400k free.

clohfink · July 23, 2009, 7:00pm

400kb would very likely have problems similar to this on a x4. under 1 mb tends to cause issues.

blalor · July 24, 2009, 3:50am

At least part of my Dia config died a few hours ago; I could still telnet into the CLI presentation, but my home-grown presentation stopped uploading data to my server. I rebooted the gateway and everything’s working again, and there’s more free memory (~700kB), but I’m going to have to dig into what’s using up nearly 5MB of memory

socket info:

#> display sockets

Socket Status:

Total Sockets : 128
Free Sockets : 90
Used Sockets : 38
IP Network Sockets : 33
Device Sockets : 3
File Sockets : 1
Mesh Sockets : 1

Maximum TCP Sockets : 128
Used TCP Sockets : 23

There really needs to be better logging.

blalor · July 24, 2009, 6:31pm

Thanks, Lynn. I was wondering how I was going to scale this thing up. Combined, my two device drivers generate 16 samples per minute. The samples are collected by a file logger, which is then dumped by my presentation every two minutes and uploaded to my server via HTTP POST. I see a lot of sockets still sitting around after that interaction has completed; do those not get cleaned up until GC runs? Could explain a lot…

lynnl · July 27, 2009, 1:39pm

Yah, all “trash” sits around until some magic sauces says clean it - as the Python wiki explains. The default python GC paradigm is driven by the number of objects more than their size. So fewer big objects like socket buffers tend to NOT be cleaned up fast enough, yet there is no one-size fits all solution.

Each socket structure includes buffer space, so they can be quite large. Saving & reusing sockets should help. Even your HTTP POST “string” will be disposed of by the system and sit waiting for cleanup - so 30 times a few K bytes per hour is being deallocated for that.

In my case the client comes in only a few times per day so we free up the old sockets & force gc.collect() after every access.

I won’t try that every 2 minutes, but some simple experiments (looking at gc.collect return and system RAM free) should give you a better feel for the rate at which you consume memory & perhaps a good start would be calling gc.collect every 20 to 30 times your POST is triggered.

lynnl · August 8, 2009, 3:21pm

Interesting graph, and doing a little digging unearths rumors of a DNS lookup leak (not something in Python), so your iDigi uploads are losing memory. New CP firmware would be needed to fix; not sure the timing of such a fix.

lynnl · August 28, 2009, 8:45pm

Okay, the fixed firmwares are up (or going up) on the support.digi.com site now. You need:
X2 = 82001596_E2.bin
X2-WiFi = 82001630_E1.bin
X4 = 82001536_E3.bin
X8 = 82001115_F2.bin

They should be dated August-24-2009 (or 25th). If dated March or May 2009, those are older.

The cellular products also have new images, but I won’t list all of those. The memory issue only is seen with code using DNS name lookup - so a cellular customer putting a DNS name in Surelink or other settings will cause the problem. The iDigi Dia systems tend to see it since most presentations use DNS names to define the server.

blalor · August 8, 2009, 4:02pm

Interesting. Well, I happen to have a static IP for my server, so I could perhaps work around that issue. Or at least see if there’s a connection

lynnl · August 28, 2009, 8:53pm

Actually, it should be listed as version (aka: build) 2.8.4.16, with the last value 16 (not 14 or 15)

Topic		Replies	Views
Memory consumption issues with ConnectPort X4 Python	20	2862	December 16, 2009
DIA dying Python	1	580	August 5, 2009
Python thread start error Python	5	705	May 20, 2009
Problems with Webserver X4 when running a python script Python	4	629	May 9, 2008
Feasability question using connectport X2 Python	4	1048	January 20, 2011

ConnectPort becoming unresponsive

Related topics