RealPort Linux driver v1.9-34 not properly recovering from network loss

My company is working on a project that involves a Linux server collecting serial data through a couple of Digi One SP devices. We are experiencing unexpected behavior from the RealPort Linux driver v1.9-34 when the network connection between a RealPort device and the Linux server is interrupted.

It is my understanding that the RealPort driver should automatically recover from the network interruption and that the affected tty devices should be restored to working order fairly quickly once network communication is restored. However, this is not happening for us.

What we are seeing is that after an interruption in network communication the RealPort tty device becomes unresponsive and cannot be read from or written to. Applications that attempt to do so will block indefinitely. This applies to both interactive applications like cu and programmatic interaction through the Python PySerial module. To restore the tty to working order it has to be re-initialized by running dgrp_cfg_node.

We have reproduced this behavior on both Ubuntu 12.04.05 and CentOS 6.6. The Windows driver v4.5.373.0 running on Windows 8.1 x64 does not have this problem. COM port devices behave as expected and automatically recover after a loss of network communication.

When testing using dinc, I cannot replicate this:

dinc 9600 /dev/ttyxyz

-Reboot the Digi device
-communication restarts within dinc once the driver reconnects with the Digi device.

I suspect a watchdog timer is needed, the following article explains this:

Thank you for replying, it is much appreciated. I should have indicated in my original post that the unexpected behavior can NOT be reproduced by rebooting the Digi device from the web interface. While I am not 100 % certain about the underlying mechanism here, it seems that when you reboot via the web interface or CLI the Digi device signals the driver that it is going down. If I am monitoring serial output through cu when the Digi device reboots I get a message saying “Got hangup signal. Disconnected.”. Communication between the driver and Digi device is then properly restored once the Digi device comes back up.

To reproduce the unexpected behavior you need to interrupt network connectivity between the Digi device and the server without rebooting any of the involved devices and without shutting down/disabling any network interfaces directly on the either device.

For instance, if I am running the driver on a CentOS VM under VMware ESXi and interrupt network connectivity by disconnecting the VM’s virtual network card (pretty much equal to pulling out the network cable from a real NIC), the driver behaves as expected and reconnects once the NIC is reconnected.

However, if I instead interrupt network connectivity through firewall rules or simply shutting down the WAN interface on the router that the server uses to get to the internet (and thus the Digi device that is on a mobile broadband connection), I get the unexpected behavior where the tty device becomes unusable until re-initialized. If I’m watching serial output through cu when this happens I do NOT get the message saying “Got hangup signal. Disconnected.”, cu just hangs until I kill it. Restarting cu and attempting to read from the tty device again causes it to immediately hang again.

The tty device breaks also if nothing is actively reading from/writing to it when the network connection drops. For instance, I can open up cu, see serial data come in, shut down cu, interrupt network connectivity between the server and Digi device, restore it, and cu will then block forever if I attempt to read from the affected tty device. Same goes for the application we’re using to gather serial data. It does not matter if it was running or not at the time when the network connection dropped, once it attempts to read from a tty device that has had a network drop it simply hangs.

Since all our Digi devices are on mobile broadband connections, and most in areas with mediocre cellular coverage, they do lose internet connectivity every now and then. Every time this happens we have to go in, shut down the application that is gathering serial data as the tty device cannot be re-initialized while that application is blocking, re-initialize the device with dgrp_cfg_node and restart the monitoring application. Automating the re-initialization would be a possibility, but it is pretty painful to implement in practice since we have to kill the monitoring application every time it happens.

Under Windows the driver reconnects just fine no matter how the network connection is interrupted, which is the behavior that I believe is intended. I have tried interrupting the Windows machine’s internet connectivity while monitoring serial output from the Digi device, and the output picks up within seconds of internet connectivity coming back. Doing the exact same thing under Linux breaks the tty device until it is re-initialized.

The monitoring application is written in Python, so for now we’re simply moving it and the Digi driver to a Windows server. I would really like to get this resolved, though, as the database that the monitoring application is dumping data to is running on a Linux server and I’d rather just keep everything on Linux.

I can think of a few thinks you might want to check:

*Do you have the WAN link speed configured on the daemons? For example:

dgrp_cfg_node init -s 57600 (ID) (IP) (#_of_ports)

*Is there a keepalive configured on the Digi One SP to kill stale connections?

How long and often are the network disconnects? There might be other settings which might help with recovery.

Thank you for your suggestions and followup. The WAN link speed was initially not configured on the daemons. I have tried setting it as per your instructions and adjusting it up and down, both to approximate the actual mobile broadband WAN speed and just to try more arbitrary high and low values.

As for keepalive on the Digi One SP, yes, it was enabled and set to (a fairly long) 4 minutes and 30 seconds, which I believe is the factory default. I have tried adjusting it down to 10 seconds, which is the lowest possible.

Behavior remains the same regardless of these changes, with the tty device being rendered unusable after a network loss and remaining that way until re-initialized. I have tried waiting for 5+ minutes to see if automatic recovery would happen, but it never does.

Regarding the duration and frequency of the real-life network losses, it varies a bit, but generally they’re in the ballpark of 5-30 seconds happening anywhere from once or twice a day to once or twice an hour. Devices in areas with poor cell coverage are of course most unstable, but we get the occasional brief interrupt even in areas with strong cell signal. In my experience that must be expected from these kinds of mobile broadband connections and we don’t really have any viable options that would give us a 100 % 24/7 stable connection.

During testing I have tried interrupting network connectivity for anything from 1-2 seconds to a couple of minutes, but it seems to make no difference for driver recovery.

After some more testing on the Windows server we decided to just migrate the database as well. Everything is now running on Windows, driver, monitoring application and database, and the driver recovers perfectly from network interruptions of any duration.

We are satisfied with this solution, but I’m still puzzled as to why we were unable to make the driver behave as expected under Linux. It is of course possible that there has been some mistake on my side with regards to configuration of the daemon/drivers or Digi One SP devices, but I have read and believe I have understood the applicable man pages and documentation. Under Windows everything works perfectly with the exact same Digi One SP configuration and no changes whatsoever to driver settings during or after install. I merely install the driver and add each Digi One SP by IP address, and everything is good.

I will continue experimenting with the driver under Linux since we would rather use Linux for these types of applications. The system that’s now on Windows won’t be touched, though, as it has been classified as stable and been put into production.

EDIT: fixed some typos

If you look in /var/log/messages do you see the drpd daemon disconnect and reconnect when the WAN connection is restored?