write(2) blocks indefinitely on dgrp-1.9 x86-64 linux

We are running dgrp-1.9-24 on an x86-64 linux-2.6.31 kernel (Intel Xeon X3430 processor) to an Etherlite 80.

Following a successful call to select(2) which indicates that the file descriptor is writable, a call to write(2) to write 10 bytes sometimes blocks seemingly indefinitely (at least 24 hours). The application is multi-threaded but all access to the serial port are protected by a mutex, which the thread concerned has obtained. Other ports on the Etherlite (to the same application) continue to work normally. Immediately prior to issuing the select & write, the application sets DTR to on using the TIOCMSET ioctl.

What firmware version is being used?

What is shown within dpa.dgrp when the port is in this condition for the active port signals?

How is the port recovered (i.e. killing the application, rebooting the Etherlite, etc…)?

Is it possible there was a short network outage that was undectected by the application? Check the /var/log/messages file for any corresponding messages.

Firmware version V1.6

In this condition dpa.dgrp shows a floating point exception and crashes when I try to access the details of the ‘hung’ port (other ports show correctly)

Using the /sys/ interface, ‘cat baud_info’ generates a segfault
,cat state_info’ shows Open
‘cat msignals_info’ shows RTS CTS DTR DSR
‘cat cflag_info’ shows 0
‘cat digiflag_info’ shows 80

As a result of the ‘cat baud_info’, there is a stack trace in the kernel log

 divide error: 0000 [#1] SMP 
 last sysfs file: /sys/devices/virtual/tty/tty_dgrp_aa_2/baud_info
 CPU 1 
 Modules linked in: dgrp
 Pid: 1294, comm: cat Not tainted 2.6.31-gentoo-r10 #1 PowerEdge R210
 RIP: 0010:[]  [] dgrp_tty_baud_show+0x59/0x70 [dgrp]
 RSP: 0018:ffff88009b8cfe88  EFLAGS: 00010246
 RAX: 00000000001c2000 RBX: fffffffffffffffb RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88023c4ba800
 RBP: ffff88009b8cfe88 R08: ffff88009ba85000 R09: 0000000000000017
 R10: ffffea000220cd18 R11: 0000000000000246 R12: ffffffffa00185e0
 R13: ffff88023d93d320 R14: ffff8801aa42d5c0 R15: ffff88023c4ba810
 FS:  00007f8bfc63e6f0(0000) GS:ffff88002804d000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f8bfd61d038 CR3: 000000009da67000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process cat (pid: 1294, threadinfo ffff88009b8ce000, task ffff88023a6dc230)
  ffff88009b8cfea8 ffffffff812edfba ffffffffffffffed ffff8801aa42d5a0
 <0> ffff88009b8cff08 ffffffff8111aeb0 ffff88009b8cfee8 ffff88009b8cff48
 <0> 0000000000008000 00007f8bfd615000 ffffffff8176cd10 ffff88023dd5cd80
 Call Trace:
  [] dev_attr_show+0x2a/0x60
  [] sysfs_read_file+0xb0/0x170
  [] vfs_read+0xc8/0x1a0
  [] sys_read+0x50/0x90
  [] system_call_fastpath+0x16/0x1b
 Code: 18 01 a0 be 00 10 00 00 4c 89 c7 31 c0 e8 30 71 21 e1 c9 48 98 c3 0f 1f 40 00 0f b7 b2 70 01 00 00 ba 00 20 1c 00 89 d0 c1 fa 1f  fe 89 c1 eb cb 90 31 c0 c9 c3 66 66 66 2e 0f 1f 84 00 00 00 
 RIP  [] dgrp_tty_baud_show+0x59/0x70 [dgrp]
 ---[ end trace 9af1b5ef220e0966 ]---

The port is recovered by restarting the application. In all the cases where it has failed, it has been when sending the AT command to the attached modem in preparation for answering an incoming call. Because this is the only place where DTR is raised before sending data, I suspected it might be a race condition so I tried adding a 0.5s delay after sending the IOCTL to raise DTR. Following this i went longer before failing - but this may just have been a coincidence.

A wake up in select() on a write fd only guarantees that at most 1 byte can be written.

If you send in 10 bytes, the write could possibly block forever, depending upon whether the tty is in nonblocking mode or not.

From the WRITE(P) man page, which gives a better description of blocking/nonblocking on write IO:

   When  attempting  to write to a file descriptor (other than a pipe or FIFO) that supports non-
   blocking writes and cannot accept the data immediately:

    * If the O_NONBLOCK flag is clear, write() shall block the calling thread until the data  can
      be accepted.

    * If  the  O_NONBLOCK  flag  is set, write() shall not block the thread.  If some data can be
      written without blocking the thread, write() shall write what it can and return the  number
      of bytes written. Otherwise, it shall return -1 and set errno to [EAGAIN].

Did you open the tty in nonblocking mode, or set nonblocking during the open runtime?
If not, could you add this and try again?