Why does a misbehaving SMTP server cause the Connect SP web server thread to crash?

Wireshark logs show that a SMTP server is sending TCP/IP resets on connection attempts when the Connect SP sends email; this is unexpected behavior from the SMTP server, but it eventually causes the Connect SP web server thread to terminate with an error. The Connect SP then becomes “dead” to all TCP/IP traffic. JTAG shows that all the other threads appear to be running properly, and various queues and memory pools appear to be intact. Has anyone else run into this problem? We think that maybe the Connect SP is trying to reuse a closed port, but can’t explain why the web server thread is dying, or what we should do to prevent this from occurring.

The web server stops running because the mail client uses the web server engine as its basis.

When you state that “The Connect SP then becomes “dead” to all TCP/IP traffic” can you still “ping” the board or do “pings” also fail?

Have you every seen this when the mail server does not RST the connections or only when it RSTs the connections?

Are you running the -C type module or the -S type module? That is is this a “plug-and-play” module or a module running NET+OS?

If NET+OS (-C type) what version of NET+OS?

Thanks for your response! We had some thoughts that the mail client used the web server engine, so weren’t too surprised that was the thread that died. Pinging the board in this state elicites no response; Wireshark logs show the incoming ping, but nothing coming back. We have never experienced this except when the mail server issues RST. We’re running NET+OS v7.42 on the Connect SP. I have Wireshark logs available for this if interested.

Do you have a way of placing the module a network that does NOT have access to a mailserver? In this case, the module won’t get RSTs it just won’t get any answers. Do you get a similar error profile, or does the API that starts the mail client just return some error stating that no server is available?

Or maybe better if you set up the email client to point to a PC (or other machine) that does NOT run a mailserver, then when the NET+OS email client tries to connect to that PC, that PC should respond with RSTs (NET+OS will try to open a port on the PC but on the PC there is no thread attached to that port so the PC’s stack will respond with RSTs). Does this scenario cause the issue to occur?

We have indeed run those experiments; we cannot get the Connect SP to fail in any environment other than the specific SMTP server sending the resets.

Hmmmm this is an interesting one.
If you take a trace of the device communicating with a PC without a mailserver running on it and a trace of a device communicating with this mailserver that RSTs, what are the differences in the traces? RSTs alone should just have the APIs return a status indicating either a TCP error or a system error.

What PORT is the mail server RSTing? I am assuming that it is either 25 or 110.

ESP has a sample application called email services sample. Can you replicate this issue with this sample project?

We are using the standard port 25 for SMTP.

On a device communicating with a PC which is not running a server, connection attempts just don’t succeed and everything continues to work properly. On a device communicating with the specific server generating RST, everything seems to be operating normally (mail is being delivered, no failures), then there are several (5 on the example I’m looking at) SMTP SYN attempts (with no response from the SMTP server), then the SMTP server finally responds to the SYN after ~40 seconds have elapsed. The Connect SP attempts to close (RST) some of the earlier SYN attempts, and the RST then RST, ACK dance begins and (after a period of 18-24 hours) the Connect SP web server thread dies.

We have not attempted to replicate with the sample project yet.

Digi support has replicated this issue and has opened an internal case (NETOS-53). When a mail server RSTs connections on acceptance, it causes some heap growth. Eventually the heap is exhausted and causes a crash - classic memory leak. Hopefully posting to the community will help raise visibility on this issue.