DICOM Basics using .NET and C# - Handling Transient Errors during Communications

Apr 26, 2019

Introduction

This is part of my series of articles on the DICOM standard that I am currently working on (a number of them have already been completed). If you are totally new to DICOM, please have a quick look at my earlier article titled "Introduction to the DICOM Standard" for a quick introduction to the standard. This tutorial assumes that you know the basics of C# or any equivalent object-oriented language such as Java or C++.

One issue that often arises in DICOM-related communications when peer devices are connecting with one another within the enterprise or with external partner systems is what is known as a "transient error". These types of errors are temporary in nature and often occur due to temporary glitches in network connectivity or when the peer or external services are either temporary unavailable or overloaded for some reason. To enable seamless end user experience especially required in highly critical medical environments, one needs to design these applications to be resilient and "self-healing" so that these applications retry for a specified number of times before giving up completely. Applications that are designed this way can make a huge difference especially in situations where applications/devices often go unattended while personnel are busy attending to more critical problems and these users expect these applications to continue performing when temporary issues occur and don't require frequent intervention. I have witnessed many problems in this area that I feel that this topic is worth writing about.

Before we go any further, I think it is important to recognize the behavior of the application should/will vary when handling non-critical errors and fatal errors. Fatal or critical errors pertain to conditions that will result in either complete or near complete degradation of application functionality when not acted on soon. Errors in this category include things such as extremely low disk space, access or authorization issues, configuration issues, etc. These errors will not normally go away no matter what, and the technicians should be alerted to these errors right away. Transient errors on the other hand are those that are caused by conditions that are temporary in nature and will often go away after a few seconds (or a few minutes at most). These occur due to things such as restart of peer devices, network connectivity drops, etc. It is these kinds of errors that are worth not giving up on right away and we should ensure that the program logic tries for a few times at regular or exponentially increasing intervals before fully giving up.

“Never cut a tree down in the wintertime. Never make a negative decision in the low time. Never make your most important decisions when you are in your worst moods. Wait. Be patient. The storm will pass. The spring will come.” ~ Robert H. Schuller

I have seen many good strategies that have been employed over the years when dealing with transient errors. These include:

use of loops (approach often seen in old procedural programs)
use of delay and other command functions when failures occur (also seen in older programs)
use of exception hierarchies in combination with retry policies and "circuit breakers" (modern approach)

No matter what style of programming or approach is used, employing such a strategy is always beneficial to the users of the application. When retry logic is enabled, the retry intervals (time between retries) maybe equally spaced, linearly increasing or even exponentially increasing until the maximum number of retries has passed. Let us take a look at three short code examples showing the various approaches of program logic with and without transient error handling to understand what I mean.

Approach 1 - Without Transient Error Handling Logic - Bad Approach Seen in Many Applications (do not use this)

In the example below, you will notice that I am simulating an example of a DICOM network operation that throws a transient exception. This is an example of a badly written DICOM or any healthcare applications that will often cease operation once these types of errors are raised. The end user then has to needlessly restart the program or manually retry to attempt connectivity with the DICOM peer device.

    using System;

    namespace DealingWithTransientErrorsTheBadWay
    {
        public class Program
        {
            private const int Success = 0;
            private const int Failure = -1;

            static void Main(string[] args)
            {
                try
                {
                    // Some operation that is capable of throwing
                    // both critical/fatal errors as well as transient errors
                    SomeOperationInvolvingDICOMRemotePeer();
                    Environment.Exit(Success);
                }
                catch (Exception e)
                {

                    // Here all exceptions are treated the same and the application gives up too quickly
                    // DO NOT EMPLOY such as an approach when writing DICOM applications

                    //in real-life, do something about this exception

                    LogToDebugConsole(e.StackTrace); //print the stack trace of the error
                    Environment.Exit(Failure); //exit the application
                }
            }
        }
    }

Approach 2 - Modified with Transient Error Handling Logic - Slightly Better Approach (ugly but will work)

In the example below, I have modified the previous example to slightly better using a loop-based retry logic. This is an approach that I see in a lot of applications. They are at least better than not handling transient errors at all, but I consider this style to be more procedural and will require quite a bit of intervention/upkeep and maintenance especially when you have to handle different errors in the future.

    using System;

    namespace DealingWithTransientErrorsSlightlyBetterApproach
    {
        public class Program
        {
            private const int MaxNumRetriesForTransientException = 10;
            private const int Failure = -1;

            static void Main(string[] args)
            {
                //Here the application tries a bit harder before giving up.
                //This is a slightly better option compared to giving up at the first sign of trouble
                //I will show you an even better option in the next example below

                for (var retries = 0; ; retries++)
                {
                    try
                    {
                        // Some operation that is capable of throwing
                        // both critical/fatal errors as well as transient errors
                        SomeOperationInvolvingDICOMRemotePeer();

                        //exit from the loop since you are done
                        break;
                    }
                    catch (SomeTransientDicomException stde)
                    {
                        //retry this exception for the maximum number of times specified
                        if (retries >= MaxNumRetriesForTransientException)
                        {
                            throw; //if maximum retries is reached, then give up
                        }
                    }
                    catch (Exception e)
                    {
                        //in real-life, do something about this exception
                        LogToDebugConsole(e.StackTrace); //print the stack trace of the error
                        Environment.Exit(Failure); //exit the application
                    }
                }
            }
        }
    }

“There is a pleasure in the pathless woods. There is a rapture on the lonely shore. There is society where none intrudes. By the deep sea, and music in its roar. I love not man the less, but Nature more.” ~ Lord Byron

Approach 3 - Error Handling Logic Using a Error Handling Framework (best option)

The approach below provides the most optimal solution to the problem of error handling especially when dealing with transient errors. Good error handling frameworks enable clear separation between the retry logic or policy you want to employ and the code that is actually invoked that causes these errors to occur. This helps the programmer understand the code paths better which should enable extensibility as well as easy debugging when needed. Other styles not shown here that I like include use of annotations to mark methods with the retry policy and the use of "circuit breakers" to control even more fine-grained error handling-related logic when abnormal situations arise within the system. However, those patterns are beyond the scope of this tutorial.

    using System;
    using Polly;

    namespace DealingWithTransientErrorsBestApproach
    {
        public class Program
        {
            private const int MaxNumRetriesForTransientException = 10;
            private const int TimeBetweenRetriesInSeconds = 10;

            public static void Main(string[] args)
            {
                //define a retry policy for how you want to handle with a transient exception
                //Use a good library such as Polly (see https://github.com/App-vNext/Polly )

                Policy
                    .Handle<SomeTransientDicomException>()
                    .WaitAndRetry(
                        MaxNumRetriesForTransientException,
                        retryAttempt => TimeSpan.FromSeconds(TimeBetweenRetriesInSeconds),
                        (exception, timeSpan, retryCount, context) =>
                        {

                            // do some operation against DICOM remote peer

                        }
                    );

                // you can specify additional logic to perform before the next retry (not shown here)
                // you can also specify gradually increasing retry interval periods (not shown here)
            }

        }
    }

Conclusion

This concludes my short tutorial on the topic of handling transient errors that you have to deal with when dealing with DICOM devices. Transient errors are temporary in nature and the conditions that cause them only to exist momentarily and often disappear very quickly. A well-designed application will have logic in place to retry the failed operation when these errors occur and not require intervention from end users often. When implementing logic to handle the retry operation, please do ensure that any side effects from the retry operation are minimized such as releasing any resources used when establishing connections before attempting again as this will create more strain on your application. In the next tutorial in my series of articles on the DICOM standard we will dive more deeply into query and retrieve operations in DICOM. See you then!