DICOM Basics - Handling Transient Errors during Communications

This is part of my series of articles on the DICOM standard that I am currently working on (a number of them have already been completed). If you are totally new to DICOM, please have a quick look at my earlier article titled "Introduction to the DICOM Standard" for a quick introduction to the standard. This tutorial assumes that you know the basics of Java or any equivalent object-oriented language such as C# or C++.

Introduction

One issue that often arises in DICOM-related communications when peer devices are connecting with one another within the enterprise or with external partner systems is what is known as a "transient error". These types of errors are temporary in nature and often occur due to temporary glitches in network connectivity or when the peer or external services are either temporary unavailable or overloaded for some reason. To enable seamless end user experience especially required in highly critical medical environments, one needs to design these applications to be resilient and "self-healing" so that these applications retry for a specified number of times before giving up completely. Applications that are designed this way can make a huge difference especially in situations where applications/devices often go unattended while personnel are busy attending to more critical problems and these users expect these applications to continue performing when temporary issues occur and don't require frequent intervention. I have witnessed many problems in this area that I feel that this topic is worth writing about.

DICOM Cannot Connect Error

Before we go any further, I think it is important to recognize the behavior of the application should/will vary when handling non-critical errors and fatal errors. Fatal or critical errors pertain to conditions that will result in either complete or near complete degradation of application functionality when not acted on soon. Errors in this category include things such as extremely low disk space, access or authorization issues, configuration issues, etc. These errors will not normally go away no matter what, and the technicians should be alerted to these errors right away. Transient errors on the other hand are those that are caused by conditions that are temporary in nature and will often go away after a few seconds (or a few minutes at most). These occur due to things such as restart of peer devices, network connectivity drops, etc. It is these kinds of errors that are worth not giving up on right away and we should ensure that the program logic tries for a few times at regular or exponentially increasing intervals before fully giving up.

“Every child begins the world again.” ~ Henry David Thoreau

I have seen many good strategies that have been employed over the years when dealing with transient errors. These include:

  • use of loops (approach often seen in old procedural programs)
  • use of delay and other command functions when failures occur (also seen in older programs)
  • use of exception hierarchies in combination with retry policies and "circuit breakers" (modern approach)

No matter what style of programming or approach is used, employing such a strategy is always beneficial to the users of the application. When retry logic is enabled, the retry intervals (time between retries) maybe equally spaced, linearly increasing or even exponentially increasing until the maximum number of retries has passed. Let us take a look at three short code examples showing the various approaches of program logic with and without transient error handling to understand what I mean.

Approach 1 - Without Transient Error Handling Logic - Bad Approach Seen in Many Applications (do not use this)

In the example below, you will notice that I am simulating an example of a DICOM network operation that throws a transient exception. This is an example of a badly written DICOM or any healthcare applications that will often cease operation once these types of errors are raised. The end user then has to needlessly restart the program or manually retry to attempt connectivity with the DICOM peer device.

    package com.saravanansubramanian.dicom.pixelmedtutorial;

    public class DealingWithTransientErrorsTheBadWay {
        
        public static void main(String[] args) {
            try {

                // Some operation that is capable of throwing 
                // both critical/fatal errors as well as transient errors
                SomeOperationInvolvingDICOMRemotePeer();
            }
            catch (Throwable e) {
                
                // Here all exceptions are treated the same and the application gives up too quickly
                // DO NOT EMPLOY such as an approach when writing DICOM applications
                
                e.printStackTrace(); //print the stack trace of the error
                System.exit(1); //exit the application
            }
        }

    }

Approach 2 - Modified with Transient Error Handling Logic - Slightly Better Approach (ugly but will work)

In the example below, I have modified the previous example to slightly better using a loop-based retry logic. This is an approach that I see in a lot of applications. They are at least better than not handling transient errors at all, but I consider this style to be more procedural and will require quite a bit of intervention/upkeep and maintenance especially when you have to handle different errors in the future.

    package com.saravanansubramanian.dicom.pixelmedtutorial;

    public class DealingWithTransientErrorsSlightlyBetterApproach {
        
        private static final int MAX_NUM_RETRIES_FOR_TRANSIENT_EXCEPTION = 10;

        public static void main(String[] args) {
            
            //Here the application tries a bit harder before giving up. 
            //This is a slightly better option compared to giving up at the first sign of trouble
            //I will show you an even better option in the next example below
            
            for (int retries = 0;; retries++) {
                try {
                    // Some operation that is capable of throwing 
                    // both critical/fatal errors as well as transient errors
                    SomeOperationInvolvingDICOMRemotePeer();
                    
                    //exit from the loop since you are done
                    break;
                } 
                catch (SomeTransientDicomException e) { 
                    //retry this exception for the maximum number of times specified
                    if (retries < MAX_NUM_RETRIES_FOR_TRANSIENT_EXCEPTION) {
                        continue;
                    } else {
                        throw e; //if maximum retries is reached, then give up
                    }
                }
                catch (Exception e) {
                    e.printStackTrace(); //print the stack trace of the error
                    System.exit(1); //exit the application
                }
            }
        }

    }

Approach 3 - Error Handling Logic Using a Error Handling Framework (best option)

The approach below provides the most optimal solution to the problem of error handling especially when dealing with transient errors. Good error handling frameworks enable clear separation between the retry logic or policy you want to employ and the code that is actually invoked that causes these errors to occur. This helps the programmer understand the code paths better which should enable extensibility as well as easy debugging when needed. Other styles not shown here that I like include use of annotations to mark methods with the retry policy and the use of "circuit breakers" to control even more fine-grained error handling-related logic when abnormal situations arise within the system. However, those patterns are beyond the scope of this tutorial.

    package com.saravanansubramanian.dicom.pixelmedtutorial;

    import java.util.concurrent.Callable;
    import java.util.concurrent.TimeUnit;
    import net.jodah.failsafe.Failsafe;
    import net.jodah.failsafe.RetryPolicy;

    public class DealingWithTransientErrorsBestApproach {

        private static final int MAX_NUM_RETRIES_FOR_TRANSIENT_EXCEPTION = 10;
        private static final int TIME_BETWEEN_RETRIES_IN_SECONDS = 10;
        
        public static void main(String[] args) {

            //define a retry policy for how you want to handle with a transient exception
            //Use a good library such as the following:
            
            //1. https://github.com/jhalterman/failsafe 
            //2. https://github.com/elennick/retry4j
            
            //I am using the FailSafe library here (option #1 above)
            RetryPolicy retryPolicy = new RetryPolicy()
                    .retryOn(SomeTransientDicomException.class)
                    .withDelay(TIME_BETWEEN_RETRIES_IN_SECONDS, TimeUnit.SECONDS)
                    .withMaxRetries(MAX_NUM_RETRIES_FOR_TRANSIENT_EXCEPTION);
            
            // Invoke the DICOM operation with the retry policy 
            // this approach makes the logic easier to read, debug and also extend in the future
              
            Failsafe.with(retryPolicy).run(() -> SomeOperationInvolvingDICOMRemotePeer());

            // you can specify additional logic to perform before the next retry (not shown here)
            // you can also specify gradually increasing retry interval periods (not shown here)
        }

    }

This concludes my short tutorial on the topic of handling transient errors that you have to deal with when dealing with DICOM devices. Transient errors are temporary in nature and the conditions that cause them only to exist momentarily and often disappear very quickly. A well-designed application will have logic in place to retry the failed operation when these errors occur and not require intervention from end users often. When implementing logic to handle the retry operation, please do ensure that any side effects from the retry operation are minimized such as releasing any resources used when establishing connections before attempting again as this will create more strain on your application. In the next tutorial in my series of articles on the DICOM standard we will dive more deeply into query and retrieve operations in DICOM. See you then!