# Data Smoothing

SigmaPlot provides six different data smoothing algorithms that should satisfy most smoothing needs – negative exponential, loess, running average, running median, bisquare, inverse square and inverse distance. Each smoother contains options that make them very flexible. For example, unequally spaced data that occurs in clumps is better analyzed using the nearest neighbor rather than a fixed bandwidth method. Also, outlier rejection is available in some smoothers.

**Two Dimensional Smoothing**

Smoothing is used to elicit trends from noisy data. The three examples in Tukey’s book Exploratory Data Analysis (Addison-Wesley, 1977) show the need for smoothing beautifully. The trends in the U.S. gold production from 1872 to 1956, Figure 1A, are fairly clear.

Figure 1. Data with trends that are increasingly more difficult to visualize

The peaks and valleys in the U.S. wheat production, Figure 1B, are less clear. I challenge you to visually find the trends in the annual New York City precipitation data shown in Figure 1C. The loess algorithm will be used to smooth these data sets. “loess” means locally weighted regression.

Each point along the smooth curve is obtained from a regression of data points close to the curve point with the closest points more heavily weighted. The amount of smoothing, which affects the number of points in the regression is determined by the user. A weighted regression is performed for each point along the smooth curve.

Loess smoothed curves for the three examples in Figure 1 are shown in Figure 2. The smoothed curves in Figure 2A and 2B make the trends in the gold and wheat data very clear. It is still difficult to visualize in the raw data the precipitation trend shown in Figure 2C.

To confirm the results of the loess smoothed curve the histogram of average rainfall in ten year intervals was computed and superimposed on the smooth curve. There is a good comparison between the histogram and the loess smooth.

The loess smoothing parameters were varied to achieve the best visualization. A polynomial degree of one was used in all cases. A 0.1 sampling proportion was used in Figure 2A and B and 0.3 in Figure 2C. Since the data was unequally spaced along the x axis the nearest neighbor bandwidth method was used.

The default number of intervals (100) for generation of the smooth curve was found to be the best. This generates a line using straight lines between curve points. Sometimes this leads to sharp corners in the smooth so the spline interpolation line type (Smoothed (spline)) was used.

Figure 2. Smoothed curves for data in Figure 1. A ten year average rainfall histogram is also shown in C.

Several of the smoothing methods, including loess, are based on local polynomial regression and the polynomial order is selectable. Increasing the order tends to include more high frequency components in the smooth. The effect of increasing the order from 1 (local linear regressions) to 2 (local quadratic regressions) is shown in Figure 3. The effect is to increase peak height magnitude and introduce additional high frequency components (wiggles) in B. A subsequent increase of the sampling proportion in C results in a smooth very much like the original for order 1 in A.

Figure 3. Effect of increasing the regression polynomial order. The order is 1 and sampling proportion is 0.1 in A. The order is increased to 2 in B and then the sampling proportion is increased to 0.2 in C.

**Three Dimensional Smoothing**

Visualizing spatial relationships in a three dimensional scatter plot can be very difficult. The strongest three dimensional cue is provided by an animated rotation of the data. Since this is not possible in paper publications we must resort to using drop lines, enclosing the graph with additional axes, etc. Figure 4 shows that a smooth surface also helps.

This data describes the reaction characteristics on an isomer of hexane. The smooth surface B clearly shows the trends with respect to temperature and reaction rate whereas visualizing this in the scatterplot A is difficult.

Figure 4. The data trend in A is easily visualized with a loess smoothed surface, B.

This data is relatively sparse so a large sampling proportion 0.6 was required to avoid oscillations and spikes in the loess surface. A polynomial degree of 1 and the nearest neighbor bandwidth method were used. The Preview feature allows a quick comparison of smoothing methods on a given data set. For this data essentially equivalent smooth surfaces were obtained with the negative exponential and bisquare methods.

The bandwidth method option is also very useful. The nearest neighbor method works well for unequally spaced data. The data in Figure 3 is unequally spaced in both X and Y. Compare the smoothing results using the nearest neighbor and fixed methods shown in Figure 5. The result for the fixed method is about the best that could be obtained by varying the sampling proportion with a value of 0.8 shown.

Figure 5. Comparison of bandwidth methods for unequally spaced data. Nearest neighbor on the left and fixed on the right.