Forecasting
Ride that Vector Wave and You Might Get Epsilon Tubed
Every now and then something comes along that is kind of different. Support Vector Regression (SVR) is one of those things. Unlike most regression methods, SVR comes out of the data science field as an offshoot of Support Vector Machines (SVM). We wanted to take a look at how SVR stacks up when applied to time-series prediction problems, like modeling daily energy use.
Like Ordinary Least Squares Regression (OLS), Support Vector Regression starts out with an equation that generates predicted values. So, if Y is the thing we are trying to explain, X1 to Xk are explanatory factors, e is the model error, and t is the time period, we can write the SVR equation as:
So, the difference between linear SVR (i.e., SVR that uses a linear Kernel function) and other linear methods is not about how we compute the predicted value. The difference lies in the way we estimate the unknown parameters that are used to calculate the predicted value.
As you probably know, in OLS the constant and slope parameters are estimated by minimizing the sum of squared errors. The objective function in SVR is a bit more complicated and is based on two main ideas:
Idea 1: Small slope parameters (b values) are better than large slope parameters.
Idea 2: We only care about the absolute value of errors above a specified threshold.
The first idea is present in other regression methods like Ridge, Lasso and Elastic Net. These methods impose a penalty on the size of the slope parameters. In SVR, as in Ridge, the sum of squared slope parameters is used, and part of the SVR objective function is to make this sum small.
The second idea is best explained by a picture. The picture below shows a simple linear model with only one explanatory factor (X). The black line labeled “a+bX” is the linear equation. The red and blue points show the Y values for each of the X observations. The green lines are parallel to the black line and the distance up and down is the same amount (epsilon or eps for short). Now for some nomenclature:
- Epsilon tube – the area between the green lines
- Support Vectors – the red points – outside the epsilon tube
- Epsilon Insensitive – the blue points – inside the epsilon tube
In OLS, the errors for all the data points matter and these errors are measured as the vertical distance from the black line to the data point. In SVR, only the red points count. And the error quantity that is used is the absolute value of the vertical distance from the epsilon tube boundary (green line) to the support vector data point (see the vertical distances labeled “s” in the diagram).
Putting these two ideas together, the SVR objective function can be written as follows:
We can hear you thinking that this is quite a bit different from OLS. You can see the sum of the squared coefficients on the left. And you can see the sum of the absolute support vector (s) values on the right. You can also see that we are trying to find the slope parameters (b) that minimize a combination of the two sums given a value for eps and C. Just think of C as a number that determines how important small errors are relative to small parameters.
In data science terms, eps and C are called hyper parameters. These are values we have to set before we can optimize. So, the first thing we need to do is figure out some good values to use for these two parameters. Good values are values that work well in out-of-sample prediction or forecasting. The practical way to find good values is to withhold observations from estimation and see what combinations of eps and C work best in predicting these withheld values. This is called cross validation, and to do that we need a model and some data.
To proceed we used three years of daily energy data for a major utility. Explanatory variables included calendar variables (month, day of week, holidays) and weather variables (temperature, wind speed, cloud cover, including lags and interactions). We first estimated an OLS model, which had very strong estimation statistics (a mean absolute percent error or MAPE value of 1.42%). With the same specification, we then applied SVR using 10-fold cross validation and found that the hyper parameter values that worked best were eps = 3.5 and C = 1.5.
The last thing we did was to compare how the SVR model with these optimized parameters compared with OLS. This was done by running 1000 out-of-sample tests withholding a random subset of 30% of the observations for each test. The MAPE values for the observations included in estimation and the test observations are shown below.
As you can see, the differences are small. OLS does a little bit better in the estimation (Train) statistics, and a little bit better in the out-of-sample (Test) statistics. But again, the differences are negligible.
We concluded that SVR is an interesting and different method. In the end, however, the estimated models performed about the same as OLS for modeling daily energy data. For both methods, the important thing is to get the model specification right. As we say around the shop, “Feature Engineering is Everything.” Apparently, the estimation method you use once the model is specified is not so important in this case.
This was only a brief summary of what we found, and you probably have a few questions. If you are interested in a deeper dive, check out the new white paper entitled “Q&A on Support Vector Regression” located in the forecasting section of the Itron website and register for our Forecasting Brown Bag events from the forecasting workshops page to review the recording on this topic that was conducted on June 17, 2021, both of which provide more juicy details. Thanks for reading. Questions or comments are always welcome (forecasting@itron.com)!
Related Articles
HTML Example
A paragraph is a self-contained unit of a discourse in writing dealing with a particular point or idea. Paragraphs are usually an expected part of formal writing, used to organize longer prose.