This is the start of a new series that I’m excited about diving into. Back in college one of my degrees included a concentration in Quantitative Analysis. I loved statistics since my first exposure to it in high school and found that I did pretty well at the subject. Statistics has always come naturally for me but found it hard to get a job in the area without at least a Masters degree. I started getting excited about 2-3 years ago when statistics, namely advanced analytics, started becoming a staple in many IT and functional organizations. Data visualization is part of this staple and it’s not just…data art. It’s very much a science that plays well to the artists inside us.
When I teach people Data Visualization, I find they try to jump into visualizations too fast without understanding why or how they work. They want to see pictures, quickly and making their data make sense on the first try. …I hate to break it to you, but it almost never works like that. While the incoming college hires are more data savvy than the ‘old guard’, there is still a fundamental education that needs to occur about the different types of data out there. While this post (and series, at least initially) will not be diving into the different data types, it is important to understand the difference between facts/measures, attributes, dates, and spatial data.
This series’ focus is on explaining the options you have available to you in Oracle Data Visualization. You have considerably more statistical power available to you than you might imagine!
The first stop on the analysis train is Trend. You may think you already understand this concept since we throw the word around as common language – “The stock market trend is going up”, for example. But within DV, you have different Trend options available to you whether it be the Method (Linear, Exponential, or Polynomial), the Degree, or the Confidence Interval (90%, 95%, 99%, off).
Trend can be found on the Advanced Analytics tab to be added to a visualization containing a date (as shown by right-clicking on “Trend” to add it to my visualization.
My data is set is the time of the male winners of the Boston Marathon since 1900 (I’ll add women later, but because of sexism, they were not added until the late 1960s) in minutes. When I plot the data with Trend added, I get the following result:
The arrow shows the automatic trend line that was added to my visualization. I have put a box around the details behind my trend line. By default, the trend line is a Linear line with a 95% confidence interval. But what are the other options? If you open the Method drop down list, you will see options for Linear, Exponential, and Polynomial. But what do each of these mean? Let’s take a look…
Let’s start with our default, the Linear trend line. Not to go into the math behind the line, but (yep, I’m going there) the formula is one that you will recognize from junior high:
Y(t) = a + bt
Or, more commonly,
y = mx + b
y = ȳ – slope * x̄
a = b = intercept
b = m = ( S((x – x̄)/(y – ȳ))) / (x -x̄)2
y = dependent variable (time is required for y to have a value)
t = time; independent variable (will occur with or without y)
So, why is (any of) this important?
A linear trend creates a line. While somewhat helpful when the data points seem to follow a straight line, it’s not often appropriate for most types of data. This is especially true if, especially towards the end of the line, the actual data points are not even in the Confidence Interval band (wait for it…). If it is true that the end of the data points are not in the CI band, then how can you accurately forecast?? This brings me to the 2 other types of trend methods available…
The next on the list is Exponential:
The Exponential method is especially useful if you see sharp growth or decay in your numbers (but cannot be used if there are zero or null data points in your set!). If I change to this method, you will see that my trend line is starting to get a curve:
Again, we are at our default 95% CI. Because we have no sharp incline or decline in our data, we won’t see the real value of this method.
The formula for Exponential Trend is:
Y = Arx
r = 10m where m is the same formula as in Linear for slope
A = 10b where b is the same formula as in Linear for intercept
In our example, if we take the first 10 years and math it all out in Excel, here are the results:
The bottom line is the regression. If we filter for only the first 10 years, we can see our trend line change because there was some volatility in 1909 (which from research I found that it was 100*…YES, ONE HUNDRED DEGREES…on the day of the race!).
Let’s move onto the Polynomial trend line… When I add this Method to my graph, I get a more true trend over time. We also get a new option: Degree. We will get to that…
We can see that more data points fall in the 95% CI band. This is because this method allows for more data fluctuations over a large data set. For financial data, this would be my recommendation. Actually, in general, I opt for a Polynomial trend.
The formula is more complicated…
Yi = b0 + b1ti + … + bptip + ei
Sigh. I could go into more details, but suffice it say that the formula is an extrapolation of the Linear trend to account for variations in all the data points around a 3 dimensional axis (vectors for those who have had Linear Algebra). In fact, if you have had Linear Algebra (sneaked by, myself) you might notice you can create matrices to do the math by using the formula:
bhat= (X’X)-1 X’y
Where b hat = (transpose of the matrix times the original matrix) to the matrix inverse times the transposed matrix times y. Yeah, I’ll spare you.
This is where the Degree gets important…it determines how many iterations of the above formula you will do and to what degree (aha!). A first degree polynomial is called a straight line, or linear regression. A second degree polynomial is call a quadratic regression (or exponential for our purposes). The next degree, 3rd …which is why the Degree option starts at 3… is called a cubic polynomial. The 4th, quartic and so on.
But why would you want to change the Degree? It depends on how many “valleys and peaks” you want in your trend line. You will have n-1 valleys and peaks for the Degree you choose. Since it defaulted to 3, I have 2:
If you have a full year of data, you may want to choose a Degree of 4 to show how each season is represented in your graph. See, it is all making sense! If you’ll permit me to get a bit nuts, I’m going to change the Degree to 7. I will expect 6 valleys and peaks in my data:
Although subtle at the end because the times have, more or less, normalized, I can still pick out my 6 waves:
You’ll never going to settle for the default Trend options again, are you?
Let’s go back to Confidence Intervals… A CI is the probability that a given recorded value will fall within a certain range. The calculation is pretty simple:
CI = x̄ ± z (s / Ön)
If we want to choose the CI for the first 10 years of the Boston Marathon finish times, we will get
n = sample = 10 years
x̄ = mean = 158.316 minutes
s = standard deviation = 9.2311
z-score for 95% = table lookup = 1.96
CI = 158.316 ± 1.96 (9.2311 / Ö10)
CI = 158.316 ± 5.7215
There’s our CI for the first 10 years – 95% sure that the numbers will fall between 152.5945 and 164.0375. You might also recognize this as the margin of error you keep hearing about…
If you want to increase the CI, you will notice the band get larger. You can also turn it off to focus on the trend and not trend and CI.
Hopefully you have a better appreciation for what the Trend Methods and Degrees can do for you. Do NOT just stick with the Trend defaults! You are missing out on key data analysis results!