Table of Contents
What is Q-Q Plots?
The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. Please note that it’s just a visual check, not an air-tight proof, so it is somewhat subjective. But it allows us to see at-a-glance if our assumption is plausible, and if not, how the assumption is violated and what data points contribute to the violation.
To give a concrete example, assume the actual data values have a mean of 10 and a standard deviation of 3. Then, assuming a normal distribution, we would expect a data point ranked at the 50th percentile to lie at position 10 (the mean), a data point at the 84th percentile to lie at position 13 (one standard deviation above from the mean).
The gist of Q-Q Plots
For $n$ i.i.d observations ${y_i}$, let us define the order statistics from ${y_i}$ such that
\[y_{\{1\}} \leq y_{\{2\}} \leq \cdots \leq y_{\{n\}}\]Let $q_i$ be the $i/(n+1)$ quantile of some theoretical distribution, for $i = 1, 2, \cdots, n$.
When ${y_i}$ come from the same theoretical distirbution, the pair of the order statistics and $q_i$, i.e.,
\[(q_1, y_{\{1\}}), (q_2, y_{\{2\}}), \cdots, (q_n, y_{\{n\}})\]should be approximately on the straight line, more closely so when $n$ is large.
Make Q-Q Plots with Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
N = 1000
mu = 0
sd = 1
np.random.seed(42)
## GDP
x = np.random.normal(mu, sd, N)
x_sorted = sorted(x)
x_theoretical = norm.ppf((np.arange(1, N+1))/(N+1))
abline_pair = [min(x_theoretical), max(x_theoretical)]
plt.scatter(x_theoretical, x_sorted)
plt.plot(abline_pair ,abline_pair , color='r')
Scipy: probplot
- Note that the red line is not “45 degree” line, but just the fitting line
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
from scipy.stats import norm
from scipy.stats import probplot
N = 1000
mu = 0
sd = 1
np.random.seed(42)
## GDP
x = np.random.normal(mu, sd, N)
## scipy plot
fig, ax = plt.subplots()
res = probplot(x, dist=norm, plot=ax)
Filliben’s estimate 1975
When you look closely at scipy.stats.probplot
, you will notice that the module is using Filliben’s Estimate.
The docstring says the followings:
1
2
3
4
5
6
7
8
9
10
11
The formula used for the theoretical quantiles (horizontal axis of the
probability plot) is Filliben's estimate::
quantiles = dist.ppf(val), for
0.5**(1/n), for i = n
val = (i - 0.3175) / (n + 0.365), for i = 2, ..., n-1
1 - 0.5**(1/n), for i = 1
where ``i`` indicates the i-th ordered value and ``n`` is the total number
of values.
The above magic numbers, 0.3175, 0.365, 0.5**(1/n)
, are based on Filliben’s simulation for calculating the median of the order statistic.
The simple intuition is that when you want to estimate the parameters of the uniform distibution, you have to correct the min and max of the sample for getting the unbiased estimators (even though the simple mim and max estimators are consistent ones).
Detail: Q-Q Plots with Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import matplotlib.pyplot as plt
import numpy as np
def _add_axis_labels_title(plot, xlabel, ylabel, title):
"""Helper function to add axes labels and a title to stats plots."""
try:
if hasattr(plot, 'set_title'):
# Matplotlib Axes instance or something that looks like it
plot.set_title(title)
plot.set_xlabel(xlabel)
plot.set_ylabel(ylabel)
else:
# matplotlib.pyplot module
plot.title(title)
plot.xlabel(xlabel)
plot.ylabel(ylabel)
except Exception:
# Not an MPL object or something that looks (enough) like it.
# Don't crash on adding labels or title
pass
def _add_x_and_y_lim(plot, x_range, y_range):
"""Helper function to add axes labels and a title to stats plots."""
try:
if hasattr(plot, 'set_title'):
# Matplotlib Axes instance or something that looks like it
plot.set_xlim(x_range)
plot.set_ylim(y_range)
else:
# matplotlib.pyplot module
plot.xlim(x_range)
plot.ylim(y_range)
except Exception:
# Not an MPL object or something that looks (enough) like it.
# Don't crash on adding labels or title
pass
def qqplot(x, y=[], sparams=(), dist='norm', filliben=True, percentile=False, plot=None):
x = np.asarray(x)
y = np.asarray(y)
x_size = x.size
if x_size < 20:
raise ValueError("x need at least 20 data-points.")
if filliben:
v = np.empty(x_size, dtype=np.float64)
v[-1] = 0.5**(1.0 / x_size)
v[0] = 1 - v[-1]
i = np.arange(2, x_size)
v[1:-1] = (i - 0.3175) / (x_size + 0.365)
else:
v = (np.arange(1, x_size+1))/(x_size+1)
if y.size == 0:
try:
from scipy.stats import distributions as scipy_dist
dist = getattr(scipy_dist, dist)
if sparams is None:
sparams = ()
if np.isscalar(sparams):
sparams = (sparams,)
if not isinstance(sparams, tuple):
sparams = tuple(sparams)
## get theoretical values
if percentile:
y_cdf = dist.cdf
osm = v
else:
osm = dist.ppf(v, *sparams)
except AttributeError as e:
raise ValueError("%s is not a valid distribution name" % dist) from e
else:
if y.size < 20:
raise ValueError("y need at least 20 data-points.")
if percentile:
from statsmodels.distributions.empirical_distribution import ECDF
y_cdf = ECDF(y)
osm = v
else:
osm = np.quantile(y, v)
if percentile:
osr = y_cdf(sorted(x))
else:
osr = sorted(x)
if plot is not None:
plot.plot(osm, osr, 'bo')
else:
plot = plt
plot.plot(osm, osr, 'bo')
if percentile:
_add_axis_labels_title(plot, xlabel='Theoretical percentile',
ylabel='Ordered Values',
title='Percentile-Percentile Plot')
_add_x_and_y_lim(plot, x_range=[-0.05,1.05], y_range=[-0.05,1.05])
plot.plot([0, 1], [0, 1], 'k--')
else:
_add_axis_labels_title(plot, xlabel='Theoretical quantiles values',
ylabel='Ordered Values',
title='Quantile-Quantile Plot')
_range = min(osm), max(osm)
plot.plot(_range, _range, 'k--')
#return osm, osr
Appendix: Percentile vs quantile vs quartile
min | 0 quartile | 0 quantile | 0 percentile |
1 quartile | 0.25 quantile | 25 percentile | |
median | 2 quartile | .5 quantile | 50 percentile |
3 quartile | .75 quantile | 75 percentile | |
max | 4 quartile | 1 quantile | 100 percentile |
References
scipy
Q-Q Plot
- Research Data Services + Sciences, University of Virginia Library > Understanding Q-Q Plots
- Filliben’s estimate 1975
- The Probability Plot Correlation Coefficient Test for the Normal,Lognormal, and Gumbel Distributional Hypothese
(注意:GitHub Accountが必要となります)