Dive into Distributions
In this post, let’s dive into distributions.
MotivationPermalink
When studying machine learning, one undoubtedly encounters many diverse distributions. For myself, I learned about normal, binomial, Bernoulli, and χ2 distributions in high school AP Statistics, but I happened to recently come across some unfamiliar ones. In fact, after getting tired of individually looking up the distributions each time I ran into them, I decided to create my own distributions reference sheet based on what I studied.
This blog post was originally intended to contain information on seven different types of distributions. Due to the sheer length of the intended post though, I chose to first deal with the three that I saw the most frequently, but never grasped thoroughly: Bernoulli, Poisson, and Gamma distributions. For the latter two, I took notes on the derivation processes for the mean and variance using the moment generating function. The following contents are my own explanations of what I learned from one of MIT OpenCourseWare’s courses.
Bernouille Trials, DistributionPermalink
A Bernouille trial is a trial in which the only outcomes are either success or failure. When repeating Bernouilli trials, the probability of success should remain the same. When we set X to be the random variable for a Bernouilli trial, we usually denote X=1 and X=0 to be the case for success and failure, respectively. If we define the probabilities for failure and success to be P(X=0)=1−p and P(X=1)=p, we can combine those two expressions into a single form:
p(x)=px⋅(1−p)1−x, for x=0,1In this case, it is easy to see that p(0)=1−p and p(1)=p. Therefore, the expectation value E(X) is as follows: E(X)=0⋅(1−p)+1⋅(p)=p. Not surprisingly, X, the random variable for a Bernouilli trial, has a Bernoille distribution.
For the variance of X, we use the variance formula:
Var(X)=σ2=∑(X−μ)2⋅P(X=x)Since μ=p,
σ2=Var(X)=P(X=1)⋅(1−p)2+P(X=0)⋅(0−p)2 =p⋅(1−p)2+(1−p)⋅p2=p(1−p)
∴σ=√p(1−p)Poisson DistributionPermalink
Before we get into what a Poisson distribution is, let’s first imagine a simple scenario. Suppose there is a cafe called “Cafe A,” where customers visit from 7 PM to 9 PM every night. One observes that the cafe and its visitors observe the following rules:
- Each visiting customer does not affect the probability of the other customers visiting.
- Customers do not visit sporadically–that is, one can reasonably assume that fifteen customers would visit in half an hour, if thirty visit in a full hour.
- No two customers visit within small time intervals, such as seconds, of one another.
Now suppose that random variable X describes the number of visitors per night, and that P(X=n) denotes the probability of n people visiting between 7 and 9 P.M. Let λ denote the average number of visitors per night as the population parameter, and suppose that the P.D.F of X is as follows: P(X=x)=λxe−λx!.
In such a case, X is called the random variable of a Poisson distribution. In fact, such a random variable retains three key characteristics:
- Independence
- Consistency
- Non-clusteredness
The first condition, independence, means that events in one time interval or space are independent from, and do not affect the probabilities of, events in other time intervals and spaces. The second condition, consistency, implies that the rate of the events’ occurrences is consistent. Finally, the third condition, non-clusteredness, signifies that the events must not be clustered together in a short timeframe.
The coffee shop example above is a rather oversimplified real-life example of where a Poisson distribution might be seen. In reality, a Poisson distribution is essentially a binomial distribution under special circumstances–the number of trials must approach infinity, and the probability of success must approach zero. In other words, if a binomial distribution describes the probability of observing k successes among n trials with p as the probability of success,
p(k)=nCk⋅pk⋅(1−p)n−k,if n⟶∞andp⟶0,
then P(X=k)=λke−λk!. The proof is shown below.
We start with the basic binomial distribution formula:
p(x)=nCx⋅px⋅(1−p)n−xIn the case of Poisson distributions, λ=np, and substituting λ/n in place of p yields:
p(x)=nCx⋅(λ/n)x⋅(1−λ/n)n−xThen after manipulating the form thrice,
p(x)=n!(n−x)!⋅x!⋅(λ/n)x⋅(1−λ/n)n−x p(x)=n(n−1)(n−2)⋅⋅⋅(n−x+1)x!⋅(λxnx)⋅(1−λn)n−x p(x)=n(n−1)(n−2)⋅⋅⋅(n−x+1)nx⋅(λxx!)⋅(1−λn)n−xAs described above, take the limit of p(x) as n approaches infinity.
limn→∞p(x)=limn→∞n(n−1)(n−2)⋅⋅⋅(n−x+1)nx⋅(λxx!)⋅(1−λn)n⋅(1−λn)−xand since the first and last terms are equivalent to one,
limn→∞p(x)=limn→∞(λxx!)⋅(1−λn)n limn→∞p(x)=limn→∞(λxx!)⋅{(1−λn)−nλ}−λ=λx⋅e−λx!◼Using the moment generating function, we can also find the mean and the standard deviation of a Poisson distribution.
M(t)=E[etx]=∞∑x=0etxp(x) =∞∑x=0etx⋅λxe−λx!=e−λ⋅∞∑x=0(etλ)xx!Given the McLaurin series of ex,
ex=∞∑n=0xnn! e−λ⋅∞∑x=0(etλ)xx!=e−λ⋅eλet=eλ(et−1) ∴M(t)=eλ(et−1)The kth moment of random variable X is given by:
μk=E(Xk)And the mean and median expressed in terms of moments are:
μ=E(X),σ=E(X2)−E2(X) M(t)=E[etX]When t=0, use McLaurin series of ex
M(t)=E[etX]=E[∞∑k=0(tX)kk!] =E[1+tX+(tX)2/2!+⋅⋅⋅+(tX)k/k!+⋅⋅⋅] =1+tE(X)+t2E(X2)/2!+⋅⋅⋅+tkE(Xk)/k!+⋅⋅⋅ =1+μ1⋅t+μ2⋅t2/2!+⋅⋅⋅+μk⋅tk/k!+⋅⋅⋅Therefore, the kth derivative of M(t) at t=0 is the kth moment.
Mk(t)=E[Xk],t=0Applying the above conclusion down below,
μ=E(X),σ2=E(X2)−E2(X) μ=M′(t)=[eλ(et−1)]′=[eλ(et−1)]⋅(λet),(t=0) ∴μ=λ σ2=M″(t)−(M′(t))2=[eλ(et−1)]″−λ2,(t=0)With some basic algebra,
σ2=λ,σ=√λimport matplotlib.pyplot as plt
import numpy as np
import math
x = np.linspace(-5, 5, 1000)
# y = 1 / (1 + np.e**(-x))
# mu = 0
# sigma = 1
# normal: y = 1 / (sigma * np.sqrt(2*np.pi)) * np.e**(-0.5*((x-mu)/sigma)**2)
lambdaP = 4
lambdaP2 = 10
lambdaP3 = 20
k = np.linspace(0, 40, 41) # k = np.ndarray
kList = list(k)
newK = list()
for i in kList:
newK.append(math.factorial(int(i)))
# print(newK)
newKNP = np.asarray(newK)
# type(newKNP) # np.ndarray
y = (lambdaP**k) * (np.e**(-lambdaP)) / newKNP
y2 = (lambdaP2**k) * (np.e**(-lambdaP2)) / newKNP
y3 = (lambdaP3**k) * (np.e**(-lambdaP3)) / newKNP
# type(y)
plt.plot(k.tolist(), y.tolist())
plt.plot(k.tolist(), y2.tolist())
plt.plot(k.tolist(), y3.tolist())
plt.legend(["lambda = 4", "lambda = 10", "lambda = 20"])
plt.title("Poisson Distributions")
# also works, but generates two separate plots
# fig, ax = plt.subplots()
# ax.plot(k,y)
# fig, ax2 = plt.subplots()
# ax2.plot(k,y2)

The Γ DistributionPermalink
The definition of a gamma function is as follows:
Definition
Γ(x)=∫∞0tx−1e−tdtWith integration by parts, the second property can be proven in less than three lines. And by substituting one in place of x, the third property can also be easily shown. With the second and third properties, the first one is then trivial.
Properties
Γ(n)=(n−1)! Γ(n+1)=nΓ(n) Γ(1)=1Gamma distribution PDF
Replace x by alpha, t by x from definition.
Γ(α)=∫∞0xα−1⋅e−xdx 1=1Γ(α)⋅∫∞0xα−1e−xdxSubstitute x for β⋅y
∫∞0xα−1e−xdx=∫∞0(β⋅y)α−1⋅e−β⋅y⋅β⋅dy 1=∫∞01Γ(α)⋅xα−1e−xdx=∫∞0βαΓ(α)⋅yα−1e−βydyReplace y by x,
f(x|α,β)=βαΓ(α)⋅xα−1⋅e−βx,(x≥0) f(x|α,β)=0,(x<0)Mean, Variance of Gamma Distribution
λ=αβ,σ2=αβ2Proof:
We start with the momemt generating function:
M(t)=∫∞0βαΓ(α)⋅xα−1⋅e(t−β)xdxSubstitute −βy=(t−β)x
M(t)=∫∞0βαΓ(α)⋅(βyβ−t)α−1⋅e−βy⋅(ββ−t)dy =βαΓ(α)∫∞0(βα⋅yα−1(β−t)α)⋅e−βydy =βα(β−t)α⋅1Γ(α)⋅∫∞0βα⋅yα−1⋅e−βydy⏟=1Since
f(x|α,β):=βαΓ(α)⋅xα−1⋅e−βx,(x≥0), ∫∞0f(x|α,β)dx=∫∞0βαΓ(α)⋅xα−1⋅e−βxdx=1 ∴M(t)=βα(β−t)αThe rest of the steps then become trivial.
M′(t)=α⋅βα(β−t)α+1=αβ,(witht=0) M″(t)=α⋅βα⋅(α+1)(β−t)α+2=α(α+1)β2,(witht=0) ∴λ=αβ,σ2=M″(0)−[M′(0)]2=αβ2ConclusionPermalink
Distributions and statistical inferences based on those distributions are key in parametric statistics. In this post, I mainly went over Bernoulli, Poisson, and gamma distributions, although my intention was to go over several others. For binomial, multinomial, χ2, and β distributions, I will likely create a separate post on them. Although this blog post took longer to create than I had anticipated, I hope I learned much from the experience. Thanks for reading.