Below post will help you understand
the concept of random variables and the distribution it follows. These concepts
along with hypothesis testing covers good amount core Analytics. Feel free to
comment
Understanding Random Experiments and Random Variables
•
Certain variables are generated as
some activities are carried out. A flight takes certain amount of time; a
machine when operated consumes certain amount of power per unit time; a car
travels certain number of miles before encountering a fault. The activities
giving rise to the data are called random experiments. These are
the data generating mechanisms. The generated data are often referred to
as random variables.
•
Technically
random variables are mappings that assigns a real number to every outcome of
the activity. We often write this as X:Ω R, where Ω is the sample space
containing all possible outcomes of the random experiments
Random Variables
• Random variables take values from a predefined set. Thus all
possible values of the random variable are known in advance. However, the exact
value that will occur in the next occurrence is unknown.
• We do not know when a
machine will fail, what will be power consumption of a device in a given
period, what will be the efficiency of a machine, how many defective products
will be produced during the next production run…
Concept of
Distributions
• It is often assumed
that the process that generates the random variable can be modeled using a
mathematical function. This function enables us to calculate the probability
that the values of a random variable will fall within any given range
• The mathematical
function is defined in terms of the random variable and some unknown but fixed
constants called parameters
• The distribution
(mathematical function) characterizes the pattern of variation of the random
variable
• We may construct the
frequency distribution as well to visualize the pattern of variation. When we
have very large number of observations drawn randomly from a population, the
frequency distribution may be a good approximation of the distribution.
• The frequency
distribution is a non-parametric way of looking at a distribution and it has
its own advantages and disadvantages.
Understanding Density and Mass Functions
• Random variables may
be discrete or continuous. A random variable is said to be discrete if it takes
values from a set X = {x1, x2,…} , where X is
countable (a set is said to be countable if either it is finite or can be put
in a one-to-one correspondence with the integers)
• Discrete random variables
have probability mass functions (pmf) that has point masses
• Continuous random
variables have density functions (pdf) as given below. At any point x the pdf
gives the height of the curve and the areas under the curve from x0
to x1 gives the probability that the random variable assumes a value
in that range. In this sense pdf behaves like density where the mass can be
computed from density and volume
Properties of Mass and Density Functions
• Let
f(x) be a probability mass function (pmf). Then fX(x) ≥ 0 for all x
ε R and ΣfX(xi) = 1
• In
case the random variable X is continuous, we have fX(x) ≥ 0 for all
x ε R and we integrate f over the entire real line to get the area under the
curve.
• For
a continuous distribution we integrate over the range x0 to x1
to get the probability. For a pmf we take the sum.
Cumulative Distribution Function
• The
Cumulative Distribution Function (often referred to as Distribution Function)
is the function FX : R to [0, 1] defined by FX(x) = P(X ≤ x)
• Recall
that we use the CDF (often referred to as DF) through ogive
• When
either f (pdf or pmf) or F (CDF) is available, we can compute probability of
all events concerning the random experiment or equivalently probability of all
subsets of values of the random variable
• Knowledge
of the density, mass or distribution function is, therefore a convenient way to
understand the pattern of variation of a random variable from a quantitative
perspective
1)
Some important discrete distributions
1.1) Binomial Distribution
In
many real life situations it is important to determine the number or proportion
of successes/failures when trials are conducted independently of each other.
For example one may count the number of successful bids, the proportion of automobiles
that pass a test in the first go, the
number of program units found to be defect free when tested for the first time
and so on. These random variables can be modeled using the Binomial
distribution.
The Binomial distribution works as follows:
– Let
us assume that a trial has been conducted n times. Note that the trials must be
independent of each other.
– Let
p be the proportion of success. Note that this proportion must remain
reasonably same over the period of experimentation.
– Let
the random variable, number of successes, be X.
– Then
the probability density function is given by
p(X=x) = nCx px (1-p)n-x; x =
0, 1, 2, ....., n
– In
case we are interested in finding the probability density function of the
proportion of successes x/n rather than the number of successes, the same is
also given by
p(x/n) = nCx
px (1-p)n-x; x=0, 1, 2, …, n
Usages of Binomial Distribution
– The
Binomial distribution is a one parameter distribution. The proportion of
successes p is the unknown parameter. The independent trials of the binomial
distribution are called Bernoulli trials
– We
estimate p from data (how?)
– Given
n and p, we can compute probability of any event
1.2 Geometric Distribution
• Let
the random variable X be the number of Bernoulli trials required to get the
first success, where the trials are conducted independently of each other and
the probability of success remains constant
• X
takes values 0, 1, 2 ….∞.
• P(X
= k) = p (1 – p)k – 1, k ≥ 1
1.3 Negative Binomial Distribution
• This
is an extension of the geometric distribution
• In
this distribution we try to model the number of trials required to get r
successes. The random variable is denoted as x where x is the excess number of
trials required over r to get r successes
• P(X
= x) = ((x + r – 1)! / (r – 1)! x!)pr(1 – p)x, x = 0,1,…;
0 < p < 1
Some Examples
- Suppose
in a manufacturing environment defects are rather rare. We may use the
geometric distribution to estimate the number of attempts required to get
the first defective item.
- We
may use the geometric distribution to estimate the number of good products
produced before getting the first defective product
- The
negative binomial distribution may be used to find the number of sales
calls required to close a fixed number of orders
1.4 Poisson Distribution
• In
a real life situation Poisson distribution is applicable whenever a particular
trial is conducted many times, independently of each other and the probability
of a particular event in any one trial is very small.
• Take,
for instance, the number of defects found in a specification document. It may
be assumed that the probability of making a mistake in any one line (or a small
segment) is very small and the errors occur independently of each other.
• Going
by the same logic, number of accidents, number of breakdowns, number of
unreadable messages and similar random variables will follow Poisson
Distribution. (Why?)
• When
p is small but n is large so that np remains finite, the Binomial distribution
approaches the Poisson distribution. Note that large n typically means n >
30.
• Let
l be the average number of
occurrences of an event.
• Let
X be the random variable. Then
p(X=x) = (e-l
lx) / x! ; x = 0, 1, 2, …..., ∞
• Note
that the only parameter of a Poisson distribution is l, the average number of
occurrences of the event.
• Poisson
distribution is typically used for count data – number of breakdowns, number of
accidents, number of defects on a product, number of faults generated during a
production run of a machine, number of customers who come to a store
Some Examples
• Suppose
in the tool crib of a machine shop, on an average 2 requests for issuing tools
are made in an hour. What is the chance that in a particular hour
– 5
or more requests will come
– No
request will come
• Suppose
you are in charge of manpower planning for a team responsible for maintaining
software. Consider a simplistic environment where only one type of request
come. From your past observation you have seen that on an average 3 requests
come in a day and it usually takes a full day to service the request.
Considering this, you have allocated three persons.
– What
is the chance that at least one person will remain unutilized on any given day?
– What
is the chance that you will not be able to service customers, assuming that
customer service will be adversely impacted in case 4 or more requests arrive
on any given day
1.5 Discrete Uniform Distribution
• This
is often referred to as the equally likely outcome distribution and is
equivalent to assuming minimum knowledge about the distribution
• A
random variable X is said to follow a discrete uniform distribution in case it
takes values 1, 2, ….k such that P(X = j) = 1 / k for all j = 1, 2,…k
2) Continuous distributions
2.1Normal Distribution
• One
of the most common distributions for continuous random variables is the Normal
distribution.
• When
a random variable is expected to be roughly symmetrical around its mean and
tend to cluster around the mean value, the variable follows a Normal
distribution.
• Many
continuous variables like length, weight or other dimensions / characteristics
of a manufactured product; the difference between planned and actual effort or
time; amount of material consumed during a production run and so on are
expected to follow normal distribution
PDF of Normal
Distribution
• Let
X be a random variable, which follows Normal distribution.
• Let
m be the mean value of X.
• Let
s be the standard deviation
of X.
• Then
the probability density function is given by
p(x) = {1 / (Ö2ps)} exp[-(x-m)2
/ (2s2)]
• The
two parameters of Normal distribution are m
and s respectively, where m and s2 represent the
mean and variance of the random variable
Importance of Normal
Distribution
• Two
main reasons for the importance of Normal distribution are:
– The
tendency of sum or average values of independently drawn random observations
from markedly non-normal distributions to closely approximate Normal
distributions (see example in the next page), and
– The
robustness or insensitivity of many commonly used statistical procedures from
theoretical normality.
Convergence to Normal Distribution – An Example
2.2 Uniform Distribution
• Let
X be a random variable that takes values between two real numbers a and b (b > a) with the pdf f(x) = 1 / (b – a)
whenever a≤ x ≤b and 0 otherwise
• It
is important to note that in a uniform distribution the maximum and minimum
values are fixed and all values between the minimum and maximum occur with
equal likelihood
Usage of Uniform
Distribution
• This
distribution is very important from the perspective of random number
generation. The computer generated random numbers are drawn from uniform
distribution. In general these random numbers follow the U(0,1) distribution
• The
distribution functions of all distributions follow uniform distribution, i.e.
if F is the distribution function for a random variable and x is any random
observation then the distribution of y = F(x) ~ U(0,1). This is a very
important property and we will see its usage later
2.3 Exponential Distribution
• This
distribution is widely used to describe events recurring at random points in
time. Typical examples include time between arrival of customers at a service
booth; time between failures of a machine and so on.
• The
exponential and Poisson distributions are related. In case the time between
successive occurrences of an event follows an exponential distribution then the
number of occurrences in a given period follows a Poisson distribution.
• The
probability density function of the exponential distribution is given by f(x) =
λe-λx, x ≥ 0, λ > 0
• The
mean of exponential distribution is 1/λ
2.4 Weibull Distribution
• This
distribution describes data resulting from life and fatigue tests. It is often
used to describe failure times in reliability studies as well as breaking
strength of materials. Weibull distributions are also used to represent various
physical quantities like wind speed
• Weibull
pdf f(x) = (α/ βα)xα – 1 exp[ – (x / β)α],
where x ≥ 0, α > 0, β > 0
Summary of the Distributions
Name
|
Parameters
|
Mean
|
Variance
|
Binomial
|
p
|
np
|
np(1 – p)
|
Poisson
|
λ
|
λ
|
λ
|
Geometric
|
p
|
1 / p
|
(1 – p) / p2
|
Negative Binomial
|
p
|
[r(1 – p)] / p
|
r(1 – p) / p2
|
Discrete Uniform
|
K
|
(K + 1) / 2
|
(K – 1)(K + 1) / 12
|
Uniform
|
a, b
|
(a + b) / 2
|
(b – a)2 / 12
|
Normal
|
μ, σ2
|
μ
|
σ2
|
Exponential
|
λ
|
1 / λ
|
1 / λ2
|
Weibull
|
α, β
|
βΓ(1 + 1/α)
|
β2 {Γ(1 + 2/α) –
(Γ(1 + 1/α))2 }
|
Three Usages of
Distributions
• Distributions
of random variables may be used from three perspectives, namely to understand
– To
find the probability of any event
– To
compare two or more distributions; or to compare effectiveness of some actions
– To
simulate real life scenarios using the distributional models to develop an
understanding of the phenomenon