Below post will help you understand the concept of random variables and the distribution it follows. These concepts along with hypothesis testing covers good amount core Analytics. Feel free to comment

Understanding Random Experiments and Random Variables

• Certain variables are generated as some activities are carried out. A flight takes certain amount of time; a machine when operated consumes certain amount of power per unit time; a car travels certain number of miles before encountering a fault. The activities giving rise to the data are called random experiments. These are the data generating mechanisms. The generated data are often referred to as random variables.

• Technically random variables are mappings that assigns a real number to every outcome of the activity. We often write this as X:Ω R, where Ω is the sample space containing all possible outcomes of the random experiments

Random Variables

• Random variables take values from a predefined set. Thus all possible values of the random variable are known in advance. However, the exact value that will occur in the next occurrence is unknown.

• We do not know when a machine will fail, what will be power consumption of a device in a given period, what will be the efficiency of a machine, how many defective products will be produced during the next production run…

Concept of Distributions

• It is often assumed that the process that generates the random variable can be modeled using a mathematical function. This function enables us to calculate the probability that the values of a random variable will fall within any given range

• The mathematical function is defined in terms of the random variable and some unknown but fixed constants called parameters

• The distribution (mathematical function) characterizes the pattern of variation of the random variable

• We may construct the frequency distribution as well to visualize the pattern of variation. When we have very large number of observations drawn randomly from a population, the frequency distribution may be a good approximation of the distribution.

• The frequency distribution is a non-parametric way of looking at a distribution and it has its own advantages and disadvantages.

Understanding Density and Mass Functions

• Random variables may be discrete or continuous. A random variable is said to be discrete if it takes values from a set X = {x₁, x₂,…} , where X is countable (a set is said to be countable if either it is finite or can be put in a one-to-one correspondence with the integers)

• Discrete random variables have probability mass functions (pmf) that has point masses

• Continuous random variables have density functions (pdf) as given below. At any point x the pdf gives the height of the curve and the areas under the curve from x₀ to x₁ gives the probability that the random variable assumes a value in that range. In this sense pdf behaves like density where the mass can be computed from density and volume

Properties of Mass and Density Functions

• Let f(x) be a probability mass function (pmf). Then f_X(x) ≥ 0 for all x ε R and Σf_X(x_i) = 1

• In case the random variable X is continuous, we have f_X(x) ≥ 0 for all x ε R and we integrate f over the entire real line to get the area under the curve.

• For a continuous distribution we integrate over the range x₀ to x₁ to get the probability. For a pmf we take the sum.

Cumulative Distribution Function

• The Cumulative Distribution Function (often referred to as Distribution Function) is the function F_X : R to [0, 1] defined by F_X(x) = P(X ≤ x)

• Recall that we use the CDF (often referred to as DF) through ogive

• When either f (pdf or pmf) or F (CDF) is available, we can compute probability of all events concerning the random experiment or equivalently probability of all subsets of values of the random variable

• Knowledge of the density, mass or distribution function is, therefore a convenient way to understand the pattern of variation of a random variable from a quantitative perspective

1) Some important discrete distributions

1.1) Binomial Distribution

In many real life situations it is important to determine the number or proportion of successes/failures when trials are conducted independently of each other. For example one may count the number of successful bids, the proportion of automobiles that pass a test in the first go, the number of program units found to be defect free when tested for the first time and so on. These random variables can be modeled using the Binomial distribution.

The Binomial distribution works as follows:

– Let us assume that a trial has been conducted n times. Note that the trials must be independent of each other.

– Let p be the proportion of success. Note that this proportion must remain reasonably same over the period of experimentation.

– Let the random variable, number of successes, be X.

– Then the probability density function is given by

p(X=x) = ⁿC_xp^x (1-p)^n-x; x = 0, 1, 2, ....., n

– In case we are interested in finding the probability density function of the proportion of successes x/n rather than the number of successes, the same is also given by

p(x/n) = ⁿC_x p^x(1-p)^n-x; x=0, 1, 2, …, n

Usages of Binomial Distribution

– The Binomial distribution is a one parameter distribution. The proportion of successes p is the unknown parameter. The independent trials of the binomial distribution are called Bernoulli trials

– We estimate p from data (how?)

– Given n and p, we can compute probability of any event

1.2 Geometric Distribution

• Let the random variable X be the number of Bernoulli trials required to get the first success, where the trials are conducted independently of each other and the probability of success remains constant

• X takes values 0, 1, 2 ….∞.

• P(X = k) = p (1 – p)^{k – 1}, k ≥ 1

1.3 Negative Binomial Distribution

• This is an extension of the geometric distribution

• In this distribution we try to model the number of trials required to get r successes. The random variable is denoted as x where x is the excess number of trials required over r to get r successes

• P(X = x) = ((x + r – 1)! / (r – 1)! x!)p^r(1 – p)^x, x = 0,1,…; 0 < p < 1

Some Examples

Suppose in a manufacturing environment defects are rather rare. We may use the geometric distribution to estimate the number of attempts required to get the first defective item.
We may use the geometric distribution to estimate the number of good products produced before getting the first defective product
The negative binomial distribution may be used to find the number of sales calls required to close a fixed number of orders

1.4 Poisson Distribution

• In a real life situation Poisson distribution is applicable whenever a particular trial is conducted many times, independently of each other and the probability of a particular event in any one trial is very small.

• Take, for instance, the number of defects found in a specification document. It may be assumed that the probability of making a mistake in any one line (or a small segment) is very small and the errors occur independently of each other.

• Going by the same logic, number of accidents, number of breakdowns, number of unreadable messages and similar random variables will follow Poisson Distribution. (Why?)

• When p is small but n is large so that np remains finite, the Binomial distribution approaches the Poisson distribution. Note that large n typically means n > 30.

• Let l be the average number of occurrences of an event.

• Let X be the random variable. Then p(X=x) = (e^-^l l^x) / x! ; x = 0, 1, 2, …..., ∞

• Note that the only parameter of a Poisson distribution is l, the average number of occurrences of the event.

• Poisson distribution is typically used for count data – number of breakdowns, number of accidents, number of defects on a product, number of faults generated during a production run of a machine, number of customers who come to a store

Some Examples

• Suppose in the tool crib of a machine shop, on an average 2 requests for issuing tools are made in an hour. What is the chance that in a particular hour

– 5 or more requests will come

– No request will come

• Suppose you are in charge of manpower planning for a team responsible for maintaining software. Consider a simplistic environment where only one type of request come. From your past observation you have seen that on an average 3 requests come in a day and it usually takes a full day to service the request. Considering this, you have allocated three persons.

– What is the chance that at least one person will remain unutilized on any given day?

– What is the chance that you will not be able to service customers, assuming that customer service will be adversely impacted in case 4 or more requests arrive on any given day

1.5 Discrete Uniform Distribution

• This is often referred to as the equally likely outcome distribution and is equivalent to assuming minimum knowledge about the distribution

• A random variable X is said to follow a discrete uniform distribution in case it takes values 1, 2, ….k such that P(X = j) = 1 / k for all j = 1, 2,…k

2) Continuous distributions

2.1Normal Distribution

• One of the most common distributions for continuous random variables is the Normal distribution.

• When a random variable is expected to be roughly symmetrical around its mean and tend to cluster around the mean value, the variable follows a Normal distribution.

• Many continuous variables like length, weight or other dimensions / characteristics of a manufactured product; the difference between planned and actual effort or time; amount of material consumed during a production run and so on are expected to follow normal distribution

PDF of Normal Distribution

• Let X be a random variable, which follows Normal distribution.

• Let m be the mean value of X.

• Let s be the standard deviation of X.

• Then the probability density function is given by
p(x) = {1 / (Ö2ps)} exp[-(x-m)² / (2s²)]

• The two parameters of Normal distribution are m and s respectively, where m and s² represent the mean and variance of the random variable

Importance of Normal Distribution

• Two main reasons for the importance of Normal distribution are:

– The tendency of sum or average values of independently drawn random observations from markedly non-normal distributions to closely approximate Normal distributions (see example in the next page), and

– The robustness or insensitivity of many commonly used statistical procedures from theoretical normality.

Convergence to Normal Distribution – An Example

2.2 Uniform Distribution

• Let X be a random variable that takes values between two real numbers a and b (b > a) with the pdf f(x) = 1 / (b – a) whenever a≤ x ≤b and 0 otherwise

• It is important to note that in a uniform distribution the maximum and minimum values are fixed and all values between the minimum and maximum occur with equal likelihood

Usage of Uniform Distribution

• This distribution is very important from the perspective of random number generation. The computer generated random numbers are drawn from uniform distribution. In general these random numbers follow the U(0,1) distribution

• The distribution functions of all distributions follow uniform distribution, i.e. if F is the distribution function for a random variable and x is any random observation then the distribution of y = F(x) ~ U(0,1). This is a very important property and we will see its usage later

2.3 Exponential Distribution

• This distribution is widely used to describe events recurring at random points in time. Typical examples include time between arrival of customers at a service booth; time between failures of a machine and so on.

• The exponential and Poisson distributions are related. In case the time between successive occurrences of an event follows an exponential distribution then the number of occurrences in a given period follows a Poisson distribution.

• The probability density function of the exponential distribution is given by f(x) = λe^-λx, x ≥ 0, λ > 0

• The mean of exponential distribution is 1/λ

2.4 Weibull Distribution

• This distribution describes data resulting from life and fatigue tests. It is often used to describe failure times in reliability studies as well as breaking strength of materials. Weibull distributions are also used to represent various physical quantities like wind speed

• Weibull pdf f(x) = (α/ β^α)x^{α – 1}exp[ – (x / β)^α], where x ≥ 0, α > 0, β > 0

Summary of the Distributions

Name	Parameters	Mean	Variance
Binomial	p	np	np(1 – p)
Poisson	λ	λ	λ
Geometric	p	1 / p	(1 – p) / p²
Negative Binomial	p	[r(1 – p)] / p	r(1 – p) / p²
Discrete Uniform	K	(K + 1) / 2	(K – 1)(K + 1) / 12
Uniform	a, b	(a + b) / 2	(b – a)² / 12
Normal	μ, σ²	μ	σ²
Exponential	λ	1 / λ	1 / λ²
Weibull	α, β	βΓ(1 + 1/α)	β² {Γ(1 + 2/α) – (Γ(1 + 1/α))² }

Three Usages of Distributions

• Distributions of random variables may be used from three perspectives, namely to understand

– To find the probability of any event

– To compare two or more distributions; or to compare effectiveness of some actions

– To simulate real life scenarios using the distributional models to develop an understanding of the phenomenon

For a constant Learner & Data Analysts

Featured Post

Reference Books and material for Analytics

Tuesday, December 6, 2016

Introduction to Random Variables and Distributions ( Category Concepts, Level : Intermediate)

1.5 Discrete Uniform Distribution

2.3 Exponential Distribution

2.4 Weibull Distribution

No comments:

Post a Comment