Featured Post

Reference Books and material for Analytics

Website for practising R on Statistical conceptual Learning: https://statlearning.com  Reference Books & Materials: 1) Statis...

Tuesday, December 6, 2016

Introduction to Random Variables and Distributions ( Category Concepts, Level : Intermediate)




Below post will help you understand the concept of random variables and the distribution it follows. These concepts along with hypothesis testing covers good amount core Analytics. Feel free to comment

 Understanding Random Experiments and Random Variables

      Certain variables are generated as some activities are carried out. A flight takes certain amount of time; a machine when operated consumes certain amount of power per unit time; a car travels certain number of miles before encountering a fault. The activities giving rise to the data are called random experiments. These are the data generating mechanisms. The generated data are often referred to as random variables.

      Technically random variables are mappings that assigns a real number to every outcome of the activity. We often write this as  X:Ω        R, where Ω is the sample space containing all possible outcomes of the random experiments

 

Random Variables

      Random variables take  values from a predefined set. Thus all possible values of the random variable are known in advance. However, the exact value that will occur in the next occurrence is unknown.

      We do not know when a machine will fail, what will be power consumption of a device in a given period, what will be the efficiency of a machine, how many defective products will be produced during the next production run…

 

Concept of Distributions

 
      It is often assumed that the process that generates the random variable can be modeled using a mathematical function. This function enables us to calculate the probability that the values of a random variable will fall within any given range

      The mathematical function is defined in terms of the random variable and some unknown but fixed constants called parameters

      The distribution (mathematical function) characterizes the pattern of variation of the random variable

      We may construct the frequency distribution as well to visualize the pattern of variation. When we have very large number of observations drawn randomly from a population, the frequency distribution may be a good approximation of the distribution.

      The frequency distribution is a non-parametric way of looking at a distribution and it has its own advantages and disadvantages.

 

 

Understanding  Density and Mass Functions

 

      Random variables may be discrete or continuous. A random variable is said to be discrete if it takes values from a set X = {x1, x2,} , where X is countable (a set is said to be countable if either it is finite or can be put in a one-to-one correspondence with the integers)

      Discrete random variables have probability mass functions (pmf) that has point masses

      Continuous random variables have density functions (pdf) as given below. At any point x the pdf gives the height of the curve and the areas under the curve from x0 to x1 gives the probability that the random variable assumes a value in that range. In this sense pdf behaves like density where the mass can be computed from density and volume

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
Properties of Mass and Density Functions



 


 

       Let f(x) be a probability mass function (pmf). Then fX(x) ≥ 0 for all x ε R and ΣfX(xi) = 1

       In case the random variable X is continuous, we have fX(x) ≥ 0 for all x ε R and we integrate f over the entire real line to get the area under the curve.

       For a continuous distribution we integrate over the range x0 to x1 to get the probability. For a pmf we take the sum.

Cumulative Distribution Function

       The Cumulative Distribution Function (often referred to as Distribution Function) is the function FX : R to [0, 1] defined by    FX(x) = P(X ≤ x)

       Recall that we use the CDF (often referred to as DF) through ogive

       When either f (pdf or pmf) or F (CDF) is available, we can compute probability of all events concerning the random experiment or equivalently probability of all subsets of values of the random variable

       Knowledge of the density, mass or distribution function is, therefore a convenient way to understand the pattern of variation of a random variable from a quantitative perspective

 

  1)    Some important discrete distributions

 

  1.1)  Binomial Distribution

         In many real life situations it is important to determine the number or proportion of successes/failures when trials are conducted independently of each other. For example one may count the number of successful bids, the proportion of automobiles that pass a  test in the first go, the number of program units found to be defect free when tested for the first time and so on. These random variables can be modeled using the Binomial distribution.

The Binomial distribution works as follows:

      Let us assume that a trial has been conducted n times. Note that the trials must be independent of each other.

      Let p be the proportion of success. Note that this proportion must remain reasonably same over the period of experimentation.

      Let the random variable, number of successes, be X.

      Then the probability density function is given by

p(X=x) = nCx  px (1-p)n-x; x = 0, 1, 2, ....., n

      In case we are interested in finding the probability density function of the proportion of successes x/n rather than the number of successes, the same is also given by

p(x/n) = nCx px (1-p)n-x; x=0, 1, 2, …, n

Usages of Binomial Distribution

      The Binomial distribution is a one parameter distribution. The proportion of successes p is the unknown parameter. The independent trials of the binomial distribution are called Bernoulli trials

      We estimate p from data (how?)

      Given n and p, we can compute probability of any event


 1.2 Geometric Distribution

       Let the random variable X be the number of Bernoulli trials required to get the first success, where the trials are conducted independently of each other and the probability of success remains constant

       X takes values 0, 1, 2 ….∞.

       P(X = k) = p (1 – p)k – 1, k ≥ 1  

 1.3 Negative Binomial Distribution

       This is an extension of the geometric distribution

       In this distribution we try to model the number of trials required to get r successes. The random variable is denoted as x where x is the excess number of trials required over r to get r successes

       P(X = x) = ((x + r – 1)! / (r – 1)! x!)pr(1 – p)x, x = 0,1,…; 0 < p < 1

Some Examples

  1. Suppose in a manufacturing environment defects are rather rare. We may use the geometric distribution to estimate the number of attempts required to get the first defective item.
  2. We may use the geometric distribution to estimate the number of good products produced before getting the first defective product
  3. The negative binomial distribution may be used to find the number of sales calls required to close a fixed number of orders

 
 1.4 Poisson Distribution

       In a real life situation Poisson distribution is applicable whenever a particular trial is conducted many times, independently of each other and the probability of a particular event in any one trial is very small.

       Take, for instance, the number of defects found in a specification document. It may be assumed that the probability of making a mistake in any one line (or a small segment) is very small and the errors occur independently of each other.

       Going by the same logic, number of accidents, number of breakdowns, number of unreadable messages and similar random variables will follow Poisson Distribution. (Why?)

       When p is small but n is large so that np remains finite, the Binomial distribution approaches the Poisson distribution. Note that large n typically means n > 30.

       Let l be the average number of occurrences of an event.

       Let X be the random variable.    Then p(X=x) = (e-l lx) / x! ; x = 0, 1, 2, …..., ∞

       Note that the only parameter of a Poisson distribution is l, the average number of occurrences of the event.

       Poisson distribution is typically used for count data – number of breakdowns, number of accidents, number of defects on a product, number of faults generated during a production run of a machine, number of customers who come to a store

Some Examples

       Suppose in the tool crib of a machine shop, on an average 2 requests for issuing tools are made in an hour. What is the chance that in a particular hour

      5 or more requests will come
      No request will come

       Suppose you are in charge of manpower planning for a team responsible for maintaining software. Consider a simplistic environment where only one type of request come. From your past observation you have seen that on an average 3 requests come in a day and it usually takes a full day to service the request. Considering this, you have allocated three persons.

      What is the chance that at least one person will remain unutilized on any given day?

      What is the chance that you will not be able to service customers, assuming that customer service will be adversely impacted in case 4 or more requests arrive on any given day

 

1.5 Discrete Uniform Distribution


       This is often referred to as the equally likely outcome distribution and is equivalent to assuming minimum knowledge about the distribution

       A random variable X is said to follow a discrete uniform distribution in case it takes values 1, 2, ….k such that P(X = j) = 1 / k for all j = 1, 2,…k

  2) Continuous distributions

   2.1Normal Distribution

       One of the most common distributions for continuous random variables is the Normal distribution.

       When a random variable is expected to be roughly symmetrical around its mean and tend to cluster around the mean value, the variable follows a Normal distribution.

       Many continuous variables like length, weight or other dimensions / characteristics of a manufactured product; the difference between planned and actual effort or time; amount of material consumed during a production run and so on are expected to follow normal distribution

     PDF of Normal Distribution

       Let X be a random variable, which follows Normal distribution.
       Let m be the mean value of X.
       Let s be the standard deviation of X.
        Then the probability density function is given by
p(x) = {1 / (
Ö2ps)}  exp[-(x-m)2 / (2s2)]
       The two parameters of Normal distribution are m and s respectively, where m and s2 represent the mean and variance of the random variable

   Importance of Normal Distribution 

       Two main reasons for the importance of Normal distribution are:

      The tendency of sum or average values of independently drawn random observations from markedly non-normal distributions to closely approximate Normal distributions (see example in the next page), and
      The robustness or insensitivity of many commonly used statistical procedures from theoretical normality.
 

Convergence to Normal Distribution – An Example



 2.2 Uniform Distribution


       Let X be a random variable that takes values between two real numbers a and b  (b > a) with the pdf f(x) = 1 / (b – a) whenever a≤ x ≤b and 0 otherwise

       It is important to note that in a uniform distribution the maximum and minimum values are fixed and all values between the minimum and maximum occur with equal likelihood

Usage of Uniform Distribution

       This distribution is very important from the perspective of random number generation. The computer generated random numbers are drawn from uniform distribution. In general these random numbers follow the U(0,1) distribution

       The distribution functions of all distributions follow uniform distribution, i.e. if F is the distribution function for a random variable and x is any random observation then the distribution of y = F(x) ~ U(0,1). This is a very important property and we will see its usage later

2.3 Exponential Distribution


       This distribution is widely used to describe events recurring at random points in time. Typical examples include time between arrival of customers at a service booth; time between failures of a machine and so on.
       The exponential and Poisson distributions are related. In case the time between successive occurrences of an event follows an exponential distribution then the number of occurrences in a given period follows a Poisson distribution.
       The probability density function of the exponential distribution is given by f(x) = λe-λx, x ≥ 0, λ > 0
       The mean of exponential distribution is 1/λ

2.4 Weibull Distribution


       This distribution describes data resulting from life and fatigue tests. It is often used to describe failure times in reliability studies as well as breaking strength of materials. Weibull distributions are also used to represent various physical quantities like wind speed
       Weibull pdf f(x) = (α/ βα)xα – 1 exp[ – (x / β)α], where x ≥ 0, α > 0, β > 0

Summary of the Distributions

 

Name
Parameters
Mean
Variance
Binomial
p
np
np(1 – p)
Poisson
λ
λ
λ
Geometric
p
1 / p
(1 – p) / p2
Negative Binomial
p
[r(1 – p)] / p
r(1 – p) / p2
Discrete Uniform
K
(K + 1)  / 2
(K – 1)(K + 1) / 12
Uniform
a, b
(a + b) / 2
(b – a)2 / 12
Normal
μ, σ2
μ
σ2
Exponential
λ
1 / λ
1 / λ2
Weibull
α, β
βΓ(1 + 1/α)
β2 {Γ(1 + 2/α) –
(Γ(1 + 1/α))2 }

  
Three Usages of Distributions


       Distributions of random variables may be used from three perspectives, namely to understand
      To find the probability of any event
      To compare two or more distributions; or to compare effectiveness of some actions
      To simulate real life scenarios using the distributional models to develop an understanding of the phenomenon
 

No comments:

Post a Comment