Tuesday, December 27, 2016

Descriptive and Inferential Statistics (Category:Concept, Level:Basic)

Descriptive statistics organize, describe, and summarize data using numbers and graphical techniques. This branch of statistics uses a set of standard measures such as percent, averages, and variability, as well as simple graphs, charts, and tables. Descriptive statistics help you to better understand your data by describing and summarizing its basic features. You learn how to generate and understand numerical summaries. These include frequency; measures of location, including minimum, maximum, percentiles, quartiles, and central tendency (mean, median, and mode); and measures of dispersion or variability, including range, interquartile range, variance, and standard deviation. The graphical summaries you learn include the histogram, normal probability plot, and box plot. The goals when you're describing data are to:

screen for unusual data values,

inspect the spread and shape of your data,

characterize the central tendency, and

draw preliminary conclusions about your data.

Is the data as error free as possible? What unique features can you identify? Are there data values that cluster or show some unusual shape? Does your data include any possible outliers? When you have a basic understanding of your data, then you can use inferential statistics.

Inferential statistics is the branch of statistics concerned with drawing conclusions about a population from analysis of a random sample drawn from that population. It is also concerned with the precision and reliability of those inferences. Inferential statistics generalize from data you observe to the population that you have not observed. Descriptive statistics describe your sample data, but inferential statistics help you draw conclusions about the entire population of data. Descriptive statistics can also be referred to as exploratory data analysis, or EDA. Inferential statistics can also be called explanatory modeling.

Explanatory versus Predictive Modeling

Before beginning your analysis, you should use descriptive statistics to explore your data. After getting familiar with your data, you can use inferential statistics, or explanatory modeling, to describe your data. You can also use predictive modeling to make predictions about future observations. Let’s briefly compare explanatory and predictive modeling.

In explanatory modeling, the goal is to develop a model that answers the question, how is X related to Y? Sample sizes are typically small and include few variables. The focus is on the parameters of the model. To assess the model, you use p-values and confidence intervals.

The goal of predictive modeling is to answer the question, if you know X, can you predict Y? Sample sizes are typically quite large and include many predictor variables, also called input variables. The focus is on the predictions of observations, rather than the parameters of the model. To assess a predictive model, you validate predictions using holdout sample data.

Populations and Samples

This post is in follow-up to the earlier post of Sampling with SAS in November 2016. Let us try to relook and understand the terms.

Profile data of 5000 random students captured to get a feeling of how Technology research program is done in India. These 5000 students are selected from over half a million students enrolled in last 10 years.In this scenario half a million students are called population and random representative 5000 students are called Samples. In the same example say out of 5000 students around 250 random students are interviewed to understand various geo-personal data then for that study 250 is Sample data and 5000 students is population data.

Let us try to further understand this through some formal definitions.

A population is the complete set of observations or the entire group of objects that you are researching. A sample is a subset of the population. You gather a sample so that you don't have to obtain data for the entire population. The sample should be representative of the population, meaning that the sample's characteristics are similar to the population's characteristics. One way to obtain a representative sample is to collect a simple random sample. With this sampling method, every possible sample of a given size in the population has an equal chance of being selected. Random sampling can help to ensure that the sample is representative of the population. You should avoid collecting your sample from a section of the population that is easily available to you. This is called convenience sampling, and it can lead to a biased sample that is not representative of the population from which it is drawn. A sample that's not representative can cause you to draw incorrect conclusions. Let's look at an example.

Suppose a university wants to estimate the percent of its freshmen who plan to return for their sophomore year. The population for this study is the entire set of 2,500 freshmen in attendance. Researchers gathered a representative sample of 100 freshmen by selecting 100 student ID numbers at random from the entire set of 2,500 freshmen. If the researchers had simply selected the first 100 freshmen who responded to an e-mail questionnaire, this would have resulted in a biased sample. This could lead to an incorrect estimate of the number who plan to return for their sophomore year. If you have a representative sample, you can make correct inferences to the entire population. In this course, we always assume that the sample is representative. Click the Information button for information on how to generate random samples.

Types of Variables

Please refer post on "Scale of Measurement" along with this to get a better understanding of Variables used in Statistical modeling.

Quantitative and Categorical:

Variables are also classified according to their characteristics. They can be quantitative or categorical. In order to plan a statistical analysis or interpret your results, you need to know which types of variables you have. Data that consists of counts or measurements is called quantitative data. You also hear this type of data referred to as numerical data. If you can perform arithmetic operations, like addition and subtraction, or take a sample average of your data, then you know that it is quantitative. Suppose you take a survey of the buying habits of families. An example of quantitative data in your survey is the age in years of the respondents. Age is a quantitative variable because it would make sense to compute the average age of individuals in a sample.

Quantitative data can be further distinguished by two types: discrete and continuous. Discrete data consists of variables that can have only a countable number of values within a measurement range. That is, the values can be 0, 1, 2, 3, and so on. An example of discrete data is the number of children in a family. A family can have two or three children, but not 2.65 children. Continuous data consists of variables that are measured on a scale that has an infinite number of values and has no breaks or jumps. An example of a continuous variable is gas mileage. The gas mileage for a particular car might be 19 miles per gallon or 19.1 miles per gallon or 19.191034 miles per gallon, and so on. Remember that practical limitations can affect the precision of the measurement.

Categorical data consists of variables that denote groupings or labels. This type of data is also called attribute data. Categorical data can be distinguished from quantitative, because it does not make sense to perform arithmetic operations on categorical variables. For example, your survey includes a variable for the political party affiliation of survey respondents (Democrat, Republican, Independent, other). It doesn't make sense to try to add or average the responses Republican and Democrat.

There are two main types of categorical variables: nominal and ordinal. A nominal categorical variable exhibits no ordering within its observed levels, groups, or categories. Gender is an example of a nominal variable. There is no ordering to the groups male and female. The type of beverage you can order from a menu, such as soda, coffee, or juice, has no logical ordering to it, so it is also a nominal variable. Nominal categorical variables can be coded to appear numeric, but their numbers are meaningless. For example, the variable Gender can be coded 1 for male and 2 for female. These numbers are not inherently meaningful: they could be reversed, or replaced, by any random set of numbers. A variable that lies on a nominal scale is sometimes called a qualitative or classification variable.

With ordinal categorical variables, the observed levels of the variable can be ordered in some meaningful way that implies that the differences between the groups or categories are due to magnitude. Disease condition divided into categories of low, moderate, or severe is an example of an ordinal variable. The size of beverage you can order from a menu being small, medium, or large does have a logical order to it, so it is also an ordinal variable.

Statistical Methods (Category:Concepts, Level:Basic )

The appropriate statistical method for your data also depends on the number of variables involved. Univariate analysis provides techniques for analyzing and describing a single variable at a time. Univariate analysis reveals patterns in the data, by looking at the range of values, measures of dispersion, the central tendency of the values, and frequency distribution. It also summarizes large amounts of data and organizes data into graphs and tables so that it is more easily understood.

Bivariate analysis describes and explains the relationship between two variables and how they change, or covary, together. It includes techniques such as correlation analysis and chi-square tests of independence.

Multivariate or multivariable analysis examines two or more variables at the same time, in order to understand the relationships among them. Techniques such as multiple linear regression and n-way ANOVA are typically called multivariable analyses because there is only one response variable. Techniques such as factor analysis and clustering are typically called multivariate analysis because they consider more than one response variable. Multivariate linear regression and multivariate ANOVA (MANOVA) are extensions of these techniques when there is more than one response variable. You learn many of these statistical methods in this course.

Scales of Measurement (Category: Concept, Level:Basic)

You can refer this post along with "Types of Variables" post.

Key points/notes:

Nominal: Variables having categories without any ordering to the levels
Ordinal: Variables having categories without any ordering to the levels
Interval: No True zero point, Example Fahrenheit scale
Ratio: True zero point, accurately indicate the ratio of difference between two spaces on the measurement scale, Example Kelvin Scale

Nominal, Ordinal, Interval & Ratio:

Variables are classified differently depending on the characteristics of that variable. We often refer to a variable's classification as its scale of measurement. You need to know the scale of measurement for each variable in order to determine the statistical procedures appropriate for use with that variable. You already know two scales of measurement for categorical variables: nominal and ordinal. The nominal scale enables you to categorize or label variables such as gender or beverage type where there is no ordering to the levels of those variables. The ordinal scale indicates categories that can be ordered in a meaningful way, as in size of beverage or severity of disease.

There are two scales of measurement for continuous variables: interval and ratio. Data from an interval scale can be rank-ordered like ordinal data, but it also has a sensible spacing of observations such that differences between measurements are meaningful. For example, in measuring patient temperature, you can indicate specific differences in temperature, between the standard measurement of normal body temperature, 98.6 degrees F, and an observed body temperature of 98.2. Interval scales lack, however, the ability to calculate ratios between numbers on the scale. In the case of the Fahrenheit scale, for example, there is no true zero point. Zero does not imply the lack of temperature. Another example of an interval scale is pH value. Sea water, which has a pH of 8, is not twice as alkaline as tomato juice, which has a pH of 4.

Data on a ratio scale is not only rank-ordered with meaningful spacing, but it also includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale. For example, the Kelvin temperature scale has a true zero point. A temperature of 50 Kelvin is half as hot as 100 Kelvin. Another example of a ratio scale is money. If an individual has zero dollars, this does imply an absence of money. And one individual can have twice as much money as another.

Saturday, December 17, 2016

Scoring for Prediction : SAS

Introduction to Predictive Modeling:
Before you can predict values, you must first build a predictive model. Predictive modeling uses historical data to predict future outcomes. These predictions can then be used to make sound strategic decisions for the future. The process of building and scoring a predictive model has two main parts: building the predictive model on existing data, and then deploying the model to make predictions on new data (using a process called scoring). A predictive model consists of either a formula or rules based on a set of input variables that are most likely to predict the values of a target variable. Some common business applications of predictive modeling are: target marketing, credit scoring, and fraud detection.

Whether you are doing predictive modeling or inferential modeling, you want to select a model that generalizes well – that is, the model that best fits the entire population. You assume that a sample that is used to fit the model is representative of the population. However, any given sample typically has idiosyncracies that are not found in the population. The model that best fits a sample and the population is the model that has the right complexity.

An overly complex model might be too flexible. This leads to overfitting – that is, accommodating nuances of the random noise (the chance relationships) in the particular sample. Overfitting leads to models that have higher variance when applied to a population. For regression, including more terms in the model increases complexity.

On the other hand, an insufficiently complex model might not be flexible enough. This leads to underfitting – that is, systematically missing the signal (the true relationships). This leads to biased inferences, which are inferences that are not true of the population. A model with just enough complexity—which also means just enough flexibility—gives the best generalization. The important thing to realize is that there is no one perfect model; there is always a balance between too much flexibility (overfitting) and not enough flexibility (underfitting).

The first part of the predictive modeling process is building the model. There are two steps to building the model: fitting the model and then assessing model performance in order to select the model that will be deployed. To build a predictive model, a method called honest assessment is commonly used to ensure that the best model is selected. Honest assessment involves partitioning (that is, splitting) the available data—typically into two data sets: a training data set and a validation data set. Both data sets contain the inputs and the target. The training data set is used to fit the model. In the training data set, an observation is called a training case. Other synonyms for observation are example, instance, and record. The validation data set is a holdout sample that is used to assess model performance and select the model that has the best generalization. Honest assessment means that the assessment is done on a different data set than the one that was used to build the model. Using a holdout sample is an honest way of assessing how well the model generalizes to the population. Sometimes, the data is partitioned into three data sets. The third data set is a test data set that is used to perform a final test on the model before the model is used for scoring.

PROC GLMSELECT can build a model using honest assessment with a holdout data set (that is, a validation data set) in two ways. The method that you use depends on the state of your data before model building begins. If your data is already partitioned into a training data set and a validation data set, you can simply reference both data sets in the procedure. If you start with a single data set, PROC GLMSELECT can partition the data for you. The PARTITION statement specifies how PROC GLMSELECT logically partitions the cases in the input data set into holdout samples for model validation and, if desired, testing. You use the FRACTION option to specify the fraction (that is, the proportion) of cases in the input data set that are randomly assigned a testing role and a validation role.

The PARTITION statement requires a pseudorandom number generator to start the random selection
process, and an integer is required to start that process. If you need to be able to reproduce your results in the future, you specify an integer that is greater than zero in the SEED= option. Then, whenever you run the PROC GLMSELECT step using that seed value, the pseudorandom
selection process is replicated and you will get the same results. In most situations, it is recommended that you use the SEED= option and specify an integer greater than zero.

Scoring Predictive Models
After you build a predictive model, you are ready to deploy the model To score new data—referred to here as scoring data—you can use PROC GLMSELECT and PROC PLM. Before you start using a newly built model to score data, some preparation of the model, the data, or both, is usually required.
For example, in some applications, such as fraud detection, the model might need to be integrated into an online monitoring system before it can be deployed. It is essential for the scoring data to be comparable to the training data and validation data that were used to build the model. The same modifications that were made to the training data must be made to the validation data before validating the model and to the scoring data before scoring.

When you score, you do not rerun the algorithm that was used to build the model. Instead, you apply the score code—that is, the equations obtained from the final model—to the scoring data. There are three methods for scoring your data:

Method 1
Use a SCORE statement in PROC GLMSELECT.

The first method is useful because you can build and score a model in one step. However, this
method is inefficient if you want to score more than once or use a large data set to build a model.
With this method, the model must be built from the training data each time the program is run.

Method 2
Use a STORE statement in PROC GLMSELECT.
Use a SCORE statement in PROC PLM.

The second method enables you to build the model only once, along with an item store, using
PROC GLMSELECT. You can then use PROC PLM to score new data using the item store.
Separating the code for model building and model scoring is especially helpful if your model is based
on a very large training data set or if you want to score more than once. Potential probles with this
method are that others might not be able to use this code with earlier versions of SAS or you might
not want to share the entire item store.

Method 3
Use a STORE statement in PROC GLMSELECT.
Use a CODE statement in PROC PLM to output SAS code.
Use a DATA step for scoring.

The third method uses PROC PLM to write detailed scoring code, based on the item store,
that is compatible with earlier versions of SAS.You can provide this code to others without having
to share other information that is in the item store. The DATA step is then used for scoring.

Syntax

PROC GLMSELECT DATA=trainingdataset
VALDATA=validationdataset;
MODEL target(s)=input(s) </ options>;
RUN;

PROC GLMSELECT DATA=trainingdataset
<SEED=number>;
MODEL target(s)=input(s) </ options>;
PARTITION FRACTION(<TEST=fraction><VALIDATE=fraction>);
RUN;

Sample Programs

Building a Predictive Model

%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;
%let categorical=House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces
Season_Sold Garage_Type_2 Foundation_2 Heating_QC
Masonry_Veneer Lot_Shape_2 Central_Air;
ods graphics;

proc glmselect data=statdata.ameshousing3
plots=all
valdata=statdata.ameshousing4;
class &categorical / param=glm ref=first;
model SalePrice=&categorical &interval /
selection=backward
select=sbc
choose=validate;
store out=work.amesstore;
title "Selecting the Best Model using Honest Assessment";
run;

title;

Scoring Data
Example:

proc plm restore=work.amesstore;
score data=statdata.ameshousing4 out=scored;
code file="myfilepath/scoring.sas";
run;

data scored2;
set statdata.ameshousing4;
%include "myfilepath/scoring.sas";
run;

proc compare base=scored compare=scored2 criterion=0.0001;
var Predicted;
with P_SalePrice;
run;

Knowing your SAS Installation/Licenses/Version- Quick method

Ever wondered how can you find the SAS products installed at your installation. Below is couple of quick tricks that you can use and view the results in the log.

PROC SETINIT;
QUIT;

PROC PRODUCT_STATUS;
QUIT;

These 2 SAS procedures are very useful in finding information about:

SAS products Installed
Information about version
Information about license expiry dates

Remember you usually have 45 days of warning period and another 45 days of grace period. Those who have worked in migration project would appreciate that all migration activities are planned to be finished before grace period starts. Keep that grace period for any contingency plans. SAS admins would hide that to you.

Tuesday, December 13, 2016

Problem : Counterfeit Currency

Currency notes are received in a bank in large numbers. The notes are checked and in case they are counterfeit, they get detected with a positive probability p (<1). Notes(Currencies) enter bank on an average once per 2 months. Given that a bank detected x fake notes out of N submitted notes, how many counterfeit notes did it encounter ?

Thanks
Learner

Quick Question - Sherlock Holmes

According to Sherlock Holmes,

While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what any one man will be up to, but you can say with precision what an average number will be up to. Individuals vary, but percentage remain constant. So says the statistician.

The Statistician doesn't quite say that. What is Sherlock Holmes Forgetting ?

Thanks
Learner

Tuesday, December 6, 2016

Introduction to Random Variables and Distributions ( Category Concepts, Level : Intermediate)

Below post will help you understand the concept of random variables and the distribution it follows. These concepts along with hypothesis testing covers good amount core Analytics. Feel free to comment

Understanding Random Experiments and Random Variables

• Certain variables are generated as some activities are carried out. A flight takes certain amount of time; a machine when operated consumes certain amount of power per unit time; a car travels certain number of miles before encountering a fault. The activities giving rise to the data are called random experiments. These are the data generating mechanisms. The generated data are often referred to as random variables.

• Technically random variables are mappings that assigns a real number to every outcome of the activity. We often write this as X:Ω R, where Ω is the sample space containing all possible outcomes of the random experiments

Random Variables

• Random variables take values from a predefined set. Thus all possible values of the random variable are known in advance. However, the exact value that will occur in the next occurrence is unknown.

• We do not know when a machine will fail, what will be power consumption of a device in a given period, what will be the efficiency of a machine, how many defective products will be produced during the next production run…

Concept of Distributions

• It is often assumed that the process that generates the random variable can be modeled using a mathematical function. This function enables us to calculate the probability that the values of a random variable will fall within any given range

• The mathematical function is defined in terms of the random variable and some unknown but fixed constants called parameters

• The distribution (mathematical function) characterizes the pattern of variation of the random variable

• We may construct the frequency distribution as well to visualize the pattern of variation. When we have very large number of observations drawn randomly from a population, the frequency distribution may be a good approximation of the distribution.

• The frequency distribution is a non-parametric way of looking at a distribution and it has its own advantages and disadvantages.

Understanding Density and Mass Functions

• Random variables may be discrete or continuous. A random variable is said to be discrete if it takes values from a set X = {x₁, x₂,…} , where X is countable (a set is said to be countable if either it is finite or can be put in a one-to-one correspondence with the integers)

• Discrete random variables have probability mass functions (pmf) that has point masses

• Continuous random variables have density functions (pdf) as given below. At any point x the pdf gives the height of the curve and the areas under the curve from x₀ to x₁ gives the probability that the random variable assumes a value in that range. In this sense pdf behaves like density where the mass can be computed from density and volume

Properties of Mass and Density Functions

• Let f(x) be a probability mass function (pmf). Then f_X(x) ≥ 0 for all x ε R and Σf_X(x_i) = 1

• In case the random variable X is continuous, we have f_X(x) ≥ 0 for all x ε R and we integrate f over the entire real line to get the area under the curve.

• For a continuous distribution we integrate over the range x₀ to x₁ to get the probability. For a pmf we take the sum.

Cumulative Distribution Function

• The Cumulative Distribution Function (often referred to as Distribution Function) is the function F_X : R to [0, 1] defined by F_X(x) = P(X ≤ x)

• Recall that we use the CDF (often referred to as DF) through ogive

• When either f (pdf or pmf) or F (CDF) is available, we can compute probability of all events concerning the random experiment or equivalently probability of all subsets of values of the random variable

• Knowledge of the density, mass or distribution function is, therefore a convenient way to understand the pattern of variation of a random variable from a quantitative perspective

1) Some important discrete distributions

1.1) Binomial Distribution

In many real life situations it is important to determine the number or proportion of successes/failures when trials are conducted independently of each other. For example one may count the number of successful bids, the proportion of automobiles that pass a test in the first go, the number of program units found to be defect free when tested for the first time and so on. These random variables can be modeled using the Binomial distribution.

The Binomial distribution works as follows:

– Let us assume that a trial has been conducted n times. Note that the trials must be independent of each other.

– Let p be the proportion of success. Note that this proportion must remain reasonably same over the period of experimentation.

– Let the random variable, number of successes, be X.

– Then the probability density function is given by

p(X=x) = ⁿC_xp^x (1-p)^n-x; x = 0, 1, 2, ....., n

– In case we are interested in finding the probability density function of the proportion of successes x/n rather than the number of successes, the same is also given by

p(x/n) = ⁿC_x p^x(1-p)^n-x; x=0, 1, 2, …, n

Usages of Binomial Distribution

– The Binomial distribution is a one parameter distribution. The proportion of successes p is the unknown parameter. The independent trials of the binomial distribution are called Bernoulli trials

– We estimate p from data (how?)

– Given n and p, we can compute probability of any event

1.2 Geometric Distribution

• Let the random variable X be the number of Bernoulli trials required to get the first success, where the trials are conducted independently of each other and the probability of success remains constant

• X takes values 0, 1, 2 ….∞.

• P(X = k) = p (1 – p)^{k – 1}, k ≥ 1

1.3 Negative Binomial Distribution

• This is an extension of the geometric distribution

• In this distribution we try to model the number of trials required to get r successes. The random variable is denoted as x where x is the excess number of trials required over r to get r successes

• P(X = x) = ((x + r – 1)! / (r – 1)! x!)p^r(1 – p)^x, x = 0,1,…; 0 < p < 1

Some Examples

Suppose in a manufacturing environment defects are rather rare. We may use the geometric distribution to estimate the number of attempts required to get the first defective item.
We may use the geometric distribution to estimate the number of good products produced before getting the first defective product
The negative binomial distribution may be used to find the number of sales calls required to close a fixed number of orders

1.4 Poisson Distribution

• In a real life situation Poisson distribution is applicable whenever a particular trial is conducted many times, independently of each other and the probability of a particular event in any one trial is very small.

• Take, for instance, the number of defects found in a specification document. It may be assumed that the probability of making a mistake in any one line (or a small segment) is very small and the errors occur independently of each other.

• Going by the same logic, number of accidents, number of breakdowns, number of unreadable messages and similar random variables will follow Poisson Distribution. (Why?)

• When p is small but n is large so that np remains finite, the Binomial distribution approaches the Poisson distribution. Note that large n typically means n > 30.

• Let l be the average number of occurrences of an event.

• Let X be the random variable. Then p(X=x) = (e^-^l l^x) / x! ; x = 0, 1, 2, …..., ∞

• Note that the only parameter of a Poisson distribution is l, the average number of occurrences of the event.

• Poisson distribution is typically used for count data – number of breakdowns, number of accidents, number of defects on a product, number of faults generated during a production run of a machine, number of customers who come to a store

Some Examples

• Suppose in the tool crib of a machine shop, on an average 2 requests for issuing tools are made in an hour. What is the chance that in a particular hour

– 5 or more requests will come

– No request will come

• Suppose you are in charge of manpower planning for a team responsible for maintaining software. Consider a simplistic environment where only one type of request come. From your past observation you have seen that on an average 3 requests come in a day and it usually takes a full day to service the request. Considering this, you have allocated three persons.

– What is the chance that at least one person will remain unutilized on any given day?

– What is the chance that you will not be able to service customers, assuming that customer service will be adversely impacted in case 4 or more requests arrive on any given day

1.5 Discrete Uniform Distribution

• This is often referred to as the equally likely outcome distribution and is equivalent to assuming minimum knowledge about the distribution

• A random variable X is said to follow a discrete uniform distribution in case it takes values 1, 2, ….k such that P(X = j) = 1 / k for all j = 1, 2,…k

2) Continuous distributions

2.1Normal Distribution

• One of the most common distributions for continuous random variables is the Normal distribution.

• When a random variable is expected to be roughly symmetrical around its mean and tend to cluster around the mean value, the variable follows a Normal distribution.

• Many continuous variables like length, weight or other dimensions / characteristics of a manufactured product; the difference between planned and actual effort or time; amount of material consumed during a production run and so on are expected to follow normal distribution

PDF of Normal Distribution

• Let X be a random variable, which follows Normal distribution.

• Let m be the mean value of X.

• Let s be the standard deviation of X.

• Then the probability density function is given by
p(x) = {1 / (Ö2ps)} exp[-(x-m)² / (2s²)]

• The two parameters of Normal distribution are m and s respectively, where m and s² represent the mean and variance of the random variable

Importance of Normal Distribution

• Two main reasons for the importance of Normal distribution are:

– The tendency of sum or average values of independently drawn random observations from markedly non-normal distributions to closely approximate Normal distributions (see example in the next page), and

– The robustness or insensitivity of many commonly used statistical procedures from theoretical normality.

Convergence to Normal Distribution – An Example

2.2 Uniform Distribution

• Let X be a random variable that takes values between two real numbers a and b (b > a) with the pdf f(x) = 1 / (b – a) whenever a≤ x ≤b and 0 otherwise

• It is important to note that in a uniform distribution the maximum and minimum values are fixed and all values between the minimum and maximum occur with equal likelihood

Usage of Uniform Distribution

• This distribution is very important from the perspective of random number generation. The computer generated random numbers are drawn from uniform distribution. In general these random numbers follow the U(0,1) distribution

• The distribution functions of all distributions follow uniform distribution, i.e. if F is the distribution function for a random variable and x is any random observation then the distribution of y = F(x) ~ U(0,1). This is a very important property and we will see its usage later

2.3 Exponential Distribution

• This distribution is widely used to describe events recurring at random points in time. Typical examples include time between arrival of customers at a service booth; time between failures of a machine and so on.

• The exponential and Poisson distributions are related. In case the time between successive occurrences of an event follows an exponential distribution then the number of occurrences in a given period follows a Poisson distribution.

• The probability density function of the exponential distribution is given by f(x) = λe^-λx, x ≥ 0, λ > 0

• The mean of exponential distribution is 1/λ

2.4 Weibull Distribution

• This distribution describes data resulting from life and fatigue tests. It is often used to describe failure times in reliability studies as well as breaking strength of materials. Weibull distributions are also used to represent various physical quantities like wind speed

• Weibull pdf f(x) = (α/ β^α)x^{α – 1}exp[ – (x / β)^α], where x ≥ 0, α > 0, β > 0

Summary of the Distributions

Name	Parameters	Mean	Variance
Binomial	p	np	np(1 – p)
Poisson	λ	λ	λ
Geometric	p	1 / p	(1 – p) / p²
Negative Binomial	p	[r(1 – p)] / p	r(1 – p) / p²
Discrete Uniform	K	(K + 1) / 2	(K – 1)(K + 1) / 12
Uniform	a, b	(a + b) / 2	(b – a)² / 12
Normal	μ, σ²	μ	σ²
Exponential	λ	1 / λ	1 / λ²
Weibull	α, β	βΓ(1 + 1/α)	β² {Γ(1 + 2/α) – (Γ(1 + 1/α))² }

Three Usages of Distributions

• Distributions of random variables may be used from three perspectives, namely to understand

– To find the probability of any event

– To compare two or more distributions; or to compare effectiveness of some actions

– To simulate real life scenarios using the distributional models to develop an understanding of the phenomenon

For a constant Learner & Data Analysts

Featured Post

Reference Books and material for Analytics