Featured Post

Reference Books and material for Analytics

Website for practising R on Statistical conceptual Learning: https://statlearning.com  Reference Books & Materials: 1) Statis...

Tuesday, December 27, 2016

Descriptive and Inferential Statistics (Category:Concept, Level:Basic)


Descriptive statistics organize, describe, and summarize data using numbers and graphical techniques. This branch of statistics uses a set of standard measures such as percent, averages, and variability, as well as simple graphs, charts, and tables. Descriptive statistics help you to better understand your data by describing and summarizing its basic features. You learn how to generate and understand numerical summaries. These include frequency; measures of location, including minimum, maximum, percentiles, quartiles, and central tendency (mean, median, and mode); and measures of dispersion or variability, including range, interquartile range, variance, and standard deviation. The graphical summaries you learn include the histogram, normal probability plot, and box plot. The goals when you're describing data are to:
  • screen for unusual data values,
  • inspect the spread and shape of your data,
  • characterize the central tendency, and
  • draw preliminary conclusions about your data.
Is the data as error free as possible? What unique features can you identify? Are there data values that cluster or show some unusual shape? Does your data include any possible outliers? When you have a basic understanding of your data, then you can use inferential statistics.

Inferential statistics is the branch of statistics concerned with drawing conclusions about a population from analysis of a random sample drawn from that population. It is also concerned with the precision and reliability of those inferences. Inferential statistics generalize from data you observe to the population that you have not observed. Descriptive statistics describe your sample data, but inferential statistics help you draw conclusions about the entire population of data. Descriptive statistics can also be referred to as exploratory data analysis, or EDA. Inferential statistics can also be called explanatory modeling.

Explanatory versus Predictive Modeling


Before beginning your analysis, you should use descriptive statistics to explore your data. After getting familiar with your data, you can use inferential statistics, or explanatory modeling, to describe your data. You can also use predictive modeling to make predictions about future observations. Let’s briefly compare explanatory and predictive modeling.

In explanatory modeling, the goal is to develop a model that answers the question, how is X related to Y? Sample sizes are typically small and include few variables. The focus is on the parameters of the model. To assess the model, you use p-values and confidence intervals.

The goal of predictive modeling is to answer the question, if you know X, can you predict Y? Sample sizes are typically quite large and include many predictor variables, also called input variables. The focus is on the predictions of observations, rather than the parameters of the model. To assess a predictive model, you validate predictions using holdout sample data.

Populations and Samples

This post is in follow-up to the earlier post of Sampling with SAS in November 2016. Let us try to relook and understand the terms.

Profile data of 5000 random  students captured to get a feeling of how Technology research program is done in India. These 5000 students are selected from over half a million students enrolled in last 10 years.In this scenario half a million students are  called population and random representative 5000 students are called Samples. In the same example say out of 5000 students around 250 random students are interviewed to understand various geo-personal data then for that study 250 is Sample data and 5000 students is population data.

Let us try to further understand this through some formal definitions.

A population is the complete set of observations or the entire group of objects that you are researching. A sample is a subset of the population. You gather a sample so that you don't have to obtain data for the entire population. The sample should be representative of the population, meaning that the sample's characteristics are similar to the population's characteristics. One way to obtain a representative sample is to collect a simple random sample. With this sampling method, every possible sample of a given size in the population has an equal chance of being selected. Random sampling can help to ensure that the sample is representative of the population. You should avoid collecting your sample from a section of the population that is easily available to you. This is called convenience sampling, and it can lead to a biased sample that is not representative of the population from which it is drawn. A sample that's not representative can cause you to draw incorrect conclusions. Let's look at an example.

Suppose a university wants to estimate the percent of its freshmen who plan to return for their sophomore year. The population for this study is the entire set of 2,500 freshmen in attendance. Researchers gathered a representative sample of 100 freshmen by selecting 100 student ID numbers at random from the entire set of 2,500 freshmen. If the researchers had simply selected the first 100 freshmen who responded to an e-mail questionnaire, this would have resulted in a biased sample. This could lead to an incorrect estimate of the number who plan to return for their sophomore year. If you have a representative sample, you can make correct inferences to the entire population. In this course, we always assume that the sample is representative. Click the Information button for information on how to generate random samples.

Types of Variables

Please refer post on "Scale of Measurement" along with this to get a better understanding of Variables used in Statistical modeling.


Quantitative and Categorical:


Variables are also classified according to their characteristics. They can be quantitative or categorical. In order to plan a statistical analysis or interpret your results, you need to know which types of variables you have. Data that consists of counts or measurements is called quantitative data. You also hear this type of data referred to as numerical data. If you can perform arithmetic operations, like addition and subtraction, or take a sample average of your data, then you know that it is quantitative. Suppose you take a survey of the buying habits of families. An example of quantitative data in your survey is the age in years of the respondents. Age is a quantitative variable because it would make sense to compute the average age of individuals in a sample.

Quantitative data can be further distinguished by two types: discrete and continuous. Discrete data consists of variables that can have only a countable number of values within a measurement range. That is, the values can be 0, 1, 2, 3, and so on. An example of discrete data is the number of children in a family. A family can have two or three children, but not 2.65 children. Continuous data consists of variables that are measured on a scale that has an infinite number of values and has no breaks or jumps. An example of a continuous variable is gas mileage. The gas mileage for a particular car might be 19 miles per gallon or 19.1 miles per gallon or 19.191034 miles per gallon, and so on. Remember that practical limitations can affect the precision of the measurement.

Categorical data consists of variables that denote groupings or labels. This type of data is also called attribute data. Categorical data can be distinguished from quantitative, because it does not make sense to perform arithmetic operations on categorical variables. For example, your survey includes a variable for the political party affiliation of survey respondents (Democrat, Republican, Independent, other). It doesn't make sense to try to add or average the responses Republican and Democrat.

There are two main types of categorical variables: nominal and ordinal. A nominal categorical variable exhibits no ordering within its observed levels, groups, or categories. Gender is an example of a nominal variable. There is no ordering to the groups male and female. The type of beverage you can order from a menu, such as soda, coffee, or juice, has no logical ordering to it, so it is also a nominal variable. Nominal categorical variables can be coded to appear numeric, but their numbers are meaningless. For example, the variable Gender can be coded 1 for male and 2 for female. These numbers are not inherently meaningful: they could be reversed, or replaced, by any random set of numbers. A variable that lies on a nominal scale is sometimes called a qualitative or classification variable.

With ordinal categorical variables, the observed levels of the variable can be ordered in some meaningful way that implies that the differences between the groups or categories are due to magnitude. Disease condition divided into categories of low, moderate, or severe is an example of an ordinal variable. The size of beverage you can order from a menu being small, medium, or large does have a logical order to it, so it is also an ordinal variable.

Statistical Methods (Category:Concepts, Level:Basic )

The appropriate statistical method for your data also depends on the number of variables involved. Univariate analysis provides techniques for analyzing and describing a single variable at a time. Univariate analysis reveals patterns in the data, by looking at the range of values, measures of dispersion, the central tendency of the values, and frequency distribution. It also summarizes large amounts of data and organizes data into graphs and tables so that it is more easily understood.

Bivariate analysis describes and explains the relationship between two variables and how they change, or covary, together. It includes techniques such as correlation analysis and chi-square tests of independence.

Multivariate or multivariable analysis
examines two or more variables at the same time, in order to understand the relationships among them. Techniques such as multiple linear regression and n-way ANOVA are typically called multivariable analyses because there is only one response variable. Techniques such as factor analysis and clustering are typically called multivariate analysis because they consider more than one response variable. Multivariate linear regression and multivariate ANOVA (MANOVA) are extensions of these techniques when there is more than one response variable. You learn many of these statistical methods in this course.

Scales of Measurement (Category: Concept, Level:Basic)

You can refer this post along with "Types of Variables" post.

Key points/notes:

Nominal: Variables having categories without any ordering to the levels
Ordinal: Variables having categories without any ordering to the levels
Interval: No True zero point,  Example Fahrenheit scale
Ratio: True zero point, accurately indicate the ratio of difference between two spaces on the measurement scale, Example Kelvin Scale


Nominal, Ordinal, Interval & Ratio:

Variables are classified differently depending on the characteristics of that variable. We often refer to a variable's classification as its scale of measurement. You need to know the scale of measurement for each variable in order to determine the statistical procedures appropriate for use with that variable. You already know two scales of measurement for categorical variables: nominal and ordinal. The nominal scale enables you to categorize or label variables such as gender or beverage type where there is no ordering to the levels of those variables. The ordinal scale indicates categories that can be ordered in a meaningful way, as in size of beverage or severity of disease.

There are two scales of measurement for continuous variables: interval and ratio. Data from an interval scale can be rank-ordered like ordinal data, but it also has a sensible spacing of observations such that differences between measurements are meaningful. For example, in measuring patient temperature, you can indicate specific differences in temperature, between the standard measurement of normal body temperature, 98.6 degrees F, and an observed body temperature of 98.2. Interval scales lack, however, the ability to calculate ratios between numbers on the scale. In the case of the Fahrenheit scale, for example, there is no true zero point. Zero does not imply the lack of temperature. Another example of an interval scale is pH value. Sea water, which has a pH of 8, is not twice as alkaline as tomato juice, which has a pH of 4.

Data on a ratio scale is not only rank-ordered with meaningful spacing, but it also includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale. For example, the Kelvin temperature scale has a true zero point. A temperature of 50 Kelvin is half as hot as 100 Kelvin. Another example of a ratio scale is money. If an individual has zero dollars, this does imply an absence of money. And one individual can have twice as much money as another.

Saturday, December 17, 2016

Scoring for Prediction : SAS

Introduction to Predictive Modeling:
Before you can predict values, you must first build a predictive model. Predictive modeling uses historical data to predict future outcomes. These predictions can then be used to make sound strategic decisions for the future. The process of building and scoring a predictive model has two main parts: building the predictive model on existing data, and then deploying the model to make predictions on new data (using a process called scoring). A predictive model consists of either a formula or rules based on a set of input variables that are most likely to predict the values of a target variable. Some common business applications of predictive modeling are: target marketing, credit scoring, and fraud detection.

Whether you are doing predictive modeling or inferential modeling, you want to select a model that generalizes well – that is, the model that best fits the entire population. You assume that a sample that is used to fit the model is representative of the population. However, any given sample typically has idiosyncracies that are not found in the population. The model that best fits a sample and the population is the model that has the right complexity.

An overly complex model might be too flexible. This leads to overfitting – that is, accommodating nuances of the random noise (the chance relationships) in the particular sample. Overfitting leads to models that have higher variance when applied to a population. For regression, including more terms in the model increases complexity.

On the other hand, an insufficiently complex model might not be flexible enough. This leads to underfitting – that is, systematically missing the signal (the true relationships). This leads to biased inferences, which are inferences that are not true of the population. A model with just enough complexity—which also means just enough flexibility—gives the best generalization. The important thing to realize is that there is no one perfect model; there is always a balance between too much flexibility (overfitting) and not enough flexibility (underfitting).

The first part of the predictive modeling process is building the model. There are two steps to building the model: fitting the model and then assessing model performance in order to select the model that will be deployed. To build a predictive model, a method called honest assessment is commonly used to ensure that the best model is selected. Honest assessment involves partitioning (that is, splitting) the available data—typically into two data sets: a training data set and a validation data set. Both data sets contain the inputs and the target. The training data set is used to fit the model. In the training data set, an observation is called a training case. Other synonyms for observation are example, instance, and record. The validation data set is a holdout sample that is used to assess model performance and select the model that has the best generalization. Honest assessment means that the assessment is done on a different data set than the one that was used to build the model. Using a holdout sample is an honest way of assessing how well the model generalizes to the population. Sometimes, the data is partitioned into three data sets. The third data set is a test data set that is used to perform a final test on the model before the model is used for scoring.

PROC GLMSELECT can build a model using honest assessment with a holdout data set (that is, a validation data set) in two ways. The method that you use depends on the state of your data before model building begins. If your data is already partitioned into a training data set and a validation data set, you can simply reference both data sets in the procedure. If you start with a single data set, PROC GLMSELECT can partition the data for you. The PARTITION statement specifies how PROC GLMSELECT logically partitions the cases in the input data set into holdout samples for model validation and, if desired, testing. You use the FRACTION option to specify the fraction (that is, the proportion) of cases in the input data set that are randomly assigned a testing role and a validation role.

The PARTITION statement requires a pseudorandom number generator to start the random selection
process, and an integer is required to start that process. If you need to be able to reproduce your results in the future, you specify an integer that is greater than zero in the SEED= option. Then, whenever you run the PROC GLMSELECT step using that seed value, the pseudorandom
selection process is replicated and you will get the same results. In most situations, it is recommended that you use the SEED= option and specify an integer greater than zero.

Scoring Predictive Models
After you build a predictive model, you are ready to deploy the model To score new data—referred to here as scoring data—you can use PROC GLMSELECT and PROC PLM. Before you start using a newly built model to score data, some preparation of the model, the data, or both, is usually required.
For example, in some applications, such as fraud detection, the model might need to be integrated into an online monitoring system before it can be deployed. It is essential for the scoring data to be comparable to the training data and validation data that were used to build the model. The same modifications that were made to the training data must be made to the validation data before validating the model and to the scoring data before scoring.

When you score, you do not rerun the algorithm that was used to build the model. Instead, you apply the score code—that is, the equations obtained from the final model—to the scoring data. There are three methods for scoring your data:

Method 1
Use a SCORE statement in PROC GLMSELECT.

The first method is useful because you can build and score a model in one step. However, this
method is inefficient if you want to score more than once or use a large data set to build a model.
With this method, the model must be built from the training data each time the program is run.

Method 2
Use a STORE statement in PROC GLMSELECT.
Use a SCORE statement in PROC PLM.

The second method enables you to build the model only once, along with an item store, using
PROC GLMSELECT. You can then use PROC PLM to score new data using the item store.
Separating the code for model building and model scoring is especially helpful if your model is based
on a very large training data set or if you want to score more than once. Potential probles with this
method are that others might not be able to use this code with earlier versions of SAS or you might
not want to share the entire item store.

Method 3
Use a STORE statement in PROC GLMSELECT.
Use a CODE statement in PROC PLM to output SAS code.
Use a DATA step for scoring.

The third method uses PROC PLM to write detailed scoring code, based on the item store,
that is compatible with earlier versions of SAS.You can provide this code to others without having
to share other information that is in the item store. The DATA step is then used for scoring.

Syntax

PROC GLMSELECT DATA=trainingdataset
       VALDATA=validationdataset;
       MODEL target(s)=input(s) </ options>;
RUN;

PROC GLMSELECT DATA=trainingdataset
                 <SEED=number>;
        MODEL target(s)=input(s) </ options>;
        PARTITION FRACTION(<TEST=fraction><VALIDATE=fraction>);
RUN;

Sample Programs

Building a Predictive Model

%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
                       Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;
%let categorical=House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces
                      Season_Sold Garage_Type_2 Foundation_2 Heating_QC
                     Masonry_Veneer Lot_Shape_2 Central_Air;
ods graphics;

proc glmselect data=statdata.ameshousing3
         plots=all
         valdata=statdata.ameshousing4;
         class &categorical / param=glm ref=first;
         model SalePrice=&categorical &interval /
                 selection=backward
                 select=sbc
                 choose=validate;
          store out=work.amesstore;
          title "Selecting the Best Model using Honest Assessment";
run;

title;


Scoring Data
Example:

proc plm restore=work.amesstore;
    score data=statdata.ameshousing4 out=scored;
    code file="myfilepath/scoring.sas";
run;

data scored2;
    set statdata.ameshousing4;
    %include "myfilepath/scoring.sas";
run;

proc compare base=scored compare=scored2 criterion=0.0001;
    var Predicted;
    with P_SalePrice;
run;