Featured Post

Reference Books and material for Analytics

Website for practising R on Statistical conceptual Learning: https://statlearning.com  Reference Books & Materials: 1) Statis...

Wednesday, November 30, 2016

SAMPLING WITH SAS ( Category Concepts, Level: Basic)

Sampling with SAS
 
Terms you need to understand first
Population: Overall data available for Sampling. Actually an investigator usually wants to generalize about a class of individuals. This class is called the population.
Eligible Universe:  Target data available for Sampling after applying overall business goals. That is few selection or filter criterion.
There is very thin line between Population and eligible universe and most of the times they overlap each other. For an example, Population is all policies available in Policy database and Eligible Universe is all the policies enrolled in last 1 year in  Policy database.
Other definitions:
A sample is part of Population. A parameter is a numerical fact about a population. Usually a parameter cannot be determined exactly, but can only be estimated.
A statistics can be computed from a sample, and used to estimate a parameter. A statistic is what the investigators knows. A parameter is what the investigator wants to know
 
Some methods for choosing samples are likely to produce accurate estimates. Others are spoiled by selection bias or non-response bias. When thinking about a sample survey, ask yourself:
            What is going to be the Population? What you want to estimate i.e. the parameter?
            How was the samples selected, random or otherwise?
            How many people have not responded at all, what was the response Rate?
One needs to take care about choosing large samples as they offer no protection against bias.
Two Sampling methods:
Quota Sampling:
In Quota sampling, the sample is handpicked by the investigators to resemble the population in some key ways. This method seems logical, but often gives bad results.  The reason: Unintentional bias on part of Interviewers when they choose subject to interview.
 
Probability Sampling:
Probability methods for sampling use an objective chance process to pick the sample, and leave no discretion to the interviewer. The hallmark of a probability method: the investigator can compute the chance that any particular individuals in the population will be selected for the sample. Probability methods guard against bias, because blind chance is impartial.
 
One probability method is simple random sampling. This means drawing subjects at random without replacement. Even when using probability methods, bias may come in. Then the estimate differs from the parameter, due to bias and chance error.
            Estimate= parameter + bias+ chance error.
Chance error is also called sampling error. Bias is non-sampling error.
 
Sampling with SAS:
There are various ways of extracting samples from a large SAS datasets. Couple of most common methods used in Random sampling is listed below.
 
I) Using RANUNI function:
RANUNI function is used to assign random number in between a range to all the observations of a SAS dataset.  Further we can get the dataset sorted on the newly added variable and pick the top n observation (supposed we need n samples out of N observation).
 
ii) Using Surveyselect Proc:
SurveySelect proc can also be used for sampling other than random.  This proc has many options of sampling including with Replacements and without replacements. For Predictive modeling, When we split our data into 3 parts - training, validation and test, we perform sampling without replacement. It means, same row can never be found in more than one sample(Training/Validation/Test). If we do sampling with replacement, we would not be able to assess model performance correctly because same data points that were used to train model exist in validation or test datasets. Another Sampling is Stratified Sampling which helps to keep the initial ratio of events to non-events in both the training and the testing data sets which is useful in the case of rare-event model. Below is the example of that which creates two datasets dividing 60% and 40% and yet  stratified by skill. That means ratio of skills in BIDW will be same in two sample datasets.




Proc Sort data=SASUSER.BIDW;
by skill;
run;


proc surveyselect data=SASUSER.BIDW samprate=0.6 out=sample outall; 
strata skill;
run;




Thanks
Learner

 
 

No comments:

Post a Comment