Sampling with SAS
Terms you
need to understand first
Population: Overall data available for Sampling. Actually
an investigator usually wants to generalize about a class of individuals. This
class is called the population.
Eligible Universe:
Target data available for Sampling after applying overall business
goals. That is few selection or filter criterion.
There is
very thin line between Population and eligible universe and most of the times
they overlap each other. For an example, Population is all policies available
in Policy database and Eligible Universe is all the policies enrolled in last 1 year
in Policy database.
Other
definitions:
A sample is
part of Population. A parameter is a numerical fact about a population. Usually
a parameter cannot be determined exactly, but can only be estimated.
A statistics
can be computed from a sample, and used to estimate a parameter. A statistic is
what the investigators knows. A parameter is what the investigator wants to
know
Some methods
for choosing samples are likely to produce accurate estimates. Others are
spoiled by selection bias or non-response bias. When thinking about a sample
survey, ask yourself:
What is going to be the Population?
What you want to estimate i.e. the parameter?
How was the samples selected, random
or otherwise?
How many people have not responded
at all, what was the response Rate?
One needs to
take care about choosing large samples as they offer no protection against bias.
Two Sampling
methods:
Quota
Sampling:
In Quota
sampling, the sample is handpicked by the investigators to resemble the
population in some key ways. This method seems logical, but often gives bad
results. The reason: Unintentional bias
on part of Interviewers when they choose subject to interview.
Probability
Sampling:
Probability
methods for sampling use an objective chance process to pick the sample, and
leave no discretion to the interviewer. The hallmark of a probability method:
the investigator can compute the chance that any particular individuals in the
population will be selected for the sample. Probability methods guard against
bias, because blind chance is impartial.
One
probability method is simple random sampling. This means drawing subjects at random
without replacement. Even when using probability methods, bias may come in.
Then the estimate differs from the parameter, due to bias and chance error.
Estimate=
parameter + bias+ chance error.
Chance error
is also called sampling error. Bias is non-sampling error.
Sampling with SAS:
There are
various ways of extracting samples from a large SAS datasets. Couple of most
common methods used in Random sampling is listed below.
I) Using RANUNI function:
RANUNI function
is used to assign random number in between a range to all the observations of a
SAS dataset. Further we can get the dataset
sorted on the newly added variable and pick the top n observation (supposed we
need n samples out of N observation).
ii) Using Surveyselect Proc:
SurveySelect
proc can also be used for sampling other than random. This proc has many options of sampling including
with Replacements and without replacements. For Predictive modeling, When we split our data into 3 parts - training, validation and test, we perform sampling without replacement. It means, same row can never be found in more than one sample(Training/Validation/Test). If we do sampling with replacement, we would not be able to assess model performance correctly because same data points that were used to train model exist in validation or test datasets. Another Sampling is Stratified Sampling which helps to keep the initial ratio of events to non-events in both the training and the testing data sets which is useful in the case of rare-event model. Below is the example of that which creates two datasets dividing 60% and 40% and yet stratified by skill. That means ratio of skills in BIDW will be same in two sample datasets.
Proc Sort data=SASUSER.BIDW;
by skill;
run;
proc surveyselect data=SASUSER.BIDW samprate=0.6 out=sample outall;
strata skill;
run;
Thanks
Learner
Proc Sort data=SASUSER.BIDW;
by skill;
run;
proc surveyselect data=SASUSER.BIDW samprate=0.6 out=sample outall;
strata skill;
run;
Thanks
Learner