Featured Post

Reference Books and material for Analytics

Website for practising R on Statistical conceptual Learning: https://statlearning.com  Reference Books & Materials: 1) Statis...

Sunday, April 22, 2018

SAS Regular Expression Example



Below is the example of  SAS Regular Expression function to make you understand this.

Two Perl Regular Expression(PRX) Functions
1.       PRXPARSE
Description - It define a Perl regular expression which is further used by other Perl Regular Expression function like PRXMATCH.

SyntaxPRXPARSE(“/Perl Regular Expression/i”)
                  “ ” à Part of SAS syntax
                  / à Default Perl delimiter
                I à Ignore case sensitive

Example à PRXPARSE(“/sas/i”)

2.       PRXMATCH
Description – To locate the position in a string, where a regular expression is matched. This function always returns the first position in a string expression of the pattern described by the regular expression.  If pattern is not found, then returns a zero.

SyntaxPRXPARSE(“/Perl Regular Expression/i” or Pattern_id, String)
                  “ ” à Part of SAS syntax
                  / à Default Perl delimiter
                I à Ignore case sensitive
                Pattern_id à is the value returned from the  PRXPARSE function

Example à PRXMATCH(“/sas/i”, String)
or
                       If _N_ = 1 then Pattern = PRXPARSE(“/sas/i”);
                       Retain Pattern;
                      PRXMATCH(Pattern, String)

Code
To find the word “SAS” anywhere in the string.

DATA Test;   
IF _N_ = 1 THEN PATTERN_NUM = PRXPARSE("/sas/i");   
* match for the word 'SAS' anywhere in the string;   
RETAIN PATTERN_NUM;
INPUT STRING $30.;   
POSITION = PRXMATCH(PATTERN_NUM,STRING);   
FILE PRINT;   
PUT PATTERN_NUM= STRING= POSITION=;
DATALINES;
Welcome to SAS india
SAS with Perl regular expression
Enjoy SAS with PRX
Perl Regular expression
;
run;

Output-
                              


Sunday, April 8, 2018

SAS Functions for File Operation : Basic Level


The Below code to understand SAS Functions related to Directory. You might not be using them in case you are using SAS metadata based tools but it is always advantageous to understand them .

This Code will provide the list of files and folders available within specific directory(List of Members within Directory).

Note: File could be with any extension(.sql, .sas, .txt, .xls & etc.)


Data Work.Test / view=work.Test;
/*Data _Null_;*/
Drop RC DID i;
RC = filename("Mydir", "G:\Test");
put RC;
did = dopen("Mydir");   /* Dopen  - open the directory and returns with directory identifier */
Put did;
if did > 0 then
      do i=0 to dnum(did); /* DNum  - returns number of members in a directory */
      dset = dread(did, i);   /* Dread  - returns the name of directory Member.  Dset will hold file name with extension and also folder name(if available) */

      dset1 = scan(dset,1,'.'); /* Dset1 will hold only file name also folder name(if available) */

      Ext = scan(dset,-1,'.'); /* Ext will hold only file extension also folder name(if available) */

      output;
end;
RC = dclose(did); /* DClose  - Close the Directory Opened by DOpen Function */

run;


Contributed by Shoaib Ansari

Friday, February 16, 2018

Baseline result for a Model(Category:Concept,Level:Intermediate)



There are common ways that you can use to calculate a baseline result.
A baseline result is the simplest possible prediction. For some problems, this may be a random result, and in others in may be the most common prediction.
  • Classification: If you have a classification problem, you can select the class that has the most observations and use that class as the result for all predictions. If the number of observations are equal for all classes in your training dataset, you can select a specific class or enumerate each class and see which gives the better result in your test harness.
  • Regression: If you are working on a regression problem, you can use a central tendency measure as the result for all predictions, such as the mean or the median.
  • Optimization: If you are working on an optimization problem, you can use a fixed number of random samples in the domain.
A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.
In more detail:
A machine learning algorithm tries to learn a function that models the relationship between the input (feature) data and the target variable (or label). When you test it, you will typically measure performance in one way or another. For example, your algorithm may be 75% accurate. But what does this mean? You can infer this meaning by comparing with a baseline's performance.
Typical baselines include those supported by scikit-learn's "dummy" estimators:
Classification baselines:
·         “stratified”: generates predictions by respecting the training set’s class distribution.
·         “most_frequent”: always predicts the most frequent label in the training set.
·         “prior”: always predicts the class that maximizes the class prior.
·         “uniform”: generates predictions uniformly at random.
·         “constant”: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.


Regression baselines:
“median”: always predicts the median of the training set
“quantile”: always predicts a specified quantile of the training set,provided with the quantile parameter.

“constant”: always predicts a constant value that is provided by the user.

Problem of Inference & Prediction (Category:Concepts, Level:Basics)

Before Understanding problem of Inference and prediction , let us understand how Input and output variables are defined

The input variables are typically denoted using the variable output variable symbol X, with a subscript to distinguish them. The inputs go by different names, such as predictors, independent variables, features or sometimes just variables. The output variable is variable often called the response or dependent variable, and is typically denoted using the symbol Y .

More generally, suppose that we observe a quantitative response Y and p different predictors,
X1,X2, . . .,Xp. We assume that there is some relationship between Y and X = (X1,X2, . . .,Xp), which can be written in the very general form

Y = f(X) + e




Here f is some fixed but unknown function of X1, . . . , Xp, and e is a random error term, which is independent of X and has mean zero. These 2 properties are very important of error term. It should not have any relationship with any input variable and at the same time average of the error term for all observation should be approximately zero.


 In this formulation, f represents the systematic information that X provides about Y .  Statistical Learning is all about methods or approaches of estimating this function f.


Now when we estimate f we are usually solving two kinds of problems.




Prediction
In many situations, a set of inputs X are readily available, but the output Y cannot be easily obtained. In this setting, since the error term averages to zero, we can predict Y using

Y = ˆ f(X)

where f represents our estimate for f, and ^ represents the resulting prediction for Y . In this setting, ˆ f is often treated as a black box, in the sense that one is not typically concerned with the exact form of ˆ f, provided that it yields accurate predictions for Y .



Inference
We are often interested in understanding the way that Y is affected as X1, . . . , Xp change. In this situation we wish to estimate f, but our goal is not necessarily to make predictions for Y . We instead want to understand the relationship between X and Y , or more specifically, to understand how
Y changes as a function of X1, . . .,Xp. Now ˆ f cannot be treated as a black box, because we need to know its exact form. In this setting, one may be interested in answering the following questions:

• Which predictors are associated with the response? It is often the case that only a small fraction of the available predictors are substantially associated with Y . Identifying the few important predictors among a large set of possible variables can be extremely useful, depending on
the application.


• What is the relationship between the response and each predictor?
Some predictors may have a positive relationship with Y , in the sense that increasing the predictor is associated with increasing values of Y . Other predictors may have the opposite relationship. Depending on the complexity of f, the relationship between the response and a given predictor may also depend on the values of the other predictors.


Can the relationship between Y and each predictor be adequately summarized
using a linear equation, or is the relationship more complicated?
Historically, most methods for estimating f have taken a linear form. In some situations, such an assumption is reasonable or even desirable. But often the true relationship is more complicated, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables.



Thanks
Learner


Reference: Introduction to Statistical Learning, Statistics, Wiki

Thursday, February 15, 2018

Designing Data warehouse: Good Practices




We always aspire (though difficult) for a flawless data warehouse design for a successful BI system. Listed down some best practices  to make your data warehouse perfect.
A data warehouse lies at the foundation of any business intelligence (BI) system. But if this foundation is flawed, the towering BI system cannot possibly be stable. Given BI’s importance as a decision enabler today, such flaws are undesirable. For a flawless data warehouse design and process, avoid the following common mistakes.  

1) Designing Data warehouse on future needs rather than only on Current need
Data warehouse (in some cases Common data Model) need to be designed for long term. Designing it to get the benefit only in short term would not reap much benefits. While designing the Data warehouse, 3 to 5 years organization business roadmap needs to be in mind.  Data warehouse designer should pay as much attention to as much attention to business strategy as to technical aspects
2) No shortcomings or negligence while creating the metadata layer
While designing a data warehouse, poor design of the metadata has far-reaching implications. Metadata is the integrator between data models, extract, transform, load (ETL) and BI. However, the metadata layer often is created only to fit short-sighted data criteria and its documentation is haphazard.
It is necessary to add descriptions to the tables or columns at the data warehouse design stage itself. When business users reject BI reports they cannot decipher, the problem can usually be traced to poorly designed data models that lack easy-to-understand descriptions, and have inconsistent naming conventions. Prevent this by setting the appropriate metadata strategy at the data modeling stage of data warehouse design.

 3) Weightage to ad hoc querying and self-service BI
Generating a simple report can sometimes expend considerable bandwidth and be a drain on productivity for the IT team. However, a self-service BI can simplify the task by relying on the metadata layer to generate the reports, without affecting the sanctity of the underlying data model. For example, to produce a report on top-performing branches, the user simply selects hierarchical columns with the titles “region,” “branch,” “number of customers” and “relationship packages.” Thus, incorporation of a self-service or ad hoc query layer in the data warehouse process can help you gain user acceptance.

4) Not getting carried away on visual appeal and focus on Speed
When developing the reporting layer of a data warehouse, the design should focus on ease of use and speedy action Although business users tend to be swayed by fancy charts and reports, do not succumb to the temptation of sacrificing speed at the altar of beautification. Indeed, in the course of data warehouse project implementation, IT teams have noted that speedy response time in report generation is of prime importance for the BI system to gain popularity amongst users. To give a comparison, if a simple report takes a couple of seconds to load, the chart may take three minutes. Speed should therefore be given priority over visual appeal when designing data warehouse process.

5) Clarity on data quality prior to finalization of data warehouse design
A large amount of aggregation takes place at the data mart level. The data warehouse is the source of data, and the data contained therein should be clean and accurate. If not, the output from the system is likely to show discrepancies, and the data warehouse design as well as process, is unfairly thought of as the culprit.
Rather, active monitoring of dimensional data should be incorporated right at the data warehouse design stage. Quality control is governed by usage too, and the business could point out faults that the technology may forego; hence, the data warehouse design must provide for strong data governance processes in order to maintain clean data.

6) Not considering Data warehousing only as an IT initiative.
Focusing on data warehouse implementation as a pure IT project can amount to diluting its essence. Ideally, the benefits of a data warehouse design and process have to be quantified; if not as monetary returns on investments, then at least in terms of growth in business (defined by key performance indicators). This growth can be viewed and measured in terms of increased usage of the BI system and efficient data management, leading to smoother operations, and increased efficiency for top-level management.
It must be understood that discrepancies usually creep in at transactional data levels. For instance, consider a bank customer whose transaction details are captured correctly, but whose mobile number on record is incorrect. A change in the mobile number could have occurred subsequent to recording the original form. This example relates to dimensional data. In such cases, it may not be feasible to stop the data warehouse process and cleanse the data.