Featured Post

Reference Books and material for Analytics

Website for practising R on Statistical conceptual Learning: https://statlearning.com  Reference Books & Materials: 1) Statis...

Friday, February 16, 2018

Baseline result for a Model(Category:Concept,Level:Intermediate)



There are common ways that you can use to calculate a baseline result.
A baseline result is the simplest possible prediction. For some problems, this may be a random result, and in others in may be the most common prediction.
  • Classification: If you have a classification problem, you can select the class that has the most observations and use that class as the result for all predictions. If the number of observations are equal for all classes in your training dataset, you can select a specific class or enumerate each class and see which gives the better result in your test harness.
  • Regression: If you are working on a regression problem, you can use a central tendency measure as the result for all predictions, such as the mean or the median.
  • Optimization: If you are working on an optimization problem, you can use a fixed number of random samples in the domain.
A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.
In more detail:
A machine learning algorithm tries to learn a function that models the relationship between the input (feature) data and the target variable (or label). When you test it, you will typically measure performance in one way or another. For example, your algorithm may be 75% accurate. But what does this mean? You can infer this meaning by comparing with a baseline's performance.
Typical baselines include those supported by scikit-learn's "dummy" estimators:
Classification baselines:
·         “stratified”: generates predictions by respecting the training set’s class distribution.
·         “most_frequent”: always predicts the most frequent label in the training set.
·         “prior”: always predicts the class that maximizes the class prior.
·         “uniform”: generates predictions uniformly at random.
·         “constant”: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.


Regression baselines:
“median”: always predicts the median of the training set
“quantile”: always predicts a specified quantile of the training set,provided with the quantile parameter.

“constant”: always predicts a constant value that is provided by the user.

Problem of Inference & Prediction (Category:Concepts, Level:Basics)

Before Understanding problem of Inference and prediction , let us understand how Input and output variables are defined

The input variables are typically denoted using the variable output variable symbol X, with a subscript to distinguish them. The inputs go by different names, such as predictors, independent variables, features or sometimes just variables. The output variable is variable often called the response or dependent variable, and is typically denoted using the symbol Y .

More generally, suppose that we observe a quantitative response Y and p different predictors,
X1,X2, . . .,Xp. We assume that there is some relationship between Y and X = (X1,X2, . . .,Xp), which can be written in the very general form

Y = f(X) + e




Here f is some fixed but unknown function of X1, . . . , Xp, and e is a random error term, which is independent of X and has mean zero. These 2 properties are very important of error term. It should not have any relationship with any input variable and at the same time average of the error term for all observation should be approximately zero.


 In this formulation, f represents the systematic information that X provides about Y .  Statistical Learning is all about methods or approaches of estimating this function f.


Now when we estimate f we are usually solving two kinds of problems.




Prediction
In many situations, a set of inputs X are readily available, but the output Y cannot be easily obtained. In this setting, since the error term averages to zero, we can predict Y using

Y = ˆ f(X)

where f represents our estimate for f, and ^ represents the resulting prediction for Y . In this setting, ˆ f is often treated as a black box, in the sense that one is not typically concerned with the exact form of ˆ f, provided that it yields accurate predictions for Y .



Inference
We are often interested in understanding the way that Y is affected as X1, . . . , Xp change. In this situation we wish to estimate f, but our goal is not necessarily to make predictions for Y . We instead want to understand the relationship between X and Y , or more specifically, to understand how
Y changes as a function of X1, . . .,Xp. Now ˆ f cannot be treated as a black box, because we need to know its exact form. In this setting, one may be interested in answering the following questions:

• Which predictors are associated with the response? It is often the case that only a small fraction of the available predictors are substantially associated with Y . Identifying the few important predictors among a large set of possible variables can be extremely useful, depending on
the application.


• What is the relationship between the response and each predictor?
Some predictors may have a positive relationship with Y , in the sense that increasing the predictor is associated with increasing values of Y . Other predictors may have the opposite relationship. Depending on the complexity of f, the relationship between the response and a given predictor may also depend on the values of the other predictors.


Can the relationship between Y and each predictor be adequately summarized
using a linear equation, or is the relationship more complicated?
Historically, most methods for estimating f have taken a linear form. In some situations, such an assumption is reasonable or even desirable. But often the true relationship is more complicated, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables.



Thanks
Learner


Reference: Introduction to Statistical Learning, Statistics, Wiki

Thursday, February 15, 2018

Designing Data warehouse: Good Practices




We always aspire (though difficult) for a flawless data warehouse design for a successful BI system. Listed down some best practices  to make your data warehouse perfect.
A data warehouse lies at the foundation of any business intelligence (BI) system. But if this foundation is flawed, the towering BI system cannot possibly be stable. Given BI’s importance as a decision enabler today, such flaws are undesirable. For a flawless data warehouse design and process, avoid the following common mistakes.  

1) Designing Data warehouse on future needs rather than only on Current need
Data warehouse (in some cases Common data Model) need to be designed for long term. Designing it to get the benefit only in short term would not reap much benefits. While designing the Data warehouse, 3 to 5 years organization business roadmap needs to be in mind.  Data warehouse designer should pay as much attention to as much attention to business strategy as to technical aspects
2) No shortcomings or negligence while creating the metadata layer
While designing a data warehouse, poor design of the metadata has far-reaching implications. Metadata is the integrator between data models, extract, transform, load (ETL) and BI. However, the metadata layer often is created only to fit short-sighted data criteria and its documentation is haphazard.
It is necessary to add descriptions to the tables or columns at the data warehouse design stage itself. When business users reject BI reports they cannot decipher, the problem can usually be traced to poorly designed data models that lack easy-to-understand descriptions, and have inconsistent naming conventions. Prevent this by setting the appropriate metadata strategy at the data modeling stage of data warehouse design.

 3) Weightage to ad hoc querying and self-service BI
Generating a simple report can sometimes expend considerable bandwidth and be a drain on productivity for the IT team. However, a self-service BI can simplify the task by relying on the metadata layer to generate the reports, without affecting the sanctity of the underlying data model. For example, to produce a report on top-performing branches, the user simply selects hierarchical columns with the titles “region,” “branch,” “number of customers” and “relationship packages.” Thus, incorporation of a self-service or ad hoc query layer in the data warehouse process can help you gain user acceptance.

4) Not getting carried away on visual appeal and focus on Speed
When developing the reporting layer of a data warehouse, the design should focus on ease of use and speedy action Although business users tend to be swayed by fancy charts and reports, do not succumb to the temptation of sacrificing speed at the altar of beautification. Indeed, in the course of data warehouse project implementation, IT teams have noted that speedy response time in report generation is of prime importance for the BI system to gain popularity amongst users. To give a comparison, if a simple report takes a couple of seconds to load, the chart may take three minutes. Speed should therefore be given priority over visual appeal when designing data warehouse process.

5) Clarity on data quality prior to finalization of data warehouse design
A large amount of aggregation takes place at the data mart level. The data warehouse is the source of data, and the data contained therein should be clean and accurate. If not, the output from the system is likely to show discrepancies, and the data warehouse design as well as process, is unfairly thought of as the culprit.
Rather, active monitoring of dimensional data should be incorporated right at the data warehouse design stage. Quality control is governed by usage too, and the business could point out faults that the technology may forego; hence, the data warehouse design must provide for strong data governance processes in order to maintain clean data.

6) Not considering Data warehousing only as an IT initiative.
Focusing on data warehouse implementation as a pure IT project can amount to diluting its essence. Ideally, the benefits of a data warehouse design and process have to be quantified; if not as monetary returns on investments, then at least in terms of growth in business (defined by key performance indicators). This growth can be viewed and measured in terms of increased usage of the BI system and efficient data management, leading to smoother operations, and increased efficiency for top-level management.
It must be understood that discrepancies usually creep in at transactional data levels. For instance, consider a bank customer whose transaction details are captured correctly, but whose mobile number on record is incorrect. A change in the mobile number could have occurred subsequent to recording the original form. This example relates to dimensional data. In such cases, it may not be feasible to stop the data warehouse process and cleanse the data.