For a constant Learner & Data Analysts

Wednesday, October 30, 2019

Mistagging of information when you don't know your data

Finding out relevant articles related to an entity is an interesting task. It becomes complex when an entity is known with various acronyms and short forms. It becomes further complex when you have multiple entities with similar names, short names or acronyms.

The whole effort of complex web crawling and web scrapping framework using python scrappy, selenium etc. including tagging and presentation will go for a toss if articles and documents are not entity-tagged properly.

If you search South Indian Bank Stock on https:/moneycontrol.com/ website today (as on 30th Oct 19) and go to News & Research, the most recent and relevant articles you find for this stock is actually not related at all with South Indian Bank entity. Forget about title, you would not even find a mention of the entity, South Indian Bank anywhere inside the article. Actually it is related with an entity which is completely different but similar in name called Indian Bank.

Though money control website and mobile application are amazing in various aspects and it is one of the good sources of information for most of us who are active in share market but this kind of blunder does occur when you do not understand you data well.

Finding relevant articles related to an entity through matching has to be improved specially in these cases.

My Suggestion to moneycontrol Application-cum-AI architect would be to follow following simple steps while tagging.

Ø Tag articles with entity name matches directly with Title text

Ø Tag articles with entity name matches directly with Body text

Ø Tag articles with entity name matches Partially but sufficient with Title text

o Complex Fuzzy Match

o Matches with Acronym

o Matches with other short form

Ø Tag articles with entity name matches Partially but sufficient with Body text

o Complex Fuzzy Match

o Matches with Acronym

o Matches with other short form

Ø Save the name of Matched and Matching entities along with Article IDs, steps etc.

Ø Exclude an Article if it Matched entities directly matches with other matching entities.

Tuesday, September 3, 2019

AWS Solution Architect Associate Exam (Read Time - 4 Mins)

I passed my AWS Solution Architect Associate Exam couple of months back. Please find below useful tips on the same. 

Before the exam:

Go through the official AWS learning library: https://www.aws.training/LearningLibrary. It is entirely free & has the most updated information about AWS services.
Complete the official AWS Exam Readiness: AWS Certified Solutions Architect (Associate) - Digital training (Free) :https://www.aws.training/learningobject/curriculum?id=20685
Read the FAQ of each AWS Service. e.g., https://aws.amazon.com/vpc/faqs/
Understand the AWS Well-Architected Framework & read each whitepaper from here: https://aws.amazon.com/architecture/well-architected/
Take handwritten notes & make personalized cheat sheets whenever possible.
Do plenty of hands-on practice. I had used Qwiklabs & it helped me a lot (https://www.qwiklabs.com)
You need to understand how each AWS service can be tweaked for Cost, Quality, and Performance. How can you make S3 cheaper? How can you make it more redundant/secure? How can you make it more performant? What about DynamoDB or EBS? EC2? Etc.
Take plenty of practice tests; it will give you confidence for the actual exam.

During the exam:

Get plenty of rest before the exam day. It's very challenging to maintain concentration for 130 minutes, without any breaks.
Read the answers first to understand what to focus on in the question.
Read each question twice & make sure you have found the "keywords." It's the part of the question that tells you exactly what they want. e.g., "Which option provides the MOST COST EFFECTIVE solution."
If you have no clue at first, eliminate wrong answers, then guess. Mark it for review and revisit it if you have time.

I hope you find this useful. And all the best for your exam !! 

Tuesday, August 27, 2019

Apache Spark in Google Collaboratory

This is from my learning notes!!!

1.1 Setting up Spark on Google Colab

Google Collaborator is perfect cloud platform for someone to start learning Python. You can access what you practiced from anywhere and everywhere.

This could also be used to learn Spark . Please follow below steps. Make sure you check the file version and do the modification as needed (like look for latest .tgz file etc.)

1.1.1 Install Java, Spark, and Findspark

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q http://apache.osuosl.org/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz

!tar xf spark-2.4.3-bin-hadoop2.7.tgz

!pip install -q findspark

1.1.2 Set Environment Variables

import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

os.environ["SPARK_HOME"] = "/content/spark-2.4.3-bin-hadoop2.7"

1.1.3 Start a SparkSession

import findspark

findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()

1.1.4 Use Spark!

df = spark.createDataFrame([{"winner": "Humanity"} for x in range(100)])

df.show(2)

Sunday, April 22, 2018

SAS Regular Expression Example

Below is the example of SAS Regular Expression function to make you understand this.

Two Perl Regular Expression(PRX) Functions –

1. PRXPARSE

Description - It define a Perl regular expression which is further used by other Perl Regular Expression function like PRXMATCH.

Syntax – PRXPARSE(“/Perl Regular Expression/i”)

“ ” à Part of SAS syntax

/ à Default Perl delimiter

I à Ignore case sensitive

Example à PRXPARSE(“/sas/i”)

2. PRXMATCH

Description – To locate the position in a string, where a regular expression is matched. This function always returns the first position in a string expression of the pattern described by the regular expression. If pattern is not found, then returns a zero.

Syntax – PRXPARSE(“/Perl Regular Expression/i” or Pattern_id, String)

“ ” à Part of SAS syntax

/ à Default Perl delimiter

I à Ignore case sensitive

Pattern_id à is the value returned from the PRXPARSE function

Example à PRXMATCH(“/sas/i”, String)

If _N_ = 1 then Pattern = PRXPARSE(“/sas/i”);

Retain Pattern;

PRXMATCH(Pattern, String)

Code –

To find the word “SAS” anywhere in the string.

DATA Test;

IF _N_ = 1 THEN PATTERN_NUM = PRXPARSE("/sas/i");

* match for the word 'SAS' anywhere in the string;

RETAIN PATTERN_NUM;

INPUT STRING $30.;

POSITION = PRXMATCH(PATTERN_NUM,STRING);

FILE PRINT;

PUT PATTERN_NUM= STRING= POSITION=;

DATALINES;

Welcome to SAS india

SAS with Perl regular expression

Enjoy SAS with PRX

Perl Regular expression

;

run;

Output-

Sunday, April 8, 2018

SAS Functions for File Operation : Basic Level

The Below code to understand SAS Functions related to Directory. You might not be using them in case you are using SAS metadata based tools but it is always advantageous to understand them .

This Code will provide the list of files and folders available within specific directory(List of Members within Directory).

Note: File could be with any extension(.sql, .sas, .txt, .xls & etc.)

Data Work.Test / view=work.Test;

/*Data _Null_;*/

Drop RC DID i;

RC = filename("Mydir", "G:\Test");

put RC;

did = dopen("Mydir"); /* Dopen - open the directory and returns with directory identifier */

Put did;

if did > 0 then

do i=0 to dnum(did); /* DNum - returns number of members in a directory */

dset = dread(did, i); /* Dread - returns the name of directory Member. Dset will hold file name with extension and also folder name(if available) */

dset1 = scan(dset,1,'.'); /* Dset1 will hold only file name also folder name(if available) */

Ext = scan(dset,-1,'.'); /* Ext will hold only file extension also folder name(if available) */

output;

end;

RC = dclose(did); /* DClose - Close the Directory Opened by DOpen Function */

run;

Contributed by Shoaib Ansari

Friday, February 16, 2018

Baseline result for a Model(Category:Concept,Level:Intermediate)

There are common ways that you can use to calculate a baseline result.

A baseline result is the simplest possible prediction. For some problems, this may be a random result, and in others in may be the most common prediction.

Classification: If you have a classification problem, you can select the class that has the most observations and use that class as the result for all predictions. If the number of observations are equal for all classes in your training dataset, you can select a specific class or enumerate each class and see which gives the better result in your test harness.
Regression: If you are working on a regression problem, you can use a central tendency measure as the result for all predictions, such as the mean or the median.
Optimization: If you are working on an optimization problem, you can use a fixed number of random samples in the domain.

A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.

In more detail:

A machine learning algorithm tries to learn a function that models the relationship between the input (feature) data and the target variable (or label). When you test it, you will typically measure performance in one way or another. For example, your algorithm may be 75% accurate. But what does this mean? You can infer this meaning by comparing with a baseline's performance.

Typical baselines include those supported by scikit-learn's "dummy" estimators:

Classification baselines:

· “stratified”: generates predictions by respecting the training set’s class distribution.

· “most_frequent”: always predicts the most frequent label in the training set.

· “prior”: always predicts the class that maximizes the class prior.

· “uniform”: generates predictions uniformly at random.

· “constant”: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.

Regression baselines:

“median”: always predicts the median of the training set

“quantile”: always predicts a specified quantile of the training set,provided with the quantile parameter.

“constant”: always predicts a constant value that is provided by the user.

Problem of Inference & Prediction (Category:Concepts, Level:Basics)

Before Understanding problem of Inference and prediction , let us understand how Input and output variables are defined

The input variables are typically denoted using the variable output variable symbol X, with a subscript to distinguish them. The inputs go by different names, such as predictors, independent variables, features or sometimes just variables. The output variable is variable often called the response or dependent variable, and is typically denoted using the symbol Y .

More generally, suppose that we observe a quantitative response Y and p different predictors,
X1,X2, . . .,Xp. We assume that there is some relationship between Y and X = (X1,X2, . . .,Xp), which can be written in the very general form

Y = f(X) + e

Here f is some fixed but unknown function of X1, . . . , Xp, and e is a random error term, which is independent of X and has mean zero. These 2 properties are very important of error term. It should not have any relationship with any input variable and at the same time average of the error term for all observation should be approximately zero.

In this formulation, f represents the systematic information that X provides about Y . Statistical Learning is all about methods or approaches of estimating this function f.

Now when we estimate f we are usually solving two kinds of problems.

Prediction
In many situations, a set of inputs X are readily available, but the output Y cannot be easily obtained. In this setting, since the error term averages to zero, we can predict Y using

Y = ˆ f(X)

where f represents our estimate for f, and ^ represents the resulting prediction for Y . In this setting, ˆ f is often treated as a black box, in the sense that one is not typically concerned with the exact form of ˆ f, provided that it yields accurate predictions for Y .

Inference
We are often interested in understanding the way that Y is affected as X1, . . . , Xp change. In this situation we wish to estimate f, but our goal is not necessarily to make predictions for Y . We instead want to understand the relationship between X and Y , or more specifically, to understand how
Y changes as a function of X1, . . .,Xp. Now ˆ f cannot be treated as a black box, because we need to know its exact form. In this setting, one may be interested in answering the following questions:

• Which predictors are associated with the response? It is often the case that only a small fraction of the available predictors are substantially associated with Y . Identifying the few important predictors among a large set of possible variables can be extremely useful, depending on
the application.

• What is the relationship between the response and each predictor?
Some predictors may have a positive relationship with Y , in the sense that increasing the predictor is associated with increasing values of Y . Other predictors may have the opposite relationship. Depending on the complexity of f, the relationship between the response and a given predictor may also depend on the values of the other predictors.

• Can the relationship between Y and each predictor be adequately summarized
using a linear equation, or is the relationship more complicated?
Historically, most methods for estimating f have taken a linear form. In some situations, such an assumption is reasonable or even desirable. But often the true relationship is more complicated, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables.

Thanks
Learner

Reference: Introduction to Statistical Learning, Statistics, Wiki