For a constant Learner & Data Analysts

Thursday, May 13, 2021

Machine Learning System Monitoring

Monitoring Machine learning system is a cumbersome process that involves quite a lot of skills other than constant business feedback.

Broadly there are 3 kinds of monitoring that one need to focus on:

1) Data : Data Monitoring includes drift monitoring, monitoring data variance and feature monitoring

2) Model : Monitoring of model accuracy over a period of time to make decisions for retraining and A/B testing, It also include model health check

3) System : Monitoring system health parameters such as average response time, Capacity utilization, Number of service requests, down time etc.

Friday, May 7, 2021

Machine Learning Model Governance Process

Like Data Governance, predictive model also has its own governance process. There are multiple teams like but not limited to Core team, extended team, decision making/Steering committee & implementation team. This Governance process typically requires following steps.

1) Inputs : The generation of a request for a new or updated version of model

2) Model Need, Design and Direction: Technical process to validate the requirement, scope and high level implementation

3) Model Build: Creates the model and develops implementation requirements (along with legal and regulatory considerations)

4) Model Approval: Multistep approval process (technical, business, risk, legal) to affirm and ascertain the model

5) Model Implementation: Data integrity, end to end testing and detailed implementation

6) Monitoring: This process is done for post implementation monitoring and understanding the data drift.

In addition to this, there is a model review process at a regular frequency to decision on refreshing the model.

Saturday, January 2, 2021

Unify Data Warehousing and Advanced Analytics

Most of the Data warehouses in today's world still deals with only structured data. Portion of it alos utilizes unstructured data from Data Lake or some landing layer before the warehouse. Data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse, which will (i) be based on open direct-access data formats, such as Apache Parquet, (ii) have firstclass support for machine learning and data science, and (iii) offer state-of-the-art performance. Lakehouses can help address several major challenges with data warehouses, including data staleness, reliability, total cost of ownership, data lock-in, and limited use-case support. The industry is already moving toward Lakehouses and how this shift may affect work in data management. We also report results from a Lakehouse system using Parquet that is competitive with popular cloud data warehouses on TPC-DS.

Please refer below architecture for the evolution of Data Warehouses. With the increased focus now on

Data Science & Machine learning Lakehouse platform is the future.

Reference: This article has referred the CIDR paper. For more details please refer following link.

http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf?utm_source=bambu&utm_medium=social&utm_campaign=advocacy&blaid=1066676

Tuesday, April 14, 2020

Bayes’ theorem and rare disease

Bayes' Theorem is used to reverse the direction of conditioning. Suppose we want to ask what's the P(A|B) but we know it in terms of P(B|A).

So we can write the P(A|B) = P(B|A) P(A) / P(B|A) P(A) + P(B| not A) P(not A)

This is same as P(A and B) / P(B)

This example is from an early test for HIV antibodies known as the ELISA test in North America.

Just for the example sake, I have replaced HIV with Covid19.

It's because this is a rare disease (see the probability of Covid19 in the screenshot ) and

this is actually fairly common a problem for rare diseases. The number of false positives,

greatly outnumbers the true positives because it's a rare disease. So even though the test is very accurate, we get more false positives than we get true positives. This obviously has important policy implications for things like mandatory testing. It makes much more sense to test in a sub population where the prevalence of Covid19 is higher, rather than in a general population where it's quite rare.

Monday, April 13, 2020

Need to apply Deep Learning but don't have enough data, what to do next ?

Often it has been observed that analysts and data scientists want to apply deep learning models to solve the problem but they don't have enough data to train. There are three main ways to improve data: collecting more data, synthesizing new data, or augmenting existing data.. But what if not much academic work is there on the problem you want to solve. Convolutional Neural Networks have worked pretty well on most of the Computer Vision tasks. But all the CNN's (particularly deep CNN's) are heavily dependent on availability of very large training data to avoid overfitting. So in almost all computer vision tasks having more data help. In today’s world for the majority of Computer Vision tasks we don’t have enough data. So when you are training the computer vision model, often data augmentation is must.

Some of the common data augmentation used in Computer vision models are as given below.

a) Mirroring

Below is the example of mirroring on the vertical axis.

b) Random Cropping

It is not the ideal method for data augmentation but works well as long as your cropped images are reasonable subset of original image.

c) Other Techniques like Rotation, Shearing and Local warping

d) Color Shifting

Color shifting is about adding different distortion to RGB channels of an image.

Implementing distortions during training

We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk.

In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free.

A quick taxonomy of data augmentation method in general is depicted below for a big picture

References:

1) Andrew NG Deep Learning deeplearning.ai

2) Data Augmentation on Workera.ai

3) AlexNet Paper on PCA color augmentation

4) Datahackers.rs blog

Wednesday, October 30, 2019

Mistagging of information when you don't know your data

Finding out relevant articles related to an entity is an interesting task. It becomes complex when an entity is known with various acronyms and short forms. It becomes further complex when you have multiple entities with similar names, short names or acronyms.

The whole effort of complex web crawling and web scrapping framework using python scrappy, selenium etc. including tagging and presentation will go for a toss if articles and documents are not entity-tagged properly.

If you search South Indian Bank Stock on https:/moneycontrol.com/ website today (as on 30th Oct 19) and go to News & Research, the most recent and relevant articles you find for this stock is actually not related at all with South Indian Bank entity. Forget about title, you would not even find a mention of the entity, South Indian Bank anywhere inside the article. Actually it is related with an entity which is completely different but similar in name called Indian Bank.

Though money control website and mobile application are amazing in various aspects and it is one of the good sources of information for most of us who are active in share market but this kind of blunder does occur when you do not understand you data well.

Finding relevant articles related to an entity through matching has to be improved specially in these cases.

My Suggestion to moneycontrol Application-cum-AI architect would be to follow following simple steps while tagging.

Ø Tag articles with entity name matches directly with Title text

Ø Tag articles with entity name matches directly with Body text

Ø Tag articles with entity name matches Partially but sufficient with Title text

o Complex Fuzzy Match

o Matches with Acronym

o Matches with other short form

Ø Tag articles with entity name matches Partially but sufficient with Body text