Featured Post

Reference Books and material for Analytics

Website for practising R on Statistical conceptual Learning: https://statlearning.com  Reference Books & Materials: 1) Statis...

Tuesday, April 14, 2020

Bayes’ theorem and rare disease

Bayes' Theorem is used to reverse the direction of conditioning. Suppose we want to ask what's the P(A|B) but we know it in terms of P(B|A). 
So we can write the P(A|B) = P(B|A) P(A) / P(B|A) P(A) + P(B| not A) P(not A)
This is same as P(A and B) / P(B)

This example is from an early test for HIV antibodies known as the ELISA test in North America.
Just for the example sake, I have replaced HIV with Covid19.



It's because this is a rare disease (see the probability of Covid19 in the screenshot ) and 
this is actually fairly common a problem for rare diseases. The number of false positives, 
greatly outnumbers the true positives because it's a rare disease. So even though the test is very accurate, we get more false positives than we get true positives. This obviously has important policy implications for things like mandatory testing. It makes much more sense to test in a sub population where the prevalence of Covid19 is higher, rather than in a general population where it's quite rare. 

Monday, April 13, 2020

Need to apply Deep Learning but don't have enough data, what to do next ?

Often it has been observed that analysts and data scientists want to apply deep learning models to solve the problem but they don't have enough data to train. There are three main ways to improve data: collecting more data, synthesizing new data, or augmenting existing data.. But what if not much academic work is there on the problem you want to solve. Convolutional Neural Networks have worked pretty well on most of the Computer Vision tasks. But all the CNN's (particularly deep CNN's) are heavily dependent on availability of very large training data to avoid overfitting. So in almost all computer vision tasks having more data help. In today’s world for the majority of Computer Vision tasks we don’t have enough data. So when you are training the computer vision model, often data augmentation is must.
Some of the common data augmentation used in Computer vision models are as given below.
a)    Mirroring
Below is the example of mirroring on the vertical axis.
No alt text provided for this image
b)    Random Cropping
It is not the ideal method for data augmentation but works well as long as your cropped images are reasonable subset of original image.
No alt text provided for this image
c)     Other Techniques like Rotation, Shearing and Local warping
d)    Color Shifting
Color shifting is about adding different distortion to RGB channels of an image.
No alt text provided for this image

Implementing distortions during training
We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk.
In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free.
No alt text provided for this image
quick taxonomy of data augmentation method in general is depicted below for a big picture
No alt text provided for this image

References:
1)    Andrew NG Deep Learning deeplearning.ai
2)    Data Augmentation on Workera.ai
3)    AlexNet Paper on PCA color augmentation
4)    Datahackers.rs blog