Featured Post

Reference Books and material for Analytics

Website for practising R on Statistical conceptual Learning: https://statlearning.com  Reference Books & Materials: 1) Statis...

Wednesday, October 30, 2019

Mistagging of information when you don't know your data

Finding out relevant articles related to an entity is an interesting task. It becomes complex when an entity is known with various acronyms and short forms. It becomes further complex when you have multiple entities with similar names, short names or acronyms.

The whole effort of complex web crawling and web scrapping framework using python scrappy, selenium etc. including tagging and presentation will go for a toss if articles and documents are not entity-tagged properly.

If you search South Indian Bank Stock on https:/moneycontrol.com/  website today (as on 30th Oct 19) and go to News & Research, the most recent and relevant articles you find for this stock is actually not related at all with South Indian Bank entity. Forget about title, you would not even find a mention of the entity, South Indian Bank anywhere inside the article. Actually it is related with an entity which is completely different but similar in name called Indian Bank.







Though money control website and mobile application are amazing in various aspects and it is one of the good sources of information for most of us who are active in share market but this kind of blunder does occur when you do not understand you data well.
Finding relevant articles related to an entity through matching has to be improved specially in these cases.

My Suggestion to moneycontrol Application-cum-AI architect would be to follow following simple steps while tagging.

Ø  Tag articles with entity name matches directly with Title text
Ø  Tag articles with entity name matches directly with Body text
Ø  Tag articles with entity name matches Partially but sufficient with Title text
o   Complex Fuzzy Match
o   Matches with Acronym
o   Matches with other short form
Ø  Tag articles with entity name matches Partially but sufficient with Body text
o   Complex Fuzzy Match
o   Matches with Acronym
o   Matches with other short form
Ø   Save the name of Matched and Matching entities along with Article IDs, steps etc.
Ø  Exclude an Article if it Matched entities directly matches with other matching entities.


No comments:

Post a Comment