Finding out relevant articles related to an entity is an interesting
task. It becomes complex when an entity is known with various acronyms and
short forms. It becomes further complex when you have multiple entities with
similar names, short names or acronyms.
The whole effort of complex web
crawling and web scrapping framework using python scrappy, selenium etc. including
tagging and presentation will go for a toss if articles and documents are not
entity-tagged properly.
If you search South Indian Bank
Stock on https:/moneycontrol.com/
website today (as on 30th Oct 19) and go to News & Research, the most
recent and relevant articles you find for this stock is actually not related at
all with South Indian Bank entity. Forget about title, you would not even find a
mention of the entity, South Indian Bank anywhere inside the article. Actually it
is related with an entity which is completely different but similar in name
called Indian Bank.
Though money control website and mobile application are amazing in
various aspects and it is one of the good sources of information for most of us
who are active in share market but this kind of blunder does occur when you do
not understand you data well.
Finding relevant articles related to an entity through matching has
to be improved specially in these cases.
My Suggestion to moneycontrol Application-cum-AI architect would
be to follow following simple steps while tagging.
Ø Tag articles with entity name matches
directly with Title text
Ø Tag articles with entity name matches
directly with Body text
Ø Tag articles with entity name matches Partially
but sufficient with Title text
o
Complex
Fuzzy Match
o
Matches
with Acronym
o
Matches
with other short form
Ø Tag articles with entity name matches Partially
but sufficient with Body text
o
Complex
Fuzzy Match
o
Matches
with Acronym
o
Matches
with other short form
Ø Save
the name of Matched and Matching entities along with Article IDs, steps etc.
Ø Exclude
an Article if it Matched entities directly matches with other matching entities.
No comments:
Post a Comment