Customer onboarding Journey with LDA Topic modeling

preview

Business Problem :

Customer journey from submission of document to the issuance is complex and moves to regulatory checks apart from document verification and risk analysis. This often leads to slow movement from stages and pending text tickets being raised . The aim is to understand where and why these journeys are stuck to reform the process.

Data :

Submission Date Pending Stage Ticket Text Pending Date
Date at which documents were submitted Stage at which journey is stuck Reason in plain text, why it is stuck Current Date

Solution :

Topic modeling using Latent Dirichlet Allocation(LDA)

preview

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given corpus. The term latent conveys something that exists but is not yet developed. In other words, latent means hidden or concealed.

Now, the topics that we want to extract from the data are also “hidden topics”. It is yet to be discovered. Hence, the term “latent” in LDA. The Dirichlet allocation is after the Dirichlet distribution and process.

Named after the German mathematician, Peter Gustav Lejeune Dirichlet, Dirichlet processes in probability theory are “a family of stochastic processes whose realizations are probability distributions.”

This process is a distribution over distributions, meaning that each draw from a Dirichlet process is itself a distribution. What this implies is that a Dirichlet process is a probability distribution wherein the range of this distribution is itself a set of probability distributions!

preview

To start with, let’s randomly assign weights to both the matrices and assume that our data is generated as per the following steps:

  1. Randomly choose a topic from the distribution of topics in a document based on their assigned weights. In the previous example, let’s say we chose pink topic
  2. Next, based on the distribution of words for the chosen topic, select a word at random and put it in the document
  3. Repeat this step for the entire document In this process, if our guess of the weights is wrong, then the actual data that we observe will be very unlikely under our assumed weights and data generating process.

Topics Found in the Tickets :

Please Use '</>' to move around the diagrams


In case you want to discuss over this project for more details, lets connect!