LDA Topic modeling on pending tickets text data

Customer onboarding Journey with LDA Topic modeling

preview

Business Problem :

Customer journey from submission of document to the issuance is complex and moves to regulatory checks apart from document verification and risk analysis. This often leads to slow movement from stages and pending text tickets being raised . The aim is to understand where and why these journeys are stuck to reform the process.

Data :

Submission Date	Pending Stage	Ticket Text	Pending Date
Date at which documents were submitted	Stage at which journey is stuck	Reason in plain text, why it is stuck	Current Date

Solution :

Visualization of onboarding journey with Sankey diagram to show the stages where customers are and average TAT in days between the stages.
Understand why these journeys are stuck using Natural language processing, gensim LDA topic modeling on pending tickets text data.
Topic modeling suits best as it gives the probability distribution of topics within each document (ticket) as tickets may be stuck due to multiple requirements.

Topic modeling using Latent Dirichlet Allocation(LDA)

preview

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given corpus. The term latent conveys something that exists but is not yet developed. In other words, latent means hidden or concealed.

Now, the topics that we want to extract from the data are also “hidden topics”. It is yet to be discovered. Hence, the term “latent” in LDA. The Dirichlet allocation is after the Dirichlet distribution and process.

Named after the German mathematician, Peter Gustav Lejeune Dirichlet, Dirichlet processes in probability theory are “a family of stochastic processes whose realizations are probability distributions.”

This process is a distribution over distributions, meaning that each draw from a Dirichlet process is itself a distribution. What this implies is that a Dirichlet process is a probability distribution wherein the range of this distribution is itself a set of probability distributions!

preview

To start with, let’s randomly assign weights to both the matrices and assume that our data is generated as per the following steps:

Randomly choose a topic from the distribution of topics in a document based on their assigned weights. In the previous example, let’s say we chose pink topic
Next, based on the distribution of words for the chosen topic, select a word at random and put it in the document
Repeat this step for the entire document In this process, if our guess of the weights is wrong, then the actual data that we observe will be very unlikely under our assumed weights and data generating process.

Topics Found in the Tickets :

Please Use '</>' to move around the diagrams

In case you want to discuss over this project for more details, lets connect!