Topic modeling

In this tutorial, we briefly demonstrate topic modeling on text fields of the URL Attributes table in the private environment.

Query for relevant URL records

Querying for relevant URL records might consist of a number of contraints. The following example extracts records from cnn.com shared mostly in the United States, and consisting of some key words pertaining to partisan politics as indicated in the list in the following query:

from fbri.private.sql.query import execute
import pandas as pd

database = "fbri_prod_private"
table = "erc_condor_url_attributes_dp_final_v3"

sql = f"""
SELECT *
FROM {database}.{table}
WHERE public_shares_top_country = 'US'
AND regexp_like(lower(share_main_blurb), '^(2016|2020|election|democrat|president|trump|nominee|candidate|republican|vote|super|tuesday|congress)')
AND parent_domain='cnn.com'
LIMIT 5000
"""

result = execute(sql, "5000_US_CNN.tsv")
df = pd.read_csv('5000_US_CNN.tsv', delimiter = '\t', error_bad_lines=False)
df

The result of running this code is a table similar to the following example:

Simulated data

The values in the example table below are simulated. They do not represent actual data.

Helper functions

You can define a few helper functions to ease the remainder of development.

Enter the following code into a new cell:

import re

def preprocess(preprocessed_df, text_column='share_main_blurb'):
    processed_df = preprocessed_df.copy()
    processed_df['text_processed'] = processed_df[text_column].map(lambda x: re.sub('[,\.!?]', '', str(x)))
    processed_df['text_processed'] = processed_df['text_processed'].map(lambda x: x.lower())
    
    return processed_df

Running this code does not produce a result.

Enter the following code in the next cell:

import seaborn as sns

def plot_10_most_common_words(count_data, count_vectorizer):
    import matplotlib.pyplot as plt
    words = count_vectorizer.get_feature_names()
    total_counts = np.zeros(len(words))
    for t in count_data:
        total_counts+=t.toarray()[0]
    
    count_dict = (zip(words, total_counts))
    count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:10]
    words = [w[0] for w in count_dict]
    counts = [w[1] for w in count_dict]
    x_pos = np.arange(len(words)) 
    
    plt.figure(2, figsize=(15, 15/1.6180))
    plt.subplot(title='10 most common words')
    sns.set_context("notebook", font_scale=1.25, rc={"lines.linewidth": 2.5})
    sns.barplot(x_pos, counts, palette='husl')
    plt.xticks(x_pos, words, rotation=90) 
    plt.xlabel('words')
    plt.ylabel('counts')
    plt.show()

This code defines the variables for the plot, including figure size, subplot title, xlabel, and ylabel within plot_10_most_common_words. Running this cell does not return any result.

In the next cell, enter:

def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

This code defines a vector for the plot containing the topic and the list of top_words. Running this cell does not return any result.

Enter the following code in the next cell:

import matplotlib.pyplot as plt

def plot_coherence(blurb_coherence):
    plt.figure(figsize=(10,5))
    plt.plot(range(1,31), blurb_coherence)
    plt.xlabel("Number of Topics")
    plt.ylabel("Coherence Score");

This code defines the variables for the plot, including figure size, xlabel, and ylabel within plot_coherence. Running this cell does not return any result.

Preprocessing data

In this step you start by calling the helper function that simply preprocesses a given column of the dataframe by removing punctuation and converting casing on the entire field. Some preprocessing tasks can be specified as parameters in the vectorizing steps that follow.

df = preprocess(df)
df['text_processed'].head()

The results should look similar to:

In the next cell, enter:

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

count_vectorizer = CountVectorizer(max_features=3500, min_df=5, max_df=0.8, stop_words='english')
count_data = count_vectorizer.fit_transform(df['text_processed'])
plot_10_most_common_words(count_data, count_vectorizer)

Running this code returns a bar graph with the 10 most common words listed from most common to least common using the numpy and sklearn libraries.

Topic modeling and text analysis

You have prepared the data and defined a few helper functions to support your analysis. Now you are ready to explore topic modeling through Latent Dirichlet Allocation (LDA). LDA is a probabilistic topic modeling algorithm that views each document as a mixture of a small number of topics according to Blei et al, 2003.

This section consists of two examples, one that uses LDA from scikit-learn and the other from gensim.

Scikit-learn LDA

The following parameters specify how many topics to extract from the corpus (number_topics) and how many words to print out from each resulting topic (number_words). As with any parameter, this needs to be tuned according to the questions you are investigating in your research.

number_topics = 10
number_words = 20
%%time
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
from sklearn.decomposition import LatentDirichletAllocation as LDA
 

# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)

print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)

The result of running the code above should provide results similar to:

Gensim LDA

This example uses the LDA model from gensim, another Python Natural Language Processing (NLP)/text analysis library. In order to make an appropriate choice for the number_topics parameter, you measure topic coherence as a function of topic numbers.

In two cells, enter:

%%time
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer('[a-zA-Z]\w+\'?\w*')
df['tokenized_text'] = df['text_processed'].apply(lambda row: tokenizer.tokenize(row)) 
blurb_dictionary = Dictionary(df['tokenized_text'])
blurb_corpus = [blurb_dictionary.doc2bow(blurb) for blurb in df['tokenized_text']]


blurb_coherence = []
for nb_topics in range(1,31):
    lda = LdaModel(blurb_corpus, num_topics = nb_topics, id2word = blurb_dictionary, passes=10)
    cohm = CoherenceModel(model=lda, corpus=blurb_corpus, dictionary=blurb_dictionary, coherence='u_mass')
    coh = cohm.get_coherence()
    blurb_coherence.append(coh)

When this code runs, it makes an appropriate choice for the number_topics parameter and measures topic coherence as a function of topic numbers. The code can take 10 to 11 minutes to complete.

To create the plot, run the following code:

plot_coherence(blurb_coherence)

This is by no means an exhaustive analysis, as it is only intended to briefly demonstrate how elementary NLP and text mining techniques can be applied to the URL Shares dataset. Additional text preprocessing efforts such as stemming/lemmatization and language filtering would be appropriate in a typical workflow. More visualizations can be generated to support the analysis. Alternative topic modeling algorithms such as matrix factorization as well as additional text mining techniques such as sentiment analysis are also popular approaches to unsupervised learning of text data.