Topic modeling URLs

This tutorial demonstrates topic modeling on text fields of the URL Attributes table.

Query for relevant URL records

Querying for relevant URL records might include a number of contraints. This example extracts records from cnn.com shared mostly in the United States and consisting of some key words pertaining to partisan politics as indicated in the list in the query below:

from fbri.public.sql.query import
      execute
import pandas as pd

database = "fbri_prod_public"
table = "erc_condor_url_attributes_dp_final_public_v3"

sql = f"""
SELECT *
FROM {database}.{table}
WHERE public_shares_top_country = 'US'
AND regexp_like(lower(share_main_blurb), '^(2016|2020|election|democrat|president|trump|nominee|candidate|republican|vote|super|tuesday|congress)')
AND parent_domain='cnn.com'
LIMIT 5000
"""

result = execute(sql, "5000_US_CNN.tsv")
df = pd.read_csv('5000_US_CNN.tsv', delimiter = '\t', error_bad_lines=False)
df

The result from running this code should be similar to the following example:

Helper functions

You can define a few functions to ease the remainder of development.

In the first cell:

import re

def preprocess(preprocessed_df, text_column='share_main_blurb'):
    processed_df = preprocessed_df.copy()
    processed_df['text_processed'] = processed_df[text_column].map(lambda x: re.sub('[,\.!?]', '', str(x)))
    processed_df['text_processed'] = processed_df['text_processed'].map(lambda x: x.lower())
    
    return processed_df

The code above processes the share_main_blurb column of the table and returns the processed data as a variable processed_df. Running this code does not return anything.

In the next cell:

import seaborn as sns

def plot_10_most_common_words(count_data, count_vectorizer):
    import matplotlib.pyplot as plt
    words = count_vectorizer.get_feature_names()
    total_counts = np.zeros(len(words))
    for t in count_data:
        total_counts+=t.toarray()[0]
    
    count_dict = (zip(words, total_counts))
    count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:10]
    words = [w[0] for w in count_dict]
    counts = [w[1] for w in count_dict]
    x_pos = np.arange(len(words)) 
    
    plt.figure(2, figsize=(15, 15/1.6180))
    plt.subplot(title='10 most common words')
    sns.set_context("notebook", font_scale=1.25, rc={"lines.linewidth": 2.5})
    sns.barplot(x_pos, counts, palette='husl')
    plt.xticks(x_pos, words, rotation=90) 
    plt.xlabel('words')
    plt.ylabel('counts')
    plt.show()

The code above defines the variables for the plot such as figure size, subplot title, xlabel, and ylabel within plot_10_most_common_words. Running this cell does not return anything.

In the next cell:

def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

This code above defines a vector for the plot containing the Topic and the list of top_words. Running this cell does not return anything.

In the next cell:

import matplotlib.pyplot as plt

def plot_coherence(blurb_coherence):
    plt.figure(figsize=(10,5))
    plt.plot(range(1,31), blurb_coherence)
    plt.xlabel("Number of Topics")
    plt.ylabel("Coherence Score");

The code above defines the variables for the plot such as figure size, xlabel, and ylabel within plot_coherence. Running this cell does not return anything.

Preprocessing data

Start by calling a helper function that simply preprocesses a given column of the dataframe by removing punctuation and converting casing on the entire field. Note that some preprocessing tasks can be specified as parameters in the vectorizing steps that follow.

df = preprocess(df)
df['text_processed'].head()

Running this code after all the queries from the Query for relevant URL records section returns:

Running the code in the section below returns a bar graph with the 10 most common words listed from most common to least common using the numpy and sklearn libraries.

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

count_vectorizer = CountVectorizer(max_features=3500, min_df=5, max_df=0.8, stop_words='english')
count_data = count_vectorizer.fit_transform(df['text_processed'])
plot_10_most_common_words(count_data, count_vectorizer)

The result should be similar to this:

Topic modeling and text analysis

Now that you have prepared the data and defined a few helper functions to support your analysis, you're ready to explore topic modeling through Latent Dirichlet Allocation (LDA). LDA is a probabilistic topic modeling algorithm that views each document as a mixture of a small number of topics according to Blei et al, 2003.

This section consists of two examples, one that uses LDA from Scikit-learn and the other from Gensim.

Scikit-learn LDA

The following parameters specify how many topics to extract from the corpus (number_topics) and how many words to print out from each resulting topic (number_words). As with any parameter, this needs to be tuned accordingly to the problem in order to achieve optimality. For an example of this, try running the two sections of code below.

number_topics = 10
number_words = 20
%%time
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
from sklearn.decomposition import LatentDirichletAllocation as LDA
 

# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)

print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)

Running the code above should provide results similar to:

Gensim LDA

The next example uses the LDA model from Gensim, another Python NLP/text analysis library.

In the first cell:

%%time
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer('[a-zA-Z]\w+\'?\w*')
df['tokenized_text'] = df['text_processed'].apply(lambda row: tokenizer.tokenize(row)) 
blurb_dictionary = Dictionary(df['tokenized_text'])
blurb_corpus = [blurb_dictionary.doc2bow(blurb) for blurb in df['tokenized_text']]


blurb_coherence = []
for nb_topics in range(1,31):
    lda = LdaModel(blurb_corpus, num_topics = nb_topics, id2word = blurb_dictionary, passes=10)
    cohm = CoherenceModel(model=lda, corpus=blurb_corpus, dictionary=blurb_dictionary, coherence='u_mass')
    coh = cohm.get_coherence()
    blurb_coherence.append(coh)

This code makes an appropriate choice for the number_topics parameter and measures topic coherence as a function of topic numbers. The code itself took 13 minutes and 24 seconds to complete; it could run more quickly or slowly for you.

To create the plot, run the code below as a final step:

plot_coherence(blurb_coherence)

The results produced should look similar to: