😅 How repetitive are you? Using Python to reveal word patterns in tweets.

Data analysis of brand’s tweets: Frequency distributions, bigrams, trigrams & everyone’s fav visualisation.

Marta
4 min readSep 30, 2020

You write for social media. Do you feel like you’re being repetitive? You probably are.

We’ve looked at hashtags, mentions, sentiment, and emojis, now it’s time to look at words in general.

Can we tell what are these tweets are about without having to read all of them?

Don’t take it as a discouragement from reading though, some are quite entertaining (as assessed by me, the hi-la-rious individual who wrote them).

📝 Word source: The data set

The dataset has been downloaded from Twitter Analytics, and covers the period form April to September 2020. As much as I’d love to have more, Twitter doesn’t store data going further back or, if it does, it doesn’t make it available to the account owner.

The dataset includes all the tweets sent in this period from the account @makingjam, (JAM) has 529 rows, and 40 columns most of which we’ll not need.

I’ll save you a report on data cleaning. All steps and the full analysis of this data is in the notebook.

😸 Step 1: Frequency list from a looong John Johnson*

We want to plot the frequency of words used in all the tweets. It sounds like a lot of counting, but Python makes the process very easy.

To start off we create a column with tokenised text.

Now, we join all cells into a long string, remove punctuation, then remove stop words and numbers.

We’re left with a loooong string, which we can plot using FreqDist.

Output:

Ooops, there are some words here, that are clearly useless. What’s that? ‘oo’ ‘amp’? Let’s see a list.

Most common words in the corpus.

We better remove some of the words. We can do it by updating the stop words’ list, recreating the loooong string and plotting it all again.

Output:

This now makes more sense, we have actual words here, no https’ or co’s. 👌

What do we see here? Some context first. The brand, makingjam organises online and offline events for people working in the field of product management (and related).

This explains the frequency of words like product, remote, join, event. The Remote PM and JAM London are names of two events organised by the brand— we used them as hashtags too.

🤷‍♀️ No one likes isolation, not even words

Words in isolation are not very informative. To get a more detailed picture we can also check for most common phrases, using bi- and trigrams. Thankfully we can get those by adding an extra parameter to the CountVectorizer.

Here is what’ll happen here:

  • we’ll define a function with a CountVectorizer specifying we want to get only the bi- and trigrams,
  • turn the results into a data frame, and
  • return the specified number of top results.

We get the top 10 and plot them.

Output:

The most common phrase is “product leaders”, three of the top 10 values are lists of industry hashtags (e.g. prodmgmt pmot), and jamlondon 2019 is a name of one of the events makingjam organised.

☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️

No word analysis is complete without a word cloud. FACT.

A picture is a 1000 words, they say. But, those who say that clearly don’t make word clouds, because with a 1000 words it would be a really unclear one. We’ll stick to 50.

Output:

Take a moment to appreciate the choice of the colour palette that approximates the brand’s colours. 💁‍♀️

With all this information about top mentions, hashtags, emojis, and words, you’d already do a great impersonation of me on Twitter. You’d probably write something like:

“Product leaders! Join a remote product event with @mattlemay #pmot 👉”

I’ll accept your submissions, it’ll save me time writing company tweets! 😝

*If you don’t get the reference, treat yourself to some YT education.

--

--

Marta

📈 Aspiring data scientist. Rationality fan. EA. Vegan. Working to improve global mental health at MindEase.io