Histogram of Tweets per User in 2022: SQL Interview Problem

When preparing for data science interviews, it’s common to encounter questions that require you to analyze user behavior, especially when it comes to social media platforms like Twitter. In this problem, we’ll write a query to create a histogram of tweets posted by each user in 2022. Let’s dive into the details of this problem, break down the requirements, and then craft a solution using SQL.

Problem Statement

We are given a table, tweets, that contains information about tweets posted on Twitter. The table has the following schema:

Table: tweets

Column NameType
tweet_idinteger
user_idinteger
msgstring
tweet_datetimestamp

Our goal is to find out how many tweets each user posted in the year 2022 and then group users by the number of tweets they posted. The output should be a histogram where:

  • The bucket represents the number of tweets a user posted in 2022.
  • The number of users in each bucket represents the count of users who posted that many tweets.

Example Input:

tweet_iduser_idmsgtweet_date
214252111Am considering taking Tesla private at $420.12/30/2021 00:00:00
739252111Despite the constant negative press covfefe01/01/2022 00:00:00
846402111Following @NickSinghTech on Twitter changed my life!02/14/2022 00:00:00
241425254If the salary is so competitive why won’t you tell me what it is?03/01/2022 00:00:00
231574148I no longer have a manager. I can’t be managed03/23/2022 00:00:00

Example Output:

tweet_bucketusers_num
12
21

In this output:

  • There are two users who posted exactly 1 tweet in 2022.
  • There is one user who posted exactly 2 tweets in 2022.

Step-by-Step Solution

  1. Filter the Tweets for the Year 2022: Since we are only interested in tweets posted in 2022, we need to filter the tweets table to include only those rows where the tweet_date falls within the year 2022.
  2. Count Tweets per User: For each user, count the number of tweets they posted in 2022.
  3. Create a Histogram: Group the users by the number of tweets they posted and then count the number of users in each group.

SQL Query

Here’s how you can write the SQL query to achieve this:

SELECT tweet_count AS tweet_bucket, COUNT(user_id) AS users_num
FROM (
    SELECT user_id, COUNT(tweet_id) AS tweet_count
    FROM tweets
    WHERE YEAR(tweet_date) = 2022
    GROUP BY user_id
) AS user_tweet_counts
GROUP BY tweet_count
ORDER BY tweet_bucket;

Explanation of the Query

  1. Subquery (Inner Query):
  • This subquery filters tweets posted in the year 2022 using WHERE YEAR(tweet_date) = 2022.
  • It then groups the tweets by user_id and counts the number of tweets for each user using COUNT(tweet_id) AS tweet_count.
  • The result is a table where each row contains a user_id and the number of tweets (tweet_count) they posted in 2022.
  1. Main Query (Outer Query):
  • The outer query takes the result of the subquery and groups the users by tweet_count.
  • It counts how many users fall into each tweet_count bucket using COUNT(user_id).
  • The result is a histogram of tweet counts per user, showing how many users posted a specific number of tweets in 2022.

Making It Interesting: Understanding User Engagement

This query gives valuable insights into user engagement on Twitter in a specific year. For example, if you notice that a large number of users fall into the lower tweet count buckets, it might indicate that most users are casual tweeters who don’t post frequently. Conversely, if there are significant numbers in higher tweet buckets, it could show a more engaged user base. These insights could help Twitter or similar platforms tailor their user engagement strategies effectively.

Conclusion

This SQL query is an excellent example of how data scientists can derive meaningful insights from raw data. It requires a good understanding of SQL functions like GROUP BY, COUNT(), and how to work with dates in SQL. Whether you’re preparing for an interview or analyzing real-world data, this exercise demonstrates the importance of breaking down the problem into manageable steps to achieve the desired result.