When preparing for data science interviews, it’s common to encounter questions that require you to analyze user behavior, especially when it comes to social media platforms like Twitter. In this problem, we’ll write a query to create a histogram of tweets posted by each user in 2022. Let’s dive into the details of this problem, break down the requirements, and then craft a solution using SQL.
Problem Statement
We are given a table, tweets
, that contains information about tweets posted on Twitter. The table has the following schema:
Table: tweets
Column Name | Type |
---|---|
tweet_id | integer |
user_id | integer |
msg | string |
tweet_date | timestamp |
Our goal is to find out how many tweets each user posted in the year 2022 and then group users by the number of tweets they posted. The output should be a histogram where:
- The bucket represents the number of tweets a user posted in 2022.
- The number of users in each bucket represents the count of users who posted that many tweets.
Example Input:
tweet_id | user_id | msg | tweet_date |
---|---|---|---|
214252 | 111 | Am considering taking Tesla private at $420. | 12/30/2021 00:00:00 |
739252 | 111 | Despite the constant negative press covfefe | 01/01/2022 00:00:00 |
846402 | 111 | Following @NickSinghTech on Twitter changed my life! | 02/14/2022 00:00:00 |
241425 | 254 | If the salary is so competitive why won’t you tell me what it is? | 03/01/2022 00:00:00 |
231574 | 148 | I no longer have a manager. I can’t be managed | 03/23/2022 00:00:00 |
Example Output:
tweet_bucket | users_num |
---|---|
1 | 2 |
2 | 1 |
In this output:
- There are two users who posted exactly 1 tweet in 2022.
- There is one user who posted exactly 2 tweets in 2022.
Step-by-Step Solution
- Filter the Tweets for the Year 2022: Since we are only interested in tweets posted in 2022, we need to filter the
tweets
table to include only those rows where thetweet_date
falls within the year 2022. - Count Tweets per User: For each user, count the number of tweets they posted in 2022.
- Create a Histogram: Group the users by the number of tweets they posted and then count the number of users in each group.
SQL Query
Here’s how you can write the SQL query to achieve this:
SELECT tweet_count AS tweet_bucket, COUNT(user_id) AS users_num
FROM (
SELECT user_id, COUNT(tweet_id) AS tweet_count
FROM tweets
WHERE YEAR(tweet_date) = 2022
GROUP BY user_id
) AS user_tweet_counts
GROUP BY tweet_count
ORDER BY tweet_bucket;
Explanation of the Query
- Subquery (Inner Query):
- This subquery filters tweets posted in the year 2022 using
WHERE YEAR(tweet_date) = 2022
. - It then groups the tweets by
user_id
and counts the number of tweets for each user usingCOUNT(tweet_id) AS tweet_count
. - The result is a table where each row contains a
user_id
and the number of tweets (tweet_count
) they posted in 2022.
- Main Query (Outer Query):
- The outer query takes the result of the subquery and groups the users by
tweet_count
. - It counts how many users fall into each
tweet_count
bucket usingCOUNT(user_id)
. - The result is a histogram of tweet counts per user, showing how many users posted a specific number of tweets in 2022.
Making It Interesting: Understanding User Engagement
This query gives valuable insights into user engagement on Twitter in a specific year. For example, if you notice that a large number of users fall into the lower tweet count buckets, it might indicate that most users are casual tweeters who don’t post frequently. Conversely, if there are significant numbers in higher tweet buckets, it could show a more engaged user base. These insights could help Twitter or similar platforms tailor their user engagement strategies effectively.
Conclusion
This SQL query is an excellent example of how data scientists can derive meaningful insights from raw data. It requires a good understanding of SQL functions like GROUP BY
, COUNT()
, and how to work with dates in SQL. Whether you’re preparing for an interview or analyzing real-world data, this exercise demonstrates the importance of breaking down the problem into manageable steps to achieve the desired result.