SQL | 15 days of learning SQL
Challenge introduction
This challenge offers a valuable exercise in constructing complex queries, aggregating and transforming data, working with dates and generating running statistics. This problem enhances SQL skills, especially in areas like CTEs, JOIN operations, window functions and data aggregation. These skills are useful in real-world scenarios, where data from multiple sources needs consolidation and analysis.
The problem is sourced from HackerRank and is available at the following link. I recommend attempting to solve the challenge independently before reviewing the provided solution.
Problem statement
Julia conducted a 15 days of learning SQL contest. The start date of the contest was March 01, 2016 and the end date was March 15, 2016.
Write a query to print total number of unique hackers who made at least one submission each day (starting on the first day of the contest), and find the hacker_id and name of the hacker who made maximum number of submissions each day. If more than one such hacker has a maximum number of submissions, print the lowest hacker_id. The query should print this information for each day of the contest, sorted by the date.
Input format.The data needed for this challenge is spread across the following tables:
Column | Type |
---|---|
hacker_id | Integer |
name | String |
hackers
The hacker_id is the id of the hacker, and name is the name of the hacker
Column | Type |
---|---|
submission_date | Date |
submission_id | Integer |
hacker_id | Integer |
score | Integer |
submissions
The submission_date is the date of the submission, submission_id is the id of the submission, hacker_id is the id of the hacker who made the submission, and score is the score of the submission.
Sample input.
Take as an example (assuming that the end date of the contest was March 06, 2016) the data contained in the following tables:
hacker_id | name |
---|---|
15758 | Rose |
20703 | Angela |
36396 | Frank |
38289 | Patrick |
44065 | Lisa |
53473 | Kimberly |
62529 | Bonnie |
79722 | Michael |
submission_date | submission_id | hacker_id | score |
---|---|---|---|
2016-03-01 | 8494 | 20703 | 0 |
2016-03-01 | 22403 | 53473 | 15 |
2016-03-01 | 23965 | 79722 | 60 |
2016-03-01 | 30173 | 36396 | 70 |
2016-03-02 | 34928 | 20703 | 0 |
2016-03-02 | 38740 | 15758 | 60 |
2016-03-02 | 42769 | 79722 | 25 |
2016-03-02 | 44364 | 79722 | 60 |
2016-03-03 | 45440 | 20703 | 0 |
2016-03-03 | 49050 | 36396 | 70 |
2016-03-03 | 50273 | 79722 | 5 |
2016-03-04 | 50344 | 20703 | 0 |
2016-03-04 | 51360 | 44065 | 90 |
2016-03-04 | 54404 | 53473 | 65 |
2016-03-04 | 61533 | 79722 | 45 |
2016-03-05 | 72852 | 20703 | 0 |
2016-03-05 | 74546 | 38289 | 0 |
2016-03-05 | 76487 | 62529 | 0 |
2016-03-05 | 82439 | 36396 | 10 |
2016-03-05 | 90006 | 36396 | 40 |
2016-03-06 | 90404 | 20703 | 0 |
Expected output.
The expected output given the sample input is the following:
2016-03-01 | 4 | 20703 | Angela | ||||
2016-03-02 | 2 | 79722 | Michael | ||||
2016-03-03 | 2 | 20703 | Angela | ||||
2016-03-04 | 2 | 20703 | Angela | ||||
2016-03-05 | 1 | 36396 | Frank | ||||
2016-03-06 | 1 | 20703 | Angela |
Explanation.
On March 01, 2016 hackers 20703, 36396, 53473, and 79722 made submissions. There are 4 unique hackers who made at least one submission each day. As each hacker made one submission, 20703 is considered to be the hacker who made maximum number of submissions on this day. The name of the hacker is Angela.
On March 02, 2016 hackers 15758, 20703, and 79722 made submissions. Now 20703 and 79722 were the only ones to submit every day, so there are 2 unique hackers who made at least one submission each day. 79722 made two submissions, and name of the hacker is Michael.
On March 03, 2016 hackers 20703, 36396, and 79722 made submissions. Now 20703 and 79722 were the only ones, so there are 2 unique hackers who made at least one submission each day. As each hacker made one submission, 20703 is considered to be the hacker who made maximum number of submissions on this day. The name of the hacker is Angela.
On March 04, 2016 hackers 20703, 36396, 38289, and 65529made submissions. Now only 20703 and 79722 submitted each day, so there are 2 unique hackers who made at least one submission each day. As each hacker made one submission, 20703 is considered to be the hacker who made maximum number of submissions on this day. The name of the hacker is Angela.
On March 05, 2016 hackers 20703, 36396, 38289 and 65529 made submissions. Now only 20703 submitted each day, so there is only 1 unique hacker who made at least one submission each day. 36396 made two submissions and name of the hacker is Frank.
On March 06, 2016 only 20703 made a submission, so there is only 1 unique hacker who made at least one submission each day. 20703 made one submission and the name of the hacker is Angela.
Solution
To approach this problem effectively, it's crucial to recognize that it involves two distinct challenges, each requiring a different set of logical skills. However, both complexities can be simplified by initially aggregating the submission data by date and hacker, creating a foundational auxiliary table for subsequent operations.
The first, more straightforward challenge involves identifying the hacker with the most submissions each day. This can be accomplished by sorting hackers within each date based on submission counts descending and hacker IDs ascending, consistently selecting the top-ranked hacker for each submission date block.
Conversely, the second challenge, which demands tracking the count of hackers submitting daily, is more intricate. To tackle this, the chosen approach utilizes window functions to streamline the task of calculating consecutive submission days for each hacker. This involves employing DENSE_RANK to determine the index of the submission day's and comparing it to the ROW_NUMBER for a given hacker_id for this date. It's important to note that this approach assumes at least one hacker submits every day; otherwise, DENSE_RANK may not function as intended. Additionally, while DATEPART could replace DENSE_RANK under the specific conditions of this problem where dates begin at the month's first day, it would result in a less versatile solution that would not work if start or end dates change.
Finally, the solution integrates all components and incorporates the hackers' table to extract their names, completing the task.
-- Generate auxiliary CTEs
WITH daily_subs AS (
SELECT hacker_id, submission_date, COUNT(*) AS num_subs
FROM submissions
GROUP BY hacker_id, submission_date),
all_days_subs AS (
SELECT submission_date, hacker_id
FROM (SELECT submission_date,
hacker_id,
DENSE_RANK() OVER(ORDER BY submission_date ASC) AS day_num,
ROW_NUMBER() OVER (PARTITION BY hacker_id
ORDER BY submission_date ASC) AS streak
FROM daily_subs) AS streak_check
WHERE day_num = streak),
num_all_day_subs AS (
SELECT submission_date, COUNT(*) AS num_subs
FROM all_days_subs
GROUP BY submission_date),
daily_ranking AS (
SELECT submission_date, hacker_id,
ROW_NUMBER() OVER(PARTITION BY submission_date
ORDER BY num_subs DESC, hacker_id ASC) AS rank
FROM daily_subs),
daily_winners AS (
SELECT submission_date, hacker_id
FROM daily_ranking
WHERE rank = 1)
-- Main query
SELECT ads.submission_date, num_subs, h.hacker_id, name
FROM num_all_day_subs AS ads
LEFT JOIN daily_winners AS dw
ON ads.submission_date = dw.submission_date
LEFT JOIN hackers as h
ON dw.hacker_id = h.hacker_id
ORDER BY ads.submission_date ASC;