SQL | Interviews

Challenge introduction

This challenge offers a valuable exercise in constructing complex queries, aggregating and transforming data, and handling missing data. This problem enhances SQL skills, especially in areas like CTEs, JOIN operations, and data aggregation. These skills are useful in real-world scenarios, where data from multiple sources needs consolidation and analysis.

The problem is sourced from HackerRank and is available at the following link. I recommend attempting to solve the challenge independently before reviewing the provided solution.

Problem statement

Samantha interviews many candidates from different colleges using coding challenges and contests. Write a query to print the contest_id, hacker_id, name, and the sums of total_submissions, total_accepted_submissions, total_views, and total_unique_views for each contest sorted by contest_id. Exclude the contest from the result if all four sums are zero.

Note: A specific contest can be used to screen candidates at more than one college, but each college only holds one screening contest.

Input format.
The data needed for this challenge is spread across the following tables:

ColumnType
contest_idInteger
hacker_idInteger
nameString

contests

The contest_id is the id of the contest, hacker_id is the id of the hacker who created the contest, and name is the name of the hacker.

ColumnType
college_idInteger
contest_idInteger

colleges

The college_id is the id of the college, and contest_id is the id of the contest that Samantha used to screen the candidates.

ColumnType
challenge_idInteger
college_idInteger

challenges

The challenge_id is the id of the challenge that belongs to one of the contests whose contest_id Samantha used, and college_id is the id of the college where the challenge was given to candidates.

ColumnType
challenge_idInteger
total_viewsInteger
total_unique_viewsInteger

view_stats

The challenge_id is the id of the challenge, total_views is the number of times the challenge was viewed by candidates, and total_unique_views is the number of times the challenge was viewed by unique candidates.

ColumnType
challenge_idInteger
total_submissionsInteger
total_accepted_submissionsInteger

submission_stats

The challenge_id is the id of the challenge, total_submissions is the number of submissions for the challenge, and total_accepted_submission is the number of submissions that achieved full scores.

Sample input.

Take as an example the data contained in the following tables:

contest_idhacker_idname
6640617973Rose
6655679153Angela
9482880275Frank
college_idcontest_id
1121966406
3247366556
5668594828
challenge_idcollege_id
1876511219
4712711219
6029232473
7297456685
challenge_idtotal_viewstotal_unique_views
471272619
471271514
187654310
187657213
755163517
602921110
729744115
755167511
challenge_idtotal_submissionstotal_accepted_submissions
755163412
471272710
471275618
755167412
75516838
729746824
729748214
471272811

Expected output.

The expected output given the sample input is the following:

6640617973Rose1113915656
6655679153Angela001110
9482880275Frank150384115

Explanation.

The contest 66406 is used in the college 11219. In this college challenges 18765 and 47127 are asked, so from the view and submission stats we get:

  • Sum of total submissions = 27 + 56 + 28 = 111
  • Sum of total accepted submissions = 10 + 18 + 11 = 39
  • Sum of total views = 43 + 72 + 26 + 15 = 156
  • Sum of total unique views = 10 + 13 + 19 + 14 = 56

Similarly, we can find the sums for contests 66556 and 94828.

Solution

The critical aspect of solving this problem lies in recognizing the specific sequence of aggregation operations that must be executed. With the exception of the view_stats and submission_stats tables, all other tables primarily serve as intermediaries connecting the contest_id (our primary identifier) with the challenge_id, which, in turn, links to view and submission statistics.

The initial step involves acknowledging that views and submissions are distinct aspects stemming from the challenge. Therefore, the aggregation of both tables by challenge_id should be carried out independently before merging them based on this identifier. Subsequently, the final step entails joining the tables as interconnections and conducting the overarching aggregation based on contest_id.


-- Generate auxiliary CTEs
WITH submissions_agg AS (
    SELECT challenge_id,
           SUM(total_submissions) AS sum_subs,
           SUM(total_accepted_submissions) AS sum_acc_subs
    FROM submission_stats
    GROUP BY challenge_id),
    
views_agg AS (
    SELECT challenge_id,
           SUM(total_views) AS sum_views,
           SUM(total_unique_views) AS sum_unique_views
    FROM view_stats
    GROUP BY challenge_id),

subs_views_agg AS (
    SELECT COALESCE(s.challenge_id, v.challenge_id) AS challenge_id,
           COALESCE(sum_subs, 0) AS sum_subs,
           COALESCE(sum_acc_subs, 0) AS sum_acc_subs,
           COALESCE(sum_views, 0) AS sum_views,
           COALESCE(sum_unique_views, 0) AS sum_unique_views
    FROM submissions_agg AS s
    FULL JOIN views_agg AS v
        ON s.challenge_id = v.challenge_id)

-- Main query
SELECT cont.contest_id, hacker_id, name, 
       SUM(sum_subs) AS sum_subs,
       SUM(sum_acc_subs) AS sum_acc_subs,
       SUM(sum_views) AS sum_views,
       SUM(sum_unique_views) AS sum_unique_views
FROM subs_views_agg AS sv     
LEFT JOIN challenges as chall
    ON sv.challenge_id = chall.challenge_id
LEFT JOIN colleges AS coll
    ON chall.college_id = coll.college_id
LEFT JOIN contests as cont
    ON coll.contest_id = cont.contest_id
GROUP BY cont.contest_id, hacker_id, name
ORDER BY cont.contest_id ASC;