1. JOINs
query = """
SELECT o.Name AS Owner_Name, p.Name AS Pet_Name
FROM `bigquery-public-data.pet_records.owners` AS o
INNER JOIN `bigquery-public-data.pet_records.pets` AS p
ON p.ID = o.Pet_ID
"""
JOIN 연산에는 FULL JOIN, INNER JOIN, LEFT JOIN, RIGHT JOIN이 있다. Pandas 라이브러리의 merge() 함수의 left, right와 비슷한 결과를 츨력하며, RIGHT JOIN은 JOIN 구문을 기준으로 오른쪽 데이터 셋의 key를 중심으로 교집합을, LEFT JOIN은 왼쪽 데이터셋의 key를 중심으로 데이터를 합친다. FULL JOIN의 경우 전체 key를 기준으로 통합하며 결측값이 발생할 수 있다.
2. UNIONs
UNION은 concat() 메소드와 마찬가지로 행을 기준으로 데이터를 이어 붙인다.
query = """
SELECT Age FROM `bigquery-public-data.pet_records.pets`
UNION ALL
SELECT Age FROM `bigquery-public-data.pet_records.owners`
"""
UNION ALL은 중복값을 포함하여 데이터셋을 합치고, UNION DISTINCT는 중복값을 제거하여 데이터를 합친다.
# Query to select all stories posted on January 1, 2012, with number of comments
job_query = """
WITH c AS
(
SELECT parent, COUNT(*) AS num_comments
FROM `bigquery-public-data.hacker_news.comments`
GROUP BY parent
)
SELECT s.id AS story_id, s.by, s.title, c.num_comments
FROM `bigquery-public-data.hacker_news.stories` AS s
LEFT JOIN c
ON s.id = c.parent
WHERE EXTRACT(DATE FROM s.time_ts) = '2012-01-01'
ORDER BY c.num_comments DESC
"""
# Run the query, and return a pandas DataFrame
join_result = client.query(job_query).result().to_dataframe()
3. Exercise : JOINs and UNIONs
Step 1 : How long does it take for question to receive answers?
first_query = """
SELECT q.id AS q_id
MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
INNER JOIN `bigquery-public-data.stackoverflow.posts_answers AS p
ON q.id = a.parent_id
WHERE q.creation_data >= '2018
"""
# Your code here
correct_query = """
SELECT q.id AS q_id,
MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
FULL JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
ON q.id = a.parent_id
WHERE q.creation_date >= '2018-01-01' and q.creation_date < '2018-02-01'
GROUP BY q_id
ORDER BY time_to_answer
"""
Step 2 : Initial questions and answers
# Your code here
q_and_a_query = """
SELECT q.owner_user_id AS owner_user_id,
MIN(q.creation_date) AS q_creation_date,
MIN(a.creation_date) AS a_creation_date
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
FULL JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
ON q.owner_user_id = a.owner_user_id
WHERE q.creation_date >= '2019-01-01' AND q.creation_date < '2019-02-01'
AND a.creation_date >= '2019-01-01' AND a.creation_date < '2019-02-01'
GROUP BY owner_user_id
"""
Step 3 : Initial questions and answers
# Your code here
three_tables_query = """
SELECT u.id AS id,
MIN(q.creation_date) AS q_creation_date,
MIN(a.creation_date) AS a_creation_date
FROM `bigquery-public-data.stackoverflow.users` AS u
LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q
ON u.id = q.owner_user_id
LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
ON u.id = a.owner_user_id
WHERE u.creation_date >= '2019-01-01' AND u.creation_date < '2019-02-01'
GROUP BY id
"""
Step 4 : How many distinct users posted on January 1, 2019?
# Your code here
all_users_query = """
SELECT q.owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
WHERE q.creation_date >= '2019-01-01' AND q.creation_date < '2019-01-02'
UNION DISTINCT
SELECT a.owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
WHERE a.creation_date >= '2019-01-01' AND a.creation_date < '2019-01-02'
"""
Source of the course : Kaggle Course _ JOINs and UNIONs
'Course > [Kaggle] Data Science' 카테고리의 다른 글
[SQL] Nested and Repeated Data (0) | 2022.02.24 |
---|---|
[SQL] Analytic Functions (0) | 2022.02.24 |
[SQL] Joining data (0) | 2022.02.24 |
[SQL] AS & With (0) | 2022.02.24 |
[SQL] Order By (0) | 2022.02.24 |