1. JOINs

query = """
        SELECT o.Name AS Owner_Name, p.Name AS Pet_Name
        FROM `bigquery-public-data.pet_records.owners` AS o
        INNER JOIN `bigquery-public-data.pet_records.pets` AS p
            ON p.ID = o.Pet_ID
        """

JOIN 연산에는 FULL JOIN, INNER JOIN, LEFT JOIN, RIGHT JOIN이 있다. Pandas 라이브러리의 merge() 함수의 left, right와 비슷한 결과를 츨력하며, RIGHT JOIN은 JOIN 구문을 기준으로 오른쪽 데이터 셋의 key를 중심으로 교집합을, LEFT JOIN은 왼쪽 데이터셋의 key를 중심으로 데이터를 합친다. FULL JOIN의 경우 전체 key를 기준으로 통합하며 결측값이 발생할 수 있다.

2. UNIONs

UNION은 concat() 메소드와 마찬가지로 행을 기준으로 데이터를 이어 붙인다.

query = """
        SELECT Age FROM `bigquery-public-data.pet_records.pets`
        UNION ALL
        SELECT Age FROM `bigquery-public-data.pet_records.owners`
        """

UNION ALL은 중복값을 포함하여 데이터셋을 합치고, UNION DISTINCT는 중복값을 제거하여 데이터를 합친다.

# Query to select all stories posted on January 1, 2012, with number of comments

job_query = """
            WITH c AS
            (
            SELECT parent, COUNT(*) AS num_comments
            FROM `bigquery-public-data.hacker_news.comments`
            GROUP BY parent
            )
            SELECT s.id AS story_id, s.by, s.title, c.num_comments
            FROM `bigquery-public-data.hacker_news.stories` AS s 
            LEFT JOIN c
            ON s.id = c.parent
            WHERE EXTRACT(DATE FROM s.time_ts) = '2012-01-01'
            ORDER BY c.num_comments DESC
            """

# Run the query, and return a pandas DataFrame

join_result = client.query(job_query).result().to_dataframe()

3. Exercise : JOINs and UNIONs

Step 1 : How long does it take for question to receive answers?

first_query = """
              SELECT q.id AS q_id
                  MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
              FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
                  INNER JOIN `bigquery-public-data.stackoverflow.posts_answers AS p
                    ON q.id = a.parent_id
              WHERE q.creation_data >= '2018
              """

# Your code here
correct_query = """
                SELECT q.id AS q_id,
                MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                FULL JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                ON q.id = a.parent_id
                WHERE q.creation_date >= '2018-01-01' and q.creation_date < '2018-02-01'
                GROUP BY q_id
                ORDER BY time_to_answer
                """

Step 2 : Initial questions and answers

# Your code here
q_and_a_query = """
                SELECT q.owner_user_id AS owner_user_id,
                    MIN(q.creation_date) AS q_creation_date,
                    MIN(a.creation_date) AS a_creation_date
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                    FULL JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                ON q.owner_user_id = a.owner_user_id 
                WHERE q.creation_date >= '2019-01-01' AND q.creation_date < '2019-02-01' 
                    AND a.creation_date >= '2019-01-01' AND a.creation_date < '2019-02-01'
                GROUP BY owner_user_id
                """

Step 3 : Initial questions and answers

# Your code here
three_tables_query = """
                     SELECT u.id AS id,
                        MIN(q.creation_date) AS q_creation_date,
                        MIN(a.creation_date) AS a_creation_date
                     FROM `bigquery-public-data.stackoverflow.users` AS u
                     LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q
                        ON u.id = q.owner_user_id
                     LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                        ON u.id = a.owner_user_id
                     WHERE u.creation_date >= '2019-01-01' AND u.creation_date < '2019-02-01'
                     GROUP BY id
                     """

Step 4 : How many distinct users posted on January 1, 2019?

# Your code here
all_users_query = """
                  SELECT q.owner_user_id 
                  FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                  WHERE q.creation_date >= '2019-01-01' AND q.creation_date < '2019-01-02'
                  UNION DISTINCT
                  SELECT a.owner_user_id 
                  FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
                  WHERE a.creation_date >= '2019-01-01' AND a.creation_date < '2019-01-02'
                  """

Source of the course : Kaggle Course _ JOINs and UNIONs

'Course > [Kaggle] Data Science' 카테고리의 다른 글

[SQL] Nested and Repeated Data (0)	2022.02.24
[SQL] Analytic Functions (0)	2022.02.24
[SQL] Joining data (0)	2022.02.24
[SQL] AS & With (0)	2022.02.24
[SQL] Order By (0)	2022.02.24

[SQL] JOINs and UNIONs

1. JOINs

2. UNIONs

3. Exercise : JOINs and UNIONs

'Course > [Kaggle] Data Science' 카테고리의 다른 글

티스토리툴바