1. COUNT()

COUNT()는 데이터의 개수를 측정한다. 대표적인 aggregate funcion(SUM(), AVG(), MIN(), MAX()) 종류 중 하나로 결과로 f0__의 컬럼명을 가지는 데이터를 출력한다.

query = """
        SELECT COUNT(ID)
        FROM `bigquery-public-data.pet_records.pets`
        """

2. GROUP BY

GROUP BY 는 하나 이상의 컬럼명을 입력받아 고유의 값에 따른 행을 하나의 그룹으로 묶은 뒤 aggregat function을 적용해 데이터 셋을 계산한다.

query = """
        SELECT Animal, COUNT(ID)
        FROM `bigquery-public-data.pet_records.pets`
        GROUP BY Animal
        """

3. GROUP BY ... HAVING

HAVING은 GROUP BY와 함꼐 쓰여 특정한 aggregate function의 조건의 결과만을 보여준다. WHERE의 경우 테이블안에 존재하는 컬럼명에 대해서 데이터셋을 추출한다면 HAVING은 GROUP BY된 데이터 중 aggregate function의 결과에 대해서 조건을 적용한다.

query = """
        SELECT Animal, COUNT(ID)
        FROM `bigquery-public-data.pet_records.pets`
        GROUP BY Animal
        HAVING COUNT(ID) > 1
        """

4. Example of Applying SELECT, FROM, GROUP BY, HAVING

# Example : Which Hacker News comments generated the most discussion? 

from google.cloud import bigquery

# Create a "Client" object

client = bigquery.Client()

# Construct a reference to the "hacker_news" dataset

dataset_ref = client.dataset("hakcer_news", project = 'bigquery-public-data')

# API request - fetch the dataset

dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "comments" table

table_ref = dataset_ref.table("comments")

# API request - fetch the table

table = client.get_table(table_ref)

# Preview the first five lines of the "comments" table

client.list_rows(table, max_results = 5).to_dataframe()

# Query to select comments that received more than 10 replies

query_popular = """
                SELECT parent, COUNT(id)
                FROM `bigquery-public-data.hacker_news.comments`
                GROUP BY parent
                HAVING COUNT(id) > 10
                """

# Set up the query (cancel the query if it would use too much of your qouta, with the limit set to 10GB)

safe_config = bigquery.QueryJobConfig(maximum_bytes_build = 10**10)
query_job = client.query(query_popular, job_config = safe_config)

# API request - run the query, and convert the results to a pandas DataFrame

popular_comments = query_job.to_dataframe()

# Print the first five orws of the DataFrame

popular_comments.head()

5. Ailasing and other improvments

aggregate function의 결과는 f0__와 같이 출력되는데 이는 묘사되는 이름이 없다. 이름을 표기하기 위해서는 축약문인 AS를 사용해 컬럼명르 생성해줄 수 있다. COUNT(1), COUNT(*) 같은 경우 특정 컬럼을 지정하기 어려울 경우와 확실하지 않을 경우 자동으로 컬럼을 대입하여 aggregate 하게 된다.

# Improved version of earlier query, now with aliasing & improved readability

query_imporved = """
                 SELECT parent, COUNT(1) AS NumPosts
                 FROM `bigquery-public-data.hacker_news.comments`
                 GROUP BY parent
                 HAVING COUNT(1) > 10 
                 """

6. Note on using Group BY

# Good using of GROUP BY

query_good = """
             SELECT parent, COUNT(id)
             FROM `bigquery-public-data.hacker_news.comments`
             GROUP BY parent
             """

# Bad using of GROUP BY

query_bad = """
            SELECT author, parent, COUNT(id)
            FROM `bigquery-public-data.hacker_news.comments`
            GROUP BY parent
            """

7. Exercise : Group By, Having & Count

Step 1 : Prolific commenters

# Query to select prolific commenters and post counts
prolific_commenters_query = """
                            SELECT author, COUNT(1) AS NumPosts
                            FROM `bigquery-public-data.hacker_news.comments`
                            GROUP BY author
                            HAVING NumPosts > 10000
                            """ # Your code goes here

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
prolific_commenters = query_job.to_dataframe()

# View top few rows of results
print(prolific_commenters.head())

Step 2 : Deleted comments

# Write your query here and figure out the answer

query_deleted = """
                SELECT COUNT(1) AS count
                FROM `bigquery-public-data.hacker_news.comments`
                WHERE deleted = True
                """

query_job = client.query(query_deleted)

deleted_posts = query_job.to_dataframe()

num_deleted_posts = 227736

Source of the course : Kaggle Course _ Group By, Having & Count

Group By, Having & Count

Explore and run machine learning code with Kaggle Notebooks | Using data from Hacker News

www.kaggle.com

'Course > [Kaggle] Data Science' 카테고리의 다른 글

[SQL] AS & With (0)	2022.02.24
[SQL] Order By (0)	2022.02.24
[SQL] Select, From & Where (0)	2022.02.23
[SQL] Getting Started with SQL and BigQuery (0)	2022.02.23
[FE] Target Encoding (0)	2022.02.21

[SQL] Group By, Having & Count