1. COUNT()
COUNT()는 데이터의 개수를 측정한다. 대표적인 aggregate funcion(SUM(), AVG(), MIN(), MAX()) 종류 중 하나로 결과로 f0__의 컬럼명을 가지는 데이터를 출력한다.
query = """
SELECT COUNT(ID)
FROM `bigquery-public-data.pet_records.pets`
"""
2. GROUP BY
GROUP BY 는 하나 이상의 컬럼명을 입력받아 고유의 값에 따른 행을 하나의 그룹으로 묶은 뒤 aggregat function을 적용해 데이터 셋을 계산한다.
query = """
SELECT Animal, COUNT(ID)
FROM `bigquery-public-data.pet_records.pets`
GROUP BY Animal
"""
3. GROUP BY ... HAVING
HAVING은 GROUP BY와 함꼐 쓰여 특정한 aggregate function의 조건의 결과만을 보여준다. WHERE의 경우 테이블안에 존재하는 컬럼명에 대해서 데이터셋을 추출한다면 HAVING은 GROUP BY된 데이터 중 aggregate function의 결과에 대해서 조건을 적용한다.
query = """
SELECT Animal, COUNT(ID)
FROM `bigquery-public-data.pet_records.pets`
GROUP BY Animal
HAVING COUNT(ID) > 1
"""
4. Example of Applying SELECT, FROM, GROUP BY, HAVING
# Example : Which Hacker News comments generated the most discussion?
from google.cloud import bigquery
# Create a "Client" object
client = bigquery.Client()
# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hakcer_news", project = 'bigquery-public-data')
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
# Construct a reference to the "comments" table
table_ref = dataset_ref.table("comments")
# API request - fetch the table
table = client.get_table(table_ref)
# Preview the first five lines of the "comments" table
client.list_rows(table, max_results = 5).to_dataframe()
# Query to select comments that received more than 10 replies
query_popular = """
SELECT parent, COUNT(id)
FROM `bigquery-public-data.hacker_news.comments`
GROUP BY parent
HAVING COUNT(id) > 10
"""
# Set up the query (cancel the query if it would use too much of your qouta, with the limit set to 10GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_build = 10**10)
query_job = client.query(query_popular, job_config = safe_config)
# API request - run the query, and convert the results to a pandas DataFrame
popular_comments = query_job.to_dataframe()
# Print the first five orws of the DataFrame
popular_comments.head()
5. Ailasing and other improvments
aggregate function의 결과는 f0__와 같이 출력되는데 이는 묘사되는 이름이 없다. 이름을 표기하기 위해서는 축약문인 AS를 사용해 컬럼명르 생성해줄 수 있다. COUNT(1), COUNT(*) 같은 경우 특정 컬럼을 지정하기 어려울 경우와 확실하지 않을 경우 자동으로 컬럼을 대입하여 aggregate 하게 된다.
# Improved version of earlier query, now with aliasing & improved readability
query_imporved = """
SELECT parent, COUNT(1) AS NumPosts
FROM `bigquery-public-data.hacker_news.comments`
GROUP BY parent
HAVING COUNT(1) > 10
"""
6. Note on using Group BY
# Good using of GROUP BY
query_good = """
SELECT parent, COUNT(id)
FROM `bigquery-public-data.hacker_news.comments`
GROUP BY parent
"""
# Bad using of GROUP BY
query_bad = """
SELECT author, parent, COUNT(id)
FROM `bigquery-public-data.hacker_news.comments`
GROUP BY parent
"""
7. Exercise : Group By, Having & Count
Step 1 : Prolific commenters
# Query to select prolific commenters and post counts
prolific_commenters_query = """
SELECT author, COUNT(1) AS NumPosts
FROM `bigquery-public-data.hacker_news.comments`
GROUP BY author
HAVING NumPosts > 10000
""" # Your code goes here
# Set up the query (cancel the query if it would use too much of
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
prolific_commenters = query_job.to_dataframe()
# View top few rows of results
print(prolific_commenters.head())
Step 2 : Deleted comments
# Write your query here and figure out the answer
query_deleted = """
SELECT COUNT(1) AS count
FROM `bigquery-public-data.hacker_news.comments`
WHERE deleted = True
"""
query_job = client.query(query_deleted)
deleted_posts = query_job.to_dataframe()
num_deleted_posts = 227736
Source of the course : Kaggle Course _ Group By, Having & Count
Group By, Having & Count
Explore and run machine learning code with Kaggle Notebooks | Using data from Hacker News
www.kaggle.com
'Course > [Kaggle] Data Science' 카테고리의 다른 글
[SQL] AS & With (0) | 2022.02.24 |
---|---|
[SQL] Order By (0) | 2022.02.24 |
[SQL] Select, From & Where (0) | 2022.02.23 |
[SQL] Getting Started with SQL and BigQuery (0) | 2022.02.23 |
[FE] Target Encoding (0) | 2022.02.21 |