1. SELECT ... FROM
SQL에서 가장 기초적인 구문은 SELECT를 이용한 단일 컬럼을 추출하는 것이다.
- SELECT 뒤에 추출하고자 하는 컬럼명을 작성한다
- FROM뒤에 테이블을 작성한다
query = """
SELECT Name
FROM `bigquery-public-data.pet_records.pets`
"""
2. WHERE
특정 조건을 만족하는 컬럼을 추출하고자 하면 WHERE을 이용한 조건문을 작성하면 된다.
query = """
SELECT Name
FROM `bigquery-public-data.pet_records.pets`
WHERE Animal = 'Cat'
"""
3. Submitting the query to the dataset
Client 객체에 query() 메서드를 사용하면 query문장을 만족하는 데이터 셋을 fetch 하게 된다. 이후 to_dataframe()을 통해 DataFrame 객체로 결과를 내보내 분석을 진행할 수 있다.
# Create a "Client" object
client = bigquery.Client()
# Set up the query
query_job = client.query(query)
# API request - run the query, and return a pandas DataFrame
us_cities = query_job.to_dataframe()
4. Working with bigdatasets
빅데이터 데이터셋은 너무 커서 효율적인 정보 검색을 위해 데이터셋의 용량을 제한할 필요가 있다. QueryJobConfig() 메서드는 데이터의 사이즈를 조절하고 query() 메소드의 job_config parameter의 argument로 적용시킨다.
# Query to get the score column from every row where the type column has value "job"
query = """
SELECT score, title
FROM `bigquery-public-data.hacker_news.full`
WHERE type = "job"
"""
# Create a QueryJobConfig object to estimate size of query without running it
dry_run_config = bigquery.QueryJobConfig(dry_run = True)
# API request - dry run query to estimate costs
dry_run_query_job = client.query(query, job_config = dry_run_config)
print(f"This query will process {dry_run_query_job.total_bytes_processed} bytes.")
# You can also specify a parameter when running the query to limit how much data you are willing to scan
# Only run the query if it's less than 1MB
ONE_MB = 1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB)
# Set up the query
sqfe_query_job = client.query(query, job_config = safe_config)
# API request - try to run the query, and return a pandas DataFrame
sqfe_query_job.to_dataframe()
5. Exercise : Select, From & Where
Step 1 : Units of measurement
Which countries have reported pollution levels in units of "ppm"? In the code cell below, set first_query to an SQL query that pulls the appropriate entries from the country column.
# Query to select countries with units of "ppm"
first_query = """
SELECT country
FROM `bigquery-public-data.openaq.global_air_quality`
WHERE unit = "ppm"
"""
# Set up the query (cancel the query if it would use too much of
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
first_query_job = client.query(first_query, job_config=safe_config)
# API request - run the query, and return a pandas DataFrame
first_results = first_query_job.to_dataframe()
# View top few rows of results
print(first_results.head())
Step 2 : High air quality
Which pollution levels were reported to be exactly 0?
# Query to select all columns where pollution levels are exactly 0
zero_pollution_query = """
SELECT *
FROM `bigquery-public-data.openaq.global_air_quality`
WHERE value = 0
""" # Your code goes here
# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(zero_pollution_query, job_config=safe_config)
# API request - run the query and return a pandas DataFrame
zero_pollution_results = query_job.to_dataframe() # Your code goes here
print(zero_pollution_results.head())
Source of the course : Kaggle Course _ Select, From & Where
'Course > [Kaggle] Data Science' 카테고리의 다른 글
[SQL] Order By (0) | 2022.02.24 |
---|---|
[SQL] Group By, Having & Count (0) | 2022.02.23 |
[SQL] Getting Started with SQL and BigQuery (0) | 2022.02.23 |
[FE] Target Encoding (0) | 2022.02.21 |
[FE] Principal Component Analysis (0) | 2022.02.21 |