1. SELECT ... FROM

SQL에서 가장 기초적인 구문은 SELECT를 이용한 단일 컬럼을 추출하는 것이다.

SELECT 뒤에 추출하고자 하는 컬럼명을 작성한다
FROM뒤에 테이블을 작성한다

query = """
        SELECT Name
        FROM `bigquery-public-data.pet_records.pets`
        """

2. WHERE

특정 조건을 만족하는 컬럼을 추출하고자 하면 WHERE을 이용한 조건문을 작성하면 된다.

query = """
        SELECT Name 
        FROM `bigquery-public-data.pet_records.pets`
        WHERE Animal = 'Cat'
        """

3. Submitting the query to the dataset

Client 객체에 query() 메서드를 사용하면 query문장을 만족하는 데이터 셋을 fetch 하게 된다. 이후 to_dataframe()을 통해 DataFrame 객체로 결과를 내보내 분석을 진행할 수 있다.

# Create a "Client" object

client = bigquery.Client()

# Set up the query

query_job = client.query(query)

# API request - run the query, and return a pandas DataFrame

us_cities = query_job.to_dataframe()

4. Working with bigdatasets

빅데이터 데이터셋은 너무 커서 효율적인 정보 검색을 위해 데이터셋의 용량을 제한할 필요가 있다. QueryJobConfig() 메서드는 데이터의 사이즈를 조절하고 query() 메소드의 job_config parameter의 argument로 적용시킨다.

# Query to get the score column from every row where the type column has value "job"

query = """
        SELECT score, title
        FROM `bigquery-public-data.hacker_news.full`
        WHERE type = "job"
        """

# Create a QueryJobConfig object to estimate size of query without running it 

dry_run_config = bigquery.QueryJobConfig(dry_run = True)

# API request - dry run query to estimate costs

dry_run_query_job = client.query(query, job_config = dry_run_config)

print(f"This query will process {dry_run_query_job.total_bytes_processed} bytes.")


# You can also specify a parameter when running the query to limit how much data you are willing to scan

# Only run the query if it's less than 1MB

ONE_MB = 1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB)

# Set up the query 

sqfe_query_job = client.query(query, job_config = safe_config)

# API request - try to run the query, and return a pandas DataFrame

sqfe_query_job.to_dataframe()

5. Exercise : Select, From & Where

Step 1 : Units of measurement

Which countries have reported pollution levels in units of "ppm"? In the code cell below, set first_query to an SQL query that pulls the appropriate entries from the country column.

# Query to select countries with units of "ppm"
first_query = """
              SELECT country
              FROM `bigquery-public-data.openaq.global_air_quality`
              WHERE unit = "ppm"
              """ 

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
first_query_job = client.query(first_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
first_results = first_query_job.to_dataframe()

# View top few rows of results
print(first_results.head())

Step 2 : High air quality

Which pollution levels were reported to be exactly 0?

# Query to select all columns where pollution levels are exactly 0
zero_pollution_query = """
                       SELECT *
                       FROM `bigquery-public-data.openaq.global_air_quality`
                       WHERE value = 0
                       """ # Your code goes here

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(zero_pollution_query, job_config=safe_config)

# API request - run the query and return a pandas DataFrame
zero_pollution_results = query_job.to_dataframe() # Your code goes here

print(zero_pollution_results.head())

Source of the course : Kaggle Course _ Select, From & Where

'Course > [Kaggle] Data Science' 카테고리의 다른 글

[SQL] Order By (0)	2022.02.24
[SQL] Group By, Having & Count (0)	2022.02.23
[SQL] Getting Started with SQL and BigQuery (0)	2022.02.23
[FE] Target Encoding (0)	2022.02.21
[FE] Principal Component Analysis (0)	2022.02.21

[SQL] Select, From & Where