1. What is Analytic Functions ?
분석 함수는 Aggregate Function 의 계산을 지정하는 행 그룹을 기반으로 계산해서 각 그룹에 대해 여러 행을 반환할 수 있는 Function을 말한다. 일반적으로 누적 계산, 집계 및 보고용 결과를 질의 할 때 유용하게 사용할 수 있으며 복잡한 질의를 보다 간편하고 빠르게 실행할 수 있게 도와준다.
Analytic Function (분석 함수)
Analytic Function (분석 함수) 분석 함수는 Aggregate Function 의 계산을 지정하는 행 그룹을 기반으로 계산하여 각 그룹에 대해 여러 행을 반환 할 수 있는 Function 을 말합니다. 일반적으로 누적 계산, 집
oukr.tistory.com
2. Analytic Functions
id에 따라 전날과 현재의 평균 소요 시간을 계산한 질의문은 다음과 같다.
query = """
SELECT *,
AVG(time) OVER(
PARTITION BY id
ORDER BY date
ROWS BETWEEEN 1 PRECEDING AND CURRENT ROW
) as avg_time
FROM `bigquery-public-date.runners.train_time`
"""
모든 분석 함수는 'OVER' 구문을 가지고 있고 'OVER'은 계산에 사용되는 행을 정해주는 역할을 한다.
- PARTITION BY : 테이블의 행을 서로 다른 그룹으로 나눈다
- ORDER BY : 각 분할에 대해서 순서를 정한다
- window frame clause : 각 분할의 계산 방식을 정의한다
3. More on window frame cluase
- ROWS BETWEEN 1 PRECEDING AND CURRENT ROW : the previous row and the current row
- ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING : the 3 previous row, the current row, the following row
- ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING : all rows in the partition
4. Three types of analytic functions
- Analytic aggregation functions
MIN(), MAX(), AVG(), SUM(), COUNT()
- Anayltic navigation functions
FIRST_VALUE(), LAST_VALUE(), LEAD(), LAG()
- Analytic numbering functions
ROW_NUMBER(), RANK()
5. Analytic functions with CTEs
# Calculate the cumulative number of trips for each date in 2015.
num_trips_query = """
WITH trip_by_day AS
(
SELECT DATE(start_date) AS trip_date,
COUNT(*) as num_trips
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE EXTRACT(YEAR FROM start_date) = 2015
GROUP BY trip_date
)
SELECT *,
SUM(num_trips)
OVER (
ORDER BY trip_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_trips
FROM trip_by_day
"""
# Run the query, and return a pandas DataFrame
num_trips_result = client.query(num_trips_query).result().to_dataframe()
num_trips_result.head()
- Since there is no PARTITION BY cluase, the entire table is treated as a single partition.
- The ORDER BY cluase orders the row by date, where earlier dates appear first
- By setting the window frame cluase to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT NOW, we ensure that allrows up to and including the current date are used to calculate the sum
6. Exercise : Analytic Functions
Step 1 : How can you predict the demand for taxis
- Create a plot that shows a rolling average of the daily number of taxi trips
avg_num_trips_query = """
WITH trips_by_day AS
(
SELECT DATE(trip_start_timestamp) AS trip_date, COUNT(1) as num_trips
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE trip_start_timestamp >= '2016-01-01' AND trip_start_timestamp < '2018-01-01'
GROUP BY trip_date
ORDER BY trip_date
)
SELECT trip_date
AVG(num_trips) OVER (
ORDER BY trip_date
ROWS BETWEEN 15 PRECEDING AND 15 FOLLOWING
) AS avg_num_trips
FROM trips_by_day
"""
Step 2 : Can you separate the order trips by community area?
# Amend the query below
trip_number_query = """
SELECT pickup_community_area,
trip_start_timestamp,
trip_end_timestamp,
RANK()
OVER (
PARTITION BY pickup_community_area
ORDER BY trip_start_timestamp
) AS trip_number
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE DATE(trip_start_timestamp) = '2017-05-01'
"""
Step 3 : How much time elapses between trips?
break_time_query = """
SELECT texi_id, trip_start_timestamp, trip_end_timestamp,
TIMESTAMP_DIFF(trip_start_timestamp, LAG(trip_end_timestamp, 1) OVER(
PARTITION BY texi_id
ORDER BY trip_start_timestamp
), MINUTE) AS prev_break
FROM `bigquer-public-data.chicago_taxi_trips.taxi_trips`
WHERE DATE(trip_start_timestamp) = '2017-05-01'
"""
Source of the course : Kaggle Course _ Analytic Functions
Analytic Functions
Explore and run machine learning code with Kaggle Notebooks | Using data from San Francisco Open Data
www.kaggle.com
'Course > [Kaggle] Data Science' 카테고리의 다른 글
[SQL] Writing Efficient Queries (0) | 2022.02.24 |
---|---|
[SQL] Nested and Repeated Data (0) | 2022.02.24 |
[SQL] JOINs and UNIONs (0) | 2022.02.24 |
[SQL] Joining data (0) | 2022.02.24 |
[SQL] AS & With (0) | 2022.02.24 |