1. Nested data
https://cookiethecat2020.blogspot.com/2020/06/array-struct.html
빅쿼리(BigQuery)의 Array(repeated)와 Struct(nested) 데이터 타입 완전 정복하기
빅쿼리(BigQuery)의 반복되는 (repeated) 배열(array)와 중첩되는(nested) 스트럭트(struct) 데이터 구조 이해하기 #bigquery #빅쿼리 #array #struct
cookiethecat2020.blogspot.com
Consider a hypothetical dataset containing information about pets and their toys. We could organize this information in two different tables(a pets and a toys table). The toys table could contains a "Pet_ID" column that could be used to match each toy to the pet that owns it.
Another option in BigQuery is to organize all the information in a single table, similar to the 'pets_and_toys' table below.
In this case, all of the information from the 'toys' table is collapsed into a single column (the "Toy" column in the 'pets_and_toys' table). We refer to the "Toy" column in the 'pets_and_toys' table as a nested column, and say that the "Name" and "Type" fields are nested inside of it.
Nested columns have type STRUCT (or type RECORD). This is reflected in the table schema below.
To query a column with nested data, we need to identify each field in the context of the column that contains it :
- Toy.Name refers to the "Name" field in the "Toy" column
- Toy.Type refers to the "Type" field in the "Toy" column
Otherwise, our unusual rules remain the same - we need not change anything else about our queries.
2. Repeated data
Now consider the (more realistic) case where each pet can have mutliple toys. In this case, to collapse this information into a single table, we need to leverage a different datatype.
We say that the "Toys" column contains repeated data, because it permits more than one value for each row. This is reflected in the table schema below, where the mode of the "Toys" column appears as 'REPEATED'.
Each entry in a repeated field is an ARRAY, or an ordered list of (zero or more) values with the same datatype. For instance, the entry in the "Toys" column for Moon the Dog is [Frisbee, Bone, Rope], which is an ARRAY with three values.
When querying repeated data, we need to put the name of the column containing the repeated data inside an UNNEST() function.
query = """
SELECT Name AS Pet_Name, Toy_Type
FROM `bigquery-public-data.pet_records.pets_and_toys_type`,
UNNEST(Toys) AS Toy_Type
"""
This is essential flatten the repeated data (which is then appended to the right side of the table) so that we have one element on each row.
query = """
SELECT *
FROM `bigquery-public-data.pet_records.pets_and_toys_type`,
UNNEST(Toys) AS Toy_Type
"""
3. Nested and repeated Data
Now, when if pets can have multiple toys, and we'd like to keep track of both the name and type of each toy? In this case, we make the "Toys" column both nested and repeated.
In the 'more_pets_and_toys' table above, "Name" and "Type" are both field contained within the "Toys" STRUCT, nad each entry in both "Toys.Name" and "Toys.Type" is an ARRAY.
query = """
SELECT Name AS Pet_Name,
t.Name AS Toy_Name,
t.Type AS Toy_Type
FROM `bigquery-public-data.pet_records.more_pets_and_toys`,
UNNSET(Toys) AS t
"""
Since the "Toys" column is repeated, we faltten it with the UNNSET() function. And, since we give the flattened column an alias of t, we can refer to the "Name" and "Type" fields in the "Toys" column as t.Name and t.Type, respectively.
NESTED + REPEATED data
- UNNEST는 데이터를 먼저 풀어 각각의 컬럼으로 풀어낸다
- 그냥 Nested된 데이터는 바로 RECORD type에서 dot 연산자를 통해 추출하면 됨
- 단 Nested + Repeated된 데이터는 RECORD type과 REPEATED type이 모두 존재하는 경우, UNNEST() 를 적용한 후 dot 연산자를 통해 추출한다
https://cookiethecat2020.blogspot.com/2020/06/bigquery-unnest-3.html
빅쿼리(BigQuery) UNNEST 함수 사용법 3가지
빅쿼리(BigQuery)의 중첩/반복 데이터를 flatten(평탄화)하는 unnest 함수 사용법 3가지 #bigquery #빅쿼리 #unnest함수
cookiethecat2020.blogspot.com
4. Exercise : Nested and Repeated Data
**Step 1 : Who had the most commits in 2016?
# Write a query to find the answer
max_commits_query = """
SELECT committer.name AS committer_name, COUNT(1) AS num_commits
FROM `bigquery-public-data.github_repos.sample_commits`
WHERE committer.date >= '2016-01-01' AND committer.date < '2017-01-01'
GROUP BY committer_name
ORDER BY num_commits DESC
"""
Step 2 : Look at languages!
# Fill in the blank
num_rows = 6
Step 3 : what's the most popular programming language?
# Write a query to find the answer
pop_lang_query = """
SELECT language.name AS language_name, COUNT(1) AS num_repos
FROM `bigquery-public-data.github_repos.languages`,
UNNEST(language) AS language
GROUP BY language_name
ORDER BY language_name DESC
"""
Step 4 : Which languages are used in the repository with the most languages?
# Your code here
all_langs_query = """
SELECT language.name AS name, SUM(language.bytes) AS bytes
FROM `bigquery-public-data.github_repos.languages`,
UNNEST(language) AS language
WHERE repo_name = 'polyrabbit/polyglot'
GROUP BY name
ORDER BY bytes DESC
"""
Source of the course : Kaggle Course _ Nested and Repeated Data
Nested and Repeated Data
Explore and run machine learning code with Kaggle Notebooks | Using data from Google Analytics Sample
www.kaggle.com
'Course > [Kaggle] Data Science' 카테고리의 다른 글
[DL] A Single Neuron (0) | 2022.02.28 |
---|---|
[SQL] Writing Efficient Queries (0) | 2022.02.24 |
[SQL] Analytic Functions (0) | 2022.02.24 |
[SQL] JOINs and UNIONs (0) | 2022.02.24 |
[SQL] Joining data (0) | 2022.02.24 |