EDA 가 무엇인가

데이터 사이언스 분야는 심오하다..어떡하지..ㅠㅠ

뉴비는 웁니다.

https://data-newbie.tistory.com/842

Paper) 추천 알고리즘들의 Data Split 전략에 대한 논문 리뷰

논문 제목 A Critical Study on Data Leakage in Recommender System Offline Evaluation 추천 시스템에서는 데이터 분리 전략에 대해서 다소 난해한 점이 있는 것 같아. 특정 논문을 리뷰하고자 한다. 결론적으로는

data-newbie.tistory.com

https://www.data-science-factory.com/post/exploratory-data-analysis-guideline

Exploratory data analysis guideline

Exploratory data analysis is a core tool in solving any Data Science problem. It is the process that aims to detect insights inside the data and to conduct a complete investigation of the dataset. Not depending on your activity - professional software deve

www.data-science-factory.com

위의 글 summary:

It is the process that aims to detect insights inside the data and to conduct a complete investigation of the dataset. The success of building your future model is highly dependent on how the analysis is performing.

- to get a deep understanding of the data that you are working on

- your EDA notebook should be clear for any other person (I mean a Data Science engineer, not a random person) who will take your notebook for the review.

General data understanding :

1. understand the physical meaning of each column

2. understand the type of each variable ; categorial, binary, continuous variables

3. remove redundant data; It can be different ids, URLs, fragments of the metadata (remember that not all metadata is useless!), and other columns.

4. prevent data leakage ;

It is a critical point because sometimes some features can provide information that will not be available in production mode

https://dacon.io/en/forum/403895

Data Leakage에 대한 개인적인 정리입니다

주차수요 예측 AI 경진대회

dacon.io

"알 수 없는 정보가 예측에 반영되었을 때 나타나는 문제"

5. define a strategy to work with missed data ; 알아서 선택하라

Target Understanding :

1. Check class distribution or values distribution in case of regression

2. Find outliers in the data - dropping several outliers samples can increase the score

3. Correlation analysis ; estimating the impact of each variable on the target -> reducing # features but be careful of losing good features.

Features Understanding :

1. Feature engineering

complex of actions like

removing not useful features
building new features based on the combinations of existing ones
scaling existing ones
encoding
and so on

2. Cross-correlation analysis

To understand the dependency between non-target features -> build at least a correlation matrix for all features. Sometimes it helps to understand that some variables are strongly correlated inside a pair and one of them can be easily removed.

3. Experimenting with preprocessing

Different tasks can require different processing for the same features. -> select the most suitable for the problem

4. Visualization : Core part of EDA, You should never skip the visualization.

Is it a noisy feature?
what is the distribution?
what are the ranges for its values?
how it behaves compared to other features?

small idea , add them in a smart way !

5. Insights search

6. Baseline modeling

It can solve several issues

to shed light on the relative scores
provide an understanding of requirements for model inputs and outputs.
it is a perfect way to make an assumption about the model's complexity.
Don't try to build State of the art models here, you don't need to do this in your EDA.

'AI' 카테고리의 다른 글

8주차 회고 (0)	2023.12.29
5,6주차 회고 (0)	2023.12.16
4주차 회고 (0)	2023.12.01
Recsys with deep learning literatures - NCF (0)	2023.12.01
11/28 깃 강의 (0)	2023.11.28

FUNGOD12

EDA 가 무엇인가

'AI' 카테고리의 다른 글

티스토리툴바

EDA 가 무엇인가

'AI' 카테고리의 다른 글

'AI' Related Articles

티스토리툴바