본문 바로가기

AI

EDA 가 무엇인가

데이터 사이언스 분야는 심오하다..어떡하지..ㅠㅠ

뉴비는 웁니다.

https://data-newbie.tistory.com/842

 

Paper) 추천 알고리즘들의 Data Split 전략에 대한 논문 리뷰

논문 제목 A Critical Study on Data Leakage in Recommender System Offline Evaluation 추천 시스템에서는 데이터 분리 전략에 대해서 다소 난해한 점이 있는 것 같아. 특정 논문을 리뷰하고자 한다. 결론적으로는

data-newbie.tistory.com

https://www.data-science-factory.com/post/exploratory-data-analysis-guideline

 

Exploratory data analysis guideline

Exploratory data analysis is a core tool in solving any Data Science problem. It is the process that aims to detect insights inside the data and to conduct a complete investigation of the dataset. Not depending on your activity - professional software deve

www.data-science-factory.com

위의 글 summary:

It is the process that aims to detect insights inside the data and to conduct a complete investigation of the dataset. The success of building your future model is highly dependent on how the analysis is performing.

 

- to get a deep understanding of the data that you are working on

- your EDA notebook should be clear for any other person (I mean a Data Science engineer, not a random person) who will take your notebook for the review.

 

General data understanding :

1. understand the physical meaning of each column

 

2. understand the type of each variable ; categorial, binary, continuous variables

 

3. remove redundant data; It can be different ids, URLs, fragments of the metadata (remember that not all metadata is useless!), and other columns.  

 

4. prevent data leakage ; 

It is a critical point because sometimes some features can provide information that will not be available in production mode

https://dacon.io/en/forum/403895

 

Data Leakage에 대한 개인적인 정리입니다

주차수요 예측 AI 경진대회

dacon.io

"알 수 없는 정보가 예측에 반영되었을 때 나타나는 문제"

 

5. define a strategy to work with missed data ; 알아서 선택하라

 

Target Understanding :

1. Check class distribution or values distribution in case of regression

2. Find outliers in the data - dropping several outliers samples can increase the score

3. Correlation analysis ; estimating the impact of each variable on the target -> reducing # features but be careful of losing good features.

 

Features Understanding : 

1. Feature engineering 

complex of actions like

  • removing not useful features
  • building new features based on the combinations of existing ones
  • scaling existing ones
  • encoding
  • and so on

2. Cross-correlation analysis

To understand the dependency between non-target features -> build at least a correlation matrix for all features. Sometimes it helps to understand that some variables are strongly correlated inside a pair and one of them can be easily removed.

 

3. Experimenting with preprocessing 

Different tasks can require different processing for the same features. -> select the most suitable for the problem

 

4. Visualization : Core part of EDA, You should never skip the visualization.

  • Is it a noisy feature?
  • what is the distribution?
  • what are the ranges for its values?
  • how it behaves compared to other features? 

small idea , add them in a smart way ! 

 

5. Insights search

 

6. Baseline modeling 

It can solve several issues 

  • to shed light on the relative scores
  • provide an understanding of requirements for model inputs and outputs. 
  • it is a perfect way to make an assumption about the model's complexity.
  • Don't try to build State of the art models here, you don't need to do this in your EDA.

 

'AI' 카테고리의 다른 글

8주차 회고  (0) 2023.12.29
5,6주차 회고  (0) 2023.12.16
4주차 회고  (0) 2023.12.01
Recsys with deep learning literatures - NCF  (0) 2023.12.01
11/28 깃 강의  (0) 2023.11.28