데이터 사이언스 분야는 심오하다..어떡하지..ㅠㅠ
뉴비는 웁니다.
https://data-newbie.tistory.com/842
https://www.data-science-factory.com/post/exploratory-data-analysis-guideline
위의 글 summary:
It is the process that aims to detect insights inside the data and to conduct a complete investigation of the dataset. The success of building your future model is highly dependent on how the analysis is performing.
- to get a deep understanding of the data that you are working on
- your EDA notebook should be clear for any other person (I mean a Data Science engineer, not a random person) who will take your notebook for the review.
General data understanding :
1. understand the physical meaning of each column
2. understand the type of each variable ; categorial, binary, continuous variables
3. remove redundant data; It can be different ids, URLs, fragments of the metadata (remember that not all metadata is useless!), and other columns.
4. prevent data leakage ;
It is a critical point because sometimes some features can provide information that will not be available in production mode
https://dacon.io/en/forum/403895
"알 수 없는 정보가 예측에 반영되었을 때 나타나는 문제"
5. define a strategy to work with missed data ; 알아서 선택하라
Target Understanding :
1. Check class distribution or values distribution in case of regression
2. Find outliers in the data - dropping several outliers samples can increase the score
3. Correlation analysis ; estimating the impact of each variable on the target -> reducing # features but be careful of losing good features.
Features Understanding :
1. Feature engineering
complex of actions like
- removing not useful features
- building new features based on the combinations of existing ones
- scaling existing ones
- encoding
- and so on
2. Cross-correlation analysis
To understand the dependency between non-target features -> build at least a correlation matrix for all features. Sometimes it helps to understand that some variables are strongly correlated inside a pair and one of them can be easily removed.
3. Experimenting with preprocessing
Different tasks can require different processing for the same features. -> select the most suitable for the problem
4. Visualization : Core part of EDA, You should never skip the visualization.
- Is it a noisy feature?
- what is the distribution?
- what are the ranges for its values?
- how it behaves compared to other features?
small idea , add them in a smart way !
5. Insights search
6. Baseline modeling
It can solve several issues
- to shed light on the relative scores
- provide an understanding of requirements for model inputs and outputs.
- it is a perfect way to make an assumption about the model's complexity.
- Don't try to build State of the art models here, you don't need to do this in your EDA.
'AI' 카테고리의 다른 글
8주차 회고 (0) | 2023.12.29 |
---|---|
5,6주차 회고 (0) | 2023.12.16 |
4주차 회고 (0) | 2023.12.01 |
Recsys with deep learning literatures - NCF (0) | 2023.12.01 |
11/28 깃 강의 (0) | 2023.11.28 |