타이타닉 EDA(hist, scatter, .value_counts(), kdeplot(), stripplot(data = , x= , y= , hue= ), violinplot())

하루조각 2023. 3. 19. 19:30

2023. 3. 19. 19:30

titanic = pd.read_csv('data/titanic.csv')
titanic.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Name           891 non-null object
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Ticket         891 non-null object
    Fare           891 non-null float64
    Cabin          204 non-null object
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(5)
    memory usage: 83.6+ KB

보기 1: "타이타닉의 승객은 30대와 40대가 가장 많다."

히스토그램

# 나이 분포
titanic.plot(kind='hist', y='Age', bins=30)

30대와 40대보다는 20대가 더 많기 때문에 틀린 설명

보기 2: "가장 높은 요금을 낸 사람은 30대이다."

나이와 요금 사이의 산점도

# 나이별 요금 분포
titanic.plot(kind='scatter', x='Age', y='Fare')

가장 높은 요금을 낸 사람은 30대

보기 3: "생존자가 사망자보다 더 많다."

생존 여부는 'Survived '라는 column

titanic['Survived'].value_counts()

    0    549
    1    342
    Name: Survived, dtype: int64

0이 사망, 1이 생존을 의미하니까 사망자가 더 많으므로 틀린 보기

보기 4: "1등실, 2등실, 3등실 중 가장 많은 사람이 탑승한 곳은 3등실이다."

titanic['Pclass'].value_counts()

    3    491
    1    216
    2    184
    Name: Pclass, dtype: int64

3등실의 사람이 491명으로 가장 많음

보기 5: "가장 생존율이 높은 객실 등급은 1등실이다."

# 객실 등급별 생존율 분포
titanic.plot(kind='scatter', x='Pclass', y='Survived')

이런 경우, KDE Plot을 활용하면 겹쳐진 정도를 알 수 있음

# 객실 등급별 생존율 분포
sns.kdeplot(titanic['Pclass'], titanic['Survived'])

보기 6: "나이가 어릴수록 생존율이 높다."

생존율에 대한 카테고리별 그래프

# 생존 여부에 따른 나이 분포
sns.stripplot(data=titanic, x="Survived", y="Age")

잘 보이지 않으면 바이올린 플롯

# 생존 여부에 따른 나이 분포
sns.violinplot(data=titanic, x="Survived", y="Age")

생존한 사람들의 나이 분포와 사망한 사람들의 나이 분포 사이에는 큰 차이가 보이지 않으므로 나이가 어릴수록 생존율이 높다고 하긴 어려움

보기 7: "나이보다 성별이 생존율에 더 많은 영향을 미친다."

# 생존 여부에 따른 나이 및 성별 분포
sns.stripplot(data=titanic, x="Survived", y="Age", hue="Sex")

저작자표시 (새창열림)

'자동제어 > Python for robotics' 카테고리의 다른 글

.str.split() (0)	2023.03.21
문자열 필터링(.str.contains()) (0)	2023.03.19
영화 카페(corr(), clustermap) (0)	2023.03.19
브런치카페음악셀렉션(상관관계) (0)	2023.03.19
상관계수 시각화 (0)	2023.03.19

하루기록