IMDb 인물 정보로 성별 구별하기

이번에 분석하는 영화 데이터에서, 감독과 제작자, 각본가의 성별 정보를 넣고 싶어 검색해본 결과,

U.S. movies with gender-disambiguated actors, directors, and producers

For producers and directors that do not also have acting credits, we use indirect methods to assign a gender. If present, we parse the individual's biographical text for gender-specific pronouns (he/his/him/himself, or she/her/hers/herself). If the number of (male-) female-specific pronouns exceeds that of (female-) male-specific ones, we assume the individual is a (male) female. If the previous attempt is inconclusive, we use the Python package gender-guesser (version 0.4.0) to "guess" the gender based on the first name of the individual. The output of gender-guesser is one of "female", "mostly female", "androgynous", "unknown", "mostly male", or "male". We only assign a gender if the guess is either "male" or "female". If we still have not been able to assign a gender, we try to find a photograph of the individual. If all attempts fail, we mark the individual's gender as "undetermined".

위 데이터셋에서 인물의 성별을 나누는 방법이 설명되어있어 참고했다.

작업에 사용한 데이터는 IMDb movies extensive dataset 이다.

데이터셋에 bio라고 biography 정보가 있는데,

IMDb 인물 페이지 안의 상세 인물 정보이다.

이 글 안에 들어있는 성별 대명사

'he', 'his', 'him', 'himself', 'male', 'actor'
'she', 'her', 'hers', 'herself', 'female', 'actress'의

여성 대명사의 합산 count가 남성 대명사 합산보다 많으면 여성으로 구분, 반대는 남성으로 구분한다.

개별 데이터의 정보를 조회해주면 정보가 잘 들어가 있는 것을 볼 수 있다.

# 먼저 제대로 작동하나 확인해보기

# 방법1
sum(df['bio'][0].count(x) for x in ('He','His','Him','Himself','Male', 'Actor','he','his','him','himself','male', 'actor'))

# 방법 2 - Natural Language Toolkit 사용
import nltk  # Natural Language Toolkit
from collections import Counter

sum(x in {'He','His','Him','Himself','Male', 'Actor','he','his','him','himself','male', 'actor'} for x in nltk.wordpunct_tokenize(df.loc[0, 'bio']))
sum(x in {'She','Her','Hers','Herself','Female','Actress','she','her','hers','herself','female','actress'} for x in nltk.wordpunct_tokenize(df.loc[0, 'bio']))

일단 이렇게 개별 셀에 실행해주어 결과가 제대로 나오는지 보고,

전체 df나 칼럼에 적용해주면 된다.

대문자도 적용해줘야 좀 더 정확하게 걸러진다.

만약 biography가 없거나, count수가 0이거나 같아 구분이 되지 않는다면

gender-guesser 패키지를 사용한다.

이름에서 first name, last name을 잘라 이름이 대중적으로 남성에게 많이 쓰이면 male, 여성에게 많이 쓰이면 female이라 구분해주는데, 문제점은 비영어권 이름은 구분하지 못한다.

!pip install gender-guesser

import gender_guesser.detector as gender
d = gender.Detector()

df['first_name'] = df.이름정보칼럼.str.split(' ').str[0] # first name

first_name이라는 새 칼럼을 만들어주어 first name을 넣어주고,

for i in df.index:
  name = df.loc[i,'first_name']
  df.loc[i, 'get_gender'] = d.get_gender(name)

gender_guesser를 적용해주면 된다.

이름에 따라 성별을 다음과 같이 구분해주는데, andy는 안드로진이고 unknown은 비영어권 이름 등으로 식별할 수 없는 경우다.

나는 mostly_male, andy는 남성으로, mostly_female은 여성으로 처리했다.

그리고 unknown은 드롭......

이 글을 읽으시는 여러분은 똑똑하니까 미리 해보셨겠지만,

데이터셋을 전 처리하면서 먼저 노가다 할 가치가 있는 데이터셋인가? 를 먼저 파악하는 게 필수적이다.

예를 들어 연도별 데이터 분포가 중요하다거나, 전체 데이터 사이즈가 중요하다면

데이터셋이 그 기준에 부합하는지를 먼저 확인하고, 작업해야

피 같은 시간을 쏟고도 연도 분포나 데이터셋 사이즈가 충분치 않아, N시간을 공들인 데이터셋이 쓸모가 없어지는 경험을 안 한다... ^.ㅠ

저작자표시 비영리 변경금지

'개발' 카테고리의 다른 글

터미널로 commit 했는데 깃허브에 반영 안 될때 (0)	2021.03.04
코랩(Colab) 노트북 전체 페이지 html 파일로 저장하기 (0)	2021.02.25
구글 코랩(Colab)에서 json 파일 열기 (1)	2021.02.22
코랩 Colab 폰트 맞춤설정하는 방법 (0)	2021.02.11
리디셀렉트 - 데이터 분석가의 숫자유감 (1)	2021.01.24

천천히, 그러나 꾸준히

IMDb 인물 정보로 성별 구별하기

'개발' 카테고리의 다른 글

티스토리툴바

IMDb 인물 정보로 성별 구별하기

'개발' 카테고리의 다른 글

'개발' Related Articles

티스토리툴바