설문조사 응답 분석 3

Q6. 블로그, 팟캐스트, 수업, 기타 등등 추천할만한 것이 있는지?

mcq['BlogsPodcastsNewslettersSelect'] = mcq[
 'BlogsPodcastsNewslettersSelect'
].astype('str').apply(lambda x: x.split(','))
mcq['BlogsPodcastsNewslettersSelect'].head()

0 [Becoming a Data Scientist Podcast, Data Machi...
1 [Becoming a Data Scientist Podcast, Siraj Rava...
2 [FastML Blog, No Free Hunch Blog, Talking Mach...
3 [KDnuggets Blog]
4 [Data Machina Newsletter, Jack's Import AI New...
Name: BlogsPodcastsNewslettersSelect, dtype: object

s = mcq.apply(lambda x: pd.Series(x['BlogsPodcastsNewslettersSelect']),
 axis=1).stack().reset_index(level=1, drop=True)
s.name = 'platforms'
s.head()

0 Becoming a Data Scientist Podcast
0 Data Machina Newsletter
0 O'Reilly Data Newsletter
0 Partially Derivative Podcast
0 R Bloggers Blog Aggregator
Name: platforms, dtype: object

s = s[s != 'nan'].value_counts().head(20)
plt.figure(figsize=(6,8))
plt.title("Most Popular Blogs and Podcasts")
sns.barplot(y=s.index, x=s)

png

KDNuggets Blog, R Bloggers Blog Aggregator 그리고 O'Reilly Data Newsletter가 가장 유용하다고 투표를 받았다.
데이터 사이언스 되기라는 팟캐스트도 유명한 듯하다.

mcq['CoursePlatformSelect'] = mcq[
 'CoursePlatformSelect'].astype(
 'str').apply(lambda x: x.split(','))
mcq['CoursePlatformSelect'].head()

0 [nan]
1 [nan]
2 [Coursera, edX]
3 [nan]
4 [nan]
Name: CoursePlatformSelect, dtype: object

t = mcq.apply(lambda x: pd.Series(x['CoursePlatformSelect']),
 axis=1).stack().reset_index(level=1, drop=True)
t.name = 'courses'
t.head(20)

0 nan
1 nan
2 Coursera
2 edX
3 nan
4 nan
5 nan
6 nan
7 Coursera
8 nan
9 nan
10 Coursera
11 nan
12 Coursera
12 DataCamp
12 edX
13 nan
14 nan
15 nan
16 nan
Name: courses, dtype: object

t = t[t != 'nan'].value_counts()
plt.title("Most Popular Course Platforms")
sns.barplot(y=t.index, x=t)

png

Coursera와 Udacity가 가장 인기 있는 플랫폼이다.

Q7. 데이터 사이언스 직무에서 가장 중요하다고 생각되는 스킬은?

job_features = [
 x for x in mcq.columns if x.find(
 'JobSkillImportance') != -1
 and x.find('JobSkillImportanceOther') == -1]

job_features

['JobSkillImportanceBigData',
'JobSkillImportanceDegree',
'JobSkillImportanceStats',
'JobSkillImportanceEnterpriseTools',
'JobSkillImportancePython',
'JobSkillImportanceR',
'JobSkillImportanceSQL',
'JobSkillImportanceKaggleRanking',
'JobSkillImportanceMOOC',
'JobSkillImportanceVisualizations']

jdf = {}
for feature in job_features:
 a = mcq[feature].value_counts()
 a = a/a.sum()
 jdf[feature[len('JobSkillImportance'):]] = a

jdf

{'BigData': Nice to have 0.574065
Necessary 0.379929
Unnecessary 0.046006
Name: JobSkillImportanceBigData, dtype: float64,
'Degree': Nice to have 0.598107
Necessary 0.279867
Unnecessary 0.122026
Name: JobSkillImportanceDegree, dtype: float64,
'EnterpriseTools': Nice to have 0.564970
Unnecessary 0.290200
Necessary 0.144829
Name: JobSkillImportanceEnterpriseTools, dtype: float64,
'KaggleRanking': Nice to have 0.677261
Unnecessary 0.203876
Necessary 0.118863
Name: JobSkillImportanceKaggleRanking, dtype: float64,
'MOOC': Nice to have 0.606994
Unnecessary 0.285752
Necessary 0.107255
Name: JobSkillImportanceMOOC, dtype: float64,
'Python': Necessary 0.645994
Nice to have 0.327214
Unnecessary 0.026792
Name: JobSkillImportancePython, dtype: float64,
'R': Nice to have 0.513945
Necessary 0.414807
Unnecessary 0.071247
Name: JobSkillImportanceR, dtype: float64,
'SQL': Nice to have 0.491778
Necessary 0.434224
Unnecessary 0.073998
Name: JobSkillImportanceSQL, dtype: float64,
'Stats': Necessary 0.513889
Nice to have 0.457576
Unnecessary 0.028535
Name: JobSkillImportanceStats, dtype: float64,
'Visualizations': Nice to have 0.490820
Necessary 0.455392
Unnecessary 0.053788
Name: JobSkillImportanceVisualizations, dtype: float64}

jdf = pd.DataFrame(jdf).transpose()
jdf
Necessary Nice to have Unnecessary
BigData 0.379929 0.574065 0.046006
Degree 0.279867 0.598107 0.122026
EnterpriseTools 0.144829 0.564970 0.290200

| KaggleRanking | 0.118863 | 0.677261 | 0.203876 |
| MOOC | 0.107255 | 0.606994 | 0.285752 |
| Python | 0.645994 | 0.327214 | 0.026792 |
| R | 0.414807 | 0.513945 | 0.071247 |
| SQL | 0.434224 | 0.491778 | 0.073998 |
| Stats | 0.513889 | 0.457576 | 0.028535 |
| Visualizations | 0.455392 | 0.490820 | 0.053788 |

plt.figure(figsize=(10,6))
sns.heatmap(jdf.sort_values("Necessary",
 ascending=False), annot=True)

png

jdf.plot(kind='bar', figsize=(12,6),
 title="Skill Importance in Data Science Jobs")

png

꼭 필요한 것으로 Python, R, SQL, 통계, 시각화가 있다.

있으면 좋은 것은 빅데이터, 학위, 툴 사용법, 캐글랭킹, 무크가 있다.

Q8. 데이터 과학자의 평균 급여는 얼마나 될까?

mcq[mcq['CompensationAmount'].notnull()].shape

(5224, 228)

mcq['CompensationAmount'] = mcq[
 'CompensationAmount'].str.replace(',','')
mcq['CompensationAmount'] = mcq[
 'CompensationAmount'].str.replace('-','')

# 환율계산을 위한 정보 가져오기
rates = pd.read_csv('data/conversionRates.csv')
rates.drop('Unnamed: 0',axis=1,inplace=True)

salary = mcq[
 ['CompensationAmount','CompensationCurrency',
 'GenderSelect',
 'Country',
 'CurrentJobTitleSelect']].dropna()
salary = salary.merge(rates,left_on='CompensationCurrency',
 right_on='originCountry', how='left')
salary['Salary'] = pd.to_numeric(
 salary['CompensationAmount']) * salary['exchangeRate']
salary.head()
CompensationAmount CompensationCurrency GenderSelect Country CurrentJobTitleSelect originCountry exchangeRate Salary
0 250000 USD Male United States Operations Research Practitioner USD 1.000000 250000.0
1 80000 AUD Female Australia Business Analyst AUD 0.802310 64184.8
2 1200000 RUB Male Russia Software Developer/Software Engineer RUB 0.017402 20882.4
3 95000 INR Male India Data Scientist INR 0.015620 1483.9
4 1100000 TWD Male Taiwan Software Developer/Software Engineer TWD 0.033304 36634.4
print('Maximum Salary is USD $',
 salary['Salary'].dropna().astype(int).max())
print('Minimum Salary is USD $',
 salary['Salary'].dropna().astype(int).min())
print('Median Salary is USD $',
 salary['Salary'].dropna().astype(int).median())

Maximum Salary is USD $ 28297400000
Minimum Salary is USD $ 0
Median Salary is USD $ 53812.0

가장 큰 수치는 여러 국가의 GDP보다 크다고 한다. 가짜 응답이며, 평균급여는 USD $ 53,812이다. 그래프를 좀 더 잘 표현하기 위해 50만 불 이상의 데이터만 distplot으로 그려봤다.

plt.subplots(figsize=(15,8))
salary=salary[salary['Salary']?]
sns.distplot(salary['Salary'])
plt.axvline(salary['Salary'].median(), linestyle='dashed')
plt.title('Salary Distribution',size=15)

Text(0.5,1,'Salary Distribution')

png

plt.subplots(figsize=(8,12))

sal_coun = salary.groupby(
 'Country')['Salary'].median().sort_values(
 ascending=False)[:30].to_frame()

sns.barplot('Salary',
 sal_coun.index,
 data = sal_coun,
 palette='RdYlGn')

plt.axvline(salary['Salary'].median(), linestyle='dashed')
plt.title('Highest Salary Paying Countries')

Text(0.5,1,'Highest Salary Paying Countries')

png

plt.subplots(figsize=(8,4))
sns.boxplot(y='GenderSelect',x='Salary', data=salary)

png

salary_korea = salary.loc[(salary['Country']=='South Korea')]
plt.subplots(figsize=(8,4))
sns.boxplot(y='GenderSelect',x='Salary',data=salary_korea)

png

salary_korea.shape

(26, 8)

salary_korea[salary_korea['GenderSelect'] == 'Female']
CompensationAmount CompensationCurrency GenderSelect Country CurrentJobTitleSelect originCountry exchangeRate Salary
479 30000 KRW Female South Korea Data Analyst KRW 0.000886 26.58
2903 800000 KRW Female South Korea Researcher KRW 0.000886 708.80
4063 60000000 KRW Female South Korea Researcher KRW 0.000886 53160.00
salary_korea_male = salary_korea[
 salary_korea['GenderSelect']== 'Male']
salary_korea_male['Salary'].describe()

count 23.000000
mean 43540.617217
std 37800.608484
min 0.886000
25% 17500.000000
50% 37212.000000
75% 59238.000000
max 177200.000000
Name: Salary, dtype: float64

salary_korea_male
CompensationAmount CompensationCurrency GenderSelect Country CurrentJobTitleSelect originCountry exchangeRate Salary
85 40000000 KRW Male South Korea Business Analyst KRW 0.000886 35440.000
147 80000 USD Male South Korea Researcher USD 1.000000 80000.000
314 60000 USD Male South Korea Business Analyst USD 1.000000 60000.000
333 60000000 KRW Male South Korea Researcher KRW 0.000886 53160.000
562 50000000 KRW Male South Korea Researcher KRW 0.000886 44300.000
769 42000000 KRW Male South Korea Software Developer/Software Engineer KRW 0.000886 37212.000
799 1000 KRW Male South Korea Machine Learning Engineer KRW 0.000886 0.886
1060 75000000 KRW Male South Korea Scientist/Researcher KRW 0.000886 66450.000
1360 30000000 KRW Male South Korea Statistician KRW 0.000886 26580.000
1568 90000 SGD Male South Korea Computer Scientist SGD 0.742589 66833.010
1576 10800000 KRW Male South Korea Data Scientist KRW 0.000886 9568.800
1905 20000 USD Male South Korea Researcher USD 1.000000 20000.000
1945 50000 KRW Male South Korea Machine Learning Engineer KRW 0.000886 44.300
1949 80000000 KRW Male South Korea Software Developer/Software Engineer KRW 0.000886 70880.000
2322 200000000 KRW Male South Korea Other KRW 0.000886 177200.000
2334 60000000 KRW Male South Korea Machine Learning Engineer KRW 0.000886 53160.000
2557 7200000 KRW Male South Korea Researcher KRW 0.000886 6379.200
2924 15000 USD Male South Korea Researcher USD 1.000000 15000.000
3394 66000000 KRW Male South Korea Programmer KRW 0.000886 58476.000
3832 30000000 KRW Male South Korea Data Scientist KRW 0.000886 26580.000
3979 35000000 KRW Male South Korea Researcher KRW 0.000886 31010.000
4300 60000000 KRW Male South Korea Scientist/Researcher KRW 0.000886 53160.000
4366 10000 USD Male South Korea Data Scientist USD 1.000000 10000.000

Q9. 개인 프로젝트나 학습용 데이터를 어디에서 얻나요?

mcq['PublicDatasetsSelect'] = mcq[
 'PublicDatasetsSelect'].astype('str').apply(
 lambda x: x.split(',')
 )
q = mcq.apply(
 lambda x: pd.Series(x['PublicDatasetsSelect']),
 axis=1).stack().reset_index(level=1, drop=True)

q.name = 'courses'
q = q[q != 'nan'].value_counts()
pd.DataFrame(q)
courses
Dataset aggregator/platform (i.e. Socrata/Kaggle Datasets/data.world/etc.) 6843
Google Search 3600
University/Non-profit research group websites 2873
I collect my own data (e.g. web-scraping) 2560
GitHub 2400
Government website 2079
Other 399
plt.title("Most Popular Dataset Platforms")
sns.barplot(y=q.index, x=q)

png

Kaggle 및 Socrata는 개인 프로젝트나 학습에 사용하기 위한 데이터를 얻는데 인기 있는 플랫폼이다. Google 검색 및 대학 / 비영리 단체 웹 사이트는 각각 2위와 3위에 있다. 그리고 직접 웹스크래핑 등을 통해 데이터를 수집한다고 한 응답이 4위다.

# 주관식 응답을 읽어온다.
ff = pd.read_csv('data/freeformResponses.csv',
 encoding="ISO-8859-1", low_memory=False)
ff.shape

(16716, 62)

# 설문내용과 누구에게 물어봤는지를 찾아봄
qc = question.loc[question[
 'Column'].str.contains('PersonalProjectsChallengeFreeForm')]
print(qc.shape)
qc.Question.values[0]

(1, 3)

'What is your biggest challenge with the public datasets you find for personal projects?'

개인프로젝트에서 공개 된 데이터셋을 다루는 데 가장 어려운 점은 무엇일까?

ppcff = ff[
 'PersonalProjectsChallengeFreeForm'].value_counts().head(15)
ppcff.name = '응답 수'
pd.DataFrame(ppcff)
응답 수
None 23
Cleaning the data 20
Cleaning 20
Dirty data 16
Data Cleaning 14
none 13
dirty data 10
Data cleaning 10
- 9
Size 9
cleaning 8
Missing data 8
Incomplete data 8
Lack of documentation 7
Quality 6

대부분 데이터를 정제하는 일이라고 응답하였고 그다음이 데이터 크기다.