In this course provided by IBM, I will assume the role of an Associate Data Analyst who has recently joined the organization and be presented with a business challenge that requires data analysis to be performed on real-world datasets. The capstone project will culminate with a presentation of your data analysis report, with an executive summary for the various stakeholders in the organization. I believe this project is a great opportunity to showcase Data Analytics skills, and demonstrate proficiency to potential employers. The following are the notes I took during this course.
Data Collection
Collecting Data Using APIs
The HTTP
protocol allows you to send and receive information through the web including webpages, images, and other web resources.
Uniform resource locator(URL): the most popular way to find resources on the web
- Scheme:
http://
- Internet address or Base URL:
www.ibm.com
- Route location on the web server:
/images/IDSNlogo.png
Request
- Request start line =
GET
method + location of the resource/index.html
+HTTP
version - Request header passes additional information with an
HTTP
request
Response
- Response start line = version number
HTTP/1.0
+ a status code (200) meaning success, + a descriptive phrase (OK). - Response header contains useful information
- Response body containing the requested file an
HTML
document
Requests in Python
1 | import requests |
Get Request with URL Parameters
- You can use the
GET
method to modify the results of your query
1 | url_get='http://httpbin.org/get' |
Post Requests
- the
POST
request sends the data in a request body
1 | url_post='http://httpbin.org/post' |
Collecting Data Using Webscraping
Review of Webscraping
1 | from bs4 import BeautifulSoup # this module helps in web scrapping. |
Scrape data from html table
1 | url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html" |
Exploring Data
Load the dataset
1 | import pandas as pd |
Explore the dataset
1 | #Display the top & bottom 5 rows and columns from your dataset |
Data Wrangling
Load the dataset
1 | import pandas as pd |
Finding Duplicates
1 | #Find how many duplicate rows exist in the dataframe. |
Removing Duplicates
1 | #Remove the duplicate rows from the dataframe. |
Finding Missing Values
1 | #Find the missing values for all columns. |
Determine Missing Values
1 | #Find the value counts for the column WorkLoc. |
Normalizing Data
1 | #List out the various categories in the column 'CompFreq' |
Exploratory Data Analysis
Import necessary modules
1 | import numpy as np |
Distribution: Determine how the data is distributed
1 | #Plot the distribution curve for the column ConvertedComp |
Outliers
1 | #Find out if outliers exist in the column ConvertedComp using a box plot |
Correlation: Find the correlation between all numerical columns
1 | df.corr() |
Data Visualization
Work with Database
1 | !wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m4_survey_data.sqlite |
Visualizing Distribution of Data
Relationship
Composition
Comparison
Dashboard Creation
IBM Cognos Dashboard Embedded (CDE) is an AI-fueled business intelligence service that supports the entire data analytics cycle, from discovery to operationalization. It provides users with data discovery capabilities to visually explore and interact with their data to identify the key insights for improving data driven decisions. Users can perform data discovery and then quickly assemble that information into interactive, visually appealing dashboards; all without the need of formal training.
Add a Cognos Dashboard Embedded (CDE) service and upload external data files to your project (supports CSV file only)
General navigation around the CDE user interface (UI), start a new dashboard with a template in CDE, populate it with a data visualization as well as save the dashboard.
Presentation of Findings
Data collected, cleaned and organized -> Report (paper style report or slideshow presentation)
Elements Of A Successful Data Findings Report
- Outline
- Cover Page
- Executive Summary: briefly explain the details of the project and should be considered a stand-alone document
- Table of Contents
- Introduction: explains the nature of the analysis, states the problem, and gives the questions that were to be answered by performing the analysis
- Methodology: explains the data sources that were used in the analysis and outlines the plan for the collected data
- Results: goes into the detail of the data collection, how it was organized, and how it was analyzed, also contain the charts and graphs that would substantiate the results and call attention to the more complex or crucial findings
- Discussion: engage the audience with a discussion of your implications that were drawn from the research
- Conclusion: reiterate the problem given in the introduction and gives an overall summary of the findings, also state the outcome of the analysis and if any other steps would be taken in the future
- Appendix: contain information that really didn’t fit in the main body of the report, but you deemed it was still important enough to include
Factors to remember in accurately conveying your message
- Make sure charts and graphs are not too small and are clearly labeled
- Use the data only as supporting evidence
- Share only one point from each chart
- Eliminate data that does not support the key message
Final Presentation (PDF version):
Or download it from My Github Repository
Assignments
Visit my Github Repository