One of the key skills of a data scientist is the ability to tell a compelling story, visualizing data and findings in an approachable and stimulating way. Learning how to leverage a software tool to visualize data will helps one understand the data better, and make more effective decisions. The main goal of this Data Visualization with Python course provided by IBM is to use various techniques and several data visualization libraries in Python, namely Matplotlib, Seaborn, and Folium for presenting data visually. The following are the notes I took during this course.
Introduction to Data Visualization Tools
Why Build Visuals?
- For exploratory data analysis
- Communicate data clearly
- Share unbiased representation of data
- Use them to support recommendations to different stakeholders
Less is more (more effective, attractive and impactive)
Matplotlib (History and Architecture): Created by John Hunter
- Backend Layer has three built-in abstract interface classes: FigureCavas, Renderer, Event
- Artist Layer is comprised of one main object - Artist. Title, lines, tick labels, images all correspond to individual Artist instances.
- Two types of Artist objects: Primitive (Line2D, Rectangle, Circle and Text), Composite (Axis, Tick, Axes and Figure)
1 | from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas # import FigureCanvas |
Scripting Layer is comprised mainly of pyplot, ascripting interface that is lighter that the Artist layer.
1 | import matplotlib.pyplot as plt |
Basic Plotting with Matplotlib
- Plot Function
1 | %matplotlib inline # pass in inline as the backend to enforce plots to be rendered within the browser |
Matplotlib - Pandas
1 | df.plot(kind="line") # create a line plot |
Line Plots
- A line plot is a type of plot which displays information as a series of data points called ‘makers’ connected by straight line segments.
- Creating line plots
1 | import matplotlib as mpl |
Basic and Specialized Visualization Tools
Area Plots (based on line plot)
- Also known as area chart or area graph, is commonly used to represent cumulated totals using numbers or percentages over time
- Area plots are stacked by default.
1 | import matplotlib as mpl |
Histograms
- A histogram is a way of representing the frequency distribution of a numeric dataset.
1 | import matplotlib as mpl |
Bar Charts
- Unlike a histogram, a bar chart is a type of plot where the length of each bar is proportional to the value of the item that it represents.
- It is commonly used to compare the values of a variable at a given point in time.
1 | import matplotlib as mpl |
Pie Charts
- A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportion.
- Arguments against pie charts
1 | import matplotlib as mpl |
Box Plots
- A boxplot is a way of statistically representing the distribution of given data through 5 main dimensions.
- Minimum, First Quartile, Median, Third Quartile, Maximum, Outliers
- IQR (Inter Quartile Range): between First Quartile and Third Quartile
Function = plot, and Parameter = kind with value = "box"
1 | import matplotlib as mpl |
Scatter Plots
- A scatter plot is a type of plot that displays values pertaining to typically two variables against each other.
- Usually it is a dependent variable to be plotted against an independent variable in order to determine if any correlation between the two variables exists.
1 | import matplotlib as mpl |
Bubble Plots
- A bubble plot is a variation of the scatter plot that displays three dimensions of data (x, y, z).
- The data points are replaced with bubbles, and the size of the bubble is determined by the third variable ‘z’, also known as the weight.
Advanced Visualizations and Geospatial Data
Waffle Charts
- A waffle chart is an interesting visualization that is normally created to display progress toward goals
- Matplotlib does not have a built-in function to create waffle charts.
Word Clouds
- A Word Cloud is a depiction of the frequency of different words in some textual data
- Mueller’s word cloud generator
Seaborn and Regression Plots
- Seaborn is a Python visualization library based on Matplotlib
1 | import seaborn as sns |
Introduction to Folium and Map Styles
- Folium is a powerful Python library that helps you create several types of Leaflet maps
- It enables both the binding of data to a map for choropleth visualizations as well as passing visualizations as markers on the map
1 | word_map=folium.Map() #define the world map |
Define the world map centered around Canada with a low zoom level
1 | world_map=folium.Map( |
Maps with Markers
1 | canada_map=folium.Map( |
Choropleth Maps
- A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per capita income.
- Geojson File
1 | world_map=folium.Map( |
Creating Dashboards with Plotly and Dash
Dashboard
- Produce real-time visuals
- Understand business moving parts
- Visually track, analyze and display key performance indicators (KPI)
- Take informed decisions and improve performance.
- Reduced hours of analyzing
Best dashboards answer critical business questions.
Python dashboarding tools: Dash from Plotly, Panel, voila, Streamlit, Bokeh, ipywidgets, matplotlib, Flask
- Interactive, open-source plotting library
- Supports over 40 unique chart types
- Includes chart types like statistical, financial, maps, scientific and 3-dimensional
- Visualizations can be displayed in Jupyter notebook, saved to HTML files, or can be used in developing Python-built web applications using Dash
Plotly Graph Objects: Low-level interface to figures, traces, and layout: plotly.gragh_objects.Figure
Plotly express: High-level wrapper for Plotly
1 | import plotly.gragh_objects as go |
Dash
- Open-Source User Interface Python library for creating reactive, web-based applications
- Easy to build GUI
- Declarative and Reactive
- Rendered in web browser and can be deployed to servers
- Inherently cross-platform and mobile ready
- Both enterprise-ready and a first-class member of Plotly’s open-source tools
Dash core components: import dash_core_components as dcc
- Describe higher-level components that are interactive and are generated with JavaScript, HTML, and CSS through the React.js library
Dash HTML components: import dash_html_components as html
- Component for every HTML tag
Callback function is a python function that is automatically called by Dash whenever an input component’s property changes
Callback function is decorated with @app.callback
decorator
1 |
|
Callback decorator function takes two parameters: Input and Output
- Input and Output to the callback function will have component id and component property
- Multiple inputs or outputs should be enclosed inside either a list or tuple.
1 | import pandas as pd |
Assignments
Visit my Github Repository