Much of the world’s data resides in databases, A working knowledge of databases and SQL is a must to become a data scientist. The emphasis in this course provided by IBM is on hands-on and practical learning. So, I’ll try to record how I work with real databases, real data science tools, real-world datasets and eventually, how I create a database instance in the cloud on the following notes I took during this course.
Introduction to Databases
Structured Query Language (or SQL) is a powerful language which is used for communicating with and extracting data from databases.
SQL is among the top 3 skills for a Data Scientist or Data Analysts
A database is a repository of data, it provides the functionality for adding, modifying and querying data
RDBMS = Relational database management system
5 Basic SQL commands: Create, Insert, Select, Update, Delete
Cloud databases: Ease of use and access, Scalability & Economics, Disaster recovery
- IBM Db2, PostgreSQL, Oracle Cloud, Microsoft Azure, Amazon RDS
DBaaS (Database-as-a-Service) provides users with access to database resources in Cloud without the need for setting up hardware and installing software.
Relational Model allows for data independence (key advantage)
Entities are independent objects which have Attributes
- Entity-Relationship Model (ER-Model): used as a tool to design RDBMS
- Mapping Entity Diagrams to Tables: Entities become tables, Attributes get translated into columns
Primary Keys and Foreign Keys: A primary key uniquely identifies a specific row in a table and prevents duplication of data.
Basic SQL
Data Definition Language statements (DDL) and Data Manipulation Language statements (DML)
Data Definition Language (DDL) statements are used to define, change, or drop database objects.
- Common DDL statement types include:
CREATE
,ALTER
,TRUNCATE
andDROP
Data Manipulation Language (DML) statements are used to read and modify data in tables.
- CRUD operations: Create, Read, Update and Delete
- Common DML statement types include:
INSERT
,SELECT
,UPDATE
andDELETE
CREATE
and DROP
tables in the database
- It is quite common to issue a
DROP
before doing aCREATE
in test and development scenarios, but if the table does not already exist and you try to drop it, you will see an error likeXXX.YYY
is an undefined name.
1 | drop table COUNTRY; ## If table already exists |
Use SELECT
queries to retrieve data from the database
1 | select COLUMN1, COLUMN2, ... from TABLE1 ; |
Use COUNT
, DISTINCT
, LIMIT
, INSERT
, UPDATE
, DELETE
to compose and run basic queries
1 | -- 0. Drop table INSTRUCTOR in case it already exists |
String Patterns, Ranges, Sorting and Grouping
Using String Patterns and Ranges
- The
WHERE
clause always requires a predicate, which is a condition that evaluates to true, false or unknown. - Use the
LIKE
predicate with string patterns for the search:WHERE <columnname> LIKE <string pattern>
1 | select F_NAME , L_NAME |
Sorting Result Sets
ORDER BY
: Descending order,Specifying Column Sequence Number1
2
3select F_NAME, L_NAME, DEP_ID
from EMPLOYEES
order by DEP_ID desc, L_NAME desc;
Grouping Result Sets
SELECT DISTINCT()
: Eliminating DuplicatesGROUP BY
HAVING
: Restricting the result set
1 | select DEP_ID, COUNT(*) AS "NUM_EMPLOYEES", AVG(SALARY) AS "AVG_SALARY" |
Functions, Sub-Queries, Multiple Tables
Built-in Database Functions: Using database functions can significantly reduce the amount of data that needs to be retrieved from the database.
Aggregate or Column Functions
- INPUT: Collection of values (e.g. entire column), OUTPUT: Single value
SUM()
,MIN()
,MAX()
,AVG()
1 | select AVG( COST / QUANTITY ) from PETRESCUE where ANIMAL = 'Dog'; |
SCALAR
and STRING
functions
- Perform operations on every input value
ROUND()
,LENGTH()
,UCASE
,LCASE
1 | select DISTINCT(UCASE(ANIMAL)) from PETRESCUE; |
Date and Time Built-in Functions
YEAR()
,MONTH()
,DAY()
,DAYOFWEEK()
,DAYOFYEAR()
,WEEK()
,HOUR()
,MINUTE()
,SECOND()
- Date or Time Arithmetic
1 | select SUM(QUANTITY) from PETRESCUE where DAY(RESCUEDATE)='14'; |
Sub-Queries and Nested Selects
- Sub-Queries cannot evaluate Aggregate functions like
AVG()
in theWHERE
clause,therefore,use a sub-Select expression - Sub-queries in
FROM
clause substitute theTABLE
name with a sub-query called Derived Tables or Table Expressions.
1 | select * from employees where salary > AVG(salary) |
Working with Multiple Tables
- Sub-queries
1 | select * from employees where DEP_ID IN ( select DEPT_ID_DEP from departments where LOC_ID = 'L0002' ); |
Implicit JOIN
1 | select * from employees E, departments D where E.DEP_ID = D.DEPT_ID_DEP; |
JOIN
operators(INNER JOIN
, OUTER JOIN
…)
Accessing databases using Python
Python ecosystem: NumPy, pandas, matplotlib, SciPy
DB-API (Python Database API): Python’s standard API for accessing relational databases.
- Connection Objects: Database connections, Manage transactions
- Cursor Objects: Database queries, Scroll through result set, Retrieve results
- Connection methods:
.cursor()
,.commit()
,.rollback()
,.close()
- Cursor methods:
.callproc()
,.execute()
,.executemany()
,.fetchone()
,.fetchmany()
,.fetchall()
,.nextset()
,.arraysize()
,.close()
1 | from dbmodule import connect |
Connect to Db2 database (ibm_db API)
1 | import ibm_db |
Close the connection
1 | ibm_db.close(conn) |
Creating tables, loading data and querying data
1 | #Lets first drop the table INSTRUCTOR in case it exists from a previous attempt |
Retrieve data into Pandas
1 | import pandas |
SQL Magic
- Cell magics: start with a double
%%
sign and apply to the entire cell - Line magics: start with a single
%
sign and apply to a particular line in a cell %magicname arguments
%sql select * from tablename
Analyzing data with Python
1 | #Connect to the database |
Using JOIN operations to work with multiple tables
Inner Join
Outer Join
- Left Outer Join
- Right Outer Join
- Full Outer Join
HR Database example
1 | --- Select the names, job start dates, and job titles of all employees who work for the department number 5 --- |
Working with Real World Datasets
Understand 3 Chicago datasets
Load the 3 datasets into 3 tables in a Db2 database
Execute SQL queries to answer assignment questions
Assignments
Visit my Github Repository