CptS 475/575: Data Science


Course Information

Credit hours: 3
Semester Offered: Fall


Course Description

Data Science is the study of the generalizable extraction of knowledge from data. Being a data scientist requires an integrated skill set spanning computer science, mathematics, statistics, and domain expertise along with a good understanding of the art of problem formulation to engineer effective solutions. The purpose of this course is to introduce students to this rapidly growing field and equip them with some of its basic principles and tools as well as its general mindset. The course will use the programming languages R (primarily) and Python.

Topics to be covered include: the data science process, exploratory data analysis, data wrangling, data visualization, linear regression, classification, clustering, principal component analysis, time-series data mining, deep learning, and data and ethics.

The focus in the treatment of these topics is more on breadth, rather than depth, and emphasis is placed on integration and synthesis of concepts and their application to solving problems. Necessary theoretical abstractions (mathematical and algorithmic) are introduced as and when needed.


Audience

The course is suitable for upper-level under-graduate students (CptS 475) and graduate students (CptS 575) in computer science, engineering, applied mathematics, the sciences, business, and related analytic fields.


Prerequisites

Students are expected to: (i) have taken an introductory course in statistics and probability, (ii) have basic knowledge of algorithms and reasonable programming experience (equivalent to completing a data structures course such as CptS 223), and (iii) have some familiarity with basic linear algebra (e.g. eigenvalue/vector computation).


Coursework

The course consists of several elements: lectures (three times a week, 50 min each); a set of assignments; a substantial semester project; one mid-term exam and no final exam. Graduate students are required to complete one additional assignment: a survey paper. Below is an example of how the coursework and assessment are broken down.

For students taking it as CptS 475:

  • Assignments (35%): There will be a total of about five assignments spread through the semester. Each assignment will have one major topic of emphasis. Assignments are to be completed and submitted individually. Each assignment will carry equal weight. Together all assignments account for 35% of final grade.
  • Semester Project (40%): Students, working individually or in a team of two, will complete a semester project. A project could take one of several forms: analyzing an interesting dataset using existing methods and software tools; developing a new method; building your own data product; or creating a visualization of a complex dataset. Students will be given an opportunity to choose from a list of projects the instructor provides or propose their own project. Guidelines for what constitutes a project will be provided by the instructor. A project will culminate in a written report and a short presentation in class. General guidelines for how to prepare a report will be provided by the instructor. Students are expected to follow the guidelines. Similarly, guidelines for how to prepare and deliver good presentations will be provided by the instructor, and students are expected to follow the guidelines.
  • Exam (23%):  There will be one mid-term exam designed to complement the assignments and the semester project. The exam is tentatively scheduled to take place in the week of Nov.4. Final date will be decided later after consulting with the class.
  • Class Participation (2%): Active class participation—in discussions during lectures, surveys, and other online discussions, including responding to Participation Question of The Day—is required. Class Participation will count towards 2% of the final grade.

For students taking it as CptS 575:

  • Assignments (25%). There will be a total of 5 assignments spread through the semester. Each assignment will have one major topic of emphasis. Assignments are to be completed and submitted individually. Each assignment will carry equal weight. Together all assignments account for 25% of final grade.
  • Semester Project (40%). Students, working in teams of two or three, will complete a semester project. A project could take one of several forms: analyzing an interesting dataset using existing methods and software tools; developing new data science methods; careful performance evaluation of known methods; building your own data product; or creating a visualization of a complex dataset. Students will be given an opportunity to choose from a list of projects the instructor provides or propose their own project. Guidelines for what constitutes a project will be provided by the instructor. A project will culminate in a written report and a short (5-min) presentation in class. General guidelines for how to prepare a report will be provided by the instructor. Similarly, guidelines for how to prepare and deliver good presentations will be provided by the instructor.
  • Survey Paper (10%). Each student, individually, will write a survey paper further exploring a specific topic related to the course content. The topic will be chosen in consultation with the instructor. The background material for the paper may be drawn from journal/conference literature reflecting recent research, books, or reports. The format of the paper will resemble typical journal/conference survey papers. The length of the paper will depend on the nature of the topic chosen but will typically be around ten pages.
  • Exam (23%). There will be one mid-term exam designed to complement the assignments and the semester project. The exam is tentatively scheduled for the week of November 8.
  • Participation (2%). Active class participation (in discussions during lectures, surveys, and other online discussions) is required. It will count towards 3% of the final grade.

Learning Outcomes and Assessment

STUDENT LEARNING OUTCOMES.COURSE TOPICS/DATES.EVALUATION.
By the end of the course, students
should be able to:
The following topics/dates will
address this outcome:
This outcome will
be evaluated primarily by:
Describe what Data Science is and the skill sets neededWhat is Data Science? (Week 1)Assignments; Exam
Describe the Data Science ProcessEDA and the Data Science Process (Week 3)Assignments; Exam; Project
Use R (or Python) to carry out statistical modeling and analysisIntro to R (week 2); Most subsequent topics throughout the semesterAssignments; Project
Carry out exploratory data analysisEDA (week 3)Assignments; Project
Use effective data wrangling approaches to manipulate dataData Wrangling (week 4, week 5)Assignments; Project
Create effective visualization of data
(To communicate or persuade)
Data Visualization (week 5, week 6)Assignments; Project
Apply machine learning algorithms for predictive modelingLinear Regression (week 7); Classification (weeks 8 and 9); Deep learning (week 13, week 15)Assignments; Project; Exam
Apply effective resampling methods to assess model performance.Cross-validation (week 10);Project; Exam
Apply learning methods to discover patterns, trends, and anomalies in dataUnsupervised learning (week 11);
Time Series Data Mining (week 12)
Assignments; Project; Exam
Reason around ethical and privacy issues in data science conduct, and apply ethical practicesData and Ethics (week 15)In-class exercise
Work effectively in teams on data science projectsProject
Apply knowledge gained in the course to carry out a project and write a technical reportProject

Detailed Topics and Course Outline

  1. Introduction: What is Data Science?
    – Big Data and Data Science; Landscape of perspectives; Skill sets needed
  2.  Intro to R
    – R basics; R graphics; R Markdown
  3. Exploratory Data Analysis and the Data Science Process
    – Basic tools of EDA; Philosophy of EDA; The Data Science Process
  4. Data Wrangling
    – Data transformation and manipulation (dplyr); Relational data; Data “tidying” (tidyr)
  5. Data Visualization
    – Telling story with data; Choosing tools to visualize data; Visualizing patterns over time; Visualizing proportions; Visualizing relationships; Visualizing text information; Ascombe’s quartet; Tufte’s visualization aesthetic.
  6. Overview of Machine Learning
    – Supervised Learning (canonical examples and real-world applications);
    Unsupervised Learning (canonical examples and real-world applications)
  7. Linear Regression
    – Simple linear regression; Multiple linear regression; Extensions of the linear model
  8. Classification
    – Overview of classification; Logistic regression; Linear Discriminate Analysis; Naive-Bayes classifier; K-Nearest Neighbors (KNN); Decision Trees and Random Forest
  9. Resampling Methods
    – Cross-validation; The Bootstrap
  10. Unsupervised Learning
    – Principal Component Analysis (PCA); K-means clustering; Hierarchical clustering
  11. Time Series Data Mining Overview
    – Examples of areas where time series data arise; Distance measures; Algorithms (motif discovery, anomaly detection, segmentation, classification, clustering).
  12. Intro to Deep Learning
    – What is deep learning? The perceptron; Activation functions; Building neural networks;
    Training neural networks; Regularization; Software packages for DL; Convolutional neural networks
  13. Data Science and Ethical Issues
    – Discussions on privacy, security, ethics; A look back at Data Science

Books

There is no required “textbook” for this course. Select chapters from the followings references will be used as starting points for discussions, but they will be supplemented with instructor-developed lecture notes and reading assignments from other sources.

  • Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. An Introduction to Statistical Learning with Applications in R. Second Edition, Springer, 2021. ISBN 978-1071614174.
    The book is freely available online at: www.statlearning.com/.
  • Hadley Wickham and Garett Grolemund. R for Data Science. The book is freely available at: r4ds.had.co.nz/.
  • Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. MIT Press; expected Feb 2022. Various info including draft PDF files of chapters available at Probabilistic Machine Learning book 1 link.
  • Kevin P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press; expected Feb 2022. Various info including draft PDF files of chapters available at Progabilistic Machine Learning book 2.
  • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. Third Edition, Cambridge University Press. 2021. The book is freely available online at: www.mmds.org/).
  • Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. (Note: this is a book currently being written by the three authors. The authors have made a draft of their notes for the book available online: Foundations of Data Science. The material is intended for a modern theoretical course in computer science).
  • Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. ISBN 9780262035613. The book is freely available online at: www.deeplearningbook.org.

Weekly Schedule

Tentative week-by-week schedule of topics and assignments. The date shown in parenthesis is just the Monday of that week.

WEEKS TOPICS ASSIGNMENTS
01 What is Data ScienceAssignment 1 out
02 Intro to R/PythonAssignment 1 due, Assignment 2 out
03 Exploratory Data AnalysisAssignment 2 due
04 Data Wrangling IAssignment 3 out
05 Data Wrangling II, Data Visualization IAssignment 3 due
06 Data Visualization II Assignment 4 out
07 Semester project set-up, Overview of MLAssign. 4 due, Project proposal out
08 Linear RegressionProject proposal due, Assign. 5 out
09 Classification I
10 Classification II, Resampling methodsAssign. 5 due
11 Unsupervised LearningProject progress report due
12 Time Series Data Mining Mid-term Exam
13 Deep Learning (DL)
14 DL II, Ethics, Course wrap-upIn-class exercise
15 Thanksgiving break
16 Project presentations Survey paper due (for 575)
17 Final project report due on Dec 10