CptS 475/575: Data Science
Course Information
Credit hours: 3
Semester Offered: Fall
Course Description
Data Science is the study of the generalizable extraction of knowledge from data. Being a data scientist requires an integrated skill set spanning computer science, mathematics, statistics, and domain expertise along with a good understanding of the art of problem formulation to engineer effective solutions. The purpose of this course is to introduce students to this rapidly growing field and equip them with some of its basic principles and tools as well as its general mindset. The course will use the programming languages R (primarily) and Python.
Topics to be covered include: the data science process, exploratory data analysis, data wrangling, data visualization, linear regression, classification, clustering, principal component analysis, time-series data mining, deep learning, and data and ethics.
The focus in the treatment of these topics is more on breadth, rather than depth, and emphasis is placed on integration and synthesis of concepts and their application to solving problems. Necessary theoretical abstractions (mathematical and algorithmic) are introduced as and when needed.
Audience
The course is suitable for upper-level under-graduate students (CptS 475) and graduate students (CptS 575) in computer science, engineering, applied mathematics, the sciences, business, and related analytic fields.
Prerequisites
Students are expected to: (i) have taken an introductory course in statistics and probability, (ii) have basic knowledge of algorithms and reasonable programming experience (equivalent to completing a data structures course such as CptS 223), and (iii) have some familiarity with basic linear algebra (e.g. eigenvalue/vector computation).
Coursework
The course consists of several elements: lectures (three times a week, 50 min each); a set of assignments; a substantial semester project; one mid-term exam and no final exam. Graduate students are required to complete one additional assignment: a survey paper. Below is an example of how the coursework and assessment are broken down.
For students taking it as CptS 475:
- Assignments (35%): There will be a total of about five assignments spread through the semester. Each assignment will have one major topic of emphasis. Assignments are to be completed and submitted individually. Each assignment will carry equal weight. Together all assignments account for 35% of final grade.
- Semester Project (40%): Students, working individually or in a team of two, will complete a semester project. A project could take one of several forms: analyzing an interesting dataset using existing methods and software tools; developing a new method; building your own data product; or creating a visualization of a complex dataset. Students will be given an opportunity to choose from a list of projects the instructor provides or propose their own project. Guidelines for what constitutes a project will be provided by the instructor. A project will culminate in a written report and a short presentation in class. General guidelines for how to prepare a report will be provided by the instructor. Students are expected to follow the guidelines. Similarly, guidelines for how to prepare and deliver good presentations will be provided by the instructor, and students are expected to follow the guidelines.
- Exam (23%): There will be one mid-term exam designed to complement the assignments and the semester project. The exam is tentatively scheduled to take place in the week of Nov.4. Final date will be decided later after consulting with the class.
- Class Participation (2%): Active class participation—in discussions during lectures, surveys, and other online discussions, including responding to Participation Question of The Day—is required. Class Participation will count towards 2% of the final grade.
For students taking it as CptS 575:
- Assignments (25%). There will be a total of 5 assignments spread through the semester. Each assignment will have one major topic of emphasis. Assignments are to be completed and submitted individually. Each assignment will carry equal weight. Together all assignments account for 25% of final grade.
- Semester Project (40%). Students, working in teams of two or three, will complete a semester project. A project could take one of several forms: analyzing an interesting dataset using existing methods and software tools; developing new data science methods; careful performance evaluation of known methods; building your own data product; or creating a visualization of a complex dataset. Students will be given an opportunity to choose from a list of projects the instructor provides or propose their own project. Guidelines for what constitutes a project will be provided by the instructor. A project will culminate in a written report and a short (5-min) presentation in class. General guidelines for how to prepare a report will be provided by the instructor. Similarly, guidelines for how to prepare and deliver good presentations will be provided by the instructor.
- Survey Paper (10%). Each student, individually, will write a survey paper further exploring a specific topic related to the course content. The topic will be chosen in consultation with the instructor. The background material for the paper may be drawn from journal/conference literature reflecting recent research, books, or reports. The format of the paper will resemble typical journal/conference survey papers. The length of the paper will depend on the nature of the topic chosen but will typically be around ten pages.
- Exam (23%). There will be one mid-term exam designed to complement the assignments and the semester project. The exam is tentatively scheduled for the week of November 8.
- Participation (2%). Active class participation (in discussions during lectures, surveys, and other online discussions) is required. It will count towards 3% of the final grade.
Learning Outcomes and Assessment
| STUDENT LEARNING OUTCOMES. | COURSE TOPICS/DATES. | EVALUATION. |
|---|---|---|
| By the end of the course, students should be able to: | The following topics/dates will address this outcome: | This outcome will be evaluated primarily by: |
| Describe what Data Science is and the skill sets needed | What is Data Science? (Week 1) | Assignments; Exam |
| Describe the Data Science Process | EDA and the Data Science Process (Week 3) | Assignments; Exam; Project |
| Use R (or Python) to carry out statistical modeling and analysis | Intro to R (week 2); Most subsequent topics throughout the semester | Assignments; Project |
| Carry out exploratory data analysis | EDA (week 3) | Assignments; Project |
| Use effective data wrangling approaches to manipulate data | Data Wrangling (week 4, week 5) | Assignments; Project |
| Create effective visualization of data (To communicate or persuade) | Data Visualization (week 5, week 6) | Assignments; Project |
| Apply machine learning algorithms for predictive modeling | Linear Regression (week 7); Classification (weeks 8 and 9); Deep learning (week 13, week 15) | Assignments; Project; Exam |
| Apply effective resampling methods to assess model performance. | Cross-validation (week 10); | Project; Exam |
| Apply learning methods to discover patterns, trends, and anomalies in data | Unsupervised learning (week 11); Time Series Data Mining (week 12) | Assignments; Project; Exam |
| Reason around ethical and privacy issues in data science conduct, and apply ethical practices | Data and Ethics (week 15) | In-class exercise |
| Work effectively in teams on data science projects | Project | |
| Apply knowledge gained in the course to carry out a project and write a technical report | Project |
Detailed Topics and Course Outline
- Introduction: What is Data Science?
– Big Data and Data Science; Landscape of perspectives; Skill sets needed - Intro to R
– R basics; R graphics; R Markdown - Exploratory Data Analysis and the Data Science Process
– Basic tools of EDA; Philosophy of EDA; The Data Science Process - Data Wrangling
– Data transformation and manipulation (dplyr); Relational data; Data “tidying” (tidyr) - Data Visualization
– Telling story with data; Choosing tools to visualize data; Visualizing patterns over time; Visualizing proportions; Visualizing relationships; Visualizing text information; Ascombe’s quartet; Tufte’s visualization aesthetic. - Overview of Machine Learning
– Supervised Learning (canonical examples and real-world applications);
Unsupervised Learning (canonical examples and real-world applications) - Linear Regression
– Simple linear regression; Multiple linear regression; Extensions of the linear model - Classification
– Overview of classification; Logistic regression; Linear Discriminate Analysis; Naive-Bayes classifier; K-Nearest Neighbors (KNN); Decision Trees and Random Forest - Resampling Methods
– Cross-validation; The Bootstrap - Unsupervised Learning
– Principal Component Analysis (PCA); K-means clustering; Hierarchical clustering - Time Series Data Mining Overview
– Examples of areas where time series data arise; Distance measures; Algorithms (motif discovery, anomaly detection, segmentation, classification, clustering). - Intro to Deep Learning
– What is deep learning? The perceptron; Activation functions; Building neural networks;
Training neural networks; Regularization; Software packages for DL; Convolutional neural networks - Data Science and Ethical Issues
– Discussions on privacy, security, ethics; A look back at Data Science
Books
There is no required “textbook” for this course. Select chapters from the followings references will be used as starting points for discussions, but they will be supplemented with instructor-developed lecture notes and reading assignments from other sources.
- Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. An Introduction to Statistical Learning with Applications in R. Second Edition, Springer, 2021. ISBN 978-1071614174.
The book is freely available online at: www.statlearning.com/. - Hadley Wickham and Garett Grolemund. R for Data Science. The book is freely available at: r4ds.had.co.nz/.
- Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. MIT Press; expected Feb 2022. Various info including draft PDF files of chapters available at Probabilistic Machine Learning book 1 link.
- Kevin P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press; expected Feb 2022. Various info including draft PDF files of chapters available at Progabilistic Machine Learning book 2.
- Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. Third Edition, Cambridge University Press. 2021. The book is freely available online at: www.mmds.org/).
- Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. (Note: this is a book currently being written by the three authors. The authors have made a draft of their notes for the book available online: Foundations of Data Science. The material is intended for a modern theoretical course in computer science).
- Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. ISBN 9780262035613. The book is freely available online at: www.deeplearningbook.org.
Weekly Schedule
Tentative week-by-week schedule of topics and assignments. The date shown in parenthesis is just the Monday of that week.
| WEEKS | TOPICS | ASSIGNMENTS |
|---|---|---|
| 01 | What is Data Science | Assignment 1 out |
| 02 | Intro to R/Python | Assignment 1 due, Assignment 2 out |
| 03 | Exploratory Data Analysis | Assignment 2 due |
| 04 | Data Wrangling I | Assignment 3 out |
| 05 | Data Wrangling II, Data Visualization I | Assignment 3 due |
| 06 | Data Visualization II | Assignment 4 out |
| 07 | Semester project set-up, Overview of ML | Assign. 4 due, Project proposal out |
| 08 | Linear Regression | Project proposal due, Assign. 5 out |
| 09 | Classification I | |
| 10 | Classification II, Resampling methods | Assign. 5 due |
| 11 | Unsupervised Learning | Project progress report due |
| 12 | Time Series Data Mining | Mid-term Exam |
| 13 | Deep Learning (DL) | |
| 14 | DL II, Ethics, Course wrap-up | In-class exercise |
| 15 | Thanksgiving break | |
| 16 | Project presentations | Survey paper due (for 575) |
| 17 | Final project report due on Dec 10 |