CptS 475/575: Data Science — 2021 — Syllabus

Links:   CptS 475 Syllabus in PDF   CptS 575 Syllabus in PDF   Datasets

Course Information

Credit hours: 3
Semester: Fall 2021
Meeting times and location:  MWF 9:10-10:00am, CARP 102

Course management system: Canvas will be used for the management of this course, including for posting of lecture material, assignments, announcements, and messages and for handling student submissions and instructor feedbacks.


Instructor Information

Instructor: Assefaw Gebremedhin
Office: EME B43
Email: assefaw DOT gebremedhin AT wsu DOT edu
Webpage: www.eecs.wsu.edu/~assefaw
Instructor Office hours: Wednesdays 10:30am–12pm, or by appointment.

Teaching Assistant: Shruti Patil
Email: shrutisunil DOT patil AT wsu DOT edu
Office: Dana 115
TA Office Hours: Tuesdays and Wednesdays 3–4:30pm


Course Description

Data Science is the study of the generalizable extraction of knowledge from data. Being a data scientist requires an integrated skill set spanning computer science, mathematics, statistics, and domain expertise along with a good understanding of the art of problem formulation to engineer effective solutions. The purpose of this course is to introduce students to this rapidly growing field and equip them with some of its basic principles and tools as well as its general mindset. The course will use the programming languages R (primarily) and Python.

Topics to be covered include: the data science process, exploratory data analysis, data wrangling, data visualization, linear regression, classification, clustering, principal component analysis, time-series data mining, deep learning, and data and ethics.

The focus in the treatment of these topics is more on breadth, rather than depth, and emphasis is placed on integration and synthesis of concepts and their application to solving problems. Necessary theoretical abstractions (mathematical and algorithmic) are introduced as and when needed.


Audience:

The course is suitable for upper-level under-graduate students (CptS 475) and graduate students (CptS 575) in computer science, engineering, applied mathematics, the sciences, business, and related analytic fields.

Prerequisites:

Students are expected to: (i) have taken an introductory course in statistics and probability, (ii) have basic knowledge of algorithms and reasonable programming experience (equivalent to completing a data structures course such as CptS 223), and (iii) have some familiarity with basic linear algebra (e.g. eigenvalue/vector computation).


Coursework

For students taking it as CptS 475:

The course consists of several elements: lectures (three times a week, 50 min each); a set of assignments; a substantial semester project; one mid-term exam and no final exam. Below is how the coursework and assessment are broken down.

  • Assignments (35%). There will be a total of about five assignments spread through the semester. Each assignment will have one major topic of emphasis. Assignments are to be completed and submitted individually. Each assignment will carry equal weight. Together all assignments account for 35% of final grade.
  • Semester Project (40%). Students, working in teams of two or three, will complete a semester project. A project could take one of several forms: analyzing an interesting dataset using existing methods and software tools; building your own data product; or creating a visualization of a complex dataset. Students will be given an opportunity to choose from a list of projects the instructor provides or propose their own project. Guidelines for what constitutes a project will be provided by the instructor. A project will culminate in a written report and a short (5-min) presentation in class. General guidelines for how to prepare a report will be provided by the instructor. Similarly, guidelines for how to prepare and deliver good presentations will be provided by the instructor.
  • Exam (22%). There will be one mid-term exam designed to complement the assignments and the semester project. The exam is tentatively scheduled for the week of November 8.
  • Participation (3%). Active class participation (in discussions during lectures, surveys, and other online discussions) is required. It will count towards 3% of the final grade.

For students taking it as CptS 575:

The course consists of several elements: lectures (three times a week, 50 min each); a set of assignments; a survey paper; a substantial semester project; and an exam. Below is how the coursework and assessment is broken down.

  • Assignments (25%). There will be a total of 5 assignments spread through the semester. Each assignment will have one major topic of emphasis. Assignments are to be completed and submitted individually. Each assignment will carry equal weight. Together all assignments account for 25% of final grade.
  • Semester Project (40%). Students, working in teams of two or three, will complete a semester project. A project could take one of several forms: analyzing an interesting dataset using existing methods and software tools; developing new data science methods; careful performance evaluation of known methods; building your own data product; or creating a visualization of a complex dataset. Students will be given an opportunity to choose from a list of projects the instructor provides or propose their own project. Guidelines for what constitutes a project will be provided by the instructor. A project will culminate in a written report and a short (5-min) presentation in class. General guidelines for how to prepare a report will be provided by the instructor. Similarly, guidelines for how to prepare and deliver good presentations will be provided by the instructor.
  • Survey Paper (15%). Each student, individually, will write a survey paper further exploring a specific topic related to the course content. The topic will be chosen in consultation with the instructor. The background material for the paper may be drawn from journal/conference literature reflecting recent research, books, or reports. The format of the paper will resemble typical journal/conference survey papers. The length of the paper will depend on the nature of the topic chosen but will typically be around ten pages.
  • Exam (17%). There will be one mid-term exam designed to complement the assignments and the semester project. The exam is tentatively scheduled for the week of November 8.
  • Participation (3%). Active class participation (in discussions during lectures, surveys, and other online discussions) is required. It will count towards 3% of the final grade.

Learning Outcomes and Assessment

STUDENT LEARNING OUTCOMES.COURSE TOPICS/DATES.EVALUATION.
By the end of the course, students
should be able to:
The following topics/dates will
address this outcome:
This outcome will
be evaluated primarily by:
Describe what Data Science is and the skill sets neededWhat is Data Science? (Week 1)Assignments; Exam
Describe the Data Science ProcessEDA and the Data Science Process (Week 3)Assignments; Exam; Project
Use R (or Python) to carry out statistical modeling and analysisIntro to R (week 2); Most subsequent topics throughout the semesterAssignments; Project
Carry out exploratory data analysisEDA (week 3)Assignments; Project
Use effective data wrangling approaches to manipulate dataData Wrangling (week 4, week 5)Assignments; Project
Create effective visualization of data
(To communicate or persuade)
Data Visualization (week 5, week 6)Assignments; Project
Apply machine learning algorithms for predictive modelingLinear Regression (week 7); Classification (weeks 8 and 9); Deep learning (week 13, week 15)Assignments; Project; Exam
Apply effective resampling methods to assess model performance.Cross-validation (week 10);Project; Exam
Apply learning methods to discover patterns, trends, and anomalies in dataUnsupervised learning (week 11);
Time Series Data Mining (week 12)
Assignments; Project; Exam
Reason around ethical and privacy issues in data science conduct, and apply ethical practicesData and Ethics (week 15)In-class exercise
Work effectively in teams on data science projectsProject
Apply knowledge gained in the course to carry out a project and write a technical reportProject

Expectations for Student Effort

For each hour of lecture equivalent, students should expect to have a minimum of two hours of work outside class.


Grading

Letter grades will be given according to the following ranges:
A (93-100%), A- (90-92.99%), B+ (87-89.99%), B (83-86.99%), B- (80-82.99%), C+ (77-79.99%), C (70-76.99%), C- (67-69.99%), D (60-66.99%), F (less than 60%).

Detailed Topics and Course Outline

  1. Introduction: What is Data Science?
    – Big Data and Data Science; Landscape of perspectives; Skill sets needed
  2.  Intro to R
    – R basics; R graphics; R Markdown
  3. Exploratory Data Analysis and the Data Science Process
    – Basic tools of EDA; Philosophy of EDA; The Data Science Process
  4. Data Wrangling
    – Data transformation and manipulation (dplyr); Relational data; Data “tidying” (tidyr)
  5. Data Visualization
    – Telling story with data; Choosing tools to visualize data; Visualizing patterns over time; Visualizing proportions; Visualizing relationships; Visualizing text information; Ascombe’s quartet; Tufte’s visualization aesthetic.
  6. Overview of Machine Learning
    – Supervised Learning (canonical examples and real-world applications);
    Unsupervised Learning (canonical examples and real-world applications)
  7. Linear Regression
    – Simple linear regression; Multiple linear regression; Extensions of the linear model
  8. Classification
    – Overview of classification; Logistic regression; Linear Discriminate Analysis; Naive-Bayes classifier; K-Nearest Neighbors (KNN); Decision Trees and Random Forest
  9. Resampling Methods
    – Cross-validation; The Bootstrap
  10. Unsupervised Learning
    – Principal Component Analysis (PCA); K-means clustering; Hierarchical clustering
  11. Time Series Data Mining Overview
    – Examples of areas where time series data arise; Distance measures; Algorithms (motif discovery, anomaly detection, segmentation, classification, clustering).
  12. Intro to Deep Learning
    – What is deep learning? The perceptron; Activation functions; Building neural networks;
    Training neural networks; Regularization; Software packages for DL; Convolutional neural networks
  13. Data Science and Ethical Issues
    – Discussions on privacy, security, ethics; A look back at Data Science

Books

There is no required “textbook” for this course. Select chapters from the followings references will be used as starting points for discussions, but they will be supplemented with instructor-developed lecture notes and reading assignments from other sources.

  • Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. An Introduction to Statistical Learning with Applications in R. Second Edition, Springer, 2021. ISBN 978-1071614174.
    The book is freely available online at: www.statlearning.com/.
  • Hadley Wickham and Garett Grolemund. R for Data Science. The book is freely available at: r4ds.had.co.nz/.
  • Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. MIT Press; expected Feb 2022. Various info including draft PDF files of chapters available at Probabilistic Machine Learning book 1 link.
  • Kevin P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press; expected Feb 2022. Various info including draft PDF files of chapters available at Progabilistic Machine Learning book 2.
  • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. Third Edition, Cambridge University Press. 2021. The book is freely available online at: www.mmds.org/).
  • Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. (Note: this is a book currently being written by the three authors. The authors have made a draft of their notes for the book available online: Foundations of Data Science. The material is intended for a modern theoretical course in computer science).
  • Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. ISBN 9780262035613. The book is freely available online at: www.deeplearningbook.org.

Weekly Schedule

Tentative week-by-week schedule of topics and assignments. The date shown in parenthesis is just the Monday of that week.

WEEKS TOPICS ASSIGNMENTS
01 (Aug 23)What is Data ScienceAssignment 1 out
02 (Aug 30)Intro to R/PythonAssignment 1 due, Assignment 2 out
03 (Sep 6)Exploratory Data AnalysisNo class 9/6; Assignment 2 due
04 (Sep 13)Data Wrangling IAssignment 3 out
05 (Sep 20)Data Wrangling II, Data Visualization IAssignment 3 due
06 (Sep 27)Data Visualization II, Project SetupAssignment 4 out, Project proposal out
07 (Oct 04)Overview of ML; Linear Regression IAssignment 4 due, Survey paper call out (for 575)
08 (Oct 11)Linear Regression II, Classification IProject proposal due, Assignment 5 out
09 (Oct 18)Classification IIAssignment 5 due, Survey paper topic due (for 575)
10 (Oct 25)Resampling Methods
11 (Nov 01)Unsupervised LearningProject progress report due
12 (Nov 08)Time Series Data Mining Mid-term Exam
13 (Nov 15)Deep Learning (DL)
14 (Nov 22)Thanksgiving break
15 (Nov 29)DL II; Ethics; Wrap-up In-class Exercise
16 (Dec 06) Project presentations Survey paper due (for 575)
17 (Dec 13)  Final project report due on Dec 13

Policies

Conduct:

Students are expected to maintain a professional and respectful virtual classroom environment. In particular, this includes:

  • silencing personal electronics (non-disruptive devices may be used during class).
  • arriving on time and remaining throughout the class.

Correspondence:

All class related correspondence with the instructor will be made via Canvas.

Attendance:

Regular attendance is expected. While students may miss class for urgent reasons, excessive absences that are not cleared with the instructor will factor into the Class Participation portion of the semester grade.

Missing or late work:

Submissions will be handled via Canvas. Students are expected to submit assignments by the specified due date and time. Assignments turned in up to 48 hours late will be accepted with a 10% grade penalty per 24 hours late. Except by prior arrangement, missing or work late by more than 48 hours will be counted as a zero.

Missed Exam:

There will be only one exam in the course, tentatively scheduled to take place in the week of November 8. Exact date of the exam will be decided at least two weeks prior to the exam date. Date will be picked in consultation with the class to accommodate as much as possible students’ other exam schedules and commitments. Make-up exam is not allowed if the exam is missed.

COVID-19 Policy:

Students are expected to abide by all current COVID-19 related university policies and public health directives, which could include wearing a cloth face covering, physically distancing, self- attestations, and sanitizing common use spaces. All current COVID-19 related university policies and public health directives are located at WSU COVID-19 webpage.

Academic Integrity:

Academic integrity is the cornerstone of higher education. As such, all members of the university community share responsibility for maintaining and promoting the principles of integrity in all activities, including academic integrity and honest scholarship. Academic integrity will be strongly enforced in this course. Students who violate WSU’s Academic Integrity Policy (identified in Washington Administrative Code (WAC) 504-26-010(4) will fail the assignment implicated, will not have the option to withdraw from the course pending an appeal, and will be reported to the Center for Community Standards.

Cheating includes, but is not limited to, plagiarism and unauthorized collaboration as defined in the Standards of Conduct for Students, WAC 504-26-010(3). Read and understand all of the definitions of cheating given here WAC 504-26-010. If you have any questions about what is and is not allowed, ask your course instructor.

If you wish to appeal a instructor’s decision relating to academic integrity, please use the form available at communitystandards.wsu.edu. Make sure you submit your appeal within 21 calendar days of the instructor’s decision.

Students with Disabilities:

Reasonable accommodations are available for students with documented disabilities or chronic medical conditions. If you have a disability and need accommodations to fully participate in this class, please visit your campus Access Center website (websites listed below) to follow published procedures to request accommodations. Students may also call or email the Access Center to schedule an appointment with an Access Advisor. All disability related accommodations are to be approved through the Access Center. It is a university expectation that students with approved accommodations visit with instructors (in person or via Zoom) within two weeks of requesting their accommodations to discuss logistics.

For more information contact a Disability Specialist on your home campus:

Accommodation for Religious Observances or Activities:

Washington State University reasonably accommodates absences allowing for students to take holidays for reasons of faith or conscience or organized activities conducted under the auspices of a religious denomination, church, or religious organization. Reasonable accommodation requires the student to coordinate with the instructor on scheduling examinations or other activities necessary for course completion. Students requesting accommodation must provide written notification within the first two weeks of the beginning of the course and include specific dates for absences. Approved accommodations for absences will not adversely impact student grades. Absence from classes or examinations for religious reasons does not relieve students from responsibility for any part of the course work required during the period of absence. Students who feel they have been treated unfairly in terms of this accommodation may refer to Academic Regulation 104 – Academic Complaint Procedures. See also Academic Regulation 82, available at the academic regulations page.

Academic Dates and Deadlines:

Students are encouraged to refer to the academic calendar often to be aware of critical deadlines throughout the semester. The academic calendar can be found at the academic calendar page.

Changes:

This syllabus is subject to change. Updates will be posted on the course website.