CSC110 Fall 2024 Coding Project 2: Tabular Data and Sentiment Analysis

\( \newcommand{\NOT}{\neg} \newcommand{\AND}{\wedge} \newcommand{\OR}{\vee} \newcommand{\XOR}{\oplus} \newcommand{\IMP}{\Rightarrow} \newcommand{\IFF}{\Leftrightarrow} \newcommand{\TRUE}{\text{True}\xspace} \newcommand{\FALSE}{\text{False}\xspace} \newcommand{\IN}{\,{\in}\,} \newcommand{\NOTIN}{\,{\notin}\,} \newcommand{\TO}{\rightarrow} \newcommand{\DIV}{\mid} \newcommand{\NDIV}{\nmid} \newcommand{\MOD}[1]{\pmod{#1}} \newcommand{\MODS}[1]{\ (\text{mod}\ #1)} \newcommand{\N}{\mathbb N} \newcommand{\Z}{\mathbb Z} \newcommand{\Q}{\mathbb Q} \newcommand{\R}{\mathbb R} \newcommand{\C}{\mathbb C} \newcommand{\cA}{\mathcal A} \newcommand{\cB}{\mathcal B} \newcommand{\cC}{\mathcal C} \newcommand{\cD}{\mathcal D} \newcommand{\cE}{\mathcal E} \newcommand{\cF}{\mathcal F} \newcommand{\cG}{\mathcal G} \newcommand{\cH}{\mathcal H} \newcommand{\cI}{\mathcal I} \newcommand{\cJ}{\mathcal J} \newcommand{\cL}{\mathcal L} \newcommand{\cK}{\mathcal K} \newcommand{\cN}{\mathcal N} \newcommand{\cO}{\mathcal O} \newcommand{\cP}{\mathcal P} \newcommand{\cQ}{\mathcal Q} \newcommand{\cS}{\mathcal S} \newcommand{\cT}{\mathcal T} \newcommand{\cV}{\mathcal V} \newcommand{\cW}{\mathcal W} \newcommand{\cZ}{\mathcal Z} \newcommand{\emp}{\emptyset} \newcommand{\bs}{\backslash} \newcommand{\floor}[1]{\left \lfloor #1 \right \rfloor} \newcommand{\bigfloor}[1]{\Big \lfloor #1 \Big \rfloor} \newcommand{\ceil}[1]{\left \lceil #1 \right \rceil} \newcommand{\bigceil}[1]{\Big \lceil #1 \Big \rceil} \newcommand{\abs}[1]{\left | #1 \right |} \newcommand{\bigabs}[1]{\Big | #1 \Big |} \newcommand{\xspace}{} \newcommand{\proofheader}[1]{\underline{\textbf{#1}}} \)

Note: Any FAQs or clarifications relevant to the assignment will be posted here. This post will be continually updated (with newer updates at the bottom of the page), so make sure to check on it regularly – click the “Watch” button in the top-right corner to receive notifications about this thread. If you have a question which is not already addressed on this page, create a new thread to post your question on our discussion board.

In parts 1 and 2 of this assignment, you’ll analyze real-world TTC subway delay data with functions—first using nested lists and comprehensions, and then using data classes and for loops.

In part 3, you will learn about sentiment analysis and write a small program that analyzes the sentiments found in text (we’ve included a bunch of song lyrics files for you to try the program out with).

Note: this assignment involves more programming than Assignment 1, and overall is a bit longer. Please start early!

Advice for Assignment 2

We strongly recommend taking a few minutes to perform the following steps, similar to the advice given for Assignment 1.

Overall, this assignment involves some deeper analysis and more complex programming tasks than Assignment 1. However, we’ve taken the same approach of breaking down the assignment into multiple parts (and sub-parts) to make it easier for you to process everything, and make a plan to complete the assignment in chunks.

Logistics

Getting Started

General instructions

Like in the previous assignment, we have provided code at the bottom of each file for running doctest examples (and pytest in part 4) and PythonTA on each file. The PythonTA code in the comments for each file is different, and is specific for that file. For example, the PythonTA code in Parts 1 and 2 allows use of the function open, whereas the PythonTA code for Part 3 does not allow open to be used. We are not grading doctests on this assignment, but encourage you to add some as a way to understand each function we’ve asked you to complete. We are using PythonTA to grade your work, so please run that on every Python file you submit using the code we’ve provided.

Part 1: TTC subway delays

A significant source of frustration to the residents of Toronto are delays in public transit. Admittedly, adding time to your commute can take a negative toll on just about anyone who commutes. Some articles and books claim that a short commute time will improve your happiness. One article goes so far as linking the misery of additional commute time to a corresponding pay cut. In this exploration, you will work with data on subway delays provided by the Toronto Transit Commission (TTC), the organization that runs Toronto public transit.

0. The data set

You should see the file ttc-subway-delays.csv in your a2 folder, as it was included with your starter code. This file contains a record of all TTC delays in the time period from January 1, 2014 to October 31, 2019, courtesy of the City of Toronto. After completing this assignment, you could use your code to analyze the newer data available on the City of Toronto site, for an interesting side project!

The data is stored using the comma-separated values (csv) file format, the same format we saw in class. For example, in our sample data the first four lines look like this:

Here is a description and expected Python data types of the columns in this data set.

1. Reading the file

Date	Time	Day	Station	Code	Min Delay	Min Gap	Bound	Line	Vehicle
01/01/2014	0:21	Wednesday	VICTORIA PARK STATION	MUPR1	55	60	W	BD	5111
01/01/2014	2:06	Wednesday	HIGH PARK STATION	SUDP	3	7	W	BD	5001
01/01/2014	2:40	Wednesday	SHEPPARD STATION	MUNCA	0	0		YU	0

Column name	Description	Python data type
Date	The date of the delay	`datetime.date`
Time	The time of the delay	`datetime.time`
Day	The day of the week on which the delay occurred.	`str`
Station	The name of the subway station where the delay occurred.	`str`
Code	The TTC delay code, which usually describes the cause of the delay. You can find a table showing the codes and descriptions in `ttc-subway-delay-codes.csv`, which was also included in the starter code.	`str`
Min Delay	The length of the subway delay (in minutes).	`int`
Min Gap	The length of time between subway trains (in minutes).	`int`
Bound	The direction in which the train was travelling. This is dependent on the line the train was on.	`str`
Line	The abbreviated name of the subway line where the delay occurred.	`str`
Vehicle	The id number of the train on which the delay occurred.	`int`

Your first task is to take the ttc-subway-delays.csv and load the data in Python in the same way we did this in lecture. Complete the read_csv_file function (along with its helper functions, which we describe below), which returns a tuple with two elements, the first representing the header, and the second representing the remaining rows of data.

Using tuples

While we’ve discussed tuples several times during lecture, we have not had much practice using tuples until now. On this assignment, many of the functions you write will use tuples. Tuples are similar to lists, and can be indexed using [], but are an immutable data type, supporting no mutating methods. Unlike lists, tuples have the benefit of being able to specify the types of each of its elements even for a heterogeneous collection, as you can see in the function headers in the starter code.

To write a tuple literal, use parentheses along with commas, for example (1, 'hi') for a tuple of type tuple[int, str]. Sometimes, the parentheses can be omitted. An example of this can be found in the starter code for part 3, and PyCharm will notify you when this is the case. A tuple with one element is written with an extra comma since parentheses on their own are used for precedence. For example, to write a value of type tuple[int] we write (1,).

Use a csv.reader object that we saw in class to read rows of data from the file. Recall that this object turns every row into a list of strings. However, in order to do useful computations on this data, we’ll need to convert many of these entries into other Python data types, like int and datetime.date. Implement the helper function process_row—and its helper functions str_to_date and str_to_time—to process a single row of data to convert the entries into their appropriate data types (specified in the table above).

2. Operating on the data

Now that we have this csv data stored as a nested list in Python, we can do some analysis on it! Complete the functions below to answer some questions about this data.

Coding requirements:

For this question, we will start off with our “older” tools of comprehensions and lists of data. Other features (e.g., loops) are not allowed, and parts of your submissions that use them may receive a grade as low as zero for doing so. We will use different features in Part 2 when we revisit these functions.

Part 2: TTC subway delays revisited

Now we will revisit our code in Part 1 and use the tools we learned about later in the course: data classes and for loops. You may want to (and are allowed to) reuse your Part 1 code here.

1. Adding a data class

Your first task is to design and implement the new data class Delay, which represents a single row of the table. This is very similar to what we did for the marriage license data set in lecture.

2. Reading the file

Next, complete the read_csv_file function and its helpers, like the analogous functions in Part 1, but using the new Delay data class.

3. Operating on the data

Coding requirements:

For this question, we will practice using for loops instead of comprehensions. In addition to this requirement, do not use any built-in aggregation functions (like sum or len or max). Like before, parts of your submissions that use these features may receive a grade as low as zero for doing so.

All of your loops should follow the loop accumulator pattern from lecture:

<x>_so_far = <default_value>

for element in <collection>:
    <x>_so_far = ... <x>_so_far ... element ...  # Somehow combine loop variable and accumulator

return <x>_so_far

Finally, complete the functions longest_delay, average_delay, and num_delays_by_month, which are equivalent to the functions you completed for Part 1, except they now take in a list[Delay] rather than a list[list] to represent the tabular data. Note that because we have the more specific type annotation list[Delay], we no longer need the preconditions in Part 1 saying that the inner lists have the right structure!

Part 3: Sentiment Analysis

For this part of the assignment, we will be taking part in some sentiment analysis (also known as “opinion mining” or “emotion AI”).

Sentiment analysis involves analyzing the feelings behind a piece of text. For example, it can involve looking at reviews for a product and trying to figure out whether the review is positive or negative. There are of course, many limitations to this (an automated analysis can be bad at detecting things like sarcasm for example!), but it’s a very important application of computational linguistics.

This sort of analysis is mainly used in analyzing social media texts (e.g. how positive is someone’s Twitter profile in comparison to another’s?), customer reviews (e.g. how positive or negative is a brand’s online reputation, based on what customers say about it?), and other marketing or web-related tasks.

For this problem set, we will write some programs that can tell us whether a piece of text is happy or sad (or neutral). Our programs will be very naive (i.e. they may not always give an accurate analysis), but it’ll give you a basic idea of how such sentiment analysis works. :)

(If you are interested in this field of study, here is a good further (completely optional) reading on some of its usages and limitations.)

1. Let’s analyze some text

In a2_part3.py, you will be completing functions that make use of a sentiment score dictionary (based off of simplified versions of the SentiWordNet 3.0 – a collection of many, many words and their associated scores).

We have included a small dictionary SAMPLE_SENTIMENTS at the top of the file which includes a tiny subset of sentiment keywords from the above resource.

The structure of our sentiment dictionary is as follows: {word1: (positive score for word1, negative score for word1), ...}

For example, from SAMPLE_SENTIMENTS we can see the word ‘good’ has a positive score of 0.625 and a negative score of 0.0, the word ‘wrong’ has a positive score of 0.1215 and a negative score of 0.75, and so on.

Any word that appears as a key in the sentiment dictionary is considered a sentiment keyword within our program.

In our program, we mainly deal with text as a list of words. To calculate the overall sentiment (which you will do in the function get_overall_score), consider the following steps:

Note that, while the overall sentiment score can either be > 0 (positive) or < 0 (negative), individual positive and negative sentiment scores (in the sentiment score dictionary) will always be positive. So, for example, the sentiment keyword: ‘shine’: (0.375, 0.125), the word “shine” has positivity score of 0.375, and a negativity score of 0.125 (NOT -0.125).

Complete the four functions provided in a2_part3.py based on the instructions provided within the starter code file as well as the explanations above.

2. Testing mutation (or, the absence of mutation)

Now that we’ve discussed mutation, it would be wise to make sure that we are not mutating objects when we shouldn’t be. Three of the four functions you completed in question 1 take arguments with mutable data types, but should not be mutating any of these objects. (We will skip get_sentiment_info, as it is difficult to satisfy its precondition that the given filename refers to a valid file.)

Based on the reading you just completed, complete the property-based test for each of the two remaining functions (get_keywords and get_overall_score) in test_a2_part3.py to test that they do not mutate the objects that their arguments refer to. Note that you do not need to have completed question 1 in order to write these tests.

In order to write these property-based tests, you will need hypothesis to generate objects like dicts, tuples, and strs. We can do this using new strategies, which are imported at the top of the starter file. For example, to generate random objects of type tuple[str, dict[int, float]] for a parameter named x, we would use the decorator @given(..., x=tuples(text(), dictionaries(ints(), floats())), ...).

Hint: for get_overall_score, we need to call it with arguments that satisfy its precondition. The following are two hints for doing this:

Submission instructions

Please proofread and test your work carefully before your final submission! As we explain in Running and Testing Your Code Before Submission, it is essential that your submitted code not contain syntax errors. Python files that contain syntax errors will receive a grade of 0 on all automated testing components (though they may receive partial or full credit on any TA grading for assignments). You have lots of time to work on this assignment and check your work (and right-click -> “Run in Python Console”), so please make sure to do this regularly and fix syntax errors right away.

Remember, you can submit your files multiple times before the due date. So you can aim to submit your work early, and if you find an error or a place to improve before the due date, you can still make your changes and resubmit your work.

After you’ve submitted your work, please give yourself a well-deserved pat on the back and go take a rest or do something fun or look at some art or look at pet instagram accounts!