\( \newcommand{\NOT}{\neg} \newcommand{\AND}{\wedge} \newcommand{\OR}{\vee} \newcommand{\XOR}{\oplus} \newcommand{\IMP}{\Rightarrow} \newcommand{\IFF}{\Leftrightarrow} \newcommand{\TRUE}{\text{True}\xspace} \newcommand{\FALSE}{\text{False}\xspace} \newcommand{\IN}{\,{\in}\,} \newcommand{\NOTIN}{\,{\notin}\,} \newcommand{\TO}{\rightarrow} \newcommand{\DIV}{\mid} \newcommand{\NDIV}{\nmid} \newcommand{\MOD}[1]{\pmod{#1}} \newcommand{\MODS}[1]{\ (\text{mod}\ #1)} \newcommand{\N}{\mathbb N} \newcommand{\Z}{\mathbb Z} \newcommand{\Q}{\mathbb Q} \newcommand{\R}{\mathbb R} \newcommand{\C}{\mathbb C} \newcommand{\cA}{\mathcal A} \newcommand{\cB}{\mathcal B} \newcommand{\cC}{\mathcal C} \newcommand{\cD}{\mathcal D} \newcommand{\cE}{\mathcal E} \newcommand{\cF}{\mathcal F} \newcommand{\cG}{\mathcal G} \newcommand{\cH}{\mathcal H} \newcommand{\cI}{\mathcal I} \newcommand{\cJ}{\mathcal J} \newcommand{\cL}{\mathcal L} \newcommand{\cK}{\mathcal K} \newcommand{\cN}{\mathcal N} \newcommand{\cO}{\mathcal O} \newcommand{\cP}{\mathcal P} \newcommand{\cQ}{\mathcal Q} \newcommand{\cS}{\mathcal S} \newcommand{\cT}{\mathcal T} \newcommand{\cV}{\mathcal V} \newcommand{\cW}{\mathcal W} \newcommand{\cZ}{\mathcal Z} \newcommand{\emp}{\emptyset} \newcommand{\bs}{\backslash} \newcommand{\floor}[1]{\left \lfloor #1 \right \rfloor} \newcommand{\bigfloor}[1]{\Big \lfloor #1 \Big \rfloor} \newcommand{\ceil}[1]{\left \lceil #1 \right \rceil} \newcommand{\bigceil}[1]{\Big \lceil #1 \Big \rceil} \newcommand{\abs}[1]{\left | #1 \right |} \newcommand{\bigabs}[1]{\Big | #1 \Big |} \newcommand{\xspace}{} \newcommand{\proofheader}[1]{\underline{\textbf{#1}}} \)

CSC110 Fall 2024 Coding Project 2: Tabular Data and Sentiment Analysis

Note: Any FAQs or clarifications relevant to the assignment will be posted here. This post will be continually updated (with newer updates at the bottom of the page), so make sure to check on it regularly – click the “Watch” button in the top-right corner to receive notifications about this thread. If you have a question which is not already addressed on this page, create a new thread to post your question on our discussion board.

In parts 1 and 2 of this assignment, you’ll analyze real-world TTC subway delay data with functions—first using nested lists and comprehensions, and then using data classes and for loops.

In part 3, you will learn about sentiment analysis and write a small program that analyzes the sentiments found in text (we’ve included a bunch of song lyrics files for you to try the program out with).

Note: this assignment involves more programming than Assignment 1, and overall is a bit longer. Please start early!

Advice for Assignment 2

We strongly recommend taking a few minutes to perform the following steps, similar to the advice given for Assignment 1.

  1. Skim the assignment handout.
  2. Download the starter files.
  3. Schedule time to work on the assignment.

Overall, this assignment involves some deeper analysis and more complex programming tasks than Assignment 1. However, we’ve taken the same approach of breaking down the assignment into multiple parts (and sub-parts) to make it easier for you to process everything, and make a plan to complete the assignment in chunks.

Logistics

Getting Started

To obtain the starter files for this assignment:

  1. Download a2.zip.
  2. Extract the contents of this zip file into your csc110/assignments/ folder.
  3. This will create a new a2 folder for you, with all the starter files inside. This should look similar to what you had for Assignment 1.

General instructions

Like in the previous assignment, we have provided code at the bottom of each file for running doctest examples (and pytest in part 4) and PythonTA on each file. The PythonTA code in the comments for each file is different, and is specific for that file. For example, the PythonTA code in Parts 1 and 2 allows use of the function open, whereas the PythonTA code for Part 3 does not allow open to be used. We are not grading doctests on this assignment, but encourage you to add some as a way to understand each function we’ve asked you to complete. We are using PythonTA to grade your work, so please run that on every Python file you submit using the code we’ve provided.

Part 1: TTC subway delays

A significant source of frustration to the residents of Toronto are delays in public transit. Admittedly, adding time to your commute can take a negative toll on just about anyone who commutes. Some articles and books claim that a short commute time will improve your happiness. One article goes so far as linking the misery of additional commute time to a corresponding pay cut. In this exploration, you will work with data on subway delays provided by the Toronto Transit Commission (TTC), the organization that runs Toronto public transit.

0. The data set

You should see the file ttc-subway-delays.csv in your a2 folder, as it was included with your starter code. This file contains a record of all TTC delays in the time period from January 1, 2014 to October 31, 2019, courtesy of the City of Toronto. After completing this assignment, you could use your code to analyze the newer data available on the City of Toronto site, for an interesting side project!

The data is stored using the comma-separated values (csv) file format, the same format we saw in class. For example, in our sample data the first four lines look like this:

Date,Time,Day,Station,Code,Min Delay,Min Gap,Bound,Line,Vehicle
01/01/2014,00:21,Wednesday,VICTORIA PARK STATION,MUPR1,55,60,W,BD,5111
01/01/2014,02:06,Wednesday,HIGH PARK STATION,SUDP,3,7,W,BD,5001
01/01/2014,02:40,Wednesday,SHEPPARD STATION,MUNCA,0,0,,YU,0

and they represent the following tabular data:

Date Time Day Station Code Min Delay Min Gap Bound Line Vehicle
01/01/2014 0:21 Wednesday VICTORIA PARK STATION MUPR1 55 60 W BD 5111
01/01/2014 2:06 Wednesday HIGH PARK STATION SUDP 3 7 W BD 5001
01/01/2014 2:40 Wednesday SHEPPARD STATION MUNCA 0 0 YU 0

Here is a description and expected Python data types of the columns in this data set.

Column name Description Python data type
Date The date of the delay datetime.date
Time The time of the delay datetime.time
Day The day of the week on which the delay occurred. str
Station The name of the subway station where the delay occurred. str
Code The TTC delay code, which usually describes the cause of the delay. You can find a table showing the codes and descriptions in ttc-subway-delay-codes.csv, which was also included in the starter code. str
Min Delay The length of the subway delay (in minutes). int
Min Gap The length of time between subway trains (in minutes). int
Bound The direction in which the train was travelling. This is dependent on the line the train was on. str
Line The abbreviated name of the subway line where the delay occurred. str
Vehicle The id number of the train on which the delay occurred. int

1. Reading the file

Your first task is to take the ttc-subway-delays.csv and load the data in Python in the same way we did this in lecture. Complete the read_csv_file function (along with its helper functions, which we describe below), which returns a tuple with two elements, the first representing the header, and the second representing the remaining rows of data.

Using tuples

While we’ve discussed tuples several times during lecture, we have not had much practice using tuples until now. On this assignment, many of the functions you write will use tuples. Tuples are similar to lists, and can be indexed using [], but are an immutable data type, supporting no mutating methods. Unlike lists, tuples have the benefit of being able to specify the types of each of its elements even for a heterogeneous collection, as you can see in the function headers in the starter code.

To write a tuple literal, use parentheses along with commas, for example (1, 'hi') for a tuple of type tuple[int, str]. Sometimes, the parentheses can be omitted. An example of this can be found in the starter code for part 3, and PyCharm will notify you when this is the case. A tuple with one element is written with an extra comma since parentheses on their own are used for precedence. For example, to write a value of type tuple[int] we write (1,).

Use a csv.reader object that we saw in class to read rows of data from the file. Recall that this object turns every row into a list of strings. However, in order to do useful computations on this data, we’ll need to convert many of these entries into other Python data types, like int and datetime.date. Implement the helper function process_row—and its helper functions str_to_date and str_to_time—to process a single row of data to convert the entries into their appropriate data types (specified in the table above).

2. Operating on the data

Now that we have this csv data stored as a nested list in Python, we can do some analysis on it! Complete the functions below to answer some questions about this data.

Coding requirements:

For this question, we will start off with our “older” tools of comprehensions and lists of data. Other features (e.g., loops) are not allowed, and parts of your submissions that use them may receive a grade as low as zero for doing so. We will use different features in Part 2 when we revisit these functions.

  1. What was the longest subway delay? (longest_delay)
  2. On average, how long do the subway delays last? (average_delay)
  3. How many subway delays were there in a specifc month, like July 2018? (num_delays_by_month)

Part 2: TTC subway delays revisited

Now we will revisit our code in Part 1 and use the tools we learned about later in the course: data classes and for loops. You may want to (and are allowed to) reuse your Part 1 code here.

1. Adding a data class

Your first task is to design and implement the new data class Delay, which represents a single row of the table. This is very similar to what we did for the marriage license data set in lecture.

2. Reading the file

Next, complete the read_csv_file function and its helpers, like the analogous functions in Part 1, but using the new Delay data class.

3. Operating on the data

Coding requirements:

For this question, we will practice using for loops instead of comprehensions. In addition to this requirement, do not use any built-in aggregation functions (like sum or len or max). Like before, parts of your submissions that use these features may receive a grade as low as zero for doing so.

All of your loops should follow the loop accumulator pattern from lecture:

<x>_so_far = <default_value>

for element in <collection>:
    <x>_so_far = ... <x>_so_far ... element ...  # Somehow combine loop variable and accumulator

return <x>_so_far

Finally, complete the functions longest_delay, average_delay, and num_delays_by_month, which are equivalent to the functions you completed for Part 1, except they now take in a list[Delay] rather than a list[list] to represent the tabular data. Note that because we have the more specific type annotation list[Delay], we no longer need the preconditions in Part 1 saying that the inner lists have the right structure!

Part 3: Sentiment Analysis

For this part of the assignment, we will be taking part in some sentiment analysis (also known as “opinion mining” or “emotion AI”).

Sentiment analysis involves analyzing the feelings behind a piece of text. For example, it can involve looking at reviews for a product and trying to figure out whether the review is positive or negative. There are of course, many limitations to this (an automated analysis can be bad at detecting things like sarcasm for example!), but it’s a very important application of computational linguistics.

This sort of analysis is mainly used in analyzing social media texts (e.g. how positive is someone’s Twitter profile in comparison to another’s?), customer reviews (e.g. how positive or negative is a brand’s online reputation, based on what customers say about it?), and other marketing or web-related tasks.

For this problem set, we will write some programs that can tell us whether a piece of text is happy or sad (or neutral). Our programs will be very naive (i.e. they may not always give an accurate analysis), but it’ll give you a basic idea of how such sentiment analysis works. :)

(If you are interested in this field of study, here is a good further (completely optional) reading on some of its usages and limitations.)

1. Let’s analyze some text

In a2_part3.py, you will be completing functions that make use of a sentiment score dictionary (based off of simplified versions of the SentiWordNet 3.0 – a collection of many, many words and their associated scores).

We have included a small dictionary SAMPLE_SENTIMENTS at the top of the file which includes a tiny subset of sentiment keywords from the above resource.

The structure of our sentiment dictionary is as follows: {word1: (positive score for word1, negative score for word1), ...}

For example, from SAMPLE_SENTIMENTS we can see the word ‘good’ has a positive score of 0.625 and a negative score of 0.0, the word ‘wrong’ has a positive score of 0.1215 and a negative score of 0.75, and so on.

Any word that appears as a key in the sentiment dictionary is considered a sentiment keyword within our program.

In our program, we mainly deal with text as a list of words. To calculate the overall sentiment (which you will do in the function get_overall_score), consider the following steps:

Note that, while the overall sentiment score can either be > 0 (positive) or < 0 (negative), individual positive and negative sentiment scores (in the sentiment score dictionary) will always be positive. So, for example, the sentiment keyword: ‘shine’: (0.375, 0.125), the word “shine” has positivity score of 0.375, and a negativity score of 0.125 (NOT -0.125).

Complete the four functions provided in a2_part3.py based on the instructions provided within the starter code file as well as the explanations above.

2. Testing mutation (or, the absence of mutation)

For this question, first read Chapter 6.7 of the course notes, which we didn’t cover in class.

Now that we’ve discussed mutation, it would be wise to make sure that we are not mutating objects when we shouldn’t be. Three of the four functions you completed in question 1 take arguments with mutable data types, but should not be mutating any of these objects. (We will skip get_sentiment_info, as it is difficult to satisfy its precondition that the given filename refers to a valid file.)

Based on the reading you just completed, complete the property-based test for each of the two remaining functions (get_keywords and get_overall_score) in test_a2_part3.py to test that they do not mutate the objects that their arguments refer to. Note that you do not need to have completed question 1 in order to write these tests.

In order to write these property-based tests, you will need hypothesis to generate objects like dicts, tuples, and strs. We can do this using new strategies, which are imported at the top of the starter file. For example, to generate random objects of type tuple[str, dict[int, float]] for a parameter named x, we would use the decorator @given(..., x=tuples(text(), dictionaries(ints(), floats())), ...).

Hint: for get_overall_score, we need to call it with arguments that satisfy its precondition. The following are two hints for doing this:

  1. Since we need at least one word in the word list to be in the sentiment scores dictionary, the min_size=1 option for the lists strategy may be helpful. This is similar to the min_value=1 option that was used in the worksheet for lecture 4A.

  2. While hypothesis generates random objects, there is nothing stopping us from calling the function we are trying to test with a different object. Consider mutating the random sentiment scores dictionary that you get from hypothesis to satisfy the precondition of get_overall_score.

Submission instructions

Please proofread and test your work carefully before your final submission! As we explain in Running and Testing Your Code Before Submission, it is essential that your submitted code not contain syntax errors. Python files that contain syntax errors will receive a grade of 0 on all automated testing components (though they may receive partial or full credit on any TA grading for assignments). You have lots of time to work on this assignment and check your work (and right-click -> “Run in Python Console”), so please make sure to do this regularly and fix syntax errors right away.

  1. Login to MarkUs.

  2. Go to Assignment 2, then the “Submissions” tab.

  3. Submit the following files: honour_code.txt, a2_part1.py, a2_part2.py, a2_part3.py, and test_a2_part3.py. Please note that MarkUs is picky with filenames, and so your filenames must match these exactly, including using lowercase letters.

  4. Refresh the page, and then download each file to make sure you submitted the right version.

Remember, you can submit your files multiple times before the due date. So you can aim to submit your work early, and if you find an error or a place to improve before the due date, you can still make your changes and resubmit your work.

After you’ve submitted your work, please give yourself a well-deserved pat on the back and go take a rest or do something fun or look at some art or look at pet instagram accounts!

Strawberry