Note: Any FAQs or clarifications relevant to the assignment will be posted here. This post will be continually updated (with newer updates at the bottom of the page), so make sure to check on it regularly – click the “Watch” button in the top-right corner to receive notifications about this thread. If you have a question which is not already addressed on this page, create a new thread to post your question on our discussion board.
In parts 1 and 2 of this assignment, you’ll analyze real-world TTC subway delay data with functions—first using nested lists and comprehensions, and then using data classes and for loops.
In part 3, you will learn about sentiment analysis and write a small program that analyzes the sentiments found in text (we’ve included a bunch of song lyrics files for you to try the program out with).
Note: this assignment involves more programming than Assignment 1, and overall is a bit longer. Please start early!
We strongly recommend taking a few minutes to perform the following steps, similar to the advice given for Assignment 1.
Overall, this assignment involves some deeper analysis and more complex programming tasks than Assignment 1. However, we’ve taken the same approach of breaking down the assignment into multiple parts (and sub-parts) to make it easier for you to process everything, and make a plan to complete the assignment in chunks.
honour_code.txt
file included in the starter files (more
below) which gives an overview of common mistakes students make in
regards to academic integrity.To obtain the starter files for this assignment:
a2.zip
.csc110/assignments/
folder.a2
folder for you, with all the
starter files inside. This should look similar to what you had for Assignment 1.Like in the previous assignment, we have provided code at the bottom
of each file for running doctest examples (and pytest in part 4) and
PythonTA on each file. The PythonTA code in the comments for each file
is different, and is specific for that file. For example, the
PythonTA code in Parts 1 and 2 allows use of the function
open
, whereas the PythonTA code for Part 3 does not allow
open
to be used. We are not grading doctests on this
assignment, but encourage you to add some as a way to understand each
function we’ve asked you to complete. We are using PythonTA to
grade your work, so please run that on every Python file you submit
using the code we’ve provided.
A significant source of frustration to the residents of Toronto are delays in public transit. Admittedly, adding time to your commute can take a negative toll on just about anyone who commutes. Some articles and books claim that a short commute time will improve your happiness. One article goes so far as linking the misery of additional commute time to a corresponding pay cut. In this exploration, you will work with data on subway delays provided by the Toronto Transit Commission (TTC), the organization that runs Toronto public transit.
You should see the file ttc-subway-delays.csv
in your
a2
folder, as it was included with your starter code. This
file contains a record of all TTC delays in the time period from January
1, 2014 to October 31, 2019, courtesy of the City of
Toronto. After completing this assignment, you could use your code
to analyze the newer data available on the City of Toronto
site, for an interesting side project!
The data is stored using the comma-separated values (csv) file format, the same format we saw in class. For example, in our sample data the first four lines look like this:
Date,Time,Day,Station,Code,Min Delay,Min Gap,Bound,Line,Vehicle
01/01/2014,00:21,Wednesday,VICTORIA PARK STATION,MUPR1,55,60,W,BD,5111
01/01/2014,02:06,Wednesday,HIGH PARK STATION,SUDP,3,7,W,BD,5001
01/01/2014,02:40,Wednesday,SHEPPARD STATION,MUNCA,0,0,,YU,0
and they represent the following tabular data:
Date | Time | Day | Station | Code | Min Delay | Min Gap | Bound | Line | Vehicle |
---|---|---|---|---|---|---|---|---|---|
01/01/2014 | 0:21 | Wednesday | VICTORIA PARK STATION | MUPR1 | 55 | 60 | W | BD | 5111 |
01/01/2014 | 2:06 | Wednesday | HIGH PARK STATION | SUDP | 3 | 7 | W | BD | 5001 |
01/01/2014 | 2:40 | Wednesday | SHEPPARD STATION | MUNCA | 0 | 0 | YU | 0 |
Here is a description and expected Python data types of the columns in this data set.
Column name | Description | Python data type |
---|---|---|
Date | The date of the delay | datetime.date |
Time | The time of the delay | datetime.time |
Day | The day of the week on which the delay occurred. | str |
Station | The name of the subway station where the delay occurred. | str |
Code | The TTC delay code, which usually describes the cause of the delay.
You can find a table showing the codes and descriptions in
ttc-subway-delay-codes.csv , which was also included in the
starter code. |
str |
Min Delay | The length of the subway delay (in minutes). | int |
Min Gap | The length of time between subway trains (in minutes). | int |
Bound | The direction in which the train was travelling. This is dependent on the line the train was on. | str |
Line | The abbreviated name of the subway line where the delay occurred. | str |
Vehicle | The id number of the train on which the delay occurred. | int |
Your first task is to take the ttc-subway-delays.csv
and
load the data in Python in the same way we did this in lecture. Complete
the read_csv_file
function (along with its helper
functions, which we describe below), which returns a tuple with two
elements, the first representing the header, and the second representing
the remaining rows of data.
While we’ve discussed tuples several times during lecture, we have
not had much practice using tuples until now. On this assignment, many
of the functions you write will use tuples. Tuples are similar to lists,
and can be indexed using []
, but are an immutable data
type, supporting no mutating methods. Unlike lists, tuples have the
benefit of being able to specify the types of each of its elements even
for a heterogeneous collection, as you can see in the function headers
in the starter code.
To write a tuple literal, use parentheses along with commas, for
example (1, 'hi')
for a tuple of type
tuple[int, str]
. Sometimes, the parentheses can be omitted.
An example of this can be found in the starter code for part 3, and
PyCharm will notify you when this is the case. A tuple with one element
is written with an extra comma since parentheses on their own are used
for precedence. For example, to write a value of type
tuple[int]
we write (1,)
.
Use a csv.reader
object that we saw in class to read
rows of data from the file. Recall that this object turns every row into
a list of strings. However, in order to do useful computations on this
data, we’ll need to convert many of these entries into other Python data
types, like int
and datetime.date
. Implement
the helper function process_row
—and its helper functions
str_to_date
and str_to_time
—to process a
single row of data to convert the entries into their appropriate data
types (specified in the table above).
Now that we have this csv data stored as a nested list in Python, we can do some analysis on it! Complete the functions below to answer some questions about this data.
Coding requirements:
For this question, we will start off with our “older” tools of comprehensions and lists of data. Other features (e.g., loops) are not allowed, and parts of your submissions that use them may receive a grade as low as zero for doing so. We will use different features in Part 2 when we revisit these functions.
longest_delay
)average_delay
)num_delays_by_month
)Now we will revisit our code in Part 1 and use the tools we learned about later in the course: data classes and for loops. You may want to (and are allowed to) reuse your Part 1 code here.
Your first task is to design and implement the new data class
Delay
, which represents a single row of the table. This is
very similar to what we did for the marriage license data set in
lecture.
Next, complete the read_csv_file
function and its
helpers, like the analogous functions in Part 1, but using the new
Delay
data class.
Coding requirements:
For this question, we will practice using for loops instead of
comprehensions. In addition to this requirement, do not use any built-in
aggregation functions (like sum
or len
or
max
). Like before, parts of your submissions that use these
features may receive a grade as low as zero for doing
so.
All of your loops should follow the loop accumulator pattern from lecture:
<x>_so_far = <default_value>
for element in <collection>:
<x>_so_far = ... <x>_so_far ... element ... # Somehow combine loop variable and accumulator
return <x>_so_far
Finally, complete the functions longest_delay
,
average_delay
, and num_delays_by_month
, which
are equivalent to the functions you completed for Part 1, except they
now take in a list[Delay]
rather than a
list[list]
to represent the tabular data. Note that because
we have the more specific type annotation list[Delay]
, we
no longer need the preconditions in Part 1 saying that the inner lists
have the right structure!
For this part of the assignment, we will be taking part in some sentiment analysis (also known as “opinion mining” or “emotion AI”).
Sentiment analysis involves analyzing the feelings behind a piece of text. For example, it can involve looking at reviews for a product and trying to figure out whether the review is positive or negative. There are of course, many limitations to this (an automated analysis can be bad at detecting things like sarcasm for example!), but it’s a very important application of computational linguistics.
This sort of analysis is mainly used in analyzing social media texts (e.g. how positive is someone’s Twitter profile in comparison to another’s?), customer reviews (e.g. how positive or negative is a brand’s online reputation, based on what customers say about it?), and other marketing or web-related tasks.
For this problem set, we will write some programs that can tell us whether a piece of text is happy or sad (or neutral). Our programs will be very naive (i.e. they may not always give an accurate analysis), but it’ll give you a basic idea of how such sentiment analysis works. :)
(If you are interested in this field of study, here is a good further (completely optional) reading on some of its usages and limitations.)
In a2_part3.py
, you will be completing functions that
make use of a sentiment score dictionary (based off of simplified
versions of the SentiWordNet 3.0 – a
collection of many, many words and their associated scores).
We have included a small dictionary SAMPLE_SENTIMENTS
at
the top of the file which includes a tiny subset of sentiment keywords
from the above resource.
The structure of our sentiment dictionary is as follows:
{word1: (positive score for word1, negative score for word1), ...}
For example, from SAMPLE_SENTIMENTS
we can see the word
‘good’ has a positive score of 0.625 and a negative score of 0.0, the
word ‘wrong’ has a positive score of 0.1215 and a negative score of
0.75, and so on.
Any word that appears as a key in the sentiment dictionary is considered a sentiment keyword within our program.
In our program, we mainly deal with text as a list of words. To
calculate the overall sentiment (which you will do in the function
get_overall_score
), consider the following steps:
round()
function to make it more readable.Note that, while the overall sentiment score can either be > 0 (positive) or < 0 (negative), individual positive and negative sentiment scores (in the sentiment score dictionary) will always be positive. So, for example, the sentiment keyword: ‘shine’: (0.375, 0.125), the word “shine” has positivity score of 0.375, and a negativity score of 0.125 (NOT -0.125).
Complete the four functions provided in a2_part3.py
based on the instructions provided within the starter code file as well
as the explanations above.
For this question, first read Chapter 6.7 of the course notes, which we didn’t cover in class.
Now that we’ve discussed mutation, it would be wise to make sure that
we are not mutating objects when we shouldn’t be. Three of the four
functions you completed in question 1 take arguments with mutable data
types, but should not be mutating any of these objects. (We will skip
get_sentiment_info
, as it is difficult to satisfy its
precondition that the given filename refers to a valid file.)
Based on the reading you just completed, complete the property-based
test for each of the two remaining functions (get_keywords
and get_overall_score
) in test_a2_part3.py
to
test that they do not mutate the objects that their arguments refer to.
Note that you do not need to have completed question 1 in order to write
these tests.
In order to write these property-based tests, you will need
hypothesis
to generate objects like dict
s,
tuple
s, and str
s. We can do this using new
strategies, which are imported at the top of the starter file.
For example, to generate random objects of type
tuple[str, dict[int, float]]
for a parameter named
x
, we would use the decorator
@given(..., x=tuples(text(), dictionaries(ints(), floats())), ...)
.
Hint: for get_overall_score
, we need to call it
with arguments that satisfy its precondition. The following are two
hints for doing this:
Since we need at least one word in the word list to be in the
sentiment scores dictionary, the min_size=1
option for the
lists
strategy may be helpful. This is similar to the
min_value=1
option that was used in the worksheet for
lecture 4A.
While hypothesis
generates random objects, there is
nothing stopping us from calling the function we are trying to test with
a different object. Consider mutating the random sentiment scores
dictionary that you get from hypothesis
to satisfy the
precondition of get_overall_score
.
Please proofread and test your work carefully before your final submission! As we explain in Running and Testing Your Code Before Submission, it is essential that your submitted code not contain syntax errors. Python files that contain syntax errors will receive a grade of 0 on all automated testing components (though they may receive partial or full credit on any TA grading for assignments). You have lots of time to work on this assignment and check your work (and right-click -> “Run in Python Console”), so please make sure to do this regularly and fix syntax errors right away.
Login to MarkUs.
Go to Assignment 2, then the “Submissions” tab.
Submit the following files: honour_code.txt
,
a2_part1.py
, a2_part2.py
,
a2_part3.py
, and test_a2_part3.py
. Please note
that MarkUs is picky with filenames, and so your filenames must match
these exactly, including using lowercase letters.
Refresh the page, and then download each file to make sure you submitted the right version.
Remember, you can submit your files multiple times before the due date. So you can aim to submit your work early, and if you find an error or a place to improve before the due date, you can still make your changes and resubmit your work.
After you’ve submitted your work, please give yourself a well-deserved pat on the back and go take a rest or do something fun or look at some art or look at pet instagram accounts!