CSC411/2515 Project 1 bonus: k-Nearest Neighbours
About the bonus
Bonus projects give you an opportunity to explore a topic in machine learning in more depth. They are optional. The quality of your write-up will be considered when grading: you are exploring a complex topic, so if the writing is unclear, we won’t be able to figure out what exactly you are talking about.
For a really good write-up of a problem that’s not unlike the problem in P1b, see Swirszcz et al, Local Minima in Training of Neural Networks. Obviously, we don’t expect anything nearly as detailed as that paper. Two pages is likely plenty for a good bonus project submission, and less may be possible. We do expect an explanation of what you did and why,
Project 1 bonus
In class, we saw that the performance of k-NN on the training set is 100% for \(k = 1\), with the performance stabilizing to the proportion of the plurality label in the training set when \(k\) becomes very large. The behaviour for the performance on the validation set is different: there is an optimal \(k\) between \(1\) and the size of the training set.
For this part of the project, you will explore the question of whether, and under what circumstances, the performance on the training set can improve when k becomes larger. Here are some possible ways to answer the question:
Generate random datasets and set up experiments that convincingly demonstrate a scenario where the performance on the training set goes up when \(k\) goes up. Explain how you came up with the scenario and why it makes sense.
Attempt to describe all the circumstances in which performance on the training set will always go up as \(k\) increases for some ranges of \(k\). Make a mathematical argument that you correctly described the circumstances.
Setting up experiments
Unlike with the main part of Project 1, you are allowed to use scikit-learn or other libraries. Of course, any algorithm you use should be completely described in your report.
What to submit
The project should be implemented using Python 2 and should be runnable on the CS Teaching Labs computers. If you choose to deviate from that, state your system configuration clearly in your report so that we can reproduce your work if needed.
Your report should be in PDF format. You should use LaTeX to generate the report, and submit the .tex
file as well. A sample template is on the course website. You will submit at least three files: knn.py
, knn.tex
, and knn.pdf
. You may submit additional Python files if necessary.
Reproducibility counts! We should be able to obtain all the graphs and figures in your report by running your code. The only exception is that you may pre-download the images (what and how you did that, including the code you used to download the images, should be included in your submission.) Submissions that are not reproducible will not receive full marks. If your graphs/reported numbers cannot be reproduced by running the code, you may be docked up to 20%. (Of course, if the code is simply incomplete, you may lose even more.) Suggestion: if you are using randomness anywhere, use numpy.random.seed()
.
You must use LaTeX to generate the report. LaTeX is the tool used to generate virtually all technical reports and research papers in machine learning, and students report that after they get used to writing reports in LaTeX, they start using LaTeX for all their course reports. In addition, using LaTeX facilitates the production of reproducible results.
Using our code
You are free to use any of the code available from the CSC411/CSC2515 course website.
Readability
Readability counts! If your code isn’t readable or your report doesn’t make sense, they are not that useful. In addition, the TA can’t read them. You will lose marks for those things.
Academic integrity
It is perfectly fine to discuss general ideas with other people, if you acknowledge ideas in your report that are not your own. However, you must not look at other people’s code, or show your code to other people, and you must not look at other people’s reports and derivations, or show your report and derivations to other people. All of those things are academic offences.