The goal of this assignment is to demonstrate your mastery of data structures by: 1) creating a new data
structure; and 2) using that data structure as part of a larger software system. This assignment has one
implementation requirement and one evaluation requirement, both detailed below.
Background
Implement a combination spell checker and word prediction utility in American English. A spell checker is
a software feature often built into word processors like Google Docs, in browser input boxes, etc. The
spell checker detects misspelled words and has the ability to suggest alternatives. Most spell checking
software is built on a lot of knowledge of how misspellings occur — for example, the word “bizarre” is
frequently misspelled with one “r” (“bizare”) — and is often guided by word frequency counts. Similarly,
word prediction anticipates the word being typed based on the first letters in the word.
There are versions of word prediction utilities, like predictive text, which anticipates the next word based
on the previous words. Predictive text is not in scope for this assignment.
The utilities will be combined and implemented using a trie, a tree-based data structure not discussed in
detail during class time. Once built, you will evaluate its performance on a “frequently misspelled”
database. Because the spell checker / word prediction required for this implementation is not built on a
state of the art model, our implementation will make mistakes and have unusual suggestions. You will
suggest improvements to the system in order to improve the spell checker’s performance.
Part 1 – Trie
Figure 1: Graphical representation of a trie, from https://en.wikipedia.org/wiki/Trie
Use a trie (pronounced either like “tree” or “try”) to implement a spell checker. Also known as a prefix tree,
a trie is a tree-based data structure. Figure 1 shows one example trie, appropriate for this implementation.
Each node in a trie may represent a type (word) or the partial spelling of a type. A leaf node in the may
represent the full spelling of a type, along with its frequency. In Figure 1 for example, leaf nodes (“a”, “to”,
“tea”, “ted”, “ten” and “inn”) are valid types in American English. The parent of a leaf node may represent
one of two items, either:
● The partial spelling of one or more types (eg. “te” in Figure 1)
● The full spelling of a type (eg. “i” or “in” in Figure 1)
Figure 1 represents one possible trie for spelling in which the entire prefix is stored in each node. The
path from root to leaf for a type like “inn” contains nodes () —> (i) —> (in) —> (n). There are
implementations of a trie in which each node contains only the previous letter of the prefix. In this
implementation, the path from root to leaf for a type like “inn” would contain nodes () —> (i) —> (n) —>
(n). Either implementation is acceptable for this assignment.
Part 2 – Data
Fill the trie with data from the file unigram_freq.csv. This file will be supplied to your implementation as the
first argument. This file is from Rachael Tatman’s English Word Frequency. It contains the 333,333 most
frequently used types from Google’s Trillion Word Corpus, along with the frequencies of those types, in
CSV format. Each type, including proper names like “Michelle”, is converted to lowercase, and there are
no repeated entries in the file.
The first 5 lines of this file are as follows:
word,count
the,23135851162
of,13151942776
and,12997637966
to,12136980858
The first line of this file describes each column. (This is common in CSV data files.) This line may be
ignored. All subsequent lines contain a type, a comma and an integer representing the frequency of the
type in the corpus (data set). The second line shows the most frequent type in the corpus is “the”, with
more than 23 billion tokens (i.e. used more than 23 billion times in the corpus).
Part 3 – Mechanism for Check Spelling & Word Prediction
The overall class design is up to you, but your implementation must contain at least one Java class
named Spelling, which must expose at least one public function, suggest(…), which itself must
take two parameters: token and count, and must return a List of String instances. In other words, the
function in Spelling.java must have the following header:
public List> suggest (String token, int count)
Assuming the parameter token contains n characters, for each character (1 .. n), the suggest(…)
function adds count types for the token. The suggested types should be the most frequent which share
the prefix with the input, up to and including the ith character. Where no prefix can be found, the
implementation must assume the parameter token is incorrectly spelled, and the most frequent prefix to
the point of misspelling should be used.
For example, if the parameter token has the value “onomatopoeia” and parameter count has the value
5, the return List> should be the following:
{ {“of”, “on”, “or”, “our”, “one”},
{“on”, “one”, “only”, “online”, “once”},
{“ona”, “onan”, “onalaska”, “onassis”, “onanie”},
{“onomatopoeia”, “onoma”, “onoml”, “onomichi”, “onomastics”},
{“onomatopoeia”, “onoma”, “onoml”, “onomichi”, “onomastics”},
…
{“onomatopoeia”, “onoma”, “onoml”, “onomichi”, “onomastics”}
}
The first List, which contains “of”, “on”, etc., represents the most frequent types starting with the
letter “o”. The second List, which contains “on”, “one”, etc. represents the most frequent types
starting with the prefix “on”. In this example, the fifth and subsequent List contains types like
“onomichi” which do not have the prefix “onoma.” This occurs because there are fewer than count (5)
types in the file unigram_freq.csv with the “onoma” prefix. In this case, the most frequent types with the
longest prefix are included.
In addition, your implementation must contain a “main” function which accepts two parameters which may
be passed to the suggest(…) function:
● The name and location of the unigram_freq.csv file
● The count parameter
Assuming the class containing your “main” function is named A2, your implementation must be called as
follows on a MacOS / Unix system:
java A2 ../data/unigram_freq.csv 5
(The path would be specified differently on a Windows system.)
Part 4 – Improvements (Thought Exercise)
Test your implementation with all the data from the file misspelling.csv, which contains some of the most
frequently-misspelled types in English in mixed case, according to the Oxford English Corpus. Observe
the number of times the correct spelling appears in the List> returned from the
suggest(…) function, varying the value of the count variable between 3 and 7 .1
What, if anything, could be changed in the implementation to get the correct spelling of the input tokens?
1 Why numbers between 3 and 7? The lower limit, 3, is what current phone and tablet interfaces have
settled on. The upper limit is 7 items because psychologists have determined that we may have a
maximum capacity for processing information, the same reason phone numbers started with at most 7
digits. Of course, no one really uses phone numbers any more.
Submission
No starter code is provided. Check your Java code — including any main function, interfaces and class
implementations — into a GitHub repository. Do not check in the data files (unigram_freq.csv or
misspelling.csv), but you may refer to them in comments or in the auxiliary files of your repository. Submit
the link to this GitHub repository on Canvas.
Grading
Your grade for this assignment will be determined as follows:
● 60% = Implementation: your class implementations must run successfully with valid input arrays.
The implementation must produce the expected results. Any deviation from the expected results
results in 0 credit for implementation. Each portion of the implementation (Trie, Data, Mechanism)
is equally weighted.
● 15% = Improvements (Thought Experiment): you must demonstrate that you have executed one
or more tests of your implementation, examined the results and have drawn larger conclusions
about the behaviour of your implementation, including: how it may be improved.
● 10% = Decomposition: in the eyes of the grader, your implementation must demonstrate a
reasonable object oriented decomposition — i.e. encapsulation, polymorphism and inheritance.
● 5% = Efficiency: in the eyes of the grader, your implementation must be maximally efficient with
respect to running time and required space. Despite the name, your submission for Part 2 is not
required to be faster than merge sort. However, it is required to be faster than the quadratic
sorting algorithms.
● 5% = Style: in the eyes of the grader, your implementation must be well-commented, use
intelligently-named variables and functions.
● 5% = Documentation: all required documents, “stopwatch” charts, running times and descriptions
must be clear and unambiguous, and these must match the true running time and true space of
your implementation