The goal of this assignment is to demonstrate your mastery of data structures by: 1) creating a new data

structure; and 2) using that data structure as part of a larger software system. This assignment has one

implementation requirement and one evaluation requirement, both detailed below.


Implement a combination spell checker and word prediction utility in American English. A spell checker is

a software feature often built into word processors like Google Docs, in browser input boxes, etc. The

spell checker detects misspelled words and has the ability to suggest alternatives. Most spell checking

software is built on a lot of knowledge of how misspellings occur — for example, the word “bizarre” is

frequently misspelled with one “r” (“bizare”) — and is often guided by word frequency counts. Similarly,

word prediction anticipates the word being typed based on the first letters in the word.

There are versions of word prediction utilities, like predictive text, which anticipates the next word based

on the previous words. Predictive text is not in scope for this assignment.

The utilities will be combined and implemented using a trie, a tree-based data structure not discussed in

detail during class time. Once built, you will evaluate its performance on a “frequently misspelled”

database. Because the spell checker / word prediction required for this implementation is not built on a

state of the art model, our implementation will make mistakes and have unusual suggestions. You will

suggest improvements to the system in order to improve the spell checker’s performance.

Part 1 – Trie

Figure 1: Graphical representation of a trie, from

Use a trie (pronounced either like “tree” or “try”) to implement a spell checker. Also known as a prefix tree,

a trie is a tree-based data structure. Figure 1 shows one example trie, appropriate for this implementation.

Each node in a trie may represent a type (word) or the partial spelling of a type. A leaf node in the may

represent the full spelling of a type, along with its frequency. In Figure 1 for example, leaf nodes (“a”, “to”,

“tea”, “ted”, “ten” and “inn”) are valid types in American English. The parent of a leaf node may represent

one of two items, either:

● The partial spelling of one or more types (eg. “te” in Figure 1)

● The full spelling of a type (eg. “i” or “in” in Figure 1)

Figure 1 represents one possible trie for spelling in which the entire prefix is stored in each node. The

path from root to leaf for a type like “inn” contains nodes () —> (i) —> (in) —> (n). There are

implementations of a trie in which each node contains only the previous letter of the prefix. In this

implementation, the path from root to leaf for a type like “inn” would contain nodes () —> (i) —> (n) —>

(n). Either implementation is acceptable for this assignment.

Part 2 – Data

Fill the trie with data from the file unigram_freq.csv. This file will be supplied to your implementation as the

first argument. This file is from Rachael Tatman’s English Word Frequency. It contains the 333,333 most

frequently used types from Google’s Trillion Word Corpus, along with the frequencies of those types, in

CSV format. Each type, including proper names like “Michelle”, is converted to lowercase, and there are

no repeated entries in the file.

The first 5 lines of this file are as follows:






The first line of this file describes each column. (This is common in CSV data files.) This line may be

ignored. All subsequent lines contain a type, a comma and an integer representing the frequency of the

type in the corpus (data set). The second line shows the most frequent type in the corpus is “the”, with

more than 23 billion tokens (i.e. used more than 23 billion times in the corpus).

Part 3 – Mechanism for Check Spelling & Word Prediction

The overall class design is up to you, but your implementation must contain at least one Java class

named Spelling, which must expose at least one public function, suggest(…), which itself must

take two parameters: token and count, and must return a List of String instances. In other words, the

function in must have the following header:

public List> suggest (String token, int count)

Assuming the parameter token contains n characters, for each character (1 .. n), the suggest(…)

function adds count types for the token. The suggested types should be the most frequent which share

the prefix with the input, up to and including the ith character. Where no prefix can be found, the

implementation must assume the parameter token is incorrectly spelled, and the most frequent prefix to

the point of misspelling should be used.

For example, if the parameter token has the value “onomatopoeia” and parameter count has the value

5, the return List> should be the following:

{ {“of”, “on”, “or”, “our”, “one”},

{“on”, “one”, “only”, “online”, “once”},

{“ona”, “onan”, “onalaska”, “onassis”, “onanie”},

{“onomatopoeia”, “onoma”, “onoml”, “onomichi”, “onomastics”},

{“onomatopoeia”, “onoma”, “onoml”, “onomichi”, “onomastics”},

{“onomatopoeia”, “onoma”, “onoml”, “onomichi”, “onomastics”}


The first List, which contains “of”, “on”, etc., represents the most frequent types starting with the

letter “o”. The second List, which contains “on”, “one”, etc. represents the most frequent types

starting with the prefix “on”. In this example, the fifth and subsequent List contains types like

“onomichi” which do not have the prefix “onoma.” This occurs because there are fewer than count (5)

types in the file unigram_freq.csv with the “onoma” prefix. In this case, the most frequent types with the

longest prefix are included.

In addition, your implementation must contain a “main” function which accepts two parameters which may

be passed to the suggest(…) function:

● The name and location of the unigram_freq.csv file

● The count parameter

Assuming the class containing your “main” function is named A2, your implementation must be called as

follows on a MacOS / Unix system:

java A2 ../data/unigram_freq.csv 5

(The path would be specified differently on a Windows system.)

Part 4 – Improvements (Thought Exercise)

Test your implementation with all the data from the file misspelling.csv, which contains some of the most

frequently-misspelled types in English in mixed case, according to the Oxford English Corpus. Observe

the number of times the correct spelling appears in the List> returned from the

suggest(…) function, varying the value of the count variable between 3 and 7 .1

What, if anything, could be changed in the implementation to get the correct spelling of the input tokens?

1 Why numbers between 3 and 7? The lower limit, 3, is what current phone and tablet interfaces have

settled on. The upper limit is 7 items because psychologists have determined that we may have a

maximum capacity for processing information, the same reason phone numbers started with at most 7

digits. Of course, no one really uses phone numbers any more.


No starter code is provided. Check your Java code — including any main function, interfaces and class

implementations — into a GitHub repository. Do not check in the data files (unigram_freq.csv or

misspelling.csv), but you may refer to them in comments or in the auxiliary files of your repository. Submit

the link to this GitHub repository on Canvas.


Your grade for this assignment will be determined as follows:

● 60% = Implementation: your class implementations must run successfully with valid input arrays.

The implementation must produce the expected results. Any deviation from the expected results

results in 0 credit for implementation. Each portion of the implementation (Trie, Data, Mechanism)

is equally weighted.

● 15% = Improvements (Thought Experiment): you must demonstrate that you have executed one

or more tests of your implementation, examined the results and have drawn larger conclusions

about the behaviour of your implementation, including: how it may be improved.

● 10% = Decomposition: in the eyes of the grader, your implementation must demonstrate a

reasonable object oriented decomposition — i.e. encapsulation, polymorphism and inheritance.

● 5% = Efficiency: in the eyes of the grader, your implementation must be maximally efficient with

respect to running time and required space. Despite the name, your submission for Part 2 is not

required to be faster than merge sort. However, it is required to be faster than the quadratic

sorting algorithms.

● 5% = Style: in the eyes of the grader, your implementation must be well-commented, use

intelligently-named variables and functions.

● 5% = Documentation: all required documents, “stopwatch” charts, running times and descriptions

must be clear and unambiguous, and these must match the true running time and true space of

your implementation

Order your essay today and save 30% with the discount code ESSAYSHELP