Home > Programming > Word Frequency Analysis Program

## Word Frequency Analysis Program

Recently I was browsing for interesting programming problems and stumbled across a virtual “ghost town” of programming projects called  “Programming Fun Challenge”  While the activity has appeared to have died down long ago, most of the problems are still relevant and fun to tackle. I decided to take on the Word Count Challenge where you determine the word frequency for 35 of Shakespeare’s Plays.

The original post is here: Christmas Programming Fun Challenge 9

I did find the link to the book to analyze to be out of date, here is the corrected link: Shakespeare’s First Folio by William Shakespeare

Basic Strategy

• Read the entire file contents into meory
• Use a regular expression to break the text into words (ignoring all punctuation except apostrophes)
• Use a Hash Table (Dictionary<TKey, TValue> type) to keep track of each word and its frequency
• If the word hasn’t been seen before, add it to the Hash with a count of 1
• If the word has already been found, increment the count
• Sort the Hash in descending order based on the frequency count

Most of this behavior was captured in two methods. Here is the code:

```        public static Dictionary<string, int> AnalyzeWordFrequency(string textToAnalyze)
{
Dictionary<string, int> wordDistribution = new Dictionary<string,int>();
Regex allWordsPattern = new Regex(@"[w']+");
MatchCollection allWords = allWordsPattern.Matches(textToAnalyze.ToLower());
foreach(Match word in allWords)
{
if (wordDistribution.ContainsKey(word.Value))
{
wordDistribution[word.Value]++;
}
else
{
}
}

return wordDistribution;
}

public static void PrintDistributionToFile(Dictionary<string, int> distribution, string fileName)
{
StreamWriter dataWriter = new StreamWriter(new FileStream(fileName, FileMode.Create));
foreach(KeyValuePair<string, int> dataPoint in distribution.OrderByDescending(x => x.Value))
{
dataWriter.WriteLine(dataPoint.Key + " " + dataPoint.Value);
}
dataWriter.Close();
}
```

Word Frequency Data Collected With Program