Home > Programming > Word Frequency Analysis Program

Word Frequency Analysis Program

Recently I was browsing for interesting programming problems and stumbled across a virtual “ghost town” of programming projects called  “Programming Fun Challenge”  While the activity has appeared to have died down long ago, most of the problems are still relevant and fun to tackle. I decided to take on the Word Count Challenge where you determine the word frequency for 35 of Shakespeare’s Plays.

The original post is here: Christmas Programming Fun Challenge 9

I did find the link to the book to analyze to be out of date, here is the corrected link: Shakespeare’s First Folio by William Shakespeare

Basic Strategy

  • Read the entire file contents into meory
  • Use a regular expression to break the text into words (ignoring all punctuation except apostrophes)
  • Use a Hash Table (Dictionary<TKey, TValue> type) to keep track of each word and its frequency
    • If the word hasn’t been seen before, add it to the Hash with a count of 1
    • If the word has already been found, increment the count
  • Sort the Hash in descending order based on the frequency count

Most of this behavior was captured in two methods. Here is the code:

        public static Dictionary<string, int> AnalyzeWordFrequency(string textToAnalyze)
        {
            Dictionary<string, int> wordDistribution = new Dictionary<string,int>();
            Regex allWordsPattern = new Regex(@"[w']+");
            MatchCollection allWords = allWordsPattern.Matches(textToAnalyze.ToLower());
            foreach(Match word in allWords)
            {
                if (wordDistribution.ContainsKey(word.Value))
                {
                    wordDistribution[word.Value]++;
                }
                else
                {
                    wordDistribution.Add(word.Value, 1);
                }
            }

            return wordDistribution;
        }

        public static void PrintDistributionToFile(Dictionary<string, int> distribution, string fileName)
        {
            StreamWriter dataWriter = new StreamWriter(new FileStream(fileName, FileMode.Create));
            foreach(KeyValuePair<string, int> dataPoint in distribution.OrderByDescending(x => x.Value))
            {
                dataWriter.WriteLine(dataPoint.Key + " " + dataPoint.Value);
            }
            dataWriter.Close();
        }
50 Most Common Words Used by Shakespeare

Word Frequency Data Collected With Program

Advertisements
Categories: Programming
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: