Home‎ > ‎Assignments‎ > ‎

Homework Assignment 2

Homework 2: The binomial and the normal distribution

Due: Tuesday, February 28

In this homework, you'll work with the binomial and normal distribution. 

Please start by downloading the file lastname_firstname_hw2.r  and rename it so that it has your last and first names as appropriate (e.g., ''puddleduck_jemima_hw2.R''). You will edit this file, putting the R code you use to complete the assignment and write down comments as requested in the problems. When you are done, you will submit this file on Blackboard.

A perfect solution to this homework will be worth 100 points.

Part 1 (50 points)

1.1 (15 pts)

In a hypothetical corpus of 1,000,000 words, the word humongous occurs 147 times. Compute the relative frequency of humongous in this corpus. 

Assuming that humongous is binomially distributed and that its actual population probability is the same as the relative frequency you just estimated, plot the probability of different frequencies of the word humongous across an infinite sample of corpora of size 1,000,000. Choose the frequencies for which you plot probabilities such that most of the probability mass is visible in the plot. 

Now only consider a finite sample of 1000 simulated corpora: Draw the frequency of humongous in each of those corpora according to the same binomial distribution you assumed above, and plot the results. 

1.2 (15 pts.)

The data frame ''english'' in the ''languageR'' library contains log-transformed lexical decision reaction times in the column ''english$RTlexdec''. Start out by adding a column ''english$eRTlexdec'' with the non-log-transformed reaction times to the data frame ''english''. 

In reaction time experiments, it is common to discard reaction times that deviate more than 3 standard deviations from the mean. List all reaction times in ''english$eRTlexdec'' that are that extreme. Under the assumption that ''english$eRTlexdec'' is normally distributed, what is the likelihood of getting a reaction time that is at least this extreme?

Plot the density of ''english$eRTlexdec'' to check the assumption that this column contains normally distributed data. What do you see? 
What do you see if you plot the density separately for young and old subjects (coded as ''english$AgeSubjects'')?

1.3 (20 pts.) 

(Adapted from a problem by Roger Levy.) A linguist searches a corpus for binomials of the form pepper and salt (P) or salt and pepper (S). He collects twenty examples, obtaining the sequence 


We want to test whether this sample indicates that the corpus shows a preference for the use of S over P: How likely would it be to achieve this sequence under the Null Hypothesis that S and P are equally preferred?

Typical probability thresholds used to test hypotheses are p = 0.05 and p = 0.01. A null hypothesis can be rejected at the (confidence) level p=0.05 if the likelihood of the data under the null hypothesis is p≤0.05. Is this the case for this sample? How many counts of S among 20 examples at least are needed for rejection at level p = 0.05? (The R function you need for answering this question is one of ''dbinom'', ''rbinom'', ''pbinom'' and ''qbinom''.)

Part 2 (50 points)

Please download the story The Wonderful Wizard of Oz by L. Frank Baum from Project Gutenberg, ''http://www.gutenberg.org/wiki/Main_Page''. Be sure to get the plain text version, not the HTML version. Save the file as "OZ.txt" and load it into a vector ''oz'' of strings using the function ''scan''. You will need to set the parameter ''what'' to some arbitrary character, for example ''what="q"'' to tell R that you are reading strings, and tell it to ignore quotes
by setting the parameter ''quote=NULL''. Finally, transform the vector ''text'' such that all words in it are in lowercase. You can use the R function ''tolower'' for this. So your command will look something like the following:

oz = tolower(scan("Oz.txt",what="q",quote=NULL))

(Note: The file contains a preamble and coda from Project Gutenberg. This is okay. Do not remove them.) 

2.1 (20 pts.)

Follow the procedure outlined in the Workbook section of the Baayen book, chapter 3, to achieve the following:

  * We want to cut the Wizard of Oz text into equal-sized text chunks of 700 words each. Make a new dataframe ''oz.df'' with a column ''word'' that contains the vector ''oz'' except that you leave off as many words as needed from the end of the vector such that the length of ''word'' can be divided by 700 without remainder.

  * Use ''cut'' to cut ''oz.df$word'' into equal-sized text chunks of 700 words each, and save the result as ''oz.df$chunk''.

  * In the same way that the tables ''countOfAlice'' and ''countOfAlice.tab'' are constructed in the book, make tables for the following words from the Wizard of Oz:


Call them ''countOfDorothy'', ''countOfDorothy.tab'', ''countOfWoodman'', ... (woodman is the second part of the name tin woodman.)

2.2 (5 pts.)

For each of the words dorothy, woodman, and to, make a plot that displays through high-density lines how often that word occurs in each successive chunk. What do you see?

2.3 (25 pts.)

Set R's plotting function to display two panels, one next to the other, in the plotting window. 

In the left panel, make a plot that shows the number of times each frequency of dorothy occurs across the chunks. On the x-axis, show the frequencies of dorothy, and on the y-axis show the number of chunks that had this frequency of dorothy

In the right panel, plot the densities for dorothy under the assumption that this word follows a binomial distribution with an estimated probability p equal to the mean of the counts across all chunks divided by the size of the chunks. Compare the binomial densities with the sample densities. What do you see?

Do the same for woodman and to. Again, what do you see?

Colin Bannard,
Feb 17, 2012, 2:12 PM