Home‎ > ‎Assignments‎ > ‎

Homework assignment 5

Homework 5: Principal Component Analysis

Due: Tuesday, April 24

In this homework, you'll work with the methods for principal component analysis covered in chapter 5 of Baayen.

Please start by downloading the file lastname_firstname_hw5.R, and rename it so that it has your last and first names as appropriate (e.g., ''tzara_tristan_hw5.R''). You will edit this file, putting the R code you use to complete the assignment and write down comments as requested in the problems. When you are done, submit this file on Blackboard.

This homework uses the dataset wyatt.txt. See the Wyatt dataset description to understand the columns and what the levels mean. The data sets and its annotations come from Jason Powell -- a big thanks goes to him for letting us use the data for this homework.

A perfect solution to this homework will be worth 75 points.

Problem 1 (20 pts) Principal component analysis

(a) [4 pts] Read the contents of ''wyatt.txt'' into a data frame called ''wy'', taking into account that ''wyatt.txt'' contains a header line. Next, make a second dataframe ''wy.relfreq'' that contains the relative frequencies of words and word bigrams in each text.

(b) [8 pts] Perform principal component analysis on ''wy.relfreq'' and store the result in a variable ''wy.pca''. Do not forget to scale your data. Compute the proportion of variance that each principal component explains without using ''summary()''. Plot the proportions of variance in a barplot.

(c) [8 pts] Use the R function ''pairs()'' to plot the texts in the space spanned by the three first principal components. (Note that the transformed coordinates of the texts are stored in ''wy.pca$x.'') Do the same for components 4-7, and 8-11. Record your observations.

Problem 2 (25 pts) Authors and principal components

For this problem, you will explore to what extent different principal components encode a distinction between different assumed authors.

(a) [8 pts] First, use the R function pairs() again to visualize the location of the texts in the first three principal components, but use a different color for each supposed author. You can use the R function ''rainbow()'' to create a vector of different colors. You can either add each supposed author to the plot separately using "points", or you can use ''unclass()'' to map authors to colors. We have discussed the use of ''unclass()'' to map different levels of a factor to different colors or plotting symbols. Demo code that illustrates it is in Demo: plotting. To get authors, recall that supposed authors are recorded in ''wy$Author''. What do you observe?

(b) [9 pts] The label "UN" stands for "unknown author". Make a new column ''wy$UN'' that is TRUE when ''wy$Author'' is "UN". Then re-plot the texts in the first three principal components using pairs() and using only two colors: red for texts whose author is "UN", and blue for others. Note that you cannot use ''unclass()'' here: You can use it to map to integers only for an argument that is a vector of factors. What you want to achieve is a vector of integers that has an entry of 1 whenever ''wy$UN'' is FALSE and 2 whenever ''wy$UN'' is TRUE. What do you observe? Is there any dimension that seems to encode a distinction of "UN" from other authors?

Do the same for the author label "W", which stands for "Wyatt".

(c) [8 pts] Pick a dimension (i.e., principal component) that to you most seemed to distinguish "UN" from other authors. Explore some more to what extent this dimension separates authors: Use the R function ''boxplot()'' to visualize where this dimension tends to place texts with author "UN" versus other texts. Here is how you can do this: If the dimension that you picked were located in a variable ''D'', you would plot ''D ~ wy$UN''. (Baayen discusses these kinds of boxplots on pp. 136-137.) What do you observe? Does this dimension distinguish "UN" from non-"UN" texts?

For the same dimension, do the same for author "W".

 Problem 3 (15 pts) Genre and principal components

In this problem you will explore whether any principal components encodes genre distinctions, encoded in the  column ''Genre''.

Use ''xtabs()'' to determine the frequency of different genre labels in ''wy''. For each of the four most frequent genres, do a ''pairs()'' plot: Visualize again the first three principal components, showing in red the texts for the genre you are testing, and in blue all other  texts.

Say what you are observing.

Problem 4 (15 pts) Time and principal components

In this problem, you will test if any of the principal components encodes the time at which a text was written. This information is encoded in the columns ''DateStart'' and ''DateEnd'' of ''wy''.

For each of the first three principal components, plot ''DateStart'' against the value of the text on that component. Use ''lowess()'' to draw a smoother line through the scatterplot. (Baayen discusses how to do this on pp. 136-137, directly above the use of ''boxplot()''.) What do you see?

Do the same for ''DateEnd''. What do you see? Does any of the first three principal components seem to encode time?

Katrin Erk,
Apr 16, 2012, 11:52 AM
Katrin Erk,
Apr 16, 2012, 11:52 AM