Home‎ > ‎Assignments‎ > ‎

Homework assignment 1

Homework 1: R exploration

Due: Tuesday, February 14

In this homework, you'll work with a small dataset that describes the number of occurrences of several words (I,the,Obama,McCain, etc.) in 80 editorial articles from the New York Times from August to December 2008. There are 20 articles from each of four authors (Gail Collins, Maureen Dowd, William Kristol, and Paul Krugman).

Please download nyt_opinion.txt and lastname_firstname_hw1.R. The file nyt_opinion.txt has information about the editorials. The file  lastname_firstname_hw1.R is where we ask you to put your answers. Please rename lastname_firstname_hw1.R so that it has your last and first names as appropriate (e.g., doe_john_hw1.R). You will edit this file, putting the R code you use to complete the assignment and write down comments as requested in the problems. When you are done, submit this file on Blackboard.

Please record in lastname_firstname_hw1.R all the code that you use to solve the homework problems. You do not need to record erroneous commands, but we do need to see all the commands that you used to actually obtain the solutions, not just the solutions themselves.

Part 1 (50 points)

1 (5 pts).

Load the data in nyt_opinion.txt into an R data frame named nyt.

Use the summary function on nyt and note one interesting pattern you can see in the result.

 2 (10 pts).

Calculate the number of texts that mention "economy" more than "campaign".

Sort the rows in nyt by the number of combined mentions (from most to least) of 2008 presidential candidates (obama + mccain) and then by number of combined mentions to 2008 vice-presidential candidates (palin + biden).

 3 (10 pts).

Extract the vector of publication dates for nyt and use it to create a new column called Date.

You'll need to use the information in the FileName column and the functions as.Date, substring, and paste. Here are some examples to act as tips for each of these steps:

> substring("tempting",1,5)
[1] "tempt"

> paste("at","tempt",sep="")
[1] "attempt"

> paste("at",substring("tempting",1,5),sep="")
[1] "attempt"

> as.Date("2009-09-22")
[1] "2009-09-22"

Extract from nyt all rows from articles published after November 1, 2008.

4 (5 pts).

Add a new column to nyt called Month that gives the name of the month the article was published in. Use the Dates column and the months function. An example:

> months(as.Date("2009-09-22"))
[1] "September"

 5 (5 pts).

Create a new dataframe encoding the gender of the authors (Collins and Dowd are female, and Kristol and Krugman are male). Add this information to nyt using the function merge. Use Gender for the name of the column giving the gender of the author.

 6 (10 pts).

Calculate the average number of mentions of each word for each author. Please obtain the indices of the starting (the) and end columns (government) by using the function which on the vector of column names obtained with the function colnames; then use a list slice to access the appropriate columns and aggregate to perform the calculation.

7 (5 pts).

Create a copy of nyt called nyt.mod. Set values for all lexical items (from column the to column government) in nyt.mod to be their relative frequency by dividing all values for an article by the number of words in the article (found in the NumWords column).

Part 2 (50 points)

1 (10 pts).

Pick two of the three columns Author, Gender, and Month and look for interesting relationships between these and one or more of the word columns via mosaic plots of xtabs contingency tables.

Note that you can look at sums of counts from the different word columns when you are doing this. Here's a simple example:

> xtabs(obama+biden ~ Author, data=nyt)
collins    dowd kristol krugman
     84     120     139      30

Comment on the relationships you found.

2 (15 pts).

Use the function pairs on the word count columns in nyt to look for interesting relationships (you may want to do different subsets, e.g. with c(9:17,25:29). Look for word pairs which appear to be correlated, pick two of them, and hypothesize why they might be correlated.

3 (15 pts).

Make a plot with dates as the x-axis and counts for obama and mccain on the y-axis. Make the points for obama blue circles and for mccain red triangles. The x-axis should be labeled "Date of Publication" and the y-axis should be "Number of mentions".

Note at least one interesting thing about this graph.

4(10 pts).

Plot histograms and densities of the summed counts of a through and, obama through bush, and financial through government. Use rowSums (you will need to look this up in the R help; note that rowSums works for data frames as well) to add up values across multiple columns. Provide the code to produce two of the graphs you found most interesting. (If you would like to use truehist, please note that you need to load the library MASS first.)

Katrin Erk,
Feb 6, 2012, 3:39 PM
Katrin Erk,
Feb 6, 2012, 3:39 PM