Discussion Work 1 (15 mins)

The data set in the file Pollution.txt contains some data on pollution. A description of the data is in the file Pollution_description.txt. Read through the description and perform data exploration on the dataset. Consider the following questions:

  • Could the data collection and processing methods have influenced the data?
  • Which predictors are associated with mortality?
  • Should any predictors be transformed, or should new features be created?
  • Are the predictors and response linearly or non-linearly related?
  • Are there any outliers? If so, how influential are they likely to be? What should be done about these outliers?
  • Discussion Work 2 (15 mins):

    Examine the Blood Brain dataset included in the caret package. The dataset needs to be specifically loaded with the commands:

  • library(caret)
  • data("BloodBrain")
  • This dataset comes in two parts: bbbDescr is a matrix with a number of properties of chemicals. logBBB is a vector giving the log ratio between concentration of the same chemicals in the blood, and in the brain of mice.

    This dataset has 134 predictors, so looking at all pairwise plots is not feasible. Dimension reduction techniques such as PCA are one approach you should try.