目录
1 A brief introduction to R 1
1.1 An overview of R 1
1.1.1 A short R session 1
1.1.2 The uses of R 6
1.1.3 Online help 7
1.1.4 Input of data from a file 8
1.1.5 R packages 9
1.1.6 Further steps in learning R 9
1.2 Vectors, factors, and univariate time series 10
1.2.1 Vectors 10
1.2.2 Concatenation – joining vector objects 10
1.2.3 The use of relational operators to compare vector elements 11
1.2.4 The use of square brackets to extract subsets of vectors 11
1.2.5 Patterned data 11
1.2.6 Missing values 12
1.2.7 Factors 13
1.2.8 Time series 14
1.3 Data frames and matrices 14
1.3.1 Accessing the columns of data frames – with() and
attach() 17
1.3.2 Aggregation, stacking, and unstacking 17
1.3.3∗ Data frames and matrices 18
1.4 Functions, operators, and loops 19
1.4.1 Common useful built-in functions 19
1.4.2 Generic functions, and the class of an object 21
1.4.3 User-written functions 22
1.4.4 if Statements 23
1.4.5 Selection and matching 23
1.4.6 Functions for working with missing values 24
1.4.7∗ Looping 24
x Contents
1.5 Graphics in R 25
1.5.1 The function plot( ) and allied functions 25
1.5.2 The use of color 27
1.5.3 The importance of aspect ratio 28
1.5.4 Dimensions and other settings for graphics devices 28
1.5.5 The plotting of expressions and mathematical symbols 29
1.5.6 Identification and location on the figure region 29
1.5.7 Plot methods for objects other than vectors 30
1.5.8 Lattice (trellis) graphics 30
1.5.9 Good and bad graphs 32
1.5.10 Further information on graphics 33
1.6 Additional points on the use of R 33
1.7 Recap 35
1.8 Further reading 36
1.9 Exercises 37
2 Styles of data analysis 43
2.1 Revealing views of the data 43
2.1.1 Views of a single sample 44
2.1.2 Patterns in univariate time series 47
2.1.3 Patterns in bivariate data 49
2.1.4 Patterns in grouped data – lengths of cuckoo eggs 52
2.1.5∗ Multiple variables and times 53
2.1.6 Scatterplots, broken down by multiple factors 56
2.1.7 What to look for in plots 58
2.2 Data summary 59
2.2.1 Counts 59
2.2.2 Summaries of information from data frames 63
2.2.3 Standard deviation and inter-quartile range 65
2.2.4 Correlation 67
2.3 Statistical analysis questions, aims, and strategies 69
2.3.1 How relevant and how reliable are the data? 70
2.3.2 How will results be used? 70
2.3.3 Formal and informal assessments 71
2.3.4 Statistical analysis strategies 72
2.3.5 Planning the formal analysis 72
2.3.6 Changes to the intended plan of analysis 73
2.4 Recap 73
2.5 Further reading 74
2.6 Exercises 74
3 Statistical models 77
3.1 Statistical models 77
3.1.1 Incorporation of an error or noise component 78
3.1.2 Fitting models – the model formula 80
Contents xi
3.2 Distributions: models for the random component 81
3.2.1 Discrete distributions – models for counts 82
3.2.2 Continuous distributions 84
3.3 Simulation of random numbers and random samples 86
3.3.1 Sampling from the normal and other continuous distributions 87
3.3.2 Simulation of regression data 88
3.3.3 Simulation of the sampling distribution of the mean 88
3.3.4 Sampling from finite populations 90
3.4 Model assumptions 91
3.4.1 Random sampling assumptions – independence 91
3.4.2 Checks for normality 92
3.4.3 Checking other model assumptions 95
3.4.4 Are non-parametric methods the answer? 95
3.4.5 Why models matter – adding across contingency tables 96
3.5 Recap 97
3.6 Further reading 98
3.7 Exercises 98
4 A review of inference concepts 102
4.1 Basic concepts of estimation 102
4.1.1 Population parameters and sample statistics 102
4.1.2 Sampling distributions 102
4.1.3 Assessing accuracy – the standard error 103
4.1.4 The standard error for the difference of means 103
4.1.5∗ The standard error of the median 104
4.1.6 The sampling distribution of the t -statistic 105
4.2 Confidence intervals and tests of hypotheses 106
4.2.1 A summary of one- and two-sample calculations 109
4.2.2 Confidence intervals and tests for proportions 112
4.2.3 Confidence intervals for the correlation 113
4.2.4 Confidence intervals versus hypothesis tests 113
4.3 Contingency tables 114
4.3.1 Rare and endangered plant species 116
4.3.2 Additional notes 119
4.4 One-way unstructured comparisons 119
4.4.1 Multiple comparisons 122
4.4.2 Data with a two-way structure, i.e., two factors 123
4.4.3 Presentation issues 124
4.5 Response curves 125
4.6 Data with a nested variation structure 126
4.6.1 Degrees of freedom considerations 127
4.6.2 General multi-way analysis of variance designs 127
4.7 Resampling methods for standard errors, tests, and confidence intervals 128
4.7.1 The one-sample permutation test 128
4.7.2 The two-sample permutation test 129
xii Contents
4.7.3∗ Estimating the standard error of the median: bootstrapping 130
4.7.4 Bootstrap estimates of confidence intervals 131
4.8∗ Theories of inference 132
4.8.1 Maximum likelihood estimation 133
4.8.2 Bayesian estimation 133
4.8.3 If there is strong prior information, use it! 135
4.9 Recap 135
4.10 Further reading 136
4.11 Exercises 137
5 Regression with a single predictor 142
5.1 Fitting a line to data 142
5.1.1 Summary information – lawn roller example 143
5.1.2 Residual plots 143
5.1.3 Iron slag example: is there a pattern in the residuals? 145
5.1.4 The analysis of variance table 147
5.2 Outliers, influence, and robust regression 147
5.3 Standard errors and confidence intervals 149
5.3.1 Confidence intervals and tests for the slope 150
5.3.2 SEs and confidence intervals for predicted values 150
5.3.3∗ Implications for design 151
5.4 Assessing predictive accuracy 152
5.4.1 Training/test sets and cross-validation 153
5.4.2 Cross-validation – an example 153
5.4.3∗ Bootstrapping 155
5.5 Regression versus qualitative anova comparisons – issues of power 158
5.6 Logarithmic and other transformations 160
5.6.1∗ A note on power transformations 160
5.6.2 Size and shape data – allometric growth 161
5.7 There are two regression lines! 162
5.8 The model matrix in regression 163
5.9∗ Bayesian regression estimation using the MCMCpack package 165
5.10 Recap 166
5.11 Methodological references 167
5.12 Exercises 167
6 Multiple linear regression 170
6.1 Basic ideas: a book weight example 170
6.1.1 Omission of the intercept term 172
6.1.2 Diagnostic plots 173
6.2 The interpretation of model coefficients 174
6.2.1 Times for Northern Irish hill races 174
6.2.2 Plots that show the contribution of individual terms 177
6.2.3 Mouse brain weight example 179
6.2.4 Book dimensions, density, and book weight 181
Contents xiii
6.3 Multiple regression assumptions, diagnostics, and efficacy measures 183
6.3.1 Outliers, leverage, influence, and Cook’s distance 183
6.3.2 Assessment and comparison of regression models 186
6.3.3 How accurately does the equation predict? 187
6.4 A strategy for fitting multiple regression models 189
6.4.1 Suggested steps 190
6.4.2 Diagnostic checks 191
6.4.3 An example – Scottish hill race data 191
6.5 Problems with many explanatory variables 196
6.5.1 Variable selection issues 197
6.6 Multicollinearity 199
6.6.1 The variance inflation factor 201
6.6.2 Remedies for multicollinearity 203
6.7 Errors in x 203
6.8 Multiple regression models – additional points 208
6.8.1 Confusion between explanatory and response variables 208
6.8.2 Missing explanatory variables 208
6.8.3∗ The use of transformations 210
6.8.4∗ Non-linear methods – an alternative to transformation? 210
6.9 Recap 212
6.10 Further reading 212
6.11 Exercises 214
7 Exploiting the linear model framework 217
7.1 Levels of a factor – using indicator variables 217
7.1.1 Example – sugar weight 217
7.1.2 Different choices for the model matrix when there are factors 220
7.2 Block designs and balanced incomplete block designs 222
7.2.1 Analysis of the rice data, allowing for block effects 222
7.2.2 A balanced incomplete block design 223
7.3 Fitting multiple lines 224
7.4 Polynomial regression 228
7.4.1 Issues in the choice of model 229
7.5∗ Methods for passing smooth curves through data 231
7.5.1 Scatterplot smoothing – regression splines 232
7.5.2∗ Roughness penalty methods and generalized
additive models 235
7.5.3 Distributional assumptions for automatic choice of
roughness penalty 236
7.5.4 Other smoothing methods 236
7.6 Smoothing with multiple explanatory variables 238
7.6.1 An additive model with two smooth terms 238
7.6.2∗ A smooth surface 240
7.7 Further reading 240
7.8 Exercises 240
xiv Contents
8 Generalized linear models and survival analysis 244
8.1 Generalized linear models 244
8.1.1 Transformation of the expected value on the left 244
8.1.2 Noise terms need not be normal 245
8.1.3 Log odds in contingency tables 245
8.1.4 Logistic regression with a continuous explanatory
variable 246
8.2 Logistic multiple regression 249
8.2.1 Selection of model terms, and fitting the model 252
8.2.2 Fitted values 254
8.2.3 A plot of contributions of explanatory variables 255
8.2.4 Cross-validation estimates of predictive accuracy 255
8.3 Logistic models for categorical data – an example 256
8.4 Poisson and quasi-Poisson regression 258
8.4.1 Data on aberrant crypt foci 258
8.4.2 Moth habitat example 261
8.5 Additional notes on generalized linear models 266
8.5.1∗ Residuals, and estimating the dispersion 266
8.5.2 Standard errors and z- or t -statistics for binomial models 267
8.5.3 Leverage for binomial models 268
8.6 Models with an ordered categorical or categorical response 268
8.6.1 Ordinal regression models 269
8.6.2∗ Loglinear models 272
8.7 Survival analysis 272
8.7.1 Analysis of the Aids2 data 273
8.7.2 Right-censoring prior to the termination of the study 275
8.7.3 The survival curve for male homosexuals 276
8.7.4 Hazard rates 276
8.7.5 The Cox proportional hazards model 277
8.8 Transformations for count data 279
8.9 Further reading 280
8.10 Exercises 281
9 Time series models 283
9.1 Time series – some basic ideas 283
9.1.1 Preliminary graphical explorations 283
9.1.2 The autocorrelation and partial autocorrelation function 284
9.1.3 Autoregressive models 285
9.1.4∗ Autoregressive moving average models – theory 287
9.1.5 Automatic model selection? 288
9.1.6 A time series forecast 289
9.2∗ Regression modeling with ARIMA errors 291
9.3∗ Non-linear time series 298
9.4 Further reading 300
9.5 Exercises 301
Contents xv
10 Multi-level models and repeated measures 303
10.1 A one-way random effects model 304
10.1.1 Analysis with aov() 305
10.1.2 A more formal approach 308
10.1.3 Analysis using lmer() 310
10.2 Survey data, with clustering 313
10.2.1 Alternative models 313
10.2.2 Instructive, though faulty, analyses 318
10.2.3 Predictive accuracy 319
10.3 A multi-level experimental design 319
10.3.1 The anova table 321
10.3.2 Expected values of mean squares 322
10.3.3∗ The analysis of variance sums of squares breakdown 323
10.3.4 The variance components 325
10.3.5 The mixed model analysis 326
10.3.6 Predictive accuracy 328
10.4 Within- and between-subject effects 329
10.4.1 Model selection 329
10.4.2 Estimates of model parameters 331
10.5 A generalized linear mixed model 332
10.6 Repeated measures in time 334
10.6.1 Example – random variation between profiles 336
10.6.2 Orthodontic measurements on children 340
10.7 Further notes on multi-level and other models with correlated errors 344
10.7.1 Different sources of variance – complication or focus of
interest? 344
10.7.2 Predictions from models with a complex error structure 345
10.7.3 An historical perspective on multi-level models 345
10.7.4 Meta-analysis 347
10.7.5 Functional data analysis 347
10.7.6 Error structure in explanatory variables 347
10.8 Recap 347
10.9 Further reading 348
10.10 Exercises 349
11 Tree-based classification and regression 351
11.1 The uses of tree-based methods 352
11.1.1 Problems for which tree-based regression may be used 352
11.2 Detecting email spam – an example 353
11.2.1 Choosing the number of splits 356
11.3 Terminology and methodology 356
11.3.1 Choosing the split – regression trees 357
11.3.2 Within and between sums of squares 357
11.3.3 Choosing the split – classification trees 358
11.3.4 Tree-based regression versus loess regression smoothing 359
xvi Contents
11.4 Predictive accuracy and the cost–complexity trade-off 361
11.4.1 Cross-validation 361
11.4.2 The cost–complexity parameter 362
11.4.3 Prediction error versus tree size 363
11.5 Data for female heart attack patients 363
11.5.1 The one-standard-deviation rule 365
11.5.2 Printed information on each split 366
11.6 Detecting email spam – the optimal tree 366
11.7 The randomForest package 369
11.8 Additional notes on tree-based methods 372
11.9 Further reading and extensions 373
11.10 Exercises 374
12 Multivariate data exploration and discrimination 377
12.1 Multivariate exploratory data analysis 378
12.1.1 Scatterplot matrices 378
12.1.2 Principal components analysis 379
12.1.3 Multi-dimensional scaling 383
12.2 Discriminant analysis 385
12.2.1 Example – plant architecture 386
12.2.2 Logistic discriminant analysis 387
12.2.3 Linear discriminant analysis 388
12.2.4 An example with more than two groups 390
12.3∗ High-dimensional data, classification, and plots 392
12.3.1 Classifications and associated graphs 394
12.3.2 Flawed graphs 394
12.3.3 Accuracies and scores for test data 398
12.3.4 Graphs derived from the cross-validation process 404
12.4 Further reading 406
12.5 Exercises 407
13 Regression on principal component or discriminant scores 410
13.1 Principal component scores in regression 410
13.2∗ Propensity scores in regression comparisons – labor training data 414
13.2.1 Regression comparisons 417
13.2.2 A strategy that uses propensity scores 419
13.3 Further reading 426
13.4 Exercises 426
14 The R system – additional topics 427
14.1 Graphical user interfaces to R 427
14.1.1 The R Commander’s interface – a guide to getting started 428
14.1.2 The rattle GUI 429
14.1.3 The creation of simple GUIs – the fgui package 429
14.2 Working directories, workspaces, and the search list 430
Contents xvii
14.2.1∗ The search path 430
14.2.2 Workspace management 430
14.2.3 Utility functions 431
14.3 R system configuration 432
14.3.1 The R Windows installation directory tree 432
14.3.2 The library directories 433
14.3.3 The startup mechanism 433
14.4 Data input and output 433
14.4.1 Input of data 434
14.4.2 Data output 437
14.4.3 Database connections 438
14.5 Functions and operators – some further details 438
14.5.1 Function arguments 439
14.5.2 Character string and vector functions 440
14.5.3 Anonymous functions 441
14.5.4 Functions for working with dates (and times) 441
14.5.5 Creating groups 443
14.5.6 Logical operators 443
14.6 Factors 444
14.7 Missing values 446
14.8∗ Matrices and arrays 448
14.8.1 Matrix arithmetic 450
14.8.2 Outer products 451
14.8.3 Arrays 451
14.9 Manipulations with lists, data frames, matrices, and time series 452
14.9.1 Lists – an extension of the notion of “vector” 452
14.9.2 Changing the shape of data frames (or matrices) 454
14.9.3∗ Merging data frames – merge() 455
14.9.4 Joining data frames, matrices, and vectors – cbind() 455
14.9.5 The apply family of functions 456
14.9.6 Splitting vectors and data frames into lists – split() 457
14.9.7 Multivariate time series 458
14.10 Classes and methods 458
14.10.1 Printing and summarizing model objects 459
14.10.2 Extracting information from model objects 460
14.10.3 S4 classes and methods 460
14.11 Manipulation of language constructs 461
14.11.1 Model and graphics formulae 461
14.11.2 The use of a list to pass arguments 462
14.11.3 Expressions 463
14.11.4 Environments 463
14.11.5 Function environments and lazy evaluation 464
14.12∗ Creation of R packages 465
14.13 Document preparation – Sweave() and xtable() 467
14.14 Further reading 468
14.15 Exercises 469
xviii Contents
15 Graphs in R 472
15.1 Hardcopy graphics devices 472
15.2 Plotting characters, symbols, line types, and colors 472
15.3 Formatting and plotting of text and equations 474
15.3.1 Symbolic substitution of symbols in an expression 475
15.3.2 Plotting expressions in parallel 475
15.4 Multiple graphs on a single graphics page 476
15.5 Lattice graphics and the grid package 477
15.5.1 Groups within data, and/or columns in parallel 478
15.5.2 Lattice parameter settings 480
15.5.3 Panel functions, strip functions, strip labels, and
other annotation 483
15.5.4 Interaction with lattice (and other) plots – the playwith
package 485
15.5.5 Interaction with lattice plots – focus, interact, unfocus 485
15.5.6 Overlaid plots with different scales 486
15.6 An implementation of Wilkinson’s Grammar of Graphics 487
15.7 Dynamic graphics – the rgl and rggobi packages 491
15.8 Further reading 492
Epilogue 493
References 495
Index of R symbols and functions 507
Index of terms 514
Index of authors 523
下载:
Data Analysis and Graphics Using R, Third Edition.rar
(4.55 MB, 下载次数: 1, 售价: 5 )
备注:
很多人都有收集一堆资料而不看的习惯。为了有效利用资源,养成下载一本看一本的习惯,特设置了积分下载,请见谅。
多参加论坛的活动、多帮助别人,会很容易凑够积分的!
祝大家使用愉快!
|