After you have installed R and RStudio, you are ready to import your first data set and run some basic statistics. This page provides a brief summary of how to open up RStudio for the first time, load a data set in, and run basic descriptive statistics. If you need help downloading and installing base R and RStudio, make sure to take a look at our Installing Statistical Software Programs page.
If you are looking to calculate basic statistics in other statistics packages, such as Python, Stata or SPSS, click here for a step-by-step guide for these programs.
First, you need to import some data into R from either a Comma Separated Values (.csv) file, Excel file or SPSS file. Here is some easy to understand code to accomplish these tasks before moving on to some actual data analyses.
# Import data into RStudio's memory from a # Comma Separated Values (.csv) file df <- read.table("c:/mydata.csv", header=TRUE, sep=",", row.names="id") # Code to import an Excel data set with the xlsx package # Import the first worksheet from the workbook # excelfile.xlsx with a first row that has # variable names install.packages("xlsx") library(xlsx) df <- read.xlsx("c:/excelfile.xlsx", 1) # Import dataset from "mysheet" in "excelfile.xlsx" df <- read.xlsx("c:/excelfile.xlsx", sheetName = "mysheet") # Bonus: Code to import an SPSS dataset with the Hmisc package # The "use.value.labels=TRUE" option converts value # labels to R factors library(Hmisc) df <- spss.get("c:/mydata.por", use.value.labels=TRUE)
Next, you can run some descriptive statistics, examine relationships between variables and create beautiful data visualizations with the ggplot2 package.
# Calculate descriptive statistics including mean, median, # quartiles, standard deviation and range for all variables # in data frame summary(df1) # Examine correlations between two variables from data frame correlate(x,y, method = "Pearson") # Create a correllogram with two two variables or all variables in R correllogram(x,y, method = "Pearson") # Load ggplot2 and create a scatterplot between two variables library(ggplot2) ggplot(df1, aes(x = X1, y = Y1)) + geom_point
There are many additional advanced topics in data science and statistics, and we at Data Science for Anyone are hard at work on creating new expert guides for R, Python, Stata and SPSS. For an excellent resource that allows you to find for help resources for any package in R (including the statistic packages!), check out this super helpful search platform for R packages.
Make sure to check back on our site often for updates and we highly recommended the educational resources on data science and statistics hosted by the Institute for Digital Research & Education Statistical Consulting (IDRE)!