This is the third article in our series from Data Science for Anyone that covers ways in which ordinary people can develop skills in data science to apply in their day job, advance in their career or succeed in data science courses. Don’t miss the first featured article in our series which reviews the data science learning resources on our website, as well as the second featured article in our series, which covers five tips to help you succeed in a career in data science.
One common question that often comes up in the data science field is: which statistical software should I learn first? Although each software program has its own unique functionalities and pros and cons, many users find that learning R is slightly easier than Python or Stata. We tend to agree and think that running R within the RStudio GUI has an excellent balance of usability, advanced functionalities and user-written packages.
Within R, there are many packages (16,037 at the time of writing this article!!) which are housed in CRAN’s (the Comprehensive R Archive Network) online repository and that expand the functionality of the Base R commands that come built-in with R. You might be asking: how do I install and run packages in R?. Each package can be downloaded into your version of R by typing: install.packages(‘package name’). To load a package in R, you can simply type the following into the R console or run it as part of an R script: library(package name).
Here are our top 10 picks for R packages
The tidyverse is a not quite a package itself, but instead a collection of packages, that allow for the manipulation, transformation, analysis and visualization of data. The core packages are ddply and ggplot2, which each have their own unique application. ddply is designed to make basic data transformation tasks easily, with functions to summarize, aggregate and combine datasets for analysis. For data visualization, the ggplot2 package within the tidyverse provides a full range of tools to create beautiful graphs and charts, using a straightforward, east-to-use grammar of graphics. The tidyverse has a truly amazing user base and numerous helpful web resources to help you get started. Check out this page to get started, which include extensive learning resources and tips to understand the basics of the tidyverse. The book, R for Data Science, is another exceptional resource for learning both the basics and more advanced functionalities within the tidyverse.
The Hmisc package contains a full range of useful functions that come in handy at all stages of working with data. It is especially useful because it builds on Base R functions to provide better or more efficient ways of recoding and clustering variables, computing power and sample sizes, imputing missing values, annotating data, importing datasets and text string manipulation. For users who are more familiar working with LaTeX or html, it also allows for the conversion of R objects to LaTeX and html code, which is particularly useful for creating reports or publishing R projects on the internet.
3. ggplot2 (part of the tidyverse)
As described above, for those who have found the Base R graphics creation and manipulation capabilities somewhat limited, ggplot2 is a phenomenal collection of tools to create incredible visualizations. ggplot2 is particularly great due to the ability to customize graphics pretty much any way you want and it has a spectacular community of fellow ggplot2 enthusiasts on StackOverflow to help answer any questions.
For anyone working with huge datasets, data.table is soon to become one of your favorite packages in R. It mimics functions in Base R such as merging and importing datasets, yet does these tasks considerably faster, especially with big data. The coding syntax is intuitive and quick to learn, making it a relatively simple addition to any data scientist’s toolbox. Whether you are reading in tab or comma separated values files, data.table offers extremely fast write and read speeds, as well as quick the addition or removal of columns and aggregation of datasets that take up a large amount of RAM while running in R.
The summarytools package makes running basic statistics including means, standard deviations, medians, ranges, frequency tables, cross-tabulations and Chi-square contingency tables. It allows for quick and efficient exporting to HTML or text, which can be copied into word processing programs or uploaded directly to a website or blog to share with the world. Even cooler, the summarytools package allows the use of the pipe operator ( %>% ), which is also used in the tidyverse and can string multiple operations together in R to make your work more efficient and free-flowing.
Not just limited to data scientists who are conducting psychological research, the psych package contains a full-featured suite of tools to conduct descriptive and inferential statistical analysis. For anyone conducting cluster analysis, factor analysis or even item response theory-based analysis, this package has functionalities for these techniques, along with an excellent user base online to help with any issues that come up while using the psych package. It also has great functions for working with survey data to create scales and test for internal consistency reliability of indicators.
If you make descriptive statistics or multivariate regression tables, then stargazer might be your new favorite package in R. The old way of making tables might have meant creating tables in SPSS or Stata and copying the whole table or the numbers in each cell (!) to a word processing program like Microsoft Word or a spreadsheet program like Microsoft Excel. This is, of course, very time-consuming and can be error-prone, because it relies on manually copying and pasting, instead of using programmatic operations to automatically output the table. With stargazer, you can export summary statistics and multivariate regression tables to text, html or LaTeX with added features such as asterisks (* p < 0.05, anyone?) for statistical significance, model fit statistics, titles and other table aesthetics. Even better, the package has excellent documentation, which is available here using the NYC Flights dataset, along with many helpful examples of the package in use.
Along your data science journey, you will inevitably encounter datasets in many different formats, such as Comma Separated Values files (.csv), Microsoft Excel files (.xls or .xlsx), SPSS data files (.sav), Stata data files (.dta) or SAS data files (.sas7sdat). Each of these files stores data in specific formats, which must be opened in particular software programs. The one exception is .csv files, which can be opened in Notepad or easily read into R using the read.csv() or read.table() functions. For the other types of data files, the haven package really comes in handy. It uses simple syntax code to successfully import SPSS, Stata and SAS files into R as a data.frame. Even better, you can use haven to export datasets in R to each of these three formats, making it indispensable if you work with colleagues or collaborators who are SPSS, Stata or SAS users. For more information about these packages and to get started, check out our guides to SPSS and Stata.
9. stringr (also part of the tidyverse)
Although people often think of data as strictly composed of numbers, a large portion of data often includes words or sentences, which is known as string data or text data. When working with this type of data, there are special considerations, such as replacing letters based on their position or substituting characters for other characters. This is where the stringr package really comes in handy. It is made up of a set of simple and intuitive commands to manipulate and transform strings, which really comes in handy if you are conducting data cleaning, text mining analyses, natural language processing or content analysis.
An exciting development in recent years is the rise of R Markdown files, which allow users to embed and run R code as part of static or dynamic reports that can be exported to PDF or html for publishing or sharing with others. Within RStudio, the knitr package provides full-featured functionalities to create, modify and share beautiful, fully-reproducible reports created using the R Markdown (.rmd) format. Check out this page for an exceptional introduction to R Markdown files and to get an idea of how to incorporate Markdown into your own data science skill set.
The bottom line
These top 10 packages in R, combined with the extensive Base R functions, provide many or all of the data science tools that you will need to complete most tasks and projects. For more specialized analyses or data types, you might need to venture beyond these 10 packages. For example, more advanced analyses such as machine learning algorithms, time series forecasting, structural equation modeling and multiple imputation require additional user-written packages from CRAN. We will be covering these more sophisticated packages in a future article, so make sure to subscribe to our website updates using the button below and don’t forget to follow us on Twitter, Pinterest and Instagram!
Extra Tip: It is important to keep packages updated to their latest versions, which is made easy by RStudio’s “Check for Package Updates…” button (see below), which allows you to update one or more packages that have new versions available to download from CRAN.
Don’t forget to also keep Base R and RStudio updated to their most recent versions by following our step-by-step guide.