Basic Statistics: Easy-To-Understand Code Examples in R, Python, Stata and SPSS

As you apply data science techniques to gain insights from data, there are a few important steps to follow. This page outlines these key steps and shows how to conduct them in R, Python, SPSS and Stata. Before following any of these steps, make sure you have downloaded and installed the statistical software package you will be using.

So you get started off right, check out our guide to Installing Statistical Software Programs on your computer. To conduct any analysis, you will need some real data to work with. If you want some great ideas for actual data sets to analyze, check out our expert guides to Online Survey Data and Datasets to Download.

First, you need to know how to load your data into a statistical software. Each program has a slightly different way to accomplish this task and we describe the processes below.


R: Many users struggle at first to get data loaded into R. There are two ways to import data in R.

Point and Click: The first method requires you to simply click on “File”, then go to “Import” and select the type of data set your would like to import. In most cases, you will be using Excel or .csv files, which are the first and second options in the menu. After you select your file, it will be automatically loaded into R to allow analyses.

Using Code in R: You can also import data using coding syntax in R. There are several ways to accomplish this task in R, including with commands from packages in R (read_excel, read_csv and Hmisc), as well as the base R commands for reading in a variety of data set types (read.table, read.csv, read.xlsx). Because it is the easiest and does not require loading packages, we will first cover importing data in base R here. Next, we go into two packages that you can use to import Microsoft Excel files and SPSS files directly into R. Check out the syntax below:

# Code to import a Comma Separated Values (.csv) data set
# in base R

# "df" can be named whatever you choose 
# you may need to experiment with setting
# "row.names"=FALSE for some datasets

df <- read.table("c:/mydata.csv", header=TRUE,
   sep=",", row.names="id")


# Code to import an Excel data set with the xlsx package

# Import the first worksheet from the workbook 
# excelfile.xlsx with a first row that has 
# variable names

install.packages("xlsx")

library(xlsx)
df <- read.xlsx("c:/excelfile.xlsx", 1)

# Import dataset from "mysheet" in "excelfile.xlsx"
df <- read.xlsx("c:/excelfile.xlsx", sheetName = "mysheet") 


# Bonus: Code to import an SPSS dataset with the Hmisc package
# The "use.value.labels=TRUE" option converts value 
# labels to R factors 

library(Hmisc)
df <- spss.get("c:/mydata.por", use.value.labels=TRUE)

After you have loaded your data into R’s memory and have assigned the data set object a name, you are ready to run some basic statistical analyses. Here is some sample code in R to run descriptive statistics (mean, median, quartiles, standard deviation, and range), check the correlation between two variables and then visualize this relationship in a scatter plot.

summary(df1)

correlate(x,y, method = "Pearson")

correllogram(x,y, method = "Pearson")

library(ggplot2)
ggplot(df1, aes(x = X1, y = Y1)) + geom_point


Python: Similar to R, Python has a considerable learning curve to importing data, especially because it is a programming language originally used exclusively by software Developers. In recent years, Python has taken off as a data science platform.

To import data in Python, and in the Spyder Integrated Development Environment (IDE) in particular, there are three steps to follow. First, check out this site about the specific steps to install pandas for your version of Python. Second, make sure that you are able to install three other major packages for data science, numpy, scipy and matplotlib, using the code below.

# Code to import an Excel data set in Spyder, Python's
# Integrated Development Environment (IDE)

# First make sure the pandas packages is installed

# Code to install and load pandas numpy, scipy
# and matplotlib packages to allow for 
# data frame use and calculation of statistics

# First you need to install the four major data science
# packages in Spyder, Python's IDE from the Anaconda
# Navigator program in Windows or MacOS 

pip install numpy

pip install scipy

pip install pandas

pip install matplotlib

# Next, you need to load each of these packages
# to allow you to run their commands on your data set(s)

import numpy as np

import scipy as sp

import pandas as pd

import matplotlip as mp

Next, you are ready to create some test data of your own to work with or to import some real data from your computer or an online source. The next blocks of code describe how to create and import data from a wide variety of sources in Python.

# Code to create dataset from scratch

# Pandas importation and examples of dataframes

import pandas as pd

# Create a simple dataset of people, their locations and ages
data = {'name': ["Mike", "Sarah", "Steve", "Anthony"],
'Location' : ["Boston", "New York", "Miami", "Chicago"],
'Age' : [30, 48, 22, 35]
}

data_pandas = pd.DataFrame(data)

# Code to import a Comma Separated Values (.csv) data set 

mydata1 = pd.read_csv("C:\\file_path\\file1.csv", header = None)


# Code to import an Excel (.xlsx) data set into pandas in Spyder

mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2) 

# If you would like to add column (i.e., variable) names, use this code

mydata2 = pd.read_csv("C:\\file_path\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])

# Another really cool feature of pandas is the ease of
# importing .csv files from internet sources

mydata = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

After you have successfully imported or created data in your Python session, you are ready to glean some insights from your data by running some essential basic statistics. Check out the code below and comments explaining the step-by-step process using pandas.

# Basic descriptive statistics in Python using pandas

# Ensure that you have loaded each of these packages
# with their typical abbreviation in most data science
# applications

import numpy as np

import scipy as sp

import pandas as pd

import matplotlip as mp

# Summary statistics including frequencies, mean, 
# median, mode, standard deviation and range

# Based on an example data frame named "df"

df.describe()

# Correlations between all variables in data frame

df.corr()


# Correlations for two specific variable columns
# in data frame

df['Variable 1 name'].corr(df['Variable 2 name'])


# Scatterplot

plot1 = df.plot.scatter(x='var1',
                        y='var2',
                        c='DarkBlue')


Stata: There are two ways to import data in Stata. The first method requires you to simply click on “File”, then go to “Import” and select the type of data set your would like to import. In most cases, you will be using Excel or .csv files, which are the first and second options in the menu. After you select your file, it will be automatically loaded into Stata to allow analyses.

The second method of importing data in Stata requires typing a line of code with the file name on your computer, which is shown below. Stata has many built-in commands, so you do not need to download any additional packages to import Excel files.

* Code to import Excel files in Stata *
* Make sure to substitute your folder, file name 
* and sheet name to make sure your data is imported
* The "clear" command clears Stata's memory for the
* new dataset. Make sure to save often!

clear

import excel "c:\folder\filename.xls", sheet("sheetname")

* Code to import Comma Separated Values (.csv) files in Stata *

clear

import delimited "c:\mydata1.csv"

Next, there are several basic statistics and simple visualizations that should be run to understand the distribution and averages of the variables in any particular data set.

* Code to run descriptive statistics in Stata

summarize x y z

* For categorical variables

tabulate x y z

* Correlations

pwcorr x y z

* You can also use the correlate command

correlate x y z

* This command allows for other options, such as
* adding significance stars

correlate x y z, sig star(0.05)

* Spearman correlations

spearman x y z

* Scatterplot between two variables

scatter y x

* Add a linear regression line

scatter y x || lfit y x 

Only after you have run these basic descriptive statistics can you move on to more advanced statistics and data analyses.


SPSS: There are two ways to import data in SPSS. The first method requires you to simply click on “File”, then go to “Import” and select the type of data set your would like to import. In most cases, you will be using Excel or .csv files, which are the first and second options in the menu. After you select your file, it will be automatically loaded into SPSS to allow analyses.

You can also use syntax coding in the SPSS Syntax editor by creating a new Syntax file by clicking on “File” and then “New ->” and then “Syntax” and using the code below to import an Comma Separated Values (.csv) file or Excel file.

* Code to import a .csv or Excel data set in SPSS

get data  /type=txt
  /FILE='c:\filename.csv'
  /ENCODING='UTF8'
  /DELCASE=LINE
  /DELIMITERS=","
  /QUALIFIER='"'
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /DATATYPEMIN PERCENTAGE=95.0
  /VARIABLES=
  var1 = AUTO
var2 = AUTO
  /MAP.
restore.

cache.
execute.


* Activate dataset to front window

dataset name DataSet1 window=front.

* Code to import an Excel data (.xlsx) data set in SPSS 

GET DATA /TYPE=XLSX 
  /FILE='C:\path\to\file.xlsx' 
  /SHEET=name 'Name-of-Sheet' 
  /CELLRANGE=full
  /READNAMES=on
  /ASSUMEDSTRWIDTH=32767.
EXECUTE. 
DATASET NAME DataSetExcel WINDOW=FRONT.


get data 
/type=xlsx
/file='c:\dataset1.xlsx'
/sheet=name 'Sheetname'
/cellrange=full
/readnames=on
/assumedstrwidth=32767.

* Activate dataset to front window

dataset name DataSet1 window=front.

After you have successfully loaded your data, we recommend you open the Data View and then the Variable View by clicking on the tabs on the bottom of the SPSS main window.

In the Variable View, you should be able to see your variable information, including the variables names, variable labels (if any, you can always assign them yourself), variable types, widths, missing data indicators and measurement level.

In the Data View, you can see the variable names listed on the top (header) row and the actual data values in the cells in the rest of the rows and columns.

To gain a basic understanding of your data, the next step is to run descriptive statistics in SPSS using the code below.

* Code to calculate descriptive statistics and frequencies
* in SPSS

* IMPORTANT: Unlike R, Python and Stata, SPSS requires 
* line(s) for each command to be ended with a "."

frequencies x y z
/statistics median mean quartiles stddev min max.

* You can also run descriptives with this command
* However, it does not allow for computing the median

descriptives x y z
/statistics mean quartiles stddev min max kurtosis skew stderr.

* Next, correlations will show you if any variables
* are significantly associated each other and how
* strong that association is

correlations
/variables=x y z
/print=twotail nosig
/missing=pairwise.

* If you do see any significant (* starred)
* correlations, you can visualize the relationship
* with a scatterplot in SPSS

graph
/scatter x with y
/subtitle "Scatterplot of X and Y (r = - 0.2; n = 128).

After you have run these basic statistics, there are many additional advanced topics including multivariate regression models, machine learning and classification, factor analysis, structural equation modeling and complex visualization. We are hard at work creating easy-to-understand educational resources for these more advanced topics, so make sure to check back often for updates!

Make sure to also check out our reviews of data science books and courses, our informative expert guides to data science technology and our rundown of top online data science courses. As always, our team at Data Science for Anyone are here to help you succeed at every step of the way in your data science journey.