After you have installed Python, you are ready to import your first data set and run some basic statistics. This page provides a brief summary of how to load pandas in the Spyder IDE as part of the Anaconda release of Python, load a data set in, and run basic descriptive statistics. If you need help downloading and installing the Anaconda release of Python and major packages, make sure to take a look at our Installing Statistical Software Programs page.
If you are looking to calculate basic statistics in other statistics packages, such as R, Stata or SPSS, click here for a step-by-step guide for these programs.
# Code to import a Comma Separated Values (.csv) data set mydata1 = pd.read_csv("C:\\file_path\\file1.csv", header = None) # Code to import an Excel (.xlsx) data set into pandas in Spyder mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2) # If you would like to add column (i.e., variable) names, use this code mydata2 = pd.read_csv("C:\\file_path\\file1.csv", header = None, names = ['ID', 'first_name', 'salary']) # Another really cool feature of pandas is the ease of # importing .csv files from internet sources mydata = pd.read_csv("http://winterolympicsmedals.com/medals.csv")
Next, you are ready to run some basic statistics, including descriptives to show averages and central tendency measures, correlations to examine potentially significant relationships between variables and visualizations to illustrate these relationships.
# Basic descriptive statistics in Python using pandas # Ensure that you have loaded each of these packages # with their typical abbreviation in most data science # applications import numpy as np import scipy as sp import pandas as pd import matplotlip as mp # Summary statistics including frequencies, mean, # median, mode, standard deviation and range # Based on an example data frame named "df" df.describe() # Correlations between all variables in data frame df.corr() # Correlations for two specific variable columns # in data frame df['Variable 1 name'].corr(df['Variable 2 name']) # Scatterplot plot1 = df.plot.scatter(x='var1', y='var2', c='DarkBlue')
There are many additional advanced topics in data science and statistics, and we at Data Science for Anyone are hard at work on creating new expert guides for R, Python, Stata and SPSS. For a really helpful and easy-to-use listing of Python commands for analyzing data and running statistics with the pandas package, check out this amazing guide and search engine for commands in pandas.
Make sure to check back on our site often for updates and we highly recommended the educational resources on data science and statistics hosted by the Institute for Digital Research & Education Statistical Consulting (IDRE) at UCLA!