Data Wrangling using Python- Part 1

In this blog, we will show some of the commonly used data wrangling steps using Python.  We will be using pandas data frame as our data object to show all the steps.

Data Wrangling with Python

Importing Python Packages

In this part of blog, we will use pandas and numpy packages available in Python. We need to import these packages before use them.

import pandas as pd
import numpy as np

Creating a Data Frame in Python

Now, we want to create a data frame in Python and there could be multiple ways to do that.

  • Creating Data Frame by individual columns
  • Read data into a data frame
  • Convert different object to a data frame

Creating a data frame: We are creating a series of random numbers and storing into a data frame - df1.

# Creating a Data Frame in Python
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['c1', 'c2', 'c3'])

In the data frame created, we have 3 columns  - c1, c2 and c3.

We can create a data frame by combining different columns into a dictionary and converting the dictionary to data frame.

a = {'C1':range(1,10),
df = pd.DataFrame(a)

"a" is a dictionary and "df" is a data frame.

Renaming Columns of a Data Frame

In a number of scenarios, we may want to rename the columns of an existing data frame in Python. Some of the ways to rename column(s) are:

# ---------------------- Rename Column Names ------------
df.columns = ["N1","N2","N3"]
# Rename using rename method
old_names = ['N1', 'N2'] 
new_names = ['a', 'b',"c"]
df.rename(columns=dict(zip(old_names, new_names)), inplace=True)

Drop Column(s) of a Data Frame

We can drop column by column name and position.

# ---------------------- Drop Column(s) ------------
df1 = df.drop('a',axis=1)

#---------------------Drop Columns by Position ------

df2=df.drop(df.columns[[1,2]], axis=1)

Add New Column(s) to a Data Frame

We can create new column and add to an exiting data frame.  For creating new column, we can use existing columns or add other data.

# Add a new column to an existing data frame
df1['c4'] = df1['c1']*10
df1['c5'] = ['3','5']*int(len(df1)/2)

First we have added a new column "c4", we have used existing column "c1" and multiplied with 10.  In the second line, we are repeating "3","5"  and creating a a new column 'c5'.

Change existing Column(s) of a Data Frame

A number of reasons of requiring change of columns.  We may want to change a floating into interger, convert a string values to numeric values or rounding off values. Here are the ways to achieve of these scenarios.

# Change an existing Columns - Convert Floating to Interger
df1['c2'] = df1['c2']*10
df1['c2'] = df1.c2.astype(np.int64)
# Change a string column to numeric or vice versa
df1['c5'] = ['3','5']*int(len(df1)/2)
df1['c6'] = pd.to_numeric(df1['c5'])
df1['c7'] = df1['c1'].astype(str) # Changed to string
# Rounding of Values

Finally, for this part of the blog, we may want to find the type of columns of a data frame. And here is the way to do.

#------------ Find Data Type of Columns --------------


Leave a Comment