Data Manipulation with Pandas

175 Views

Data Manipulation with Pandas

养花风水

175 Views

When it comes to working with data while programming, the Pandas library in python is one of the most popular and sought after libraries. It aims at making good structures that can be relied on to be flexible and fast for interacting with structured data. If you are a data analyst, data scientist or engineer for any company cleaning, transforming or analyzing data with the help of pandas becomes quite crucial.

Pandas is an extension of another popular library that is, NumPy, however it has a more advanced approach when it comes to manipulating the data whether it is in the form of tables, time series or even CSV and Excel files. Two of the most significant data structures in Pandas are Series and DataFrame. Thanks to these structures you will be able to carry out numerous data manipulation tasks quite effortlessly.

The Pandas Library

In order to start using Pandas, first it is important to import the library. This is usually done using the following import command:

  import pandas as pd

In most conventions , Pandas is imported using an alias 'pd' which simplifies the calling of the pandas functions in your program.

Pandas Data Structures

In Pandas, the two primary data structures are Series and DataFrame. It is important to note these structures as they are the basis for most of the data manipulation process in Pandas.

Series

A Series can be defined as a one-dimensional array. It can include any type of data including integers, strings, and other objects. A Series resembles a list in python, but in addition to the data, it also contains labels (called indices). This allows for fast and easy access to data or the ability to modify it.

  import pandas as pd
 

 
data = [1, 2, 3, 4]
 

 
series = pd.Series(data)
 

 
print(series)

DataFrame

In broad terms a DataFrame is a two dimensional structure containing tabular data. It contains rows and columns, just like a table or excel spreadsheet. Columns can be of various data types. It is very advantageous especially when dealing with data that has more than one variable.

  data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age' : [24, 27, 22]}
 
df = pd.DataFrame(data)
 
print(df)

Methods to Import Data

In the corporate world there are a plethora of file formats and one of them includes CSV, Excel, JSON, and SQL databases. These file formats are in fact easy to import especially with the help of Pandas. This ability to read data from different prominent reasoning of why Pandas is very popular in data analysis Bring that fun to read data into different formats of file even more interesting In the below example as an illustration I am pulling in some data from a CSV file into a defined structure `DataFrame in a Pandas read times as written by Alok Sanwalpour

To do this use: Remember to replace 'data' and subsequently 'data' file as it pertains to your case. Note that the type of data is contained in the tag following the quotation marks. So for instance zero data type container is not allowed.

  df = pd.read_csv('data.csv')

This means once you load all the required data into the structure DataFrame you will start modifying and transforming it is based on your requirement

Practical Perspective on the Data Manipulation Techniques

When data is loaded in a DataFrame manipulating it as desired by the user is possible by a range of operations cleaning transformations and analyzing of one's data One of the top objectives is pulling together and construction of a range and variety of data matrix Ø Selecting Data One of the early on of the approach targeted to the structural changes on a data frame or matrices is enabling selecting shaping data which is represented by pointed out column lines or reaching specific digits

- If you want to specify a column, use the column name, for example,

  age_column = df['Age']

- To select a row by its index, use the `loc[]` or `iloc[]` methods, for instance,

  row_by_index = df.loc[0]  Selecting first row

- If you only need a certain value by telling which row and column name you want to locate that specific value in using `loc[]`,

  value = df.loc[0, 'Name']  It is a value which is in first row and 'Name' column

Selecting Data

You can also select data on certain criteria, for instance, we can select all rows that have the 'Age' column value greater than '25' as shown below,

  filtered_data = df[df['Age'] > 25]

Sorting Data

Dataframes can also be sorted by numerous columns, and this can either be in ascending or descending order.

  sorted_df = df.sort_values(by='Age', ascending=False)

Dealing with NaN values

The common term which is used to refer to null values is NaN. NaN very often appears in dataset in real life situations. There are certain functionalities that are provided by Pandas to deal with such data.

- When you want to look for any data that is not present, apply the `isnull()` method:

  df.isnull()

- In case the null values exist in any row and you want to get rid of those rows, use the dropna() method:

  df_cleaned = df.dropna()

- Or else, you can replace the null values with 0 using the `fillna()`:

  df_filled = df.fillna(0)

Transforming Information of a Frame

In case you would like to filter some data, or sort it, you will still be able to update the data present in the DataFrame. For instance, if you wish you can create new columns, or adjust the ones already present, or even use some function to change some values in several columns.
- Creating and populating a new column:

  df['Salary'] = [50000, 60000, 55000]

- Creating and populating a new column:

  df['Age'] = df['Age'].apply(lambda x: x + 1)

Aggregating Information

If you wish to do some operation over the data set created, for instance, if you want to do aggregation replacing certain values, you can always make the use of grouping. Grouping a Pandas DataFrame is possible through one or more columns and afterwards some of the aggregation functions that can be defined include sum, mean, count, and others.

  grouped_data = df.groupby('Age').sum()

Combining Data

Since files are saved into multiple tables or into different files, those tables are sometimes called as datasets. These datasets are combined by merging in Pandas forms. The merging of DataFrames is done by the `merge()` method in which one or more columns are common.

  df_merged = pd.merge(df1, df2, on='ID')

Tables with various dimensions

Pandas tables also contain specific functionality that allows to create dimensional tables also called pivot tables which are useful for making summaries and analysis of data. You can make pivot tables by the method `pivot_table()`.

  pivot_df = df.pivot_table(values='Salary', index='Age', aggfunc='mean')

Compendious review of fundamental steps

Now with the help of Pandas, you can perform several other data manipulation activities such as;

- Uploading files from a range of types

- Data selection or filtering and sorting of data

- Removing null values of datasets

- Changing the datasets' value, raising new datasets or functions.

- Segmenting the data and combining it.

- Also joining data tables and dimensional tables.

All of these activities are critical in the processing of the data from its raw form for purposes of analysis. Indeed, with the power of deploying complex algorithms on vast datasets with a swiftness that's quite remarkable, it's no wonder that Pandas is an invaluable resource to any data scientist and analyst. Whether you're in the business of wrangling data, reshaping or transforming it in order to analyze or visualize the data, definitely there are ways in which one can handle data with the help of pandas.

Article