Python Data Science

NumPy, Pandas & University Q&A

5 Marks Questions

1. Program to Sort an Array

Sorting is a fundamental operation in numerical computing used to arrange elements in either ascending or descending order. In Python, the NumPy library provides built-in functions to handle this efficiently without altering the original array by default.

You can sort an array directly using np.sort(), or you can use np.argsort() to return the indices that would result in a sorted array.

Example Program:

import numpy as np

# Creating an unsorted array
a = np.array([40, 10, 30, 20])

# Sorting the array directly
print(np.sort(a)) 
# Output: [10 20 30 40]

# Sorting a 2D array along a specific axis (rows)
b = np.array([[3, 2, 1], 
              [6, 5, 4]])
print(np.sort(b, axis=1))
# Output: [[1 2 3]
#          [4 5 6]]

# Getting the sorted indices using argsort
print(np.argsort(a))
# Output: [1 3 2 0]

2. Comparisons and Boolean Arrays

In NumPy, performing an element-wise comparison operation (like <, >, or ==) on an array evaluates the condition for every single item and generates a new Boolean array. This resulting array consists entirely of True and False values.

These Boolean arrays are highly valuable because they can act as a "mask." When you pass the Boolean mask back into the original array, it filters the data, allowing you to extract, modify, or count only the elements that satisfy your specific condition.

Example Program:

import numpy as np

a = np.array([10, 20, 30, 40, 50])

# Creating a Boolean mask where elements are greater than 25
mask = a > 25
print(mask)
# Output: [False False True  True  True]

# Applying the mask to filter the original array
print(a[mask])
# Output: [30 40 50]

# Combining multiple logical conditions using bitwise operators
print(a[(a > 20) & (a < 50)])
# Output: [30 40]

3. Combining Data Sets in Pandas

When working with data spread across multiple sources, Pandas provides several methods to integrate them into a single, unified structure. The three primary techniques are:

  • concat(): This function glues or stacks DataFrames together along a specific axis (either vertically across rows or horizontally across columns).
  • merge(): This operates similarly to SQL database joins (Inner, Outer, Left, Right). It combines datasets based on the values within shared columns.
  • join(): This is a convenience method used primarily to combine DataFrames based on their row indices rather than column values.

Example using merge():

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['A', 'B']})
df2 = pd.DataFrame({'ID': [1, 2], 'Marks': [80, 90]})

# Merging based on the shared 'ID' column
print(pd.merge(df1, df2, on='ID'))

4. Creating a Pivot Table

A Pivot Table in Pandas is an advanced summarization tool used to aggregate and reorganize multi-dimensional data. It takes column-wise data and reshapes it into a two-dimensional rectangular grid, allowing you to specify which data fields act as the index (rows), which act as the columns, and what aggregation function to apply to the values. This is incredibly useful for business reporting, spotting trends, and comparing categories.

Syntax: df.pivot_table(values, index, columns, aggfunc)

Example Program:

import pandas as pd

df = pd.DataFrame({
    'Dept': ['CSE', 'CSE', 'ECE'],
    'Year': [1, 2, 1],
    'Marks': [80, 90, 85]
})

# Creating a pivot table summarizing marks by department and year
print(pd.pivot_table(df, values='Marks', index='Dept', columns='Year'))

10 Marks Questions

1. Structured Arrays in NumPy

Introduction and Core Concept

Standard NumPy arrays (ndarrays) are homogeneous, meaning they are designed to hold only one type of data at a time. However, real-world data is often heterogeneous. Structured arrays solve this by allowing you to compose an array from simpler, varied data types (such as integers, floats, and strings) organized into a sequence of named fields. You can think of a structured array as being similar to a "struct" in the C programming language, or a row in a relational database or spreadsheet.

Creation and Syntax

To create a structured array, you must explicitly define the data type (dtype) for the array. This is done by providing a list of tuples, where each tuple specifies the name of the field and its corresponding data format.

For example, if you wanted to store student records containing an ID, a name, and a score, you would define it as follows:

import numpy as np

# Defining a structured array with heterogeneous data
data = np.array([
    (1, 'Alice', 85.5), 
    (2, 'Bob', 78.0)
], dtype=[('id', 'i4'), ('name', 'U10'), ('marks', 'f4')]) 

print(data)
# Output: [(1, 'Alice', 85.5) (2, 'Bob', 78.)]

Accessing Data

Unlike standard arrays where you rely purely on numerical indices, structured arrays allow you to access entire columns or specific fields using their designated dictionary-style names.

# Accessing all values under the 'name' field
print(data['name'])
# Output: ['Alice' 'Bob']

Advantages and Applications

  • Organization: They provide a highly organized way to store tabular datasets directly within NumPy.
  • Performance: They are significantly faster and more memory-efficient than using standard Python dictionaries to store grouped data.
  • Best Use Case: They are ideal for scenarios where you need the structural feel of a spreadsheet or SQL table but absolutely require the high-speed, C-level computational performance of NumPy.

2. Handling Missing Values in Pandas

Introduction

In practical data analysis, datasets are rarely perfect. They often contain gaps due to system failures, data collection errors, or incomplete records. In Pandas, missing or undefined numerical data is standardly represented by the special floating-point marker NaN (Not a Number). Handling these missing values is a mandatory preprocessing step to guarantee data quality and ensure that downstream statistical analyses or machine learning models perform reliably.

Pandas provides a robust, built-in suite of tools to detect, eliminate, or replace these gaps.

Key Techniques for Handling Missing Data

  • 1. Detection: Before fixing missing data, you must locate it. Functions like isnull() and notnull() evaluate the dataset and return Boolean masks (arrays of True and False) indicating exactly where the NaN or None values reside.
  • 2. Elimination (Deletion): If the missing data cannot be salvaged, you can remove it entirely using the dropna() method. Depending on the severity, this can involve listwise deletion (dropping entire rows that contain even a single missing value) or dropping entire columns.
  • 3. Imputation (Filling): Instead of losing records, you can substitute NaN values using the fillna() method. You can fill gaps with a specific constant value (like 0), or use statistical measures derived from the dataset, such as the mean, median, or mode.
  • 4. Propagation: Particularly useful in time-series data, Pandas allows you to propagate existing values to fill gaps. You can use forward fill (ffill()) to carry the last known valid observation forward, or backward fill (bfill()) to carry the next valid observation backward.

Practical Example

import pandas as pd
import numpy as np

# Creating a DataFrame with intentional missing values (NaN)
df = pd.DataFrame({
    'A': [1, 2, np.nan],
    'B': [4, np.nan, 6]
})

# 1. Detecting missing values (Returns True where data is missing)
print(df.isnull())

# 2. Dropping missing values (Removes any row with a NaN)
print(df.dropna())

# 3. Filling missing values (Replaces all NaN markers with 0)
print(df.fillna(0))