`datexplore` Example Usage

Here we will show how the datexplore package can be used for the early stages of a data analysis project. We will show example usages for each function in the package (clean_names, visualise, and detect_outliers).

The early stages of data analysis projects often begin with similar steps. For many projects, data cleaning and exploratory data analysis (EDA) are essential before beginning more complex analysis. Using clean data for your analysis can make your code less suceptible to bugs or errors. Additionally, performing EDA can help to direct the analysis of your project and gives a stronger understanding of the data you are working with.

This package aims to help with the early stages of a project. Specifically, it contains a function to clean the column names of tabular data, a function to detect outliers in numerical data, and a function to create useful visulaization for EDA.

Imports

from datexplore.clean_names import clean_names
from datexplore.detect_outliers import detect_outliers
from datexplore.visualise import visualise
import pandas as pd
import numpy as np

Clean names

Often times raw data contains non syntactic column names. It can be particulary troublesome when the column names contain spaces and you are working with other packages which are designed only for column names without spaces.

For column name with a space:

An example of one such tool which does not work for column names with spaces the .query() method from the pandas library. This is shown below:

raw_df = pd.DataFrame({'Even Numbers': [2, 4, 6, 8],'odd numbers': [1, 3, 5, 7]})
filtered_df = raw_df.query("Even Numbers > 2")

Traceback (most recent call last):

  File ~/checkouts/readthedocs.org/user_builds/datexplore/envs/latest/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3550 in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  Cell In[2], line 2
    filtered_df = raw_df.query("Even Numbers > 2")

  File ~/checkouts/readthedocs.org/user_builds/datexplore/envs/latest/lib/python3.9/site-packages/pandas/core/frame.py:4811 in query
    res = self.eval(expr, **kwargs)

  File ~/checkouts/readthedocs.org/user_builds/datexplore/envs/latest/lib/python3.9/site-packages/pandas/core/frame.py:4937 in eval
    return _eval(expr, inplace=inplace, **kwargs)

  File ~/checkouts/readthedocs.org/user_builds/datexplore/envs/latest/lib/python3.9/site-packages/pandas/core/computation/eval.py:336 in eval
    parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)

  File ~/checkouts/readthedocs.org/user_builds/datexplore/envs/latest/lib/python3.9/site-packages/pandas/core/computation/expr.py:809 in __init__
    self.terms = self.parse()

  File ~/checkouts/readthedocs.org/user_builds/datexplore/envs/latest/lib/python3.9/site-packages/pandas/core/computation/expr.py:828 in parse
    return self._visitor.visit(self.expr)

  File ~/checkouts/readthedocs.org/user_builds/datexplore/envs/latest/lib/python3.9/site-packages/pandas/core/computation/expr.py:408 in visit
    raise e

  File ~/checkouts/readthedocs.org/user_builds/datexplore/envs/latest/lib/python3.9/site-packages/pandas/core/computation/expr.py:404 in visit
    node = ast.fix_missing_locations(ast.parse(clean))

  File ~/.asdf/installs/python/3.9.18/lib/python3.9/ast.py:50 in parse
    return compile(source, filename, mode, flags,

  File <unknown>:1
    Even Numbers >2
         ^
SyntaxError: invalid syntax

As you can see, using the column name containing a space results in an error. Now, we can use the clean_names function to “clean” the column names of the data frame. By “cleaning” the column names, we mean that we make all column names in a dataframe such that the names only use letters, numbers, and underscores.

The clean_names function takes a pandas dataframe containing data with column names as an input. There is also an optional parameter, case, which specifies the capitalization structure of the output dataframe (more on this later).

For column names without spaces:

Below we use the clean_names function and show that the resulting dataframe can now be used with the .query() method.

# Clean the column names and view the resulting dataframe
df = pd.DataFrame({'Even Numbers': [2, 4, 6, 8],'odd numbers': [1, 3, 5, 7]})
clean_names(df)
df

	even_numbers	odd_numbers
0	2	1
1	4	3
2	6	5
3	8	7

# Use the .query method on the new dataframe
filtered_df = df.query("even_numbers > 2")
filtered_df

	even_numbers	odd_numbers
1	4	3
2	6	5
3	8	7

This may not seem that useful for a dataframe with only two columns, but for a data frame with many columns, or if you are working with many dataframes, using the clean_names function could save a lot of time.

Exploring the case parameter:

The clean_names function also has an optional parameter, case, which specifics the capitalization structure of the output column names. The default value for this parameter is “snake_case” and the other options are “CamelCase” and “lowerCamelCase”. snake_case uses only lowercase letters and spaces are replaced with underscores. “CamelCase” capitalizes the first letter of a name and every letter following a space. “lowerCamelCase” results in the first letter of the name being lowercase and the first letter following a space being capitalized. Below are some examples using this optional parameter:

df1 =  pd.DataFrame({'make this SnaKe CaSe##': ["sample 1", "sample 2", "sample 3", "sample 4"]})
display(df1)
clean_names(df1) #this has the default value for the case parameter which is snake_case

	make this SnaKe CaSe##
0	sample 1
1	sample 2
2	sample 3
3	sample 4

	make_this_snake_case
0	sample 1
1	sample 2
2	sample 3
3	sample 4

df2 =  pd.DataFrame({'make THIS CAMEL Case!!!': ["sample 1", "sample 2", "sample 3", "sample 4"]})
display (df2)
clean_names(df2, case = "CamelCase") 

	make THIS CAMEL Case!!!
0	sample 1
1	sample 2
2	sample 3
3	sample 4

	MakeThisCamelCase!!!
0	sample 1
1	sample 2
2	sample 3
3	sample 4

df3 =  pd.DataFrame({'Make this ***LOWER*** CAMEL caSe': ["sample 1", "sample 2", "sample 3", "sample 4"]})
display(df3)
clean_names(df3, case = "lowerCamelCase")

	Make this *LOWER* CAMEL caSe
0	sample 1
1	sample 2
2	sample 3
3	sample 4

	makeThisLowerCamelCase
0	sample 1
1	sample 2
2	sample 3
3	sample 4

Visualise

The datexplore package, within the Python ecosystem, complements the functionalities of pandas, seaborn, and matplotlib by offering a unified approach to visualizing data. The visualise function simplifies the process of creating comprehensive visual analyses, especially during the early stages of a data project.

The visualise function aims to provide a thorough visual exploration of a DataFrame. It focuses on three key areas:

Identifying Missing Values: Generates heatmap to visualize missing values in the DataFrame. This is crucial for assessing data quality and deciding on data cleaning strategies.

Correlation Analysis: Computes and displays a correlation heatmap for numeric variables. Understanding these relationships is vital for feature selection and predictive modeling.

Pairwise Variable Inspection: Create a grid of scatter plots for numeric variables. This helps in visually inspecting variable distributions and interactions.

Usage Example

data = {
    'Numeric_1': np.random.randint(1, 100, 10),
    'Numeric_2': np.random.normal(50, 15, 10),
    'Numeric_3': np.random.uniform(20, 60, 10),
    'Category': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
}
df = pd.DataFrame(data)

df.loc[1:4, 'Numeric_1'] = np.nan
df.loc[4:6, 'Numeric_2'] = np.nan
df.loc[7:8, 'Numeric_3'] = np.nan

df

	Numeric_1	Numeric_2	Numeric_3	Category
0	60.0	55.772101	41.886092	A
1	NaN	66.599678	41.797397	B
2	NaN	42.284315	48.683661	C
3	NaN	62.653329	46.797470	D
4	NaN	NaN	49.999846	E
5	80.0	NaN	24.569233	F
6	86.0	NaN	57.739063	G
7	99.0	64.628283	NaN	H
8	11.0	56.293119	NaN	I
9	18.0	35.333900	57.475740	J

The created toy DataFrame has four columns, where three are numeric and one is categorical. It also includes deliberately introduced missing values to demonstrate the capabilities of the visualise function.

Now, let’s run the visualise function.

visualise(df)

_images/c35d63a8771c44057fe9576ad68fffa71870d784c8e0146db7d35a75a1ba8ed6.png

_images/ce2946571ccb54dec2a0ca0f88842907d2723c8c92b8181bb20d5b4881a0e349.png

_images/5d410e24eadf65b99beed5434e7f4f2b9b6452a2b6c7137511f0e0c1a0d87e6a.png

<Figure size 1000x400 with 0 Axes>

Detect outliers

In data analysis, outliers can either reveal critical insights or introduce annoying biases. The detect_outliers() function is an easy to use tool for detecting and categorizing outliers in Pandas Data Frames. This function uses the Interquartile Range (IQR) and standard deviation to identify and categorize its outliers. It then outputs the outliers to a Data Frame in a format that is simple to use and explore. By quickly identifying the most extreme outliers with our function, you can immediately get a sense of the scale of the problem the outliers might present. There are many real-world examples where disproportionate outliers make otherwise useful summary statistics unreliable. For example, detect_outliers() should be useful for real estate pricing. Data analysis on a real estate dataset can be compromised, when that dataset includes a few luxury homes priced significantly higher than the average. These extreme home values introduce a substantial skew, distorting the overall analysis. Our function enables you to swiftly identify and categorize these outliers. It also provides their index locations in the output Data Frame. Once you have the index of the outliers, all that’s required is a few extra lines of code to remove these anomalous entries from your original dataset, ensuring a more balanced analysis!

Usage example

To give a simple demonstration on how the function is used, let’s create a sample toy Data Frame (we can imagine the columns are features in a housing data set referenced above).

#generate a dataframe with two numerical columns, both having 2 outliers and one categorical column
df = pd.DataFrame({
    'House Price': [250, 300, 275, 320, 310, 290, 280, 265, 150, 225, 300, 250, 210, 2380, 2450],  # Prices in thousands, including outliers
    'Square Feet': [1500, 2000, 1800, 2100, 1900, 1600, 1700, 1750, 1500, 1700, 1800, 1450, 1300, 5400, 6800],  # Size in square feet
    'Location': ['Urban', 'Suburban', 'Rural', 'Urban', 'Suburban', 'Rural', 'Urban', 'Suburban', 'Urban', 'Urban', 'Rural', 'Urban', 'Suburban', 'Urban', 'Urban']  # Locations
})

Now that we have our dataframe with two extreme outliers, let’s use the function to see what useful information we find on the outliers of our dataset.

result = detect_outliers(df)
print(result)

        column  index  outlier_value  deviation category
House Price      8            150     19.375     Mild
House Price     13           2380   2000.000  Extreme
House Price     14           2450   2070.000  Extreme
Square Feet     13           5400   2925.000   Severe
Square Feet     14           6800   4325.000  Extreme

The detect_outliers output returns the Data Frame shown above. A simple print function call will output the data in a clean, easy to read format. The column name of the outlier, it’s index, the outlier_value, it’s deviation and a category description of how large the outlier is. The outlier information can be used with pandas as needed to transform the original data frame.

For example, to remove the outliers from the original Data Frame using the output from our detect_outliers function you would:

# Find the 'extreme' outlier index locations
extreme_indices = result[result['category'] == 'Extreme']['index']

# Drop the rows from the main Data Frame that correspond to these indices
df_outlier_free = df.drop(extreme_indices)
print(df_outlier_free)

    House Price  Square Feet  Location
         250         1500     Urban
         300         2000  Suburban
         275         1800     Rural
         320         2100     Urban
         310         1900  Suburban
         290         1600     Rural
         280         1700     Urban
         265         1750  Suburban
         150         1500     Urban
         225         1700     Urban
        300         1800     Rural
        250         1450     Urban
        210         1300  Suburban

datexplore Example Usage