Memory efficient query execution in batches with PostgreSQL and Python

Maciej Wysocki

1 August 2024, 8 min read

What's inside

Introduction

Challenges in Data Extraction

Reproducible Setup for Experiments

Simple Query Execution

Batch Results with Pandas and Polars

Setup of an Effective Server-Side Execution

Comparison of Memory Usage and Runtimes

Conclusions

Introduction

In this article, we will explore strategies for effective batched query execution in Python when working with PostgreSQL databases. We will delve into an approach that can be used for handling large tables, highlighting techniques to ensure your queries remain efficient and your database interactions smooth, regardless of the dataset size.

Challenges in Data Extraction

When working with large databases, efficient query execution is crucial for maintaining performance and scalability. In Python, interacting with PostgreSQL databases is most-commonly done using libraries such as psycopg2 or SQLAlchemy, which provide robust tools for executing queries. However, the exact approach to executing queries can vary significantly depending on the size of the tables involved.

For small tables, straightforward query execution techniques often suffice, even with using basic pandas or polars functionalities. These tables typically easily fit in memory, allowing for quick processing and minimal performance concerns.

On the other hand, large tables with millions of rows and hundreds of columns pose a greater challenge. Queries that involve significant amounts of data can be memory-intensive, leading to slow performance and potential errors if not managed properly. For these scenarios, a more sophisticated approach is required, such as the one that involves retrieving the queries’ results in batches. Batched query execution breaks down large queries into smaller, more manageable chunks, optimizing resource usage and improving overall efficiency.

Reproducible Setup for Experiments

Before we begin, let's set up the environment used for the examples. To ensure reproducibility and ease of setup, we will use Docker to create a PostgreSQL database and a Python script to generate and populate tables with artificial data. To start, we need to create Docker Compose configuration inside a docker-compose.yaml file.

services:
  db:
    image: postgres:latest
    container_name: postgres_db
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
      POSTGRES_DB: database
    ports:
      - '5432:5432'
    volumes:
      - db_data:/var/lib/postgresql/data

volumes:
  db_data:

This Docker configuration creates a very simple PostgreSQL environment that allows to store and interact with the tables. Remember that this is just an example and you always need to protect your secrets like usernames and passwords.

Now we need a simple Python script create_tables.py to generate a table with artificial data and store it in the PostgreSQL database created in the Docker container. The script below uses pandas and SQLAlchemy to handle data creation and database interaction.

import pandas as pd
from sqlalchemy import create_engine, Engine
import datetime

# Database connection parameters
DB_CONFIG = {
    "dbname": "database",
    "user": "user",
    "password": "password",
    "host": "localhost",
    "port": 5432,
}


def get_engine(config: dict) -> Engine:
    """Create a SQLAlchemy engine."""
    db_url = f"postgresql://{config['user']}:{config['password']}@{config['host']}:{config['port']}/{config['dbname']}"
    engine = create_engine(db_url)
    return engine


def get_sample_table(rows: int = 100, columns: int = 3) -> pd.DataFrame:
    """Create a pandas DataFrame with artificial values."""
    start_date = datetime.date(2023, 1, 1)
    # Generate data
    data = {
        "ID": range(rows),
        "date": [
            str(start_date + datetime.timedelta(days=i % 10)) for i in range(rows)
        ],
    }

    for i in range(1, columns + 1):
        data[f"v{i}"] = range(i * rows, rows + rows * i)

    return pd.DataFrame(data)


def main():
    # Create the engine
    engine = get_engine(config=DB_CONFIG)

    # Create a table with artificial values
    table_name = "small_table"
    data = get_sample_table(rows=100, columns=3)

    # Save DataFrame to PostgreSQL
    data.to_sql(table_name, engine, if_exists="replace", index=False)
    print(data)

    print(f"DataFrame successfully saved to PostgreSQL table: {table_name}")


if __name__ == "__main__":
    main()

Simple Query Execution

To test the setup that we created we can use the following Python script (query.py) to connect to the database and execute a query.

import psycopg2

# Database connection parameters
DB_CONFIG = {
    "dbname": "database",
    "user": "user",
    "password": "password",
    "host": "localhost",
    "port": 5432,
}


def execute_query(query: str) -> list[tuple]:
    """Execute a SQL query using psycopg2 connection."""
    try:
        # Establish a connection to the database
        with psycopg2.connect(**DB_CONFIG) as conn:
            with conn.cursor() as cursor:

                # Execute the query
                cursor.execute(query)
                conn.commit()
                results = cursor.fetchall()

        print("Query executed successfully")

        return results

    except (Exception, psycopg2.DatabaseError) as error:
        print(f"Error: {error}")


def main() -> None:
    table_name = "small_table"
    query_text = f"SELECT * FROM {table_name}"
    data = execute_query(query=query_text)
    print(data)


if __name__ == "__main__":
    main()

Although very simplistic, this approach is very similar to what is commonly used to extract data from a PostgreSQL database with Python. It's particularly useful if you prefer avoiding additional external libraries. However, once data is extracted, it's often necessary to perform further transformations, where pandas or polars are popular choices. Let’s see how to modify the script to use those.

To begin with pandas, we first set up an engine, leveraging SQLAlchemy as shown in the create_tables.py example. We can reuse the get_engine function we've already implemented for this purpose. Once the engine is established, fetching data into a pandas data frame becomes straightforward with the read_sql_query function. Here’s how we can redefine our execute_query function:

def execute_query_pandas(query: str, **kwargs) -> pd.DataFrame:
    """Execute a SQL query and read the results using pandas."""
    try:
        # Create a SQLAlchemy engine
        engine = get_engine(DB_CONFIG)

        # Execute the query and read data into a pandas DataFrame
        df = pd.read_sql_query(query, con=engine, **kwargs)

        print("Query executed successfully")

        return df

    except (Exception, pd.errors.DatabaseError) as error:
        print(f"Error: {error}")
        return pd.DataFrame()

The polars implementation is very similar to the pandas, but the output will of course be a polars data frame. To start, we need to establish a connection to the PostgreSQL database using an engine. We'll again employ SQLAlchemy for this purpose, similar to our pandas implementation. Then it’s a simple use of a polars read_database. The modified execute_query function can therefore look like this:

def execute_query_polars(query: str, **kwargs) -> pl.DataFrame:
    """Execute a SQL query and read the results using polars."""
    try:
        # Create a SQLAlchemy engine
        engine = get_engine(DB_CONFIG)

        # Execute the query
        df = pl.read_database(query, connection=engine, **kwargs)

        print("Query executed successfully")

        return df

    except Exception as error:
        print(f"Error: {error}")
        return pl.DataFrame()

Batch Results with Pandas and Polars

Both pandas and Polars offer features for handling large datasets in batches when querying PostgreSQL databases. In pandas, the chunksize parameter allows you to iterate over query results in manageable portions. Similarly, Polars provides optional iter_batches and batch_size parameters. In our example, using those parameters is relatively simple, as thanks to specifying the execute_query_pandas and execute_query_polars with kwargs, we can just pass the additional parameters when executing the functions. The main function in the query.py file can therefore be:

def main() -> None:
    table_name = "small_table"
    query_text = f"SELECT * FROM {table_name}"
    data = execute_query(query=query_text)
    print(data)
    data = execute_query_pandas(query=query_text, chunksize=10)
    print(data)
    data = execute_query_polars(query=query_text, iter_batches=True, batch_size=10)
    print(data)

As expected, the functions now yield generators, allowing iterative access to query results in specified chunks. However, in PostgreSQL setup with both pandas and polars, the entire dataset is initially retrieved by executing the query before partitioning it into chunks. This approach can lead to memory issues if the resulting data frame is large, potentially causing slow performance or out-of-memory errors.

To address this challenge, we recommend implementing a server-side batched execution. This approach involves modifying the query execution on the PostgreSQL server itself to handle data segmentation directly, thereby mitigating memory constraints and enhancing overall query efficiency.

Setup of an Effective Server-Side Execution

This approach involves creating a server-side named cursor and creating a python generator that will yield the results of the query in batches. To facilitate that we can create a new function:

def execute_query_in_batches(
    query: str,
    chunk_size: int = 10_000,
) -> Generator[pd.DataFrame, None, None]:
    """Execute query in batches using a server-side cursor to optimize local memory usage"""

    def _record_batches():
        with psycopg2.connect(**DB_CONFIG) as conn:
            with conn.cursor(name="batch-operator") as cur:
                cur.execute(query)
                while True:
                    batch_rows = cur.fetchmany(chunk_size)
                    column_names = [col[0] for col in cur.description]
                    if not batch_rows:
                        break
                    yield pd.DataFrame(batch_rows, columns=column_names)

    return _record_batches()

This implementation also assumes that the data is returned as a pandas data frame, which is just for convenience. To build up on our main function, we can do the following to get three generators that provide query results in chunks:

def main() -> None:
    table_name = "small_table"
    query_text = f"SELECT * FROM {table_name}"
    data = execute_query(query=query_text)
    print(data)
    data = execute_query_pandas(query=query_text, chunksize=10)
    print(data)
    data = execute_query_polars(query=query_text, iter_batches=True, batch_size=10)
    print(data)
    data = execute_query_in_batches(query=query_text, chunk_size=10)
    print(data)

Comparison of Memory Usage and Runtimes

To test our implementation we can compare performance of the three discussed approaches in terms of runtime and memory usage. Afterall, we are mainly concerned with how quickly and efficiently the data can be retrieved. The comparisons were done on a table with 1,000,000 rows and 20 columns with a chunksize of 10_000. In each case, the query function was called 10 times.

		1	2	3	4	5	6	7	8	9	10	Avg
Pandas	Time (s)	3.859	3.801	4.069	4.041	4.098	3.99	4.188	3.895	4.044	4.011	4.0
	Mem (MB)	755.7	721.9	758.7	755	760.4	759.1	767.1	761.5	758.5	758	755.6
Polars	Time (s)	3.157	3.186	3.004	3.822	3.132	3.134	3.157	3.184	3.148	3.292	3.22
	Mem (MB)	729.3	727.9	729.9	728.3	725.4	729.6	728	728.5	729.1	729.3	728.5
Server-Side Cursor	Time (s)	3.671	3.640	3.572	3.594	3.613	3.689	3.683	3.575	3.604	3.572	3.62
	Mem (MB)	205.8	207.2	201.7	207.2	205.5	205.3	210.2	206.2	203.8	205.2	205.8

The results presented in the table above clearly indicate that in terms of memory usage, the server-side cursor solution provides much better performance than both pandas and polars. The server-side cursor is the most memory-efficient, with an average memory usage of 205.8 MB. This is significantly lower than both pandas (755.6 MB) and polars (728.5 MB), using approximately 72.8% less memory than pandas and 71.7% less than polars. While polars and pandas have similar memory efficiency, polars shows the best time performance with an average of 3.22 seconds, which is approximately 19.5% faster than pandas (4.0 seconds). The server-side cursor also performs better than pandas, with an average execution time of 3.62 seconds, about 9.5% faster.

Conclusions

This article presented a solution for memory-efficient query execution using Python and PostgreSQL. This approach utilizes a server-side cursor for batched query execution, and its performance was compared with two popular libraries commonly used for data extraction. The results indicated that server-side cursor batched querying offers substantial memory savings and good time performance, making it ideal for memory-constrained environments or large datasets where memory efficiency is critical. For applications where processing speed is the priority, polars is the recommended choice. However, for applications where memory usage is the primary concern, the server-side cursor remains the most efficient option.

Memory efficient query execution in batches with PostgreSQL and Python

Maciej Wysocki

What's inside

Introduction

Challenges in Data Extraction

Reproducible Setup for Experiments

Simple Query Execution

Batch Results with Pandas and Polars

Setup of an Effective Server-Side Execution

Comparison of Memory Usage and Runtimes

Conclusions

Recent posts

Why data engineers don’t test - according to Reddit

Testing in dbt - part 3

Modern Data Stack with Airflow and dbt - starting simple (part 1)

Why data engineers don’t test - according to Reddit

Testing in dbt - part 3

Modern Data Stack with Airflow and dbt - starting simple (part 1)

Let's talk