Elasticsearch with Python: 7 Tips and Best Practices

Patryk Młynarek - Backend Engineer

Patryk Młynarek

7 December 2023, 11 min read

thumbnail post

What's inside

  1. Getting started
  2. Useful Links
  3. Conclusion

Elasticsearch is an open-source distributed search server that comes in handy for building applications with full-text search capabilities. While its core implementation is in Java, it provides a REST interface that allows developers to interact with Elasticsearch using any programming language – including Python.

In this article, I'll show some essential best practices for using Elasticsearch with Python in any project.

Do you work with Django? Here’s an article on how to use Elasticsearch with Django.

Getting started

Before you can run the examples provided in this article, you'll need to set up an Elasticsearch instance. The quickest and simplest way to do this is by creating a Docker container with Elasticsearch.

Here are the steps to get started:

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:8.10.3
  • The -p flags specify port mapping, allowing you to access Elasticsearch on ports 9200 and 9300 on your local machine.
  • The -e "discovery.type=single-node" flag configures Elasticsearch to run as a single-node cluster for local development.
  • We use "xpack.security.enabled=false" to disable security features. Note that this setting is recommended for local development only, as it allows for HTTP communication. In Elasticsearch 8, SSL/TLS is defaulted for HTTP clients.

Once the Docker container is up and running, you can check if everything works correctly by visiting the following local URL in your web browser: http://localhost:9200/_cluster/health.

  • When you access this URL, you should receive a JSON response. Look for the "status" field in the JSON response. A "green" status indicates that your Elasticsearch cluster is healthy and operational.

TIP: Managing Disk Space for Elasticsearch.

In Elasticsearch, maintaining disk space is critical. Data-saving operations become impossible when your cluster exceeds the "high disk watermark" at 90%. Regularly monitor and ensure adequate free disk space to prevent data loss and maintain cluster functionality.

Installing the Python Elasticsearch Client

When working with Elasticsearch in Python, you have several options to meet your specific needs. Here are a few popular choices:

  • elasticsearch-py: This is the official low-level client for Elasticsearch. It provides a direct and flexible way to interact with Elasticsearch using Python.
  • elasticsearch-dsl: If you prefer a more convenient and high-level wrapper for Elasticsearch-py, elasticsearch-dsl is an excellent choice. It simplifies working with Elasticsearch and is suitable for most use cases.
  • django-elasticsearch-dsl: For those using Django, django-elasticsearch-dsl is a thin wrapper built on top of elasticsearch-dsl. It offers seamless integration with Django and allows you to use Elasticsearch in your Django applications effortlessly.

You can combine these packages simultaneously, depending on your project's specific requirements.

Explicit Mapping in Elasticsearch

In Elasticsearch, the mapping defines how data is stored and indexed. While Elasticsearch can automatically generate mappings, relying on these defaults may not always suit your needs. See official documentation here and consider this example:

from datetime import date from datetime import timedelta from elasticsearch import Elasticsearch

es = Elasticsearch(hosts=["http://localhost:9200"])

doc = {
    "internal_id": 123,
    "external_id": 99122,
    "name": "Jazz concert",
    "category": "concert",
    "where": "Warsaw",
    "when": date.today(),
    "duration_hours": 2,
    "is_free": False,
}
es.index(index="events", body=doc)
es.indices.get_mapping(index="events")

In this example, Elasticsearch generates a mapping based on the first document. Subsequent attempts to index data with different types or formats may result in mapper exceptions. For instance:

doc = {
   'internal_id': "5ab7dc38-64f0-46c6-9ebd-e039ee4d4c4a",
   'external_id': 99123,
   'name': 'Some performance',
   'category': 'performance',
   'where': 'Cracow',
   'when': date.today(),
   'duration_hours': None,
   'is_free': True,
}

es.index(index='events', body=doc)

In this case, Elasticsearch expects the internal_id field to be a numeric long type based on the previous document. Attempting to index data with a different data type results in a mapper exception.

Expected exception:

BadRequestError: BadRequestError(400, 'document_parsing_exception', "[1:16] failed to parse field [internal_id] of type [long] in document with id 'T_tkpIsBw50I2HjKLiHT'. Preview of field's value: '5ab7dc38-64f0-46c6-9ebd-e039ee4d4c4a'")

When choosing mappings, consider the nature of your data. For instance, the external_id field can be mapped as a better-suited keyword for term queries. Numeric fields are optimized for range queries. Selecting the appropriate mapping is essential to ensure your Elasticsearch index functions as expected.

Explicit mapping example:

from datetime import date

from elasticsearch import Elasticsearch


es = Elasticsearch(hosts=["http://localhost:9200"])
mapping = {
    "mappings": {
        "properties": {
            "category": {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}},
            "external_id": {"type": "long"},
            "internal_id": {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}},
            "is_free": {"type": "boolean"},
            "name": {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}},
            "when": {"type": "date"},
            "where": {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}},
        }
    }
}
es.indices.create(index="events_explicit_mapping", body=mapping)
print(es.indices.get_mapping(index="events_explicit_mapping"))


doc = {
    "internal_id": 123,
    "external_id": 99122,
    "name": "Jazz concert",
    "category": "concert",
    "where": "Warsaw",
    "when": date.today(),
    "duration_hours": 2,
    "is_free": False,
}
es.index(index="events_explicit_mapping", body=doc)

We can also use the elasticsearch-dsl package and Document class, which explicitly define the structure for the documents. See examples in the following sections.

Use the Right Library

Python developers can take advantage of the official low-level Elasticsearch client, elasticsearch-py. When dealing with simple cases, it's a suitable choice. However, for more complex searches, consider using elasticsearch-dsl, which builds upon elasticsearch-py. It offers several advantages, such as a more Pythonic query-writing approach, reduced risk of syntax errors, and enhanced query modification capabilities.

Using elasticsearch-py:

es.search(
    index="events",
    body={
        "query": {
            "bool": {
                "must": [
                    {
                        "match": {"is_free": False},
                    },
                    {
                        "match": {"category": "concert"},
                    },
                ],
                "filter": [
                    {
                        "match": {"where": "Warsaw"},
                    },
                ],
            }
        }
    },
)

Using elasticsearch-dsl:

from elasticsearch_dsl import Search
from elasticsearch_dsl import Q


result = Search(using=es, index="events").query(
    Q(
        "bool",
        must=[
            Q("match", is_free=False),
            Q("match", category="concert"),
        ],
        filter=[
            Q("match", where="Warsaw"),
        ],
    )
).execute()

for hit in result.hits:
    print(f"Name: {hit.name}")
    print(f"Is free: {hit.is_free}")
    print(f"Where: {hit.where}")

If you're working with Django, django-elasticsearch-dsl is a valuable option. It's built on elasticsearch-dsl and allows you to create Elasticsearch indexes and document types based on your Django models, automating much of the process.

Example using django-elasticsearch-dsl:

from django_elasticsearch_dsl import Document
from django_elasticsearch_dsl.registries import registry


class Event(models.Model):
    name = models.CharField()
    category = models.CharField()
    is_free = models.BooleanField()
    where = models.CharField()


@registry.register_document
class EventDocument(Document):
    class Index:
        name = "events_dj"
        settings = {
            "number_of_shards": 1,
            "number_of_replicas": 0,
        }

    class Django:
        model = Event
        fields = ["name", "category", "is_free", "where"]

s = EventDocument.search().query(
    "match", category="concert"
)

With django-elasticsearch-dsl, your Elasticsearch index stays updated automatically when objects are created or deleted, thanks to Django signals like post_save and post_delete. This streamlines the integration of Elasticsearch with your Django project. Please refer to the official documentation for a comprehensive guide on fully integrating Django with Elasticsearch using django-elasticsearch-dsl.

Bulk Helpers

Performing operations on a massive document set one by one is just inefficient. You’d have to make a request every single time. That’s why it’s smart to use bulk helpers instead. Here’s how bulk helpers work:

from datetime import date
from typing import Any
from typing import Dict
from typing import List

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch_dsl import Date
from elasticsearch_dsl import Document
from elasticsearch_dsl import Text


es = Elasticsearch(hosts=["http://localhost:9200"])

class EventDocument(Document):
    name = Text
    when = Date

    class Index:
        name = "events"


docs: List[Dict[str, Any]] = []
for i in range(100):
    document = EventDocument(
        name=f"Sample event {i}",
        when=date.today(),
    )
    docs.append(
        document.to_dict(include_meta=True)
    )


bulk(es, docs)

Take Advantage of Aliases

Index aliases in Elasticsearch offer a versatile approach to handling indices. These aliases provide secondary names for one or more indexes, making them invaluable for various scenarios.

One significant advantage of using index aliases is migrating or changing an index with zero downtime. This means you can seamlessly transition between indices without disrupting your system's availability.

Furthermore, index aliases simplify query operations by allowing you to group multiple indices under a common alias. This grouping enhances overall index management efficiency and provides an effective way to work with related indices. It offers great flexibility in your Elasticsearch setup while ensuring a streamlined and efficient workflow.

Consider the following example:

from datetime import date

from elasticsearch_dsl import Date
from elasticsearch_dsl import Document
from elasticsearch_dsl import Text
from elasticsearch import Elasticsearch


es = Elasticsearch(hosts=["http://localhost:9200"])


class EventDocument(Document):
    name = Text
    when = Date

    class Index:
        name = "events_v1"
        using = es


EventDocument.init()
event = EventDocument(name=f"Some event", when=date.today())
event.save(es)

EventDocument._index.put_alias(name="events_today")


class NewEventDocument(Document):
    name = Text
    when = Text

    class Index:
        name = "events_v2"
        using = es


NewEventDocument.init()
new_event = NewEventDocument(name=f"New type event", when="May")
new_event.save(es)


NewEventDocument._index.put_alias(name="events_today")
EventDocument._index.delete_alias(name="events_today")

Enhancing Search Flexibility with ASCIIFolding

When dealing with searchable data that includes non-Latin characters such as "ą," "č," or "ė," it's advisable to incorporate the ASCII Folding Token Filter. This is particularly useful since users often perform queries without diacritical marks, and you want your search results to accommodate such queries seamlessly.

To include ASCIIFolding in your analyzer's filters, follow these steps:

from datetime import date

from elasticsearch import Elasticsearch
from elasticsearch_dsl import analyzer
from elasticsearch_dsl import Date
from elasticsearch_dsl import Document
from elasticsearch_dsl import Text
from elasticsearch_dsl import Search

es = Elasticsearch(hosts=["http://localhost:9200"])

folding_analyzer = analyzer(
    "folding_analyzer",
    tokenizer="standard",
    filter=["lowercase", "asciifolding"],
)


class EventDocument(Document):
    name = Text
    when = Date
    where = Text(analyzer=folding_analyzer)

    class Index:
        name = "events_v3"
        using = es


EventDocument.init()

event = EventDocument(name="Sample event", when=date.today(), where="Kraków")
event.save()

event._index.refresh()
search = Search(using=es).query("match", where="Krakow")

search.execute()

By utilizing ASCIIFolding, you enhance the flexibility of your search capabilities, making it easier for users to find relevant results, even when they omit diacritical marks in their queries.

Benefit of Auto-Generated IDs

In Elasticsearch, when you explicitly set an ID for an indexed document, Elasticsearch must check if that ID already exists within the same shard. This operation can be resource-intensive, especially in large indices.

To optimize your project's efficiency, consider using auto-generated IDs. This approach bypasses the need to check for existing IDs and can save valuable time, particularly in scenarios where indexing a significant amount of data is involved.

Boosting Search Speed with Copy-to Parameter

Efficient searching is a key aspect of Elasticsearch's performance. The more fields included in a multi-match query, the slower the search becomes. It's advisable to use as few fields as possible to optimize search speed. You can achieve this by employing the copy_to parameter in your field mappings to copy their values to a designated search field.

Here's a practical example:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Document
from elasticsearch_dsl import Text
from elasticsearch_dsl import Search


es = Elasticsearch(hosts=["http://localhost:9200"])


class EventDocument(Document):
    search = Text()
    name = Text(copy_to="search")
    category = Text(copy_to="search")
    where = Text(copy_to="search")

    class Index:
        name = "events_v4"
        using = es


EventDocument.init()

event = EventDocument(name="Rock band", category="concert", where="Warsaw")
event.save()

event._index.refresh()

search = Search(using=es).query("match", search="Rock band concert Warsaw")
search.execute()

Conclusion

In conclusion, integrating Elasticsearch with Python leverages the combined strength of a powerful search engine and a versatile programming language, enabling developers to create applications with advanced search and data indexing capabilities.

By following best practices and selecting the right tools, you can achieve peak performance and efficient data management in your Elasticsearch projects.

Key takeaways include:

  • Selecting the appropriate Python library tailored to your project's needs is vital. For simple direct interactions, elasticsearch-py is apt, whereas elasticsearch-dsl is suited for a more high-level interface. For Django applications, django-elasticsearch-dsl offers seamless integration, leveraging Django's features to enhance Elasticsearch's capabilities.

  • Developers can significantly improve search efficiency and data integrity through explicit mappings and judicious use of index aliases and copy-to parameters. These practices also ensure that your Elasticsearch indices are precisely tailored to the nature of your data, thus facilitating smoother migrations and updates.

  • Incorporating features like ASCIIFolding and auto-generated IDs elevates the user's search experience by accommodating diverse input patterns and optimizes performance by reducing the overhead associated with ID management.

  • Utilizing bulk helpers and index aliases improves efficiency, especially when dealing with large datasets or making index alterations without downtime.

  • Integrating ASCIIFolding and the strategic use of the copy-to parameter empowers your search functionality, allowing for quick and accurate results even with complex queries.

At Sunscrapers, we are dedicated to ensuring your data is as searchable and manageable as possible, paving the way for enhanced decision-making and business intelligence. Contact us today to elevate your Elasticsearch strategy to the next level.

Patryk Młynarek - Backend Engineer

Patryk Młynarek

Backend Engineer

Patryk is a experienced Senior Python Developer who puts business value on the first place. Web applications enthusiast from initial development to server maintenance, ensuring the entire process runs smoothly. In his free time, Patryk enjoys playing board games and motorcycling.

Tags

python
django

Share

Let's talk

Discover how software, data, and AI can accelerate your growth. Let's discuss your goals and find the best solutions to help you achieve them.

Hi there, we use cookies to provide you with an amazing experience on our site. If you continue without changing the settings, we’ll assume that you’re happy to receive all cookies on Sunscrapers website. You can change your cookie settings at any time.