What's inside
Elasticsearch is an open-source distributed search server that comes in handy for building applications with full-text search capabilities. While its core implementation is in Java, it provides a REST interface that allows developers to interact with Elasticsearch using any programming language – including Python.
In this article, I'll show some essential best practices for using Elasticsearch with Python in any project.
Do you work with Django? Here’s an article on how to use Elasticsearch with Django.
Getting started
Before you can run the examples provided in this article, you'll need to set up an Elasticsearch instance. The quickest and simplest way to do this is by creating a Docker container with Elasticsearch.
Here are the steps to get started:
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:8.10.3
- The
-p
flags specify port mapping, allowing you to access Elasticsearch on ports 9200 and 9300 on your local machine. - The
-e "discovery.type=single-node"
flag configures Elasticsearch to run as a single-node cluster for local development. - We use
"xpack.security.enabled=false"
to disable security features. Note that this setting is recommended for local development only, as it allows for HTTP communication. In Elasticsearch 8, SSL/TLS is defaulted for HTTP clients.
Once the Docker container is up and running, you can check if everything works correctly by visiting the following local URL in your web browser: http://localhost:9200/_cluster/health.
- When you access this URL, you should receive a JSON response. Look for the "status" field in the JSON response. A "green" status indicates that your Elasticsearch cluster is healthy and operational.
TIP: Managing Disk Space for Elasticsearch.
In Elasticsearch, maintaining disk space is critical. Data-saving operations become impossible when your cluster exceeds the "high disk watermark" at 90%. Regularly monitor and ensure adequate free disk space to prevent data loss and maintain cluster functionality.
Installing the Python Elasticsearch Client
When working with Elasticsearch in Python, you have several options to meet your specific needs. Here are a few popular choices:
elasticsearch-py
: This is the official low-level client for Elasticsearch. It provides a direct and flexible way to interact with Elasticsearch using Python.elasticsearch-dsl
: If you prefer a more convenient and high-level wrapper for Elasticsearch-py, elasticsearch-dsl is an excellent choice. It simplifies working with Elasticsearch and is suitable for most use cases.django-elasticsearch-dsl
: For those using Django, django-elasticsearch-dsl is a thin wrapper built on top of elasticsearch-dsl. It offers seamless integration with Django and allows you to use Elasticsearch in your Django applications effortlessly.
You can combine these packages simultaneously, depending on your project's specific requirements.
Explicit Mapping in Elasticsearch
In Elasticsearch, the mapping defines how data is stored and indexed. While Elasticsearch can automatically generate mappings, relying on these defaults may not always suit your needs. See official documentation here and consider this example:
from datetime import date
from datetime import timedelta
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["http://localhost:9200"])
doc = {
"internal_id": 123,
"external_id": 99122,
"name": "Jazz concert",
"category": "concert",
"where": "Warsaw",
"when": date.today(),
"duration_hours": 2,
"is_free": False,
}
es.index(index="events", body=doc)
es.indices.get_mapping(index="events")
In this example, Elasticsearch generates a mapping based on the first document. Subsequent attempts to index data with different types or formats may result in mapper exceptions. For instance:
doc = {
'internal_id': "5ab7dc38-64f0-46c6-9ebd-e039ee4d4c4a",
'external_id': 99123,
'name': 'Some performance',
'category': 'performance',
'where': 'Cracow',
'when': date.today(),
'duration_hours': None,
'is_free': True,
}
es.index(index='events', body=doc)
In this case, Elasticsearch expects the internal_id field to be a numeric long type based on the previous document. Attempting to index data with a different data type results in a mapper exception.
Expected exception:
BadRequestError: BadRequestError(400, 'document_parsing_exception', "[1:16] failed to parse field [internal_id] of type [long] in document with id 'T_tkpIsBw50I2HjKLiHT'. Preview of field's value: '5ab7dc38-64f0-46c6-9ebd-e039ee4d4c4a'")
When choosing mappings, consider the nature of your data. For instance, the external_id
field can be mapped as a better-suited keyword for term queries. Numeric fields are optimized for range queries. Selecting the appropriate mapping is essential to ensure your Elasticsearch index functions as expected.
Explicit mapping example:
from datetime import date
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["http://localhost:9200"])
mapping = {
"mappings": {
"properties": {
"category": {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}},
"external_id": {"type": "long"},
"internal_id": {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}},
"is_free": {"type": "boolean"},
"name": {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}},
"when": {"type": "date"},
"where": {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}},
}
}
}
es.indices.create(index="events_explicit_mapping", body=mapping)
print(es.indices.get_mapping(index="events_explicit_mapping"))
doc = {
"internal_id": 123,
"external_id": 99122,
"name": "Jazz concert",
"category": "concert",
"where": "Warsaw",
"when": date.today(),
"duration_hours": 2,
"is_free": False,
}
es.index(index="events_explicit_mapping", body=doc)
We can also use the elasticsearch-dsl
package and Document class, which explicitly define the structure for the documents. See examples in the following sections.
Use the Right Library
Python developers can take advantage of the official low-level Elasticsearch client, elasticsearch-py
. When dealing with simple cases, it's a suitable choice. However, for more complex searches, consider using elasticsearch-dsl
, which builds upon elasticsearch-py. It offers several advantages, such as a more Pythonic query-writing approach, reduced risk of syntax errors, and enhanced query modification capabilities.
Using elasticsearch-py:
es.search(
index="events",
body={
"query": {
"bool": {
"must": [
{
"match": {"is_free": False},
},
{
"match": {"category": "concert"},
},
],
"filter": [
{
"match": {"where": "Warsaw"},
},
],
}
}
},
)
Using elasticsearch-dsl:
from elasticsearch_dsl import Search
from elasticsearch_dsl import Q
result = Search(using=es, index="events").query(
Q(
"bool",
must=[
Q("match", is_free=False),
Q("match", category="concert"),
],
filter=[
Q("match", where="Warsaw"),
],
)
).execute()
for hit in result.hits:
print(f"Name: {hit.name}")
print(f"Is free: {hit.is_free}")
print(f"Where: {hit.where}")
If you're working with Django, django-elasticsearch-dsl
is a valuable option. It's built on elasticsearch-dsl and allows you to create Elasticsearch indexes and document types based on your Django models, automating much of the process.
Example using django-elasticsearch-dsl:
from django_elasticsearch_dsl import Document
from django_elasticsearch_dsl.registries import registry
class Event(models.Model):
name = models.CharField()
category = models.CharField()
is_free = models.BooleanField()
where = models.CharField()
@registry.register_document
class EventDocument(Document):
class Index:
name = "events_dj"
settings = {
"number_of_shards": 1,
"number_of_replicas": 0,
}
class Django:
model = Event
fields = ["name", "category", "is_free", "where"]
s = EventDocument.search().query(
"match", category="concert"
)
With django-elasticsearch-dsl
, your Elasticsearch index stays updated automatically when objects are created or deleted, thanks to Django signals
like post_save
and post_delete
. This streamlines the integration of Elasticsearch with your Django project.
Please refer to the official documentation for a comprehensive guide on fully integrating Django with Elasticsearch using django-elasticsearch-dsl
.
Bulk Helpers
Performing operations on a massive document set one by one is just inefficient. You’d have to make a request every single time. That’s why it’s smart to use bulk helpers instead. Here’s how bulk helpers work:
from datetime import date
from typing import Any
from typing import Dict
from typing import List
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch_dsl import Date
from elasticsearch_dsl import Document
from elasticsearch_dsl import Text
es = Elasticsearch(hosts=["http://localhost:9200"])
class EventDocument(Document):
name = Text
when = Date
class Index:
name = "events"
docs: List[Dict[str, Any]] = []
for i in range(100):
document = EventDocument(
name=f"Sample event {i}",
when=date.today(),
)
docs.append(
document.to_dict(include_meta=True)
)
bulk(es, docs)
Take Advantage of Aliases
Index aliases in Elasticsearch offer a versatile approach to handling indices. These aliases provide secondary names for one or more indexes, making them invaluable for various scenarios.
One significant advantage of using index aliases is migrating or changing an index with zero downtime. This means you can seamlessly transition between indices without disrupting your system's availability.
Furthermore, index aliases simplify query operations by allowing you to group multiple indices under a common alias. This grouping enhances overall index management efficiency and provides an effective way to work with related indices. It offers great flexibility in your Elasticsearch setup while ensuring a streamlined and efficient workflow.
Consider the following example:
from datetime import date
from elasticsearch_dsl import Date
from elasticsearch_dsl import Document
from elasticsearch_dsl import Text
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["http://localhost:9200"])
class EventDocument(Document):
name = Text
when = Date
class Index:
name = "events_v1"
using = es
EventDocument.init()
event = EventDocument(name=f"Some event", when=date.today())
event.save(es)
EventDocument._index.put_alias(name="events_today")
class NewEventDocument(Document):
name = Text
when = Text
class Index:
name = "events_v2"
using = es
NewEventDocument.init()
new_event = NewEventDocument(name=f"New type event", when="May")
new_event.save(es)
NewEventDocument._index.put_alias(name="events_today")
EventDocument._index.delete_alias(name="events_today")
Enhancing Search Flexibility with ASCIIFolding
When dealing with searchable data that includes non-Latin characters such as "ą," "č," or "ė," it's advisable to incorporate the ASCII Folding Token Filter. This is particularly useful since users often perform queries without diacritical marks, and you want your search results to accommodate such queries seamlessly.
To include ASCIIFolding in your analyzer's filters, follow these steps:
from datetime import date
from elasticsearch import Elasticsearch
from elasticsearch_dsl import analyzer
from elasticsearch_dsl import Date
from elasticsearch_dsl import Document
from elasticsearch_dsl import Text
from elasticsearch_dsl import Search
es = Elasticsearch(hosts=["http://localhost:9200"])
folding_analyzer = analyzer(
"folding_analyzer",
tokenizer="standard",
filter=["lowercase", "asciifolding"],
)
class EventDocument(Document):
name = Text
when = Date
where = Text(analyzer=folding_analyzer)
class Index:
name = "events_v3"
using = es
EventDocument.init()
event = EventDocument(name="Sample event", when=date.today(), where="Kraków")
event.save()
event._index.refresh()
search = Search(using=es).query("match", where="Krakow")
search.execute()
By utilizing ASCIIFolding, you enhance the flexibility of your search capabilities, making it easier for users to find relevant results, even when they omit diacritical marks in their queries.
Benefit of Auto-Generated IDs
In Elasticsearch, when you explicitly set an ID for an indexed document, Elasticsearch must check if that ID already exists within the same shard. This operation can be resource-intensive, especially in large indices.
To optimize your project's efficiency, consider using auto-generated IDs. This approach bypasses the need to check for existing IDs and can save valuable time, particularly in scenarios where indexing a significant amount of data is involved.
Boosting Search Speed with Copy-to Parameter
Efficient searching is a key aspect of Elasticsearch's performance. The more fields included in a multi-match query, the slower the search becomes. It's advisable to use as few fields as possible to optimize search speed. You can achieve this by employing the copy_to parameter in your field mappings to copy their values to a designated search field.
Here's a practical example:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Document
from elasticsearch_dsl import Text
from elasticsearch_dsl import Search
es = Elasticsearch(hosts=["http://localhost:9200"])
class EventDocument(Document):
search = Text()
name = Text(copy_to="search")
category = Text(copy_to="search")
where = Text(copy_to="search")
class Index:
name = "events_v4"
using = es
EventDocument.init()
event = EventDocument(name="Rock band", category="concert", where="Warsaw")
event.save()
event._index.refresh()
search = Search(using=es).query("match", search="Rock band concert Warsaw")
search.execute()
Useful Links
- Elastic official documentation - How to
- Elastic official documentation - Profile API
- Python Elasticsearch Client docs: elasticsearch-py package
- Elasticsearch DSL docs: elasticsearch-dsl package
- Django Elasticsearch DSL docs: django-elasticsearch-dsl package
- Docker Hub Elasticsearch
Conclusion
In conclusion, integrating Elasticsearch with Python leverages the combined strength of a powerful search engine and a versatile programming language, enabling developers to create applications with advanced search and data indexing capabilities.
By following best practices and selecting the right tools, you can achieve peak performance and efficient data management in your Elasticsearch projects.
Key takeaways include:
-
Selecting the appropriate Python library tailored to your project's needs is vital. For simple direct interactions, elasticsearch-py is apt, whereas elasticsearch-dsl is suited for a more high-level interface. For Django applications, django-elasticsearch-dsl offers seamless integration, leveraging Django's features to enhance Elasticsearch's capabilities.
-
Developers can significantly improve search efficiency and data integrity through explicit mappings and judicious use of index aliases and copy-to parameters. These practices also ensure that your Elasticsearch indices are precisely tailored to the nature of your data, thus facilitating smoother migrations and updates.
-
Incorporating features like ASCIIFolding and auto-generated IDs elevates the user's search experience by accommodating diverse input patterns and optimizes performance by reducing the overhead associated with ID management.
-
Utilizing bulk helpers and index aliases improves efficiency, especially when dealing with large datasets or making index alterations without downtime.
-
Integrating ASCIIFolding and the strategic use of the copy-to parameter empowers your search functionality, allowing for quick and accurate results even with complex queries.
At Sunscrapers, we are dedicated to ensuring your data is as searchable and manageable as possible, paving the way for enhanced decision-making and business intelligence. Contact us today to elevate your Elasticsearch strategy to the next level.