How to Index Data in Elasticsearch

How to Index Data in Elasticsearch Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It enables near real-time indexing and searching of large volumes of data, making it a cornerstone of modern search applications, log analysis, and observability platforms. At the heart of Elasticsearch’s functionality is the process of indexing — the act of storing and st

alex

Nov 10, 2025 - 12:11

How to Index Data in Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It enables near real-time indexing and searching of large volumes of data, making it a cornerstone of modern search applications, log analysis, and observability platforms. At the heart of Elasticsearchs functionality is the process of indexing the act of storing and structuring data so it can be efficiently queried and retrieved. Whether youre ingesting logs from servers, product catalogs from an e-commerce platform, or user activity streams from a mobile app, mastering how to index data in Elasticsearch is essential for building scalable, high-performance applications.

Indexing is more than simply dumping data into a database. It involves defining mappings, managing document structure, handling data types, configuring replication and sharding, and optimizing for search speed and storage efficiency. Poorly indexed data leads to slow queries, inconsistent results, and operational overhead. Conversely, well-structured indexing ensures fast retrieval, accurate aggregations, and seamless scalability.

This comprehensive guide walks you through every critical aspect of indexing data in Elasticsearch from basic operations to advanced configurations. Youll learn how to prepare your data, define mappings, use bulk indexing for efficiency, monitor performance, and avoid common pitfalls. By the end, youll have the knowledge to confidently index any type of data in Elasticsearch, whether youre working with JSON documents, log files, or structured relational data.

Step-by-Step Guide

Prerequisites

Before you begin indexing data, ensure you have the following in place:

Elasticsearch installed and running You can run Elasticsearch locally using Docker, download the binary from elastic.co, or use a managed service like Elastic Cloud.
A working HTTP client Use curl, Postman, or any programming language with an HTTP library (e.g., Pythons requests, JavaScripts axios).
Basic understanding of JSON Elasticsearch stores data as JSON documents.
Access to Kibana (optional but recommended) Kibana provides a visual interface to manage indices, view documents, and monitor cluster health.

Verify your Elasticsearch instance is running by sending a GET request to the root endpoint:

curl -X GET "localhost:9200"

You should receive a JSON response containing version, cluster name, and node information.

Step 1: Understand the Index Concept

In Elasticsearch, an index is a collection of documents that share similar characteristics. Think of it like a database table in a relational system but with key differences. Unlike SQL tables, Elasticsearch indices are schema-flexible by default, meaning you can index documents with varying fields without predefining a strict structure.

However, this flexibility comes with trade-offs. Without explicit mapping, Elasticsearch auto-detects field types, which can lead to suboptimal performance or incorrect data interpretation (e.g., treating a numeric ID as a string).

To create an index manually, use the PUT method:

curl -X PUT "localhost:9200/my_first_index"

Upon success, youll receive:

{
"acknowledged": true,
"shards_acknowledged": true,
"index": "my_first_index"
}

By default, Elasticsearch creates 1 primary shard and 1 replica shard. You can customize this during index creation (covered later).

Step 2: Define a Mapping (Schema)

While Elasticsearch can auto-detect field types, defining a mapping explicitly gives you control over how data is stored and indexed. A mapping defines the fields in your documents, their data types, and how they should be analyzed (for text fields).

For example, suppose youre indexing product data. You want the price field to be a float, name to be analyzed for full-text search, and category to be a keyword for exact matches and aggregations.

Create the index with a mapping:

curl -X PUT "localhost:9200/products" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "standard"
},
"price": {
"type": "float"
},
"category": {
"type": "keyword"
},
"in_stock": {
"type": "boolean"
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}'

Key mapping types:

text Used for full-text search. Analyzed (broken into tokens).
keyword Used for exact matches, sorting, and aggregations. Not analyzed.
integer, long, float, double Numeric types for precise calculations.
boolean True/false values.
date Date and time values with configurable formats.
nested For arrays of objects where each object should be indexed independently.
object For embedded JSON objects (flattened by default).

Always define mappings for production data. Auto-detection is useful for prototyping, but not for reliable systems.

Step 3: Index a Single Document

Once the index is created with a mapping, you can add documents using the PUT or POST method.

Use PUT to specify the document ID:

curl -X PUT "localhost:9200/products/_doc/1" -H 'Content-Type: application/json' -d' { "name": "Wireless Headphones", "price": 129.99, "category": "Electronics", "in_stock": true, "created_at": "2024-06-15 10:30:00" }'

Use POST to let Elasticsearch generate a unique ID:

curl -X POST "localhost:9200/products/_doc" -H 'Content-Type: application/json' -d' { "name": "Smart Watch", "price": 249.99, "category": "Electronics", "in_stock": false, "created_at": "2024-06-14 14:22:15" }'

Response for both:

{ "_index": "products", "_type": "_doc", "_id": "1", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "_seq_no": 0, "_primary_term": 1

}

Notice the _type field is now always _doc in Elasticsearch 7.x and later. The concept of multiple types per index has been deprecated.

Step 4: Index Multiple Documents with Bulk API

Indexing documents one by one is inefficient for large datasets. The Bulk API allows you to index, update, or delete multiple documents in a single request, drastically improving performance.

The bulk request format requires each action (index, create, update, delete) to be followed by its corresponding document on the next line. The structure is:

{ "index" : { "_index" : "index_name", "_id" : "document_id" } }
{ "field1" : "value1", "field2" : "value2" }
{ "create" : { "_index" : "index_name", "_id" : "document_id" } }
{ "field1" : "value1", "field2" : "value2" }

Example: Index three products in one bulk request:

curl -X POST "localhost:9200/products/_bulk" -H 'Content-Type: application/json' -d'
{ "index" : { "_id" : "2" } }
{ "name": "Bluetooth Speaker", "price": 89.99, "category": "Electronics", "in_stock": true, "created_at": "2024-06-15 09:15:00" }
{ "index" : { "_id" : "3" } }
{ "name": "Laptop", "price": 999.99, "category": "Electronics", "in_stock": true, "created_at": "2024-06-14 16:45:00" }
{ "index" : { "_id" : "4" } }
{ "name": "Coffee Mug", "price": 12.5, "category": "Home", "in_stock": false, "created_at": "2024-06-13 11:20:00" }
'

Response:

{
"took": 123,
"errors": false,
"items": [
{
"index": {
"_index": "products",
"_type": "_doc",
"_id": "2",
"_version": 1,
"result": "created",
"_shards": { "total": 2, "successful": 1, "failed": 0 },
"_seq_no": 1,
"_primary_term": 1,
"status": 201
}
},
{
"index": {
"_index": "products",
"_type": "_doc",
"_id": "3",
"_version": 1,
"result": "created",
"_shards": { "total": 2, "successful": 1, "failed": 0 },
"_seq_no": 2,
"_primary_term": 1,
"status": 201
}
},
{
"index": {
"_index": "products",
"_type": "_doc",
"_id": "4",
"_version": 1,
"result": "created",
"_shards": { "total": 2, "successful": 1, "failed": 0 },
"_seq_no": 3,
"_primary_term": 1,
"status": 201
}
}
]
}

Always use the Bulk API for batch operations. It reduces network overhead and leverages Elasticsearchs internal optimizations for concurrent writes.

Step 5: Handle Data Types and Dynamic Mapping

Elasticsearch dynamically adds fields to your index when it encounters new ones in documents. While convenient, this can cause issues:

Field type conflicts (e.g., one document has price as string, another as number)
Unintended text analysis on numeric fields
Index bloating with unused fields

To prevent this, set dynamic mapping policies:

curl -X PUT "localhost:9200/secure_products" -H 'Content-Type: application/json' -d'
{
"mappings": {
"dynamic": "strict",
"properties": {
"name": { "type": "text" },
"price": { "type": "float" }
}
}
}'

With "dynamic": "strict", Elasticsearch will reject any document containing fields not defined in the mapping. Use "dynamic": "false" to ignore unknown fields silently, or leave it as "dynamic": "true" (default) for flexibility.

For more control, use dynamic templates to define how unknown fields should be mapped based on naming patterns or data types:

curl -X PUT "localhost:9200/logs" -H 'Content-Type: application/json' -d'
{
"mappings": {
"dynamic_templates": [
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword"
}
}
},
{
"numbers_as_long": {
"match_mapping_type": "long",
"mapping": {
"type": "long"
}
}
}
],
"properties": {
"message": { "type": "text" }
}
}
}'

This ensures all string fields not explicitly defined become keyword, preventing unwanted text analysis.

Step 6: Use Index Templates for Automation

When managing multiple indices (e.g., daily log indices like logs-2024-06-15), manually creating each one is impractical. Use index templates to automatically apply mappings, settings, and aliases to matching indices.

Create a template for daily log indices:

curl -X PUT "localhost:9200/_index_template/logs_template" -H 'Content-Type: application/json' -d'
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "30s"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"level": { "type": "keyword" },
"message": { "type": "text" },
"host": { "type": "keyword" }
}
}
},
"priority": 500,
"composed_of": []
}'

Now, when you create logs-2024-06-15, the template automatically applies:

curl -X PUT "localhost:9200/logs-2024-06-15"

Index templates are essential for time-series data, monitoring systems, and any environment where indices are created dynamically.

Step 7: Monitor Indexing Performance

Indexing performance can be monitored using Elasticsearchs built-in APIs:

Cluster Health

curl -X GET "localhost:9200/_cluster/health?pretty"

Look for status (green = healthy, yellow = some replicas unassigned, red = primary shard unavailable).

Index Stats

curl -X GET "localhost:9200/products/_stats"

Check indexing section for index_total, index_time_in_millis, and index_current to track throughput and latency.

Task Management

curl -X GET "localhost:9200/_tasks?detailed=true&actions=*bulk"

Monitor ongoing bulk indexing tasks to detect bottlenecks or stuck operations.

Step 8: Refresh and Flush Operations

By default, Elasticsearch refreshes indices every second, making new documents searchable. For bulk ingestion, you may want to disable automatic refresh to improve write speed:

curl -X PUT "localhost:9200/products/_settings" -H 'Content-Type: application/json' -d' { "index.refresh_interval": "-1" }'

After bulk indexing, manually trigger a refresh:

curl -X POST "localhost:9200/products/_refresh"

Use flush to force writing data to disk and clear the transaction log:

curl -X POST "localhost:9200/products/_flush"

Flushing is expensive and rarely needed unless youre managing disk space or recovering from crashes.

Best Practices

1. Always Define Mappings Explicitly

Never rely on dynamic mapping in production. Define field types, analyzers, and formats upfront. This prevents data type conflicts, ensures consistent search behavior, and improves query performance.

2. Use Keyword for Exact Matches, Text for Full-Text Search

Use keyword for IDs, status codes, categories, tags, and fields used in aggregations or sorting. Use text only for fields requiring full-text search (e.g., product descriptions, comments).

Consider using multi-fields to index the same field in multiple ways:

"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}

This allows you to search name for relevance and sort or aggregate on name.keyword.

3. Optimize Bulk Request Size

While bulk requests improve performance, overly large requests can overwhelm nodes. Aim for 515 MB per request. Test with your data: start with 1,0005,000 documents per request and adjust based on response time and memory usage.

4. Use Index Aliases for Zero-Downtime Operations

Aliases let you point to one or more indices under a single name. Use them for reindexing, rolling updates, or A/B testing:

curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
"actions": [
{
"add": {
"index": "products_v1",
"alias": "products"
}
}
]
}'

When you reindex to products_v2, update the alias without changing application code:

curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
"actions": [
{
"remove": {
"index": "products_v1",
"alias": "products"
}
},
{
"add": {
"index": "products_v2",
"alias": "products"
}
}
]
}'

5. Avoid Large Documents and Deep Nested Objects

Large documents (>10 MB) strain memory and slow down indexing. Split data across multiple documents if possible. Similarly, deeply nested objects can degrade performance. Use nested type only when you need to query inner objects independently. Otherwise, flatten the structure.

6. Use Proper Sharding Strategy

Shards are the building blocks of scalability. Too few shards limit parallelism; too many increase overhead. As a rule of thumb:

Keep shard size between 1050 GB.
Dont exceed 1,0002,000 shards per node.
Set primary shards at index creation they cannot be changed later.

For time-series data, use daily or weekly indices with 13 primary shards each.

7. Enable Compression and Optimize Storage

Enable index compression to reduce disk usage:

"index.codec": "best_compression"

Use doc_values (enabled by default for most types) for sorting and aggregations. Avoid fielddata on text fields it loads data into heap memory and can cause out-of-memory errors.

8. Monitor and Tune Refresh Interval

For ingestion-heavy workloads, increase refresh_interval to 30s or 60s. For search-heavy workloads, keep it at 1s. You can toggle it dynamically without reindexing.

9. Use Index Lifecycle Management (ILM)

For time-series data (logs, metrics), use ILM to automate index rollover, deletion, and tiered storage:

Hot phase: High-performance storage for active writes.
Warm phase: Lower-cost storage for infrequent queries.
Cold phase: Archived data with minimal access.
Delete phase: Remove old indices.

ILM reduces operational overhead and ensures cost-efficient data retention.

10. Secure Your Data

Enable Elasticsearchs built-in security features (X-Pack) to restrict indexing access:

Use role-based access control (RBAC).
Apply index-level permissions.
Encrypt data in transit with TLS.

Never expose Elasticsearch directly to the public internet.

Tools and Resources

Official Elasticsearch Clients

Elasticsearch provides official clients for major programming languages:

Python: elasticsearch-py
JavaScript/Node.js: @elastic/elasticsearch
Java: Elasticsearch Java Client
.NET: Elastic.Clients.Elasticsearch
Go: go-elasticsearch

These clients handle serialization, connection pooling, retries, and error handling automatically.

Logstash and Filebeat for Data Ingestion

For complex data pipelines, use Elastics ingestion tools:

Filebeat: Lightweight shipper for log files. Reads logs and sends them to Elasticsearch or Logstash.
Logstash: Data processing pipeline. Filters, enriches, and transforms data before indexing (e.g., parsing JSON, geolocating IPs).

Example Filebeat config to send Nginx logs:

filebeat.inputs: - type: log enabled: true paths: - /var/log/nginx/access.log output.elasticsearch: hosts: ["localhost:9200"]

Kibana for Visualization and Debugging

Kibana provides:

Index pattern creation and management
Document browser to inspect indexed data
Dev Tools console for direct API calls
Monitoring dashboards for cluster health and indexing metrics

Use the Dev Tools console to test mappings, bulk requests, and queries without writing external scripts.

Third-Party Tools

Postman: For manual API testing and debugging.
curl: Essential for quick commands and automation scripts.
Elasticsearch Head (deprecated, use Kibana instead).
OpenSearch Dashboards: Open-source alternative to Kibana if using OpenSearch.

Documentation and Learning Resources

Elasticsearch Reference Official, comprehensive documentation.
Elastic Training Free and paid courses on indexing, search, and cluster management.
Elastic YouTube Channel Tutorials and live demos.
Elastic Discuss Forum Community support and troubleshooting.

Real Examples

Example 1: Indexing E-Commerce Product Catalog

Scenario: You have a CSV of 10,000 products and need to index them into Elasticsearch for search and filtering.

Step 1: Define mapping with multi-fields:

curl -X PUT "localhost:9200/products" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"product_id": { "type": "keyword" },
"name": {
"type": "text",
"fields": {
"keyword": { "type": "keyword" }
}
},
"description": { "type": "text", "analyzer": "english" },
"price": { "type": "float" },
"category": { "type": "keyword" },
"tags": { "type": "keyword" },
"in_stock": { "type": "boolean" },
"created_at": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }
}
}
}'

Step 2: Convert CSV to JSON array and use Bulk API.

Sample JSON line:

{ "product_id": "P1001", "name": "Wireless Bluetooth Headphones", "description": "Premium noise-canceling headphones with 30-hour battery life.", "price": 149.99, "category": "Electronics", "tags": ["audio", "wireless", "noise-canceling"], "in_stock": true, "created_at": "2024-06-15 10:00:00"

}

Step 3: Use Python script to read CSV and send bulk requests:

import csv
import json
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch("http://localhost:9200")
def read_products(filename):
with open(filename, newline='') as f:
reader = csv.DictReader(f)
for row in reader:
yield {
"_index": "products",
"_id": row["product_id"],
"_source": {
"product_id": row["product_id"],
"name": row["name"],
"description": row["description"],
"price": float(row["price"]),
"category": row["category"],
"tags": row["tags"].split(","),
"in_stock": row["in_stock"] == "true",
"created_at": row["created_at"]
}
}
helpers.bulk(es, read_products("products.csv"))

Result: 10,000 products indexed in under 30 seconds with full-text search on name/description and fast filtering by category, price, and tags.

Example 2: Indexing Server Logs with Time-Based Indices

Scenario: Ingest application logs from 50 servers into Elasticsearch for real-time monitoring.

Step 1: Create index template for daily logs:

curl -X PUT "localhost:9200/_index_template/app_logs_template" -H 'Content-Type: application/json' -d'
{
"index_patterns": ["app-logs-*"],
"template": {
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index.codec": "best_compression"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"level": { "type": "keyword" },
"service": { "type": "keyword" },
"host": { "type": "keyword" },
"message": { "type": "text" },
"duration_ms": { "type": "integer" },
"error_code": { "type": "keyword" }
}
}
},
"priority": 500
}'

Step 2: Use Filebeat to ship logs from each server:

filebeat.inputs: - type: log enabled: true paths: - /var/log/app/*.log json.keys_under_root: true json.add_error_key: true output.elasticsearch: hosts: ["es-cluster:9200"] index: "app-logs-%{+yyyy.MM.dd}"

Step 3: In Kibana, create an index pattern app-logs-* and build a dashboard showing error rates, response times, and top services.

Result: Real-time visibility into application health with automatic daily index rollover and 30-day retention via ILM.

Example 3: Reindexing with Mapping Changes

Scenario: Youve indexed 5 million user profiles with email as text, but now need to sort and aggregate by email which requires keyword.

Step 1: Create new index with correct mapping:

curl -X PUT "localhost:9200/users_v2" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"email": {
"type": "keyword"
},
"name": { "type": "text" },
"created_at": { "type": "date" }
}
}
}'

Step 2: Use Reindex API to copy data:

curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
{
"source": {
"index": "users"
},
"dest": {
"index": "users_v2"
}
}'

Step 3: Update alias to point to new index:

curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
"actions": [
{
"add": {
"index": "users_v2",
"alias": "users"
}
},
{
"remove": {
"index": "users",
"alias": "users"
}
}
]
}'

Step 4: Delete old index:

curl -X DELETE "localhost:9200/users"

Result: Zero-downtime migration with improved query performance for email-based operations.

FAQs

Can I change the mapping of an existing index?

No, you cannot modify field mappings after an index is created. The only way to change a mapping is to reindex the data into a new index with the updated schema.

What happens if I index a document with a field that doesnt match the mapping?

If dynamic mapping is enabled (dynamic: true), Elasticsearch creates the field automatically using its best guess (e.g., string ? text). If dynamic: strict, the request is rejected. If dynamic: false, the field is ignored.

How do I know if my index is performing well?

Monitor the _stats API for indexing rate, latency, and shard size. Use Kibanas Monitoring section to track JVM memory, thread pools, and refresh times. If indexing is slow, check for too many shards, large documents, or insufficient hardware.

Is it better to have one large index or many small ones?

For search performance, smaller indices (1050 GB) are better. For scalability and maintenance, time-based indices (daily, weekly) are preferred. Avoid indices larger than 100 GB they become hard to manage and recover.

How do I delete documents from an index?

Use the _delete_by_query API:

curl -X POST "localhost:9200/products/_delete_by_query" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"in_stock": false
}
}
}'

Deletion is soft by default documents are marked for removal and cleaned during segment merges.

Can I index data from a SQL database?

Yes. Use Logstash with the JDBC input plugin, or write a script using your languages SQL driver and Elasticsearch client to extract, transform, and load (ETL) data.

Whats the difference between indexing and searching in Elasticsearch?

Indexing is the process of storing and structuring data so it can be retrieved. Searching is the process of querying that data to find matching documents. Indexing happens once (or periodically); searching happens repeatedly and must be fast.

How does Elasticsearch handle duplicates?

Elasticsearch uses document IDs to determine uniqueness. If you index two documents with the same ID, the newer one overwrites the older one (version increase). To prevent duplicates, use create instead of index in bulk requests it fails if the ID already exists.

Can I index binary data like images or PDFs?

Yes, using the ingest-attachment processor in Logstash or the attachment field type in Elasticsearch. The binary data is base64-encoded and extracted into text for search.

Conclusion

Indexing data in Elasticsearch is a foundational skill for anyone building search-driven applications, analytics platforms, or monitoring systems. From defining precise mappings to leveraging bulk APIs and index templates, the techniques covered in this guide empower you to index data efficiently, reliably, and at scale.

Remember: indexing is not a one-time task its an ongoing process that requires careful planning, monitoring, and optimization. Start with clear requirements, define your mappings explicitly, use aliases for flexibility

alex

How to Index Data in Elasticsearch

How to Index Data in Elasticsearch

Step-by-Step Guide

Prerequisites

Step 1: Understand the Index Concept

Step 2: Define a Mapping (Schema)

Step 3: Index a Single Document

Step 4: Index Multiple Documents with Bulk API

Step 5: Handle Data Types and Dynamic Mapping

Step 6: Use Index Templates for Automation

Step 7: Monitor Indexing Performance

Cluster Health

Index Stats

Task Management

Step 8: Refresh and Flush Operations

Best Practices

1. Always Define Mappings Explicitly

2. Use Keyword for Exact Matches, Text for Full-Text Search

3. Optimize Bulk Request Size

4. Use Index Aliases for Zero-Downtime Operations

5. Avoid Large Documents and Deep Nested Objects

6. Use Proper Sharding Strategy

7. Enable Compression and Optimize Storage

8. Monitor and Tune Refresh Interval

9. Use Index Lifecycle Management (ILM)

10. Secure Your Data

Tools and Resources

Official Elasticsearch Clients

Logstash and Filebeat for Data Ingestion

Kibana for Visualization and Debugging

Third-Party Tools

Documentation and Learning Resources

Real Examples

Example 1: Indexing E-Commerce Product Catalog

Example 2: Indexing Server Logs with Time-Based Indices

Example 3: Reindexing with Mapping Changes

FAQs

Can I change the mapping of an existing index?

What happens if I index a document with a field that doesnt match the mapping?

How do I know if my index is performing well?

Is it better to have one large index or many small ones?

How do I delete documents from an index?

Can I index data from a SQL database?

Whats the difference between indexing and searching in Elasticsearch?

How does Elasticsearch handle duplicates?

Can I index binary data like images or PDFs?

Conclusion

Related Posts

Popular Posts

Recommended Posts

Popular Tags