Introduction
Elasticsearch analyzers are the processing pipelines that transform text into searchable tokens. When an analyzer doesn't tokenize correctly, search functionality breaks - queries don't match expected documents, relevance scoring suffers, or searches return zero results. Analyzer issues commonly stem from incorrect custom analyzer definitions, char filter ordering problems, tokenizer configuration mismatches, or token filter chain errors. Understanding how to debug analyzers using the _analyze API, verify token output, and properly configure analyzer chains is essential for implementing effective search functionality in Elasticsearch.
Symptoms
When Elasticsearch analyzer tokenization is incorrect, you will observe:
- Search queries don't match documents that should be found
- Unexpected tokens appearing in search results
- Text not being split at expected boundaries
- Special characters or HTML tags not being removed
- Case normalization not applied as expected
- Stemming or synonym expansion not working
- Zero results for queries that should return documents
Testing analyzer with _analyze API shows unexpected output:
``bash
curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"analyzer": "my_custom_analyzer",
"text": "Hello World"
}
'
Expected output:
``json
{"tokens": [{"token": "hello", "position": 0}, {"token": "world", "position": 1}]}
Actual output may show:
``json
{"tokens": [{"token": "Hello World", "position": 0}]}
Or no tokens at all, indicating analyzer configuration problem.
Common Causes
- 1.Analyzer not applied to field - Mapping uses different analyzer or default
- 2.Custom analyzer not defined - Index settings missing analyzer definition
- 3.Tokenizer type mismatch - Using pattern tokenizer with wrong regex
- 4.Char filter order wrong - HTML strip not applied before other filters
- 5.Token filter chain incorrect - Filters applied in wrong sequence
- 6.Analyzer name typo - Mapping references non-existent analyzer
- 7.Index closed - Cannot analyze with closed index's analyzers
- 8.Analyzer definition in wrong location - Settings vs. mapping confusion
- 9.Keyword field analyzed - Keyword fields don't use analyzers
- 10.Nested object analyzer scope - Analyzer not inherited by nested fields
Step-by-Step Fix
Step 1: Test Analyzer Output
Use the _analyze API to debug tokenization:
```bash # Test with built-in analyzer curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "analyzer": "standard", "text": "The Quick Brown Fox" } '
# Test custom analyzer curl -X POST "localhost:9200/my-index/_analyze?pretty" -H 'Content-Type: application/json' -d' { "analyzer": "my_custom_analyzer", "text": "Hello World" } '
# Test with inline analyzer definition curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["lowercase", "stop"], "text": "The Quick Brown Fox" } ' ```
Compare expected vs. actual tokens: ```json // Expected tokens for "The Quick Brown Fox" with standard + lowercase + stop: { "tokens": [ {"token": "quick", "position": 1}, {"token": "brown", "position": 2}, {"token": "fox", "position": 3} ] }
// If "the" appears, stop filter not working // If "Quick" (capitalized), lowercase filter not working ```
Step 2: Check Analyzer Definition
Verify the analyzer is properly defined in index settings:
```bash # Get index settings curl -X GET "localhost:9200/my-index/_settings?pretty&include_defaults=true"
# Check specifically for analysis section curl -X GET "localhost:9200/my-index/_settings?pretty" | jq '.[]["settings"]["index"]["analysis"]' ```
Expected settings structure:
``json
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": ["html_strip"],
"filter": ["lowercase", "stop", "snowball"]
}
},
"tokenizer": {
"my_pattern_tokenizer": {
"type": "pattern",
"pattern": "[\\W_]+"
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": ["quick,fast", "brown,brunette"]
}
},
"char_filter": {
"my_html_strip": {
"type": "html_strip",
"escaped_tags": ["b", "i"]
}
}
}
}
}
Step 3: Check Field Mapping
Verify the analyzer is applied to the correct field:
```bash # Get mapping for specific field curl -X GET "localhost:9200/my-index/_mapping?pretty" | jq '.[]["mappings"]["properties"]["content"]'
# Check analyzer on text field curl -X GET "localhost:9200/my-index/_mapping/field/content?pretty" ```
Correct mapping should specify analyzer:
``json
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_custom_analyzer",
"search_analyzer": "my_custom_analyzer"
}
}
}
Common mapping errors: ```json // WRONG: Analyzer on keyword field (keyword fields don't use analyzers) { "content": { "type": "keyword", "analyzer": "my_custom_analyzer" // This has no effect! } }
// WRONG: Missing analyzer specification { "content": { "type": "text" // Uses default "standard" analyzer } } ```
Step 4: Debug Tokenizer Issues
Test tokenizer separately:
```bash # Test tokenizer alone curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "text": "Hello, World! How are you?" } '
# Test pattern tokenizer curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": { "type": "pattern", "pattern": "[,\\s]+" }, "text": "Hello, World, How are you" } ' ```
Common tokenizer problems:
```bash # Pattern tokenizer with greedy regex (wrong) curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer": {"type": "pattern", "pattern": ".*"}, "text": "Hello World" } ' // Output: single token "Hello World" (.* matches everything)
# Fixed pattern tokenizer curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer": {"type": "pattern", "pattern": "\\s+"}, "text": "Hello World" } ' // Output: two tokens "Hello", "World" ```
Step 5: Debug Char Filter Issues
Test char filters for preprocessing issues:
```bash # Test HTML strip char filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "char_filter": ["html_strip"], "text": "<p>Hello <b>World</b></p>" } '
# Test mapping char filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "char_filter": [{ "type": "mapping", "mappings": ["& => and", "| => or"] }], "text": "Tom & Jerry | Bugs Bunny" } ' ```
Common char filter problems:
```json // WRONG: HTML strip after other filters (order matters!) { "char_filter": ["my_mapping", "html_strip"], // Wrong order "text": "<p>Tom & Jerry</p>" } // Result: "&" may be inside HTML tags, stripped before mapping applied
// CORRECT: HTML strip first, then mapping { "char_filter": ["html_strip", "my_mapping"], "text": "<p>Tom & Jerry</p>" } ```
Step 6: Debug Token Filter Issues
Test token filters individually and in sequence:
```bash # Test lowercase filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["lowercase"], "text": "HELLO WORLD" } '
# Test stop filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["lowercase", "stop"], "text": "The quick brown fox jumps over the lazy dog" } '
# Test synonym filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": [{ "type": "synonym", "synonyms": ["quick,fast,speedy", "jump,leap,hop"] }], "text": "The quick fox jumped" } ' ```
Token filter order matters: ```bash # WRONG: Stop filter before lowercase curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["stop", "lowercase"], "text": "THE QUICK" } ' // Stop filter may not remove "THE" (uppercase)
# CORRECT: Lowercase first, then stop curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["lowercase", "stop"], "text": "THE QUICK" } ' // "the" is removed, "quick" remains ```
Step 7: Fix Analyzer Configuration
Update the analyzer configuration correctly:
```bash # Close index to update settings curl -X POST "localhost:9200/my-index/_close"
# Update analyzer settings curl -X PUT "localhost:9200/my-index/_settings" -H 'Content-Type: application/json' -d' { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", "tokenizer": "standard", "char_filter": ["html_strip"], "filter": ["lowercase", "stop", "my_synonym_filter"] } }, "filter": { "my_synonym_filter": { "type": "synonym", "synonyms_path": "analysis/synonyms.txt" } } } } '
# Open index curl -X POST "localhost:9200/my-index/_open" ```
For new index, include settings in creation:
``bash
curl -X PUT "localhost:9200/my-index" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop", "stemmer"]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
'
Step 8: Update Mapping with Correct Analyzer
Apply the analyzer to the correct field:
```bash # Update mapping (only works for new fields or explicit analyzer change) curl -X PUT "localhost:9200/my-index/_mapping" -H 'Content-Type: application/json' -d' { "properties": { "content": { "type": "text", "analyzer": "my_custom_analyzer", "search_analyzer": "my_custom_analyzer" } } } '
# For existing fields with wrong analyzer, reindex to new index: curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d' { "source": {"index": "my-index"}, "dest": {"index": "my-index-fixed"} } ' ```
Step 9: Verify Analyzer Works on Documents
Test that documents are properly tokenized:
```bash # Test how document was analyzed curl -X POST "localhost:9200/my-index/_explain/doc-id?pretty" -H 'Content-Type: application/json' -d' { "query": { "match": { "content": "hello world" } } } '
# View field terms for a document curl -X GET "localhost:9200/my-index/_termvectors/doc-id?fields=content&pretty" ```
Expected term vector output:
``json
{
"term_vectors": {
"content": {
"terms": {
"hello": {"term_freq": 1, "tokens": [{"position": 0}]},
"world": {"term_freq": 1, "tokens": [{"position": 1}]}
}
}
}
}
Verification
After fixing the analyzer, verify it works correctly:
```bash # Test analyzer output curl -X POST "localhost:9200/my-index/_analyze?pretty" -H 'Content-Type: application/json' -d' { "analyzer": "my_custom_analyzer", "text": "Hello World" } '
# Expected: ["hello", "world"] ```
Test search functionality: ```bash # Index test document curl -X POST "localhost:9200/my-index/_doc" -H 'Content-Type: application/json' -d' { "content": "Hello World Test" } '
# Search for test term curl -X POST "localhost:9200/my-index/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "match": { "content": "hello" } } } '
# Expected: Document found ```
Run comprehensive analyzer test:
``bash
# Test all expected behaviors
for text in "HELLO" "Hello World" "<p>Hello</p>" "quick fox"; do
echo "Testing: $text"
curl -s -X POST "localhost:9200/my-index/_analyze" \
-H 'Content-Type: application/json' \
-d "{\"analyzer\": \"my_custom_analyzer\", \"text\": \"$text\"}" | jq '.tokens[].token'
done
Prevention
To prevent analyzer issues:
- 1.Test analyzers during development:
- 2.```bash
- 3.# Create test script
- 4.cat > test_analyzer.sh << 'EOF'
- 5.curl -s -X POST "$ES/_analyze" -H 'Content-Type: application/json' \
- 6.-d '{"analyzer": "my_analyzer", "text": "'$1'"}' | jq '.tokens'
- 7.EOF
chmod +x test_analyzer.sh ./test_analyzer.sh "Test Input" ```
- 1.Document analyzer purpose:
- 2.```json
- 3.{
- 4."analyzer": {
- 5."product_search_analyzer": {
- 6."_comment": "For product names: lowercase, remove HTML, handle synonyms",
- 7."type": "custom",
- 8....
- 9.}
- 10.}
- 11.}
- 12.
` - 13.Use analyzer validation in CI/CD:
- 14.```python
- 15.import requests
def validate_analyzer(es_url, analyzer_name, test_cases): for text, expected_tokens in test_cases: response = requests.post( f"{es_url}/_analyze", json={"analyzer": analyzer_name, "text": text} ) actual_tokens = [t['token'] for t in response.json()['tokens']] assert actual_tokens == expected_tokens, \ f"Analyzer failed for '{text}': expected {expected_tokens}, got {actual_tokens}" ```
- 1.Separate index and search analyzers:
- 2.```json
- 3.{
- 4."properties": {
- 5."content": {
- 6."type": "text",
- 7."analyzer": "my_analyzer",
- 8."search_analyzer": "my_search_analyzer" // Different for search
- 9.}
- 10.}
- 11.}
- 12.
` - 13.Use synonym files for maintainability:
- 14.```bash
- 15.# Store synonyms in external file
- 16.echo "quick,fast,speedy
- 17.jump,leap,hop
- 18.brown,brunette" > config/synonyms.txt
# Reference in analyzer { "filter": { "synonyms": { "type": "synonym", "synonyms_path": "synonyms.txt" } } } ```
- 1.Monitor index refresh after changes:
- 2.```bash
- 3.# Force refresh after analyzer changes
- 4.curl -X POST "localhost:9200/my-index/_refresh"
- 5.
` - 6.Reindex after major analyzer changes:
- 7.```bash
- 8.# When analyzer changes affect existing documents
- 9.curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
- 10.{
- 11."source": {"index": "old-index"},
- 12."dest": {"index": "new-index"}
- 13.}
- 14.'
- 15.
`
Related Articles
- [Technical troubleshooting: Fix bulk request rejected 429 too many requests Is](bulk-request-rejected-429-too-many-requests)
- [Technical troubleshooting: Fix circuit breaker tripped heap memory Issue in E](circuit-breaker-tripped-heap-memory)
- [Technical troubleshooting: Fix cluster red missing replica shards Issue in El](cluster-red-missing-replica-shards)
- [Fix cross cluster search remote unreachable Issue in Elasticsearch-Errors](cross-cluster-search-remote-unreachable)
- [Fix Elasticsearch Bulk Request Rejected 429 Too Many Requests Issue in Elasticsearch](elasticsearch-bulk-request-rejected-429-too-many-requests)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "Elasticsearch Analyzer Not Tokenizing", "description": "Fix Elasticsearch analyzer tokenization issues. Debug custom analyzers, char filters, and token filters with _analyze API.", "url": "https://www.fixwikihub.com/fix-elasticsearch-analyzer-not-tokenizing", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2025-12-12T20:46:47.626Z", "dateModified": "2025-12-12T20:46:47.626Z" } </script>