Introduction

Elasticsearch analyzers are the processing pipelines that transform text into searchable tokens. When an analyzer doesn't tokenize correctly, search functionality breaks - queries don't match expected documents, relevance scoring suffers, or searches return zero results. Analyzer issues commonly stem from incorrect custom analyzer definitions, char filter ordering problems, tokenizer configuration mismatches, or token filter chain errors. Understanding how to debug analyzers using the _analyze API, verify token output, and properly configure analyzer chains is essential for implementing effective search functionality in Elasticsearch.

Symptoms

When Elasticsearch analyzer tokenization is incorrect, you will observe:

  • Search queries don't match documents that should be found
  • Unexpected tokens appearing in search results
  • Text not being split at expected boundaries
  • Special characters or HTML tags not being removed
  • Case normalization not applied as expected
  • Stemming or synonym expansion not working
  • Zero results for queries that should return documents

Testing analyzer with _analyze API shows unexpected output: ``bash curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "analyzer": "my_custom_analyzer", "text": "Hello World" } '

Expected output: ``json {"tokens": [{"token": "hello", "position": 0}, {"token": "world", "position": 1}]}

Actual output may show: ``json {"tokens": [{"token": "Hello World", "position": 0}]}

Or no tokens at all, indicating analyzer configuration problem.

Common Causes

  1. 1.Analyzer not applied to field - Mapping uses different analyzer or default
  2. 2.Custom analyzer not defined - Index settings missing analyzer definition
  3. 3.Tokenizer type mismatch - Using pattern tokenizer with wrong regex
  4. 4.Char filter order wrong - HTML strip not applied before other filters
  5. 5.Token filter chain incorrect - Filters applied in wrong sequence
  6. 6.Analyzer name typo - Mapping references non-existent analyzer
  7. 7.Index closed - Cannot analyze with closed index's analyzers
  8. 8.Analyzer definition in wrong location - Settings vs. mapping confusion
  9. 9.Keyword field analyzed - Keyword fields don't use analyzers
  10. 10.Nested object analyzer scope - Analyzer not inherited by nested fields

Step-by-Step Fix

Step 1: Test Analyzer Output

Use the _analyze API to debug tokenization:

```bash # Test with built-in analyzer curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "analyzer": "standard", "text": "The Quick Brown Fox" } '

# Test custom analyzer curl -X POST "localhost:9200/my-index/_analyze?pretty" -H 'Content-Type: application/json' -d' { "analyzer": "my_custom_analyzer", "text": "Hello World" } '

# Test with inline analyzer definition curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["lowercase", "stop"], "text": "The Quick Brown Fox" } ' ```

Compare expected vs. actual tokens: ```json // Expected tokens for "The Quick Brown Fox" with standard + lowercase + stop: { "tokens": [ {"token": "quick", "position": 1}, {"token": "brown", "position": 2}, {"token": "fox", "position": 3} ] }

// If "the" appears, stop filter not working // If "Quick" (capitalized), lowercase filter not working ```

Step 2: Check Analyzer Definition

Verify the analyzer is properly defined in index settings:

```bash # Get index settings curl -X GET "localhost:9200/my-index/_settings?pretty&include_defaults=true"

# Check specifically for analysis section curl -X GET "localhost:9200/my-index/_settings?pretty" | jq '.[]["settings"]["index"]["analysis"]' ```

Expected settings structure: ``json { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", "tokenizer": "standard", "char_filter": ["html_strip"], "filter": ["lowercase", "stop", "snowball"] } }, "tokenizer": { "my_pattern_tokenizer": { "type": "pattern", "pattern": "[\\W_]+" } }, "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": ["quick,fast", "brown,brunette"] } }, "char_filter": { "my_html_strip": { "type": "html_strip", "escaped_tags": ["b", "i"] } } } } }

Step 3: Check Field Mapping

Verify the analyzer is applied to the correct field:

```bash # Get mapping for specific field curl -X GET "localhost:9200/my-index/_mapping?pretty" | jq '.[]["mappings"]["properties"]["content"]'

# Check analyzer on text field curl -X GET "localhost:9200/my-index/_mapping/field/content?pretty" ```

Correct mapping should specify analyzer: ``json { "properties": { "content": { "type": "text", "analyzer": "my_custom_analyzer", "search_analyzer": "my_custom_analyzer" } } }

Common mapping errors: ```json // WRONG: Analyzer on keyword field (keyword fields don't use analyzers) { "content": { "type": "keyword", "analyzer": "my_custom_analyzer" // This has no effect! } }

// WRONG: Missing analyzer specification { "content": { "type": "text" // Uses default "standard" analyzer } } ```

Step 4: Debug Tokenizer Issues

Test tokenizer separately:

```bash # Test tokenizer alone curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "text": "Hello, World! How are you?" } '

# Test pattern tokenizer curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": { "type": "pattern", "pattern": "[,\\s]+" }, "text": "Hello, World, How are you" } ' ```

Common tokenizer problems:

```bash # Pattern tokenizer with greedy regex (wrong) curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer": {"type": "pattern", "pattern": ".*"}, "text": "Hello World" } ' // Output: single token "Hello World" (.* matches everything)

# Fixed pattern tokenizer curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer": {"type": "pattern", "pattern": "\\s+"}, "text": "Hello World" } ' // Output: two tokens "Hello", "World" ```

Step 5: Debug Char Filter Issues

Test char filters for preprocessing issues:

```bash # Test HTML strip char filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "char_filter": ["html_strip"], "text": "<p>Hello <b>World</b></p>" } '

# Test mapping char filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "char_filter": [{ "type": "mapping", "mappings": ["& => and", "| => or"] }], "text": "Tom & Jerry | Bugs Bunny" } ' ```

Common char filter problems:

```json // WRONG: HTML strip after other filters (order matters!) { "char_filter": ["my_mapping", "html_strip"], // Wrong order "text": "<p>Tom & Jerry</p>" } // Result: "&" may be inside HTML tags, stripped before mapping applied

// CORRECT: HTML strip first, then mapping { "char_filter": ["html_strip", "my_mapping"], "text": "<p>Tom & Jerry</p>" } ```

Step 6: Debug Token Filter Issues

Test token filters individually and in sequence:

```bash # Test lowercase filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["lowercase"], "text": "HELLO WORLD" } '

# Test stop filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["lowercase", "stop"], "text": "The quick brown fox jumps over the lazy dog" } '

# Test synonym filter curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": [{ "type": "synonym", "synonyms": ["quick,fast,speedy", "jump,leap,hop"] }], "text": "The quick fox jumped" } ' ```

Token filter order matters: ```bash # WRONG: Stop filter before lowercase curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["stop", "lowercase"], "text": "THE QUICK" } ' // Stop filter may not remove "THE" (uppercase)

# CORRECT: Lowercase first, then stop curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer": "standard", "filter": ["lowercase", "stop"], "text": "THE QUICK" } ' // "the" is removed, "quick" remains ```

Step 7: Fix Analyzer Configuration

Update the analyzer configuration correctly:

```bash # Close index to update settings curl -X POST "localhost:9200/my-index/_close"

# Update analyzer settings curl -X PUT "localhost:9200/my-index/_settings" -H 'Content-Type: application/json' -d' { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", "tokenizer": "standard", "char_filter": ["html_strip"], "filter": ["lowercase", "stop", "my_synonym_filter"] } }, "filter": { "my_synonym_filter": { "type": "synonym", "synonyms_path": "analysis/synonyms.txt" } } } } '

# Open index curl -X POST "localhost:9200/my-index/_open" ```

For new index, include settings in creation: ``bash curl -X PUT "localhost:9200/my-index" -H 'Content-Type: application/json' -d' { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase", "stop", "stemmer"] } } } }, "mappings": { "properties": { "title": { "type": "text", "analyzer": "my_analyzer" } } } } '

Step 8: Update Mapping with Correct Analyzer

Apply the analyzer to the correct field:

```bash # Update mapping (only works for new fields or explicit analyzer change) curl -X PUT "localhost:9200/my-index/_mapping" -H 'Content-Type: application/json' -d' { "properties": { "content": { "type": "text", "analyzer": "my_custom_analyzer", "search_analyzer": "my_custom_analyzer" } } } '

# For existing fields with wrong analyzer, reindex to new index: curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d' { "source": {"index": "my-index"}, "dest": {"index": "my-index-fixed"} } ' ```

Step 9: Verify Analyzer Works on Documents

Test that documents are properly tokenized:

```bash # Test how document was analyzed curl -X POST "localhost:9200/my-index/_explain/doc-id?pretty" -H 'Content-Type: application/json' -d' { "query": { "match": { "content": "hello world" } } } '

# View field terms for a document curl -X GET "localhost:9200/my-index/_termvectors/doc-id?fields=content&pretty" ```

Expected term vector output: ``json { "term_vectors": { "content": { "terms": { "hello": {"term_freq": 1, "tokens": [{"position": 0}]}, "world": {"term_freq": 1, "tokens": [{"position": 1}]} } } } }

Verification

After fixing the analyzer, verify it works correctly:

```bash # Test analyzer output curl -X POST "localhost:9200/my-index/_analyze?pretty" -H 'Content-Type: application/json' -d' { "analyzer": "my_custom_analyzer", "text": "Hello World" } '

# Expected: ["hello", "world"] ```

Test search functionality: ```bash # Index test document curl -X POST "localhost:9200/my-index/_doc" -H 'Content-Type: application/json' -d' { "content": "Hello World Test" } '

# Search for test term curl -X POST "localhost:9200/my-index/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "match": { "content": "hello" } } } '

# Expected: Document found ```

Run comprehensive analyzer test: ``bash # Test all expected behaviors for text in "HELLO" "Hello World" "<p>Hello</p>" "quick fox"; do echo "Testing: $text" curl -s -X POST "localhost:9200/my-index/_analyze" \ -H 'Content-Type: application/json' \ -d "{\"analyzer\": \"my_custom_analyzer\", \"text\": \"$text\"}" | jq '.tokens[].token' done

Prevention

To prevent analyzer issues:

  1. 1.Test analyzers during development:
  2. 2.```bash
  3. 3.# Create test script
  4. 4.cat > test_analyzer.sh << 'EOF'
  5. 5.curl -s -X POST "$ES/_analyze" -H 'Content-Type: application/json' \
  6. 6.-d '{"analyzer": "my_analyzer", "text": "'$1'"}' | jq '.tokens'
  7. 7.EOF

chmod +x test_analyzer.sh ./test_analyzer.sh "Test Input" ```

  1. 1.Document analyzer purpose:
  2. 2.```json
  3. 3.{
  4. 4."analyzer": {
  5. 5."product_search_analyzer": {
  6. 6."_comment": "For product names: lowercase, remove HTML, handle synonyms",
  7. 7."type": "custom",
  8. 8....
  9. 9.}
  10. 10.}
  11. 11.}
  12. 12.`
  13. 13.Use analyzer validation in CI/CD:
  14. 14.```python
  15. 15.import requests

def validate_analyzer(es_url, analyzer_name, test_cases): for text, expected_tokens in test_cases: response = requests.post( f"{es_url}/_analyze", json={"analyzer": analyzer_name, "text": text} ) actual_tokens = [t['token'] for t in response.json()['tokens']] assert actual_tokens == expected_tokens, \ f"Analyzer failed for '{text}': expected {expected_tokens}, got {actual_tokens}" ```

  1. 1.Separate index and search analyzers:
  2. 2.```json
  3. 3.{
  4. 4."properties": {
  5. 5."content": {
  6. 6."type": "text",
  7. 7."analyzer": "my_analyzer",
  8. 8."search_analyzer": "my_search_analyzer" // Different for search
  9. 9.}
  10. 10.}
  11. 11.}
  12. 12.`
  13. 13.Use synonym files for maintainability:
  14. 14.```bash
  15. 15.# Store synonyms in external file
  16. 16.echo "quick,fast,speedy
  17. 17.jump,leap,hop
  18. 18.brown,brunette" > config/synonyms.txt

# Reference in analyzer { "filter": { "synonyms": { "type": "synonym", "synonyms_path": "synonyms.txt" } } } ```

  1. 1.Monitor index refresh after changes:
  2. 2.```bash
  3. 3.# Force refresh after analyzer changes
  4. 4.curl -X POST "localhost:9200/my-index/_refresh"
  5. 5.`
  6. 6.Reindex after major analyzer changes:
  7. 7.```bash
  8. 8.# When analyzer changes affect existing documents
  9. 9.curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
  10. 10.{
  11. 11."source": {"index": "old-index"},
  12. 12."dest": {"index": "new-index"}
  13. 13.}
  14. 14.'
  15. 15.`
  • [Technical troubleshooting: Fix bulk request rejected 429 too many requests Is](bulk-request-rejected-429-too-many-requests)
  • [Technical troubleshooting: Fix circuit breaker tripped heap memory Issue in E](circuit-breaker-tripped-heap-memory)
  • [Technical troubleshooting: Fix cluster red missing replica shards Issue in El](cluster-red-missing-replica-shards)
  • [Fix cross cluster search remote unreachable Issue in Elasticsearch-Errors](cross-cluster-search-remote-unreachable)
  • [Fix Elasticsearch Bulk Request Rejected 429 Too Many Requests Issue in Elasticsearch](elasticsearch-bulk-request-rejected-429-too-many-requests)

<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "Elasticsearch Analyzer Not Tokenizing", "description": "Fix Elasticsearch analyzer tokenization issues. Debug custom analyzers, char filters, and token filters with _analyze API.", "url": "https://www.fixwikihub.com/fix-elasticsearch-analyzer-not-tokenizing", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2025-12-12T20:46:47.626Z", "dateModified": "2025-12-12T20:46:47.626Z" } </script>