PDF to Text

Character Frequency Analyzer

Free online character frequency analyzer. Paste any text to see letter frequency, word frequency, bigrams, and character statistics. Useful for cryptanalysis, writing analysis, and linguistics. No signup.

Advertisement
Advertisement

How to Use the Character Frequency Analyzer

Paste any text into the editor on the left. Four analysis tabs update instantly: Characters shows every character and its count, Letters compares your text against a language frequency baseline, Words lists the most common vocabulary with stopword filtering, and N-grams shows character sequence frequencies for cryptanalysis. Use the Export button to download results as JSON, CSV, or a plain text report.

About This Tool

A comprehensive text frequency analysis tool for writers, linguists, and cryptanalysis students. Analyzes character frequency, letter frequency with comparison against English, German, French, Spanish, and Icelandic baselines, word frequency with stopword filtering and vocabulary richness metrics (Type-Token Ratio, hapax legomena), and character-level n-grams (bigrams, trigrams, quadgrams) with a visual heatmap. Computes the Index of Coincidence for cipher identification and chi-squared distance for language detection. All analysis runs in pure JavaScript with zero external libraries. Pair with the Word Counter for basic counting or the Reading Time Estimator for readability analysis.

Quick Reference Table

MetricDescription
Index of CoincidenceMeasures letter distribution evenness — ~0.067 for English, ~0.038 for random
Chi-Squared (χ²)Distance from expected language baseline — lower is closer match
Type-Token RatioUnique words / total words — higher means richer vocabulary
Hapax LegomenaWords appearing exactly once — indicates vocabulary diversity
BigramTwo consecutive characters — th, he, in are top English bigrams
TrigramThree consecutive characters — the, and, ing are top English trigrams

Frequently Asked Questions

What is a bigram?

A bigram is any sequence of two consecutive characters. In the text ‘hello’, the bigrams are ‘he’, ‘el’, ‘ll’, ‘lo’. Character bigrams are used in cryptanalysis, language detection, spell checking, and language models. The most common English bigrams are ‘th’, ‘he’, ‘in’, ‘er’, and ‘an’.

What does Type-Token Ratio measure?

Type-Token Ratio (TTR) is the ratio of unique words (types) to total words (tokens). A text with TTR 1.0 has no repeated words. Academic and literary texts typically have higher TTR than news writing or casual speech.

Why does case-insensitive mode combine upper and lower?

In frequency analysis, ‘E’ and ‘e’ are the same letter — they carry the same linguistic information. Case-insensitive mode merges them into a single count, which makes frequency analysis results more meaningful.

How does the chi-squared language detection work?

Chi-squared distance measures how far the observed letter frequencies deviate from the expected frequencies for a given language. Lower scores indicate a closer match. Running this against multiple language baselines and ranking the results provides quick language identification.

Related Tools