ReferenceMarch 15, 20263 min read

Character Frequency Analysis: From Cryptography to Text Analytics

Counting how often each letter appears in text has applications from code-breaking to natural language processing. Understand frequency analysis and its modern uses.

The Most Famous Letter in English

E. It appears in roughly 13% of all English text. The next most common letters are T (9.1%), A (8.2%), O (7.5%), I (7.0%), N (6.7%), and S (6.3%). This distribution — the relative frequency of each letter — is remarkably consistent across different English texts, from novels to newspapers to emails.

This consistency is what makes character frequency analysis useful. By counting how often each character appears in a piece of text, you can learn things about the text that are not obvious from reading it.

The Cryptography Origin

Frequency analysis was first described by the Arab mathematician Al-Kindi in the 9th century. It was — and for centuries remained — the primary method for breaking substitution ciphers.

A substitution cipher replaces each letter with a different letter. "HELLO" might become "KHOOR" if each letter is shifted by 3 positions (a Caesar cipher). The encryption looks random, but the underlying letter frequencies are preserved. If you count the letters in a long encrypted message, the most common letter is probably E (encrypted as some other letter). The second most common is probably T. By matching frequencies, you can recover the substitution and read the message.

This is why substitution ciphers are considered trivially weak by modern standards — frequency analysis breaks them with a few paragraphs of ciphertext.

Modern Uses of Frequency Analysis

Language detection. Different languages have different frequency distributions. German uses more Z, W, and umlauted characters. French has more accent characters. Italian uses more vowels relative to consonants. By analyzing character frequencies, software can identify the language of a text sample.

Authorship analysis. Every writer has unconscious habits — preferred words, punctuation patterns, and characteristic letter frequency distributions. Forensic linguists use these patterns to attribute anonymous texts to authors.

Text compression. Huffman coding assigns shorter binary codes to more frequent characters and longer codes to rare characters. Knowing the frequency distribution allows optimal compression. This is why character frequency directly affects the compressed size of text.

Data quality checks. If you are processing a dataset that should contain English names and you see frequency patterns that do not match English — perhaps too many Z's or Q's — it may indicate data corruption or encoding issues.

Keyboard layout optimization. Alternative keyboard layouts like Dvorak and Colemak place the most frequently used letters on the home row. Character frequency analysis of typical text determines which letters should be most accessible.

Beyond Individual Letters

Frequency analysis extends to pairs of letters (bigrams) and triples (trigrams):

Common English bigrams: TH, HE, IN, EN, AN, ER, RE, ON, NT, ED

Common English trigrams: THE, AND, ING, HER, ERE, ENT, THA, NTH

These patterns are even more powerful for language identification and text analysis because they capture the structural patterns of a language, not just its alphabet usage.

How to Use the Toobits Character Frequency Analyzer

Paste any text and see a detailed breakdown of character frequencies — counts, percentages, and a visual bar chart. The tool analyzes letters, digits, spaces, and special characters separately. Use it for text analysis, language study, compression research, or satisfying your curiosity about the patterns in any piece of writing. Everything runs in your browser — your text never leaves your device.

Try These Tools

Related Articles