Fix “Value Too Long for Column” Error: UTF-8 String Truncation by Bytes in Java

Fix “Value Too Long for Column” Error: UTF-8 String Truncation by Bytes in Java



If you've ever encountered the "value too long for column" database error, especially in a Java backend, the root cause may not be obvious.

In this post, we’ll walk through a real production issue caused by UTF-8 encoding, explain why string length is not equal to byte length, and show how to correctly truncate strings by bytes without breaking multibyte characters.


The Production Issue

We hit a production error during a database insert:

value too long for column

  • Database column limit: 50 bytes
  • Input string length: 53 characters
  • Code already truncated to 50 characters

At first glance, everything looked correct—but the insert still failed.

Root Cause: UTF-8 Encoding

The issue comes down to this:

50 characters does NOT equal 50 bytes in UTF-8

In UTF-8 encoding:

  • ASCII characters → 1 byte
  • Chinese characters → 3 bytes
  • Emoji → 4 bytes

This means a string with 50 characters can easily exceed a 50-byte database limit.

Why Naive Truncation Fails

A typical implementation looks like this:


if (value.length() > 50) {
    value = value.substring(0, 50);
}

This ensures 50 characters—but not 50 bytes.

Even worse, naive byte truncation can:

  • Split multibyte characters
  • Produce invalid UTF-8 strings
  • Cause corrupted data or replacement characters (�)

Correct Solution: Truncate by Bytes Safely

The correct approach must:

  1. Respect the byte limit
  2. Preserve valid UTF-8 encoding
  3. Ensure the last character is not partially cut

UTF-8 Safe Truncate Method (Java)


public static String truncateUtf8(String value, int maxBytes) {
    if (value == null) return null;

    byte[] bytes = value.getBytes(StandardCharsets.UTF_8);
    if (bytes.length <= maxBytes) return value;

    int len = maxBytes;

    // Step 1: find the start of the last character
    int start = len;
    while (start > 0 && (bytes[start - 1] & 0xC0) == 0x80) {
        start--;
    }

    // Step 2: determine character byte length
    int firstByte = bytes[start] & 0xFF;
    int charLength;

    if ((firstByte & 0x80) == 0x00) {
        charLength = 1;
    } else if ((firstByte & 0xE0) == 0xC0) {
        charLength = 2;
    } else if ((firstByte & 0xF0) == 0xE0) {
        charLength = 3;
    } else if ((firstByte & 0xF8) == 0xF0) {
        charLength = 4;
    } else {
        return new String(bytes, 0, start, StandardCharsets.UTF_8);
    }

    // Step 3: ensure full character fits
    if (start + charLength > maxBytes) {
        len = start;
    }

    return new String(bytes, 0, len, StandardCharsets.UTF_8);
}

Why This Works

  • Detects UTF-8 character boundaries
  • Prevents partial multibyte characters
  • Guarantees byte-safe truncation

JUnit Test Cases

These tests validate correctness and expose failures in naive implementations.


import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;

import java.nio.charset.StandardCharsets;

public class Utf8TruncateTest {

    @Test
    void asciiShouldPass() {
        String input = "abcdefghijklmnopqrstuvwxyz1234567890";
        String result = truncateUtf8(input, 50);
        assertEquals(input, result);
    }

    @Test
    void asciiShouldTruncate() {
        String input = "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ";
        String result = truncateUtf8(input, 50);
        assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 50);
    }

    @Test
    void chineseCharactersShouldNotBreak() {
        String input = "汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉";
        String result = truncateUtf8(input, 50);

        assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 50);
        assertDoesNotThrow(() -> result.getBytes(StandardCharsets.UTF_8));
    }

    @Test
    void mixedCharactersEdgeCase() {
        String input = "abc汉字def汉字ghi汉字jkl";
        String result = truncateUtf8(input, 20);

        assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 20);
    }

    @Test
    void emojiShouldNotBreak() {
        String input = "hello😊world😊test😊";
        String result = truncateUtf8(input, 15);

        assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 15);
    }

    @Test
    void boundaryEdgeCaseFailsNaive() {
        String input = "aaaaaaa汉";
        String result = truncateUtf8(input, 9);

        assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 9);
        assertFalse(result.contains("�"));
    }
}

Key Takeaways

  • Database limits are often defined in bytes, not characters
  • UTF-8 encoding makes string handling more complex
  • Always validate against byte length when persisting data

Final Thoughts

If your database enforces byte limits and your code enforces character limits, you have a production bug waiting to happen.

Handling UTF-8 correctly is essential for any modern backend system that supports international users.

❤️ Support This Blog


If this post helped you, you can support my writing with a small donation. Thank you for reading.


Comments

Popular posts from this blog

fixed: embedded-redis: Unable to run on macOS Sonoma

Copying MDC Context Map in Web Clients: A Comprehensive Guide

Reset user password for your own Ghost blog