Fix “Value Too Long for Column” Error: UTF-8 String Truncation by Bytes in Java
Fix “Value Too Long for Column” Error: UTF-8 String Truncation by Bytes in Java
If you've ever encountered the "value too long for column" database error, especially in a Java backend, the root cause may not be obvious.
In this post, we’ll walk through a real production issue caused by UTF-8 encoding, explain why string length is not equal to byte length, and show how to correctly truncate strings by bytes without breaking multibyte characters.
The Production Issue
We hit a production error during a database insert:
value too long for column
- Database column limit: 50 bytes
- Input string length: 53 characters
- Code already truncated to 50 characters
At first glance, everything looked correct—but the insert still failed.
Root Cause: UTF-8 Encoding
The issue comes down to this:
50 characters does NOT equal 50 bytes in UTF-8
In UTF-8 encoding:
- ASCII characters → 1 byte
- Chinese characters → 3 bytes
- Emoji → 4 bytes
This means a string with 50 characters can easily exceed a 50-byte database limit.
Why Naive Truncation Fails
A typical implementation looks like this:
if (value.length() > 50) {
value = value.substring(0, 50);
}
This ensures 50 characters—but not 50 bytes.
Even worse, naive byte truncation can:
- Split multibyte characters
- Produce invalid UTF-8 strings
- Cause corrupted data or replacement characters (�)
Correct Solution: Truncate by Bytes Safely
The correct approach must:
- Respect the byte limit
- Preserve valid UTF-8 encoding
- Ensure the last character is not partially cut
UTF-8 Safe Truncate Method (Java)
public static String truncateUtf8(String value, int maxBytes) {
if (value == null) return null;
byte[] bytes = value.getBytes(StandardCharsets.UTF_8);
if (bytes.length <= maxBytes) return value;
int len = maxBytes;
// Step 1: find the start of the last character
int start = len;
while (start > 0 && (bytes[start - 1] & 0xC0) == 0x80) {
start--;
}
// Step 2: determine character byte length
int firstByte = bytes[start] & 0xFF;
int charLength;
if ((firstByte & 0x80) == 0x00) {
charLength = 1;
} else if ((firstByte & 0xE0) == 0xC0) {
charLength = 2;
} else if ((firstByte & 0xF0) == 0xE0) {
charLength = 3;
} else if ((firstByte & 0xF8) == 0xF0) {
charLength = 4;
} else {
return new String(bytes, 0, start, StandardCharsets.UTF_8);
}
// Step 3: ensure full character fits
if (start + charLength > maxBytes) {
len = start;
}
return new String(bytes, 0, len, StandardCharsets.UTF_8);
}
Why This Works
- Detects UTF-8 character boundaries
- Prevents partial multibyte characters
- Guarantees byte-safe truncation
JUnit Test Cases
These tests validate correctness and expose failures in naive implementations.
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
import java.nio.charset.StandardCharsets;
public class Utf8TruncateTest {
@Test
void asciiShouldPass() {
String input = "abcdefghijklmnopqrstuvwxyz1234567890";
String result = truncateUtf8(input, 50);
assertEquals(input, result);
}
@Test
void asciiShouldTruncate() {
String input = "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ";
String result = truncateUtf8(input, 50);
assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 50);
}
@Test
void chineseCharactersShouldNotBreak() {
String input = "汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉汉";
String result = truncateUtf8(input, 50);
assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 50);
assertDoesNotThrow(() -> result.getBytes(StandardCharsets.UTF_8));
}
@Test
void mixedCharactersEdgeCase() {
String input = "abc汉字def汉字ghi汉字jkl";
String result = truncateUtf8(input, 20);
assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 20);
}
@Test
void emojiShouldNotBreak() {
String input = "hello😊world😊test😊";
String result = truncateUtf8(input, 15);
assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 15);
}
@Test
void boundaryEdgeCaseFailsNaive() {
String input = "aaaaaaa汉";
String result = truncateUtf8(input, 9);
assertTrue(result.getBytes(StandardCharsets.UTF_8).length <= 9);
assertFalse(result.contains("�"));
}
}
Key Takeaways
- Database limits are often defined in bytes, not characters
- UTF-8 encoding makes string handling more complex
- Always validate against byte length when persisting data
Final Thoughts
If your database enforces byte limits and your code enforces character limits, you have a production bug waiting to happen.
Handling UTF-8 correctly is essential for any modern backend system that supports international users.
❤️ Support This Blog
If this post helped you, you can support my writing with a small donation. Thank you for reading.
Comments
Post a Comment