How to Safely Truncate a String by Byte Length in Java (UTF-8 Solution)
How to Safely Truncate a String by Byte Length in Java (UTF-8 Solution)
In Java backend development, it’s common to enforce size limits on strings, such as:
- Database column byte limits
- API payload size constraints
- Logging or message size restrictions
A typical approach is:
value.substring(0, n);
However, this is not safe when dealing with UTF-8 encoding.
The Core Problem: Characters ≠ Bytes
In UTF-8 encoding:
- ASCII characters → 1 byte
- Chinese characters → 3 bytes
- Emoji → 4 bytes
For example:
"hello" → 5 bytes
"你好" → 6 bytes
"😊" → 4 bytes
This means:
---Truncating by character count does NOT guarantee byte size limits
Does Java Provide a Built-in Solution?
No — Java does not provide a direct method to safely truncate a string by byte length.
While Java offers low-level APIs such as:
- CharsetEncoder
- ByteBuffer
- String.getBytes()
They are not designed for simple, safe truncation and require manual handling.
---Common Incorrect Approaches
❌ Approach 1: substring()
value.substring(0, 50);
Problem: May exceed byte limits.
❌ Approach 2: Truncate byte[] directly
new String(bytes, 0, maxBytes, StandardCharsets.UTF_8);
Problems:
- May split multi-byte characters
- Produces invalid UTF-8 strings
- Results in corrupted characters (�)
Correct Solution: Safe UTF-8 Byte Truncation
The correct approach must:
- Respect the byte limit
- Preserve valid UTF-8 encoding
- Avoid cutting characters in half
Java Implementation
public static String truncateUtf8(String value, int maxBytes) {
if (value == null) return null;
byte[] bytes = value.getBytes(StandardCharsets.UTF_8);
if (bytes.length <= maxBytes) return value;
int len = maxBytes;
// Step 1: find the start of the last character
int start = len;
while (start > 0 && (bytes[start - 1] & 0xC0) == 0x80) {
start--;
}
// Step 2: determine character byte length
int firstByte = bytes[start] & 0xFF;
int charLength;
if ((firstByte & 0x80) == 0x00) {
charLength = 1;
} else if ((firstByte & 0xE0) == 0xC0) {
charLength = 2;
} else if ((firstByte & 0xF0) == 0xE0) {
charLength = 3;
} else if ((firstByte & 0xF8) == 0xF0) {
charLength = 4;
} else {
return new String(bytes, 0, start, StandardCharsets.UTF_8);
}
// Step 3: ensure full character fits
if (start + charLength > maxBytes) {
len = start;
}
return new String(bytes, 0, len, StandardCharsets.UTF_8);
}
---
Why This Works
- Detects UTF-8 character boundaries
- Avoids partial multi-byte characters
- Ensures valid output within byte limits
Example Use Cases
- Preventing database insert errors (value too long for column)
- Limiting API request/response payload size
- Safe logging in distributed systems
Conclusion
Java does not provide a built-in way to safely truncate strings by byte length — you must handle UTF-8 boundaries manually.
- Character length ≠ byte length
- UTF-8 requires boundary-aware handling
- This utility is essential in real-world backend systems
If your application handles international text or strict byte limits, this method is a must-have.
❤️ Support This Blog
If this post helped you, you can support my writing with a small donation. Thank you for reading.
Comments
Post a Comment