How to Safely Truncate a String by Byte Length in Java (UTF-8 Solution)

How to Safely Truncate a String by Byte Length in Java (UTF-8 Solution)

In Java backend development, it’s common to enforce size limits on strings, such as:

  • Database column byte limits
  • API payload size constraints
  • Logging or message size restrictions

A typical approach is:


value.substring(0, n);

However, this is not safe when dealing with UTF-8 encoding.


The Core Problem: Characters ≠ Bytes

In UTF-8 encoding:

  • ASCII characters → 1 byte
  • Chinese characters → 3 bytes
  • Emoji → 4 bytes

For example:


"hello" → 5 bytes
"你好" → 6 bytes
"😊" → 4 bytes

This means:

Truncating by character count does NOT guarantee byte size limits

---

Does Java Provide a Built-in Solution?

No — Java does not provide a direct method to safely truncate a string by byte length.

While Java offers low-level APIs such as:

  • CharsetEncoder
  • ByteBuffer
  • String.getBytes()

They are not designed for simple, safe truncation and require manual handling.

---

Common Incorrect Approaches

❌ Approach 1: substring()


value.substring(0, 50);

Problem: May exceed byte limits.

❌ Approach 2: Truncate byte[] directly


new String(bytes, 0, maxBytes, StandardCharsets.UTF_8);

Problems:

  • May split multi-byte characters
  • Produces invalid UTF-8 strings
  • Results in corrupted characters (�)
---

Correct Solution: Safe UTF-8 Byte Truncation

The correct approach must:

  1. Respect the byte limit
  2. Preserve valid UTF-8 encoding
  3. Avoid cutting characters in half

Java Implementation


public static String truncateUtf8(String value, int maxBytes) {
    if (value == null) return null;

    byte[] bytes = value.getBytes(StandardCharsets.UTF_8);
    if (bytes.length <= maxBytes) return value;

    int len = maxBytes;

    // Step 1: find the start of the last character
    int start = len;
    while (start > 0 && (bytes[start - 1] & 0xC0) == 0x80) {
        start--;
    }

    // Step 2: determine character byte length
    int firstByte = bytes[start] & 0xFF;
    int charLength;

    if ((firstByte & 0x80) == 0x00) {
        charLength = 1;
    } else if ((firstByte & 0xE0) == 0xC0) {
        charLength = 2;
    } else if ((firstByte & 0xF0) == 0xE0) {
        charLength = 3;
    } else if ((firstByte & 0xF8) == 0xF0) {
        charLength = 4;
    } else {
        return new String(bytes, 0, start, StandardCharsets.UTF_8);
    }

    // Step 3: ensure full character fits
    if (start + charLength > maxBytes) {
        len = start;
    }

    return new String(bytes, 0, len, StandardCharsets.UTF_8);
}
---

Why This Works

  • Detects UTF-8 character boundaries
  • Avoids partial multi-byte characters
  • Ensures valid output within byte limits
---

Example Use Cases

  • Preventing database insert errors (value too long for column)
  • Limiting API request/response payload size
  • Safe logging in distributed systems
---

Conclusion

Java does not provide a built-in way to safely truncate strings by byte length — you must handle UTF-8 boundaries manually.

  • Character length ≠ byte length
  • UTF-8 requires boundary-aware handling
  • This utility is essential in real-world backend systems

If your application handles international text or strict byte limits, this method is a must-have.

❤️ Support This Blog


If this post helped you, you can support my writing with a small donation. Thank you for reading.


Comments

Popular posts from this blog

fixed: embedded-redis: Unable to run on macOS Sonoma

Copying MDC Context Map in Web Clients: A Comprehensive Guide

Reset user password for your own Ghost blog