FG
💻 Software🤖 AI & LLMsOpenAI

Non-ASCII tokens are corrupted sometimes when using the streaming API

Fresh5 days ago
Mar 14, 20260 views
Confidence Score54%
54%

Problem

Confirm this is a Node library issue and not an underlying OpenAI API issue - [x] This is an issue with the Node library Describe the bug When using the streaming API, sometimes tokens get corrupted. Characters get replaced by two or more `\uFFFD`. For example: [code block] when the token received is actually supposed to be `' известни'`. The issue occurs because `LineDecoder` does not deal with multi-byte characters on chunk boundaries. Instead of using a separate `TextDecoder` instance per buffer, perhaps it should use a single `TextDecoderStream` for the entire stream. To Reproduce 1. Send a streaming completion request that will get non-ASCII tokens as a response. 2. Observe the output. With some probability, some of the tokens will be corrupted. Code snippets _No response_ OS Linux Node version Node v18.19.1 Library version openai v4.14.2

Unverified for your environment

Select your OS to check compatibility.

1 Fix

Canonical Fix
Unverified Fix
New Fix – Awaiting Verification

Implement TextDecoderStream for Streaming API

Medium Risk

The issue arises because the LineDecoder in the Node library does not handle multi-byte characters correctly when they are split across chunk boundaries. This leads to the corruption of non-ASCII tokens, as the decoder fails to reconstruct the characters properly, resulting in the replacement character '\uFFFD'. Using a single TextDecoderStream for the entire stream will ensure that multi-byte characters are decoded correctly, regardless of how they are split across buffers.

Awaiting Verification

Be the first to verify this fix

  1. 1

    Replace LineDecoder with TextDecoderStream

    Modify the streaming API implementation to use a single instance of TextDecoderStream instead of multiple LineDecoder instances. This will allow for proper handling of multi-byte characters across chunk boundaries.

    javascript
    const { TextDecoderStream } = require('stream/web');
    
    const decoder = new TextDecoderStream('utf-8');
    stream.pipeTo(decoder.writable);
    const reader = decoder.readable.getReader();
  2. 2

    Update Buffer Handling Logic

    Ensure that the buffer handling logic is updated to accommodate the new TextDecoderStream. This may involve adjusting how data is read from the stream and processed.

    javascript
    async function processStream(stream) {
      const reader = stream.getReader();
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        // Process the decoded value
      }
    }
  3. 3

    Test with Non-ASCII Tokens

    Create test cases that specifically send streaming completion requests with non-ASCII tokens. Validate that the output does not contain corrupted characters.

    javascript
    const response = await openai.createCompletion({ model: 'gpt-3.5-turbo', prompt: 'тестирование', stream: true });
    response.on('data', (data) => {
      console.log(data);
    });
  4. 4

    Conduct Regression Testing

    Perform regression testing on the streaming API to ensure that the changes do not introduce new issues and that existing functionality remains intact.

    bash
    npm run test -- --updateSnapshot

Validation

To confirm the fix worked, send multiple streaming completion requests that include non-ASCII tokens and verify that the output does not contain any '\uFFFD' characters. Additionally, run the regression tests to ensure no other functionalities are broken.

Sign in to verify this fix

Environment

Submitted by

AC

Alex Chen

2450 rep

Tags

openaigptllmapibug