Non-ASCII tokens are corrupted sometimes when using the streaming API
Problem
Confirm this is a Node library issue and not an underlying OpenAI API issue - [x] This is an issue with the Node library Describe the bug When using the streaming API, sometimes tokens get corrupted. Characters get replaced by two or more `\uFFFD`. For example: [code block] when the token received is actually supposed to be `' известни'`. The issue occurs because `LineDecoder` does not deal with multi-byte characters on chunk boundaries. Instead of using a separate `TextDecoder` instance per buffer, perhaps it should use a single `TextDecoderStream` for the entire stream. To Reproduce 1. Send a streaming completion request that will get non-ASCII tokens as a response. 2. Observe the output. With some probability, some of the tokens will be corrupted. Code snippets _No response_ OS Linux Node version Node v18.19.1 Library version openai v4.14.2
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Implement TextDecoderStream for Streaming API
The issue arises because the LineDecoder in the Node library does not handle multi-byte characters correctly when they are split across chunk boundaries. This leads to the corruption of non-ASCII tokens, as the decoder fails to reconstruct the characters properly, resulting in the replacement character '\uFFFD'. Using a single TextDecoderStream for the entire stream will ensure that multi-byte characters are decoded correctly, regardless of how they are split across buffers.
Awaiting Verification
Be the first to verify this fix
- 1
Replace LineDecoder with TextDecoderStream
Modify the streaming API implementation to use a single instance of TextDecoderStream instead of multiple LineDecoder instances. This will allow for proper handling of multi-byte characters across chunk boundaries.
javascriptconst { TextDecoderStream } = require('stream/web'); const decoder = new TextDecoderStream('utf-8'); stream.pipeTo(decoder.writable); const reader = decoder.readable.getReader(); - 2
Update Buffer Handling Logic
Ensure that the buffer handling logic is updated to accommodate the new TextDecoderStream. This may involve adjusting how data is read from the stream and processed.
javascriptasync function processStream(stream) { const reader = stream.getReader(); while (true) { const { done, value } = await reader.read(); if (done) break; // Process the decoded value } } - 3
Test with Non-ASCII Tokens
Create test cases that specifically send streaming completion requests with non-ASCII tokens. Validate that the output does not contain corrupted characters.
javascriptconst response = await openai.createCompletion({ model: 'gpt-3.5-turbo', prompt: 'тестирование', stream: true }); response.on('data', (data) => { console.log(data); }); - 4
Conduct Regression Testing
Perform regression testing on the streaming API to ensure that the changes do not introduce new issues and that existing functionality remains intact.
bashnpm run test -- --updateSnapshot
Validation
To confirm the fix worked, send multiple streaming completion requests that include non-ASCII tokens and verify that the output does not contain any '\uFFFD' characters. Additionally, run the regression tests to ensure no other functionalities are broken.
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep