Non-ASCII tokens are corrupted sometimes when using the streaming API

Question

Accepted Answer

The issue arises because the LineDecoder in the Node library does not handle multi-byte characters correctly when they are split across chunk boundaries. This leads to the corruption of non-ASCII tokens, as the decoder fails to reconstruct the characters properly, resulting in the replacement character '\uFFFD'. Using a single TextDecoderStream for the entire stream will ensure that multi-byte characters are decoded correctly, regardless of how they are split across buffers. Modify the streaming API implementation to use a single instance of TextDecoderStream instead of multiple LineDecoder instances. This will allow for proper handling of multi-byte characters across chunk boundaries. Ensure that the buffer handling logic is updated to accommodate the new TextDecoderStream. This may involve adjusting how data is read from the stream and processed. Create test cases that specifically send streaming completion requests with non-ASCII tokens. Validate that the output does not contain corrupted characters. Perform regression testing on the streaming API to ensure that the changes do not introduce new issues and that existing functionality remains intact.

Non-ASCII tokens are corrupted sometimes when using the streaming API

Problem

1 Fix

Implement TextDecoderStream for Streaming API

Replace LineDecoder with TextDecoderStream

Update Buffer Handling Logic

Test with Non-ASCII Tokens

Conduct Regression Testing

Validation

Environment

Submitted by

Tags