• Wed. Oct 16th, 2024

GPT-4o explodes onto the scene! Listen, read and write as smoothly as a real person

May 14, 2024
    In the early morning of Tuesday (May 14), Beijing time, OpenAI, an American artificial intelligence research company, held a "Spring Update" event online.
    OpenAI released a new flagship model "GPT-4o" at the event, "which can perform reasoning on audio, vision and text in real time." According to reports, the new model enables ChatGPT to handle 50 different languages ​​while improving speed and quality.
    OpenAI released a new flagship model "GPT-4o" at the event. The picture shows a screenshot of OpenAI's X
The press release stated that GPT-4o is a step towards more natural human-computer interaction. It can accept a combination of text, audio and images as input and generate any combination of text, audio and image output, "compared to existing models. , GPT-4o is particularly good at image and audio understanding.”

    Before GPT-4o, when users used voice mode to talk to ChatGPT, the average delay for GPT-3.5 was 2.8 seconds and for GPT-4 was 5.4 seconds. A large amount of information was lost due to the processing method when audio was input, making GPT-4 unable to Direct observation of pitch, person speaking, and background noise also fails to output laughter, singing, and expression of emotion.

    By comparison, GPT-4o can react to audio input in 232 milliseconds, which is similar to how long humans react in a conversation. In the recorded video, two executives demonstrated that the robot can understand the meaning of "nervous" from rapid breathing, guide it to take deep breaths, and can also change its tone of voice according to the user's requirements.
    In terms of image input, the demonstration video showed that OpenAI executives started the camera and asked to complete a one-variable equation problem in real time, and ChatGPT easily completed the task; in addition, the executives also demonstrated the ChatGPT desktop version to perform real-time processing of code and computer desktop (a temperature chart) The ability to interpret.

According to the OpenAI press release, “We trained a new model end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. Since GPT-4o is the first of our kind to combine all these modalities model, so we are still only scratching the surface of its capabilities and its limitations.”