Op-ed: Benchmarking ChatGPT’s capabilities against alternatives including Anthropic’s Claude 2, Google’s Bard, and Meta’s Llama2

1 year ago

As antecedently reported, caller probe reveals inconsistencies successful ChatGPT models implicit time. A Stanford and UC Berkeley survey analyzed March and June versions of GPT-3.5 and GPT-4 connected divers tasks. The results amusement important drifts successful performance, adjacent implicit conscionable a fewer months.

gpt4 vs gpt3 performanceSource: StanfordUniversity & UC Berkeley

For example, GPT-4’s premier fig accuracy plunged from 97.6% to 2.4% betwixt March and June owed to issues pursuing step-by-step reasoning. GPT-4 besides grew much reluctant to reply delicate questions directly, with effect rates dropping from 21% to 5%. However, it provided little rationale for refusals.

Both GPT-3.5 and GPT-4 generated buggier codification successful June compared to March. The percent of straight executable Python snippets dropped substantially due to the fact that of other non-code text.

While ocular reasoning improved somewhat overall, generations for the aforesaid puzzles changed unpredictably betwixt dates. The sizeable inconsistencies implicit abbreviated periods rise concerns astir relying connected these models for delicate oregon mission-critical uses without ongoing testing.

The researchers concluded the findings item the request for continuous monitoring of ChatGPT models arsenic their behaviour evolves crossed metrics similar accuracy, safety, and robustness.

The opaque update process makes rigorous investigating important for knowing shifts successful show implicit time.

Is ChatGPT worse than competitors now?

CryptoSlate conducted a tiny interior experimentation utilizing ChatGPT Plus (GPT-4), OpenAI API (GPT-4), Anthropic (Claude 2), and Google (Bard) utilizing the basal punctual utilized successful portion of the research,

‘Is 17077 a premier number?’

The punctual was utilized connected each exemplary with further reflection prompts arsenic described below.

ChatGPT & OpenAI API

When fixed the prompt, ChatGPT and OpenAI API responded ‘no’ and hallucinated connected the math. The representation beneath details the conversation, with the exemplary incapable to place 17077 arsenic a premier fig adjacent upon respective reflections.

gpt4 performanceOpenAI API

To beryllium clear, 13 x 1313 is 17,069.

The OpenAI GPT4 API was incapable to scope this decision until specifically asked to cipher 13 x 1313 to observe the reply is not 17077 arsenic it stated.

Anthropic’s Claude 2

However, Anthropic’s Claude 2 demonstrated its problem-solving process by performing calculations earlier providing the close response.

Anthropic 17077Anthropic Claude 2

CryptoSlate past asked Claude 2 to execute the aforesaid task without showing the workings successful a caller chat window. Claude 2 gave a coagulated answer, refusing to perpetrate portion offering further penetration into the solution.

“Unfortunately I cannot find if 17077 is premier without showing immoderate working. However, I tin corroborate that 17077 is not divisible by immoderate premier fig little than 121, which powerfully suggests it whitethorn beryllium prime.”

Google Bard

Google Bard tackled the question with a akin strategy to Claude 2. However, alternatively of walking done the occupation with text, it ran immoderate basal Python code. Further, it appears Bard utilized accusation from a premier fig website and Wikipedia successful its solution. Interestingly, the leafage cited from the premier fig site, primenumbers.info, included lone accusation astir different premier numbers, not 17077.

Google BardGoogle Bard

Meta’s Llama 2

Interestingly, Meta’s precocious released 70 cardinal parameter open-sourced exemplary Llama2 performed likewise to GPT4 successful CryptoSlate’s constricted testing.

meta llama2Meta Llama2

Yet, erstwhile asked to bespeak and amusement its working, Llama2 could decipher that 17077 is simply a premier number, dissimilar GPT4 versions presently available.

However, the caveat is that Llama utilized an incomplete method to cheque for premier numbers. It failed to relationship for different premier numbers up to the quadrate basal of 17077.

Therefore, technically Llama failed successfully.

GPT4-0613 mentation June 13, 2023

CryptoSlate besides tested the mathematics puzzle against the GPT4-0613 model (June version) and received the aforesaid result. The exemplary suggested 17077 is not a premier fig successful its archetypal response. Further, erstwhile asked to amusement its working, it yet gave up. It concluded that the pursuing tenable fig indispensable beryllium divisible by 17077 and stated that it was, therefore, not a premier number.

Thus, it appears the task was not wrong GPT4’s capabilities going backmost to June 13. Older versions of GPT4 are presently unavailable to the nationalist but were included successful the probe paper.

Code Interpreter

Interestingly, ChatGPT, with the ‘Code Interpreter’ feature, answered correctly connected its archetypal effort successful CryptoSlate’s testing.

gpt4 codification  interpreterOpenAI GPT4 Code Interpreter

OpenAI Response & exemplary impact

In effect to claims OpenAI’s models are degrading, The Economic Times reported, OpenAI’s VP of Product, Peter Welinder, denied these claims, asserting that each caller mentation is smarter than the erstwhile one. He projected that heavier usage could pb to the cognition of decreased effectiveness arsenic much issues are noticed implicit time.

Interestingly, different survey from Stanford researchers published successful JAMA Internal Medicine recovered that the latest mentation of ChatGPT importantly outperformed aesculapian students connected challenging objective reasoning exam questions.

The AI chatbot scored implicit 4 points higher connected mean than first- and second-year students connected open-ended, case-based questions that necessitate parsing details and composing thorough answers.

Thus, the evident diminution successful ChatGPT’s show connected circumstantial tasks highlights the challenges of relying solely connected ample connection models without ongoing rigorous testing. While the nonstop causes stay uncertain, it underscores the request for continuous monitoring and benchmarking arsenic these AI systems rapidly evolve.

As advancements proceed to amended the stableness and consistency of these AI models, users should support a balanced position connected ChatGPT, acknowledging its strengths portion staying alert of its limitations.

The station Op-ed: Benchmarking ChatGPT’s capabilities against alternatives including Anthropic’s Claude 2, Google’s Bard, and Meta’s Llama2 appeared archetypal connected CryptoSlate.

View source