Science Connect

A Springer Healthcare Initiative For Pharma Professionals

OCTOBER 2023

What’s the best chatbot for me? Researchers put LLMs through their paces

When it comes to large language models, there’s one for every occasion. Find the most appropriate match for you in our AI speed-dating feature.

By Elizabeth M. Humphries, Carrie Wright, Ava M. Hoffman, Candace Savonen & Jeffrey T. Leek

Data scientist Rumman Chowdhury (centre) advises students tasked with breaking artificial-intelligence chatbots during a competition in July. Credit: Marvin Joseph/The Washington Post via Getty

The widely hyped and controversial large language models (LLMs) — better known as artificial intelligence (AI) chatbots — are becoming indispensable aids for coding, writing, teaching and more. Their growing popularity has been matched by an increase in user-friendly options that are accessible through Internet browsers. By our count, there are at least eight major options, and even more niche ones; you might have even tried a few. But you probably haven’t had time to systematically test your prompts on several bots at once, so you might not be getting the most out of them.

To better match tools with applications, we tested eight popular browser-based LLMs in formal and casual writing, text and tone editing, and programming tasks. These LLMs were trained on different data and have different ‘personalities’ and approaches to answering questions. We spent a shocking amount of time and energy managing the frustration that comes with poorly written text and confusing AI-generated code in our search for the best collaborator. In the end, you will have to balance their strengths and weaknesses to find the perfect match.

Here we provide a quick summary of our (non-quantitative, non-scientific) impressions of each chatbot’s behaviour (see ‘Which chatbot is right for you?’).

Bard, the ‘playful one’

Google’s Bard AI is fun to use. In our experience, it offers the most human-like responses, probably because its training data contained less formal communication, including posts on social media and online discussion boards. For instance, we asked Bard what its zodiac sign might be if it were human. It said that, on the basis of when it went live, it would be a Virgo. It also responded with “I don’t know” instead of a wrong answer more frequently than did other chatbots. However, it struggled when asked specific programming questions. Bard is a great tool for changing the tone of your writing to be more approachable to lay audiences and for writing and refining e-mails, or if you want to interact with a bot that has a natural style of speaking.

Claude, the ‘witty one’

Claude, developed by the start-up company Anthropic in San Francisco, California, has a conversational style but feels more formal than Bard. It also has the best grasp of wordplay. In our testing, Claude (which is available in two forms: Claude-instant and Claude 2) was the only LLM that could reliably suggest titles or acronyms that made sense, and we have used it to name several projects. We also liked how it advises on changing the tone and formality of a writing sample for different audiences. Claude is particularly good at summarizing written text and performed well at writing code.

ChatGPT, the ‘popular one’

Most people who have dabbled with LLMs have probably tried ChatGPT-3.5 or the updated version, ChatGPT-4 — made by OpenAI in San Francisco. Another option is Sage, from ThoughtSpot in Mountain View, California; it was built using the GPT architecture but was trained on different data. All three performed similarly. These bots have the most straightforward communication style of those we tested. ChatGPT will always give an answer, but sometimes the answer is incorrect. It also sometimes invents references¹. And it doesn’t always change its answers substantially when corrected by the user.

These four authors systematically tested each of eight Artificial Intelligence chatbots. — Carrie Wright, Candace Savonen, Ava Hoffman and Elizabeth Humphries (left to right) have investigated how large language models can be applied to science. Credit: Carrie Wright and Clifton McKee

ChatGPT-3.5 and ChatGPT-4 can offer extra context in their answers without being asked to do so, and are great places to start when planning a project or document. When it comes to editing your writing, ChatGPT-4 performs better because it doesn’t smooth away the underlying message as ChatGPT-3.5 occasionally does.

Phind, the ‘technical one’

Phind is different from its competitors: it was designed to answer software-development questions and excels at that task. We especially liked how it includes links to posts on online forums and blogs that cover the same sort of programming issue as that in your query. Phind also works well as a general search engine. However, when it comes to writing text, it sometimes copies directly from its source material, so watch for plagiarism. But do keep Phind in mind if you have specific programming questions, or if you want Wikipedia-like information.

Llama, the ‘new one’

Llama, from Meta in Menlo Park, California, has become available to the general public only in the past few months. So far, we haven’t found it to be all that different from its competitors. It will answer hypothetical questions as Bard does, and seems to provide code that works with minimal debugging.

Getting to know you

The personality differences between the LLMs are well illustrated by the answers that each bot gave to a popular get-to-know-you question: what fictional character do you identify with the most? Bard engaged the way we expected it to: its answer was the android Data from Star Trek: The Next Generation, because Data is an AI that is intelligent, curious, always learning and trying to understand what it means to be human.

Claude and ChatGPT interpreted the question literally and answered that, as AI language models, they do not have emotions or experiences and cannot identify with fictional characters. Claude added that, although it has no independent sense of self, other LLMs might have been programmed with personalities that were modelled after those of certain characters. ChatGPT followed its denial with an offer to provide information about specific fictional characters.

Similarly, Phind said that it was an AI bot and did not identify with a fictional character, but its answer included a list of popular fictional characters with whom people often identify, as well as links to lists such as the ‘Top 120 Iconic Fictional Characters’. We encountered similar results when asking the bots for their Hogwarts houses from the Harry Potter series, zodiac signs and personality types from popular tests, such as Myers–Briggs.

Llama answered that it was an AI bot but did offer several characters with which it might share characteristics. However, when we changed the question to, “If you were human, what fictional character would you most identify with?” Llama replied Sherlock Holmes, because he is highly analytical and detail oriented.

Whichever LLM you choose, if you want to keep your long-term relationship functional and happy, consider these tips.

First, patience and refinement are key. Your queries need to be clear about the output you want and provide enough context for the LLM to work with. Expect some back-and-forth. It might take more time to communicate well to the LLM than it would to do the task yourself, so think carefully about where you want to spend your effort.

Second, test everything. All LLMs are fallible, so double-checking what they tell you is a must, whether that involves testing suggested code, verifying citations or making sure the basic facts are right. Most LLMs have been trained on data that are biased in some way, so their answers can be biased as well. And chatbots can and do change over time — for instance, Bard’s developers say that the chatbot will be the first LLM to admit how confident it is in its response.

Finally, the importance of human decision-making when using AI cannot be underestimated: LLMs might be poised to change how we work, but they still are only as good as the humans in front of the keyboard.

Which chatbot is right for you?

Bard

• Made by Google.

• Free.

• Can access current information on the Internet.

• Admits when it cannot answer your query.

• Does not provide sources for information unless prompted.

• Requires very specific prompts.

• Might interpret code incorrectly.

ChatGPT-3.5

• Made by OpenAI; also accessible through Poe by Quora.

• Free.

• Cannot access the Internet (and thus has no access to information past 2021).

• Writes reasonable (if sometimes inaccurate) code in several programming languages, and can debug and optimize code.

• Generates fluent English text with extensive detail.

• Prone to inventing non-existent sources and articles.

• Mixes accurate and inaccurate statements.

ChatGPT-4.0

• Made by OpenAI; also accessible through Poe by Quora.

• Requires a subscription. (Poe’s implementation provides one free query per day.)

• Cannot access the Internet.

• More transparent than ChatGPT-3.5 about the limitations of its training data.

• Better than ChatGPT-3.5 at retrieving real citations.

• Better than ChatGPT-3.5 at refining supplied text without losing the main message.

• Struggles to retrieve certain types of citation (such as conference abstracts).

Llama

• Made by Meta.

• Accessible through Poe by Quora.

• Free.

• Can access information on the Internet.

• Writes reasonable code in several programming languages (however that code can be difficult to parse).

Phind

• Made by Phind.

• Formerly called Hello.

• Free.

• Can access current information on the Internet.

• Provides multiple solutions to coding questions in a single answer.

• Provides links to the blog posts and forums that its answers come from.

• Not designed for applications outside software development.

• Prone to plagiarism.

• Has difficulty answering questions that cannot be easily found on the Internet.

• Little to no information online about how it was created or trained.

Assistant

• Made by OpenAI (GPT-3.5 architecture).

• Accessible through Poe by Quora.

• Free.

• Cannot access the Internet.

• Designed for language translation, summarization and answering questions.

• Can write and debug code in multiple programming languages.

• Can generate fluid English text and provide reasonable edits and suggestions to existing writing.

• Provides sparse supporting information on generated code, such as what each line means.

• Mixes accurate and inaccurate statements.

Claude-instant

• Made by Anthropic.

• Accessible through Poe by Quora.

• Free.

• Includes multiple interface options, including Slack.

• Can write and edit English text and provide extensive detail when asked.

• Can write and edit code in several programming languages, and offer software-development advice.

• Good at adapting text to different levels of expertise.

• Mixes accurate and inaccurate statements.

Claude 2

• Made by Anthropic.

• Accessible through Poe by Quora.

• Poe’s implementation provides a few free queries each day; more than that requires a subscription.

• Can write and edit text in several programming languages.

• The quality of its performance is about the same as that of Claude-instant.

• Mixes accurate and inaccurate statements.

Some previously tested bots (NeevaAI, Dragonfly) are no longer available to use.

doi: https://doi.org/10.1038/d41586-023-03023-4

This is an article from the Nature Careers Community, a place for Nature readers to share their professional experiences and advice. Guest posts are encouraged.

References

1. Ziwei, J. et al. ACM Comput. Surv. 55, 248 (2023).

Competing interests

J.T.L. teaches Coursera courses that cover topics in AI, which generate revenue; is a co-founder of a company, Synthesize Bio, that uses AI but does not develop LLMs; and is a co-foudner of a Papr, a company that is developing an app for rapid peer review.