Alberto Rizzoli

BenchLLM by V7 — Test-driven development for LLMs

Featured
15
Simplify the testing process for LLMs, chatbots, and other apps powered by AI. BenchLLM is a free open-source tool that allows you to test hundreds of prompts and responses on the fly. Automate evaluations and benchmark models to build better and safer AI.

Add a comment

Replies
Best
Alberto Rizzoli
Hello Product Hunt! We built BenchLLM to offer a more versatile open-source benchmarking tool for AI applications. It lets you measure the accuracy of your model, agents, or chains by validating responses on any number of tests via LLMs. BenchLLM is actively used at V7 for improving our LLM applications and is now Open Sourced under MIT License to share with the wider community. You can use it to: - Test the responses of your LLM across any number of prompts. - Implement continuous integration for chains like LangChain, agents like AutoGPT, or LLM models like Llama or GPT-4. - Eliminate flaky chains and create confidence in your code. - Spot inaccurate responses and hallucinations in your application at every version. Key Features: - Automated tests and evaluations on any number of prompts and predictions via LLMs. - Multiple evaluation methods: semantic similarity checks, string matching, manual review. - Caching LLM responses to accelerate the testing and evaluation process. - Comprehensive API and CLI for executing test suites and faster development iterations. Here's a preview of a common use case in LLM testing and how popular models compare: https://www.loom.com/share/173c1... Visit our GitHub repo to access examples, templates, and docs. Or join our Discord for feedback or to contribute to the project!
Jacek Fleszar
very useful 👈
Cyril Gupta
Great job with the launch. Congrats!
Veronika
super cool!
Heather Stritch
Super impressive. Kudos! 👏
Vincent Lonij
This looks really interesting. ?makers How would you recommend dealing with false positives? For example, even using semantic similarity, I imagine you sometimes get some correct answers from a LLM that are flagged as incorrect?
Simon Edwardsson
@vincentropy Good question, we normally add a few more examples in the `expected` field for various correct answers to the same problem. Makes the tests more robust when semantic similarity isn't enough.
Martha
Congrats Alberto & team on your launch!
André J
Love that the LLM tooling eco-system is growing! Good luck today 🚀
Aden Will
Looks interesting
Anthony Adams
This is very interesting. Best wishes for your launch
Greg Z
Just curious, what are the benefits over the LangSmith? And good luck with the launch!
Carter Wang
Impressive! Check my site wikigpt3.com and email me your app details and I can help get your app listed on my directory and other 100+ AI directories. Feel free to reply if you want to know more. . ..
Ahmad Ali
Would love to give it a try! Congratulations on the launch
Vlad Golub
Great tool for streamlining testing of LLMs and AI-powered apps!