Evidently AI : p/evidently-ai | Product Hunt

Sign in

p/evidently-ai Collaborative AI observability platform

Start new thread

Evidently AI - Open-source evaluations and observability for LLM apps

by

Cargo

•

7mo ago

Evidently is an open-source framework to evaluate, test and monitor AI-powered apps.

📚 100+ built-in checks, from classification to RAG.
🚦 Both offline evals and live monitoring.
🛠 Easily add custom metrics and LLM judges.

Replies

Best

Elena Samuylova

Evidently AI

Maker

📌

Hi Makers! I'm Elena, a co-founder of Evidently AI. I'm excited to share that our open-source Evidently library is stepping into the world of LLMs! 🚀 Three years ago, we started with testing and monitoring for what's now called "traditional" ML. Think classification, regression, ranking, and recommendation systems. With over 20 million downloads, we're now bringing our toolset to help evaluate and test LLM-powered products. As you build an LLM-powered app or feature, figuring out if it's "good enough" can be tricky. Evaluating generative AI is different from traditional software and predictive ML. It lacks clear criteria and labeled answers, making quality more subjective and harder to measure. But there is no way around it: to deploy an AI app to production, you need a way to evaluate it. For instance, you might ask: - How does the quality compare if I switch from GPT to Claude? - What will change if I tweak a prompt? Do my previous good answers hold? - Where is it failing? - What real-world quality are users experiencing? It's not just about metrics—it's about the whole quality workflow. You need to define what "good" means for your app, set up offline tests, and monitor live quality. With Evidently, we provide the complete open-source infrastructure to build and manage these evaluation workflows. Here's what you can do: 📚 Pick from a library of metrics or configure custom LLM judges 📊 Get interactive summary reports or export raw evaluation scores 🚦 Run test suites for regression testing 📈 Deploy a self-hosted monitoring dashboard ⚙️ Integrate it with any adjacent tools and frameworks It's open-source under an Apache 2.0 license. We build it together with the community: I would love to learn how you address this problem and any feedback and feature requests. Check it out on GitHub: https://github.com/evidentlyai/e..., get started in the docs: http://docs.evidentlyai.com or join our Discord to chat: https://discord.gg/xZjKRaNp8b.

7mo ago

@elenasamuylova Congrats on bringing your idea to life! Wishing you a smooth and prosperous journey. How can we best support you on this journey?

7mo ago

Elena Samuylova

Evidently AI

Maker

@kjosephabraham Thanks for the support! We always appreciate any feedback and help in spreading the word. As an open-source tool, it is built together with the community! 🚀

7mo ago

Evidently AI

Maker

Hi everyone! I am Emeli, one of the co-founders of Evidently AI. I'm thrilled to share what we've been working on lately with our open-source Python library. I want to highlight a specific new feature of this launch: LLM judge templates. LLM as a judge is a popular evaluation method where you use an external LLM to review and score the outputs of LLMs. However, one thing we learned is that no LLM app is alike. Your quality criteria are unique to your use case. Even something seemingly generic like "sentiment" will mean something different each time. While we do have templates (it's always great to have a place to start), our primary goal is to make it easy to create custom LLM-powered evaluations. Here is how it works: 🏆 Define your grading criteria in plain English. Specify what matters to you, whether it's conciseness, clarity, relevance, or creativity. 💬 Pick a template. Pass your criteria to an Evidently template, and we'll generate a complete evaluation prompt for you, including formatting it as JSON and asking the LLM to explain its scores. ▶️ Run evals. Apply these evaluations to your datasets or recent traces from your app. 📊 Get results. Once you set a metric, you can use it across the Evidently framework. You can generate visual reports, run conditional test suites, and track metrics in time on a dashboard. You can track any metric you like - from hallucinations to how well your chatbot follows the brand guidelines. We plan to expand on this feature, making it easier to add examples to your prompt and adding more templates, such as pairwise comparisons. Let us know what you think! To check it out, visit our GitHub: https://github.com/evidentlyai/e..., docs http://docs.evidentlyai.com or Discord to chat: https://discord.gg/xZjKRaNp8b.

7mo ago

Evidently AI

Maker

@hamza_afzal_butt Thank you so much!

7mo ago

Congratulations on the launch, Evidently team! I've always admired Evidently for its comprehensiveness and all-encompassing approach framework. I often work with teams who are unsure about what metrics to focus on or how to begin their evaluation process. For those new or unsure where to start: * What best practices would you recommend? * Is there a feature that helps beginners 'set things on autopilot' while they're learning the ropes? * Do you offer any guided workflows or templates for common use cases that could help newcomers get started quickly? Thanks for your continued innovations in this space!

7mo ago

Elena Samuylova

Evidently AI

Maker

@rorcde @rorcde Thanks for the support! :🙏🏻 Quickstart: We have a simple example here: https://docs.evidentlyai.com/get.... It will literally take a couple of minutes! We packaged some popular evaluations as presets and general metrics (like detecting Denials). However, we generally encourage using your custom criteria—no LLM app is exactly alike, and the beauty of using LLM as a judge is that you can use your own definitions. We made it super easy to define your custom prompt just by writing your criteria in plain English. Best practices: That's a huuuge question. Let me try to summarize a few of them: - Don't skip the evals! Implementing evals can sound complex, so it's tempting to "ship on vibes". But it’s much easier to start with a simple evaluation pipeline that you iterate on than to try adding evals to your process later on. So, start simple. - Make curating an evaluation dataset a part of your process. When it comes to offline evals, the metrics are as important as the data you run them on. Preparing a set of representative, realistic inputs (and, ideally, approved outputs) is a high-value activity that should be part of the process. - Log everything. On that note, don’t miss out on capturing real traces of user conversations. You can then use them for testing, to replay new prompts against them, etc. - Start with regression testing. This is low-hanging fruit in evals: every time you change a prompt, re-generate new outputs for a set of representative inputs and see what changed (or have peace of mind that nothing did). This is hugely important for the speed of iteration. - If you use LLM as a judge, start with binary criteria and measure the quality of your judge. It’s also easier to test alignment this way.

7mo ago

Flag Match

Hey Elena and Emeli, How does Evidently AI handle potential biases in the LLM judges themselves? Do you have any plans to incorporate human feedback loops into the evaluation process? Congrats on the launch!

7mo ago

Evidently AI

Maker

Hey @kyrylosilin , Thank you for bringing this up! We design judge templates based on a binary classification model, where we thoroughly define the classification criteria and strictly structure the response format. Users also have the option to choose how to handle cases with uncertainty, whether it’s by refusing to respond, detecting an issue, or deciding not to detect an issue. This approach is already implemented and helps achieve more consistent outcomes. In the next update, we plan to add the ability to further iterate on the judges using examples of classifications they have made in previous iterations. This will help address potential biases. Users will be able to select (and even fix wrong labels if needed) complex cases and explicitly pass these examples to the judge, which will, over several iterations, improve accuracy and consistency for specific cases.

7mo ago

Dima Demirkylych

I have to give major respect to the team at Evidently AI for their outstanding open-source product. The introduction of evaluation for LLM apps is a game-changer. It's incredibly easy to integrate into my product, and the monitoring capabilities are top-notch. What I love most is that they provide ready-to-use tests, which easily customizable. Kudos to Evidently AI for making such a valuable tool available to the community!

7mo ago

Elena Samuylova

Evidently AI

Maker

@dima_dem Thanks for the support @dima_dem! 🙏🏻

7mo ago

Incredible launch! Can't wait to try this out! And congrats on the launch @elenasamuylova 🚀

7mo ago

Elena Samuylova

Evidently AI

Maker

@andinii Thank you! Let us know how it works out for you 🎉

7mo ago

suco

Comment Deleted

7mo ago

Elena Samuylova

Evidently AI

Maker

@zulkarnaim Thanks for the support! 🚀

7mo ago

Giuseppe Della Corte

panda{·}etl (YC W24)

@elenasamuylova and @emeli_dral great that you are building an open-source tool in the LLM evaluation space. Congrats!

7mo ago

Elena Samuylova

Evidently AI

Maker

@gdc Thank you! Let us know if you have the chance to try it. We appreciate all feedback!

7mo ago

Anna Veronika Dorogush

Recraft

Congrats with the launch! That's a huge milestone!

7mo ago

Elena Samuylova

Evidently AI

Maker

@annaveronika Thanks for your support! ❤️

7mo ago

Fantastic launch! We've been searching for an effective solution like this for quite some time. How do you tailor your solution to meet the varying needs of your clients?

7mo ago

Elena Samuylova

Evidently AI

Maker

@datadriven Thanks for the support, Dina! On the infrastructure side, our open-source tooling has a Lego-like design: it's easy to use specific Evidently modules and fit them into existing architecture without having to bring all the parts—use what you need. On the "contents" side, we created an extensive library of evaluation presets and metrics to choose from, as well as templates to easily add custom metrics so that users can tailor the quality criteria to their needs. So, some may use Evidently to evaluate, for example, the "conciseness" and "helpfulness" of their chatbots, while others can evaluate the quality and diversity of product recommendations—all in one tool! I hope we managed to put all the right ingredients together to allow all users to start using Evidently regardless of the specific LLM/AI use case. I'm looking forward to more community feedback to improve it further!

7mo ago

@elenasamuylova That sounds great! I hope you’ll consider hosting workshops or a hackathon to demonstrate how it works. By the way, have you come across any interesting examples of LLM judges?

7mo ago

Elena Samuylova

Evidently AI

Maker

@datadriven Maybe even a course! :) The most interesting examples of LLM judges I've seen are something very custom to the use case. Typically, users who are working on an application want to catch or avoid specific failure modes. Once you identify them, you can create an evaluator to catch them. For example: - Comments from an AI assistant that are positive but not actionable. ("Actionable" eval). - Conversations on a sensitive topic where a chatbot does not show empathy. ("Empathy" eval). etc. It does take some work to figure it out right, but that is hugely impactful.

7mo ago

Startup Death Clock

This sounds promising, Elena! How does Evidently handle the subjective nature of evaluating LLMs? Are there specific criteria you recommend using?

7mo ago

Elena Samuylova

Evidently AI

Maker

@elke_qin Great question! We tried making it easy for users to add custom criteria that will automatically convert into "LLM judges." The user only needs to add the criteria in plain English, and we automatically create a complete evaluation prompt (that will be formatted as JSON, ask LLM to provide the reasoning before outputting the score, specify how to handle uncertainty, etc.). I agree that LLM output quality is highly custom, so instead of simply providing hard-coded "universal" judge prompts, we believe it's better to help users create and iterate on their judges. We generally recommend using binary criteria, as they make it easier to test alignment and interpret the results (compared to sparse scales). We also have a workflow for evaluating the quality of judge classification against your own labels to measure alignment. If you have a reference output (for example, when you do regression testing to compare outputs with a new prompt against the old ones), there are also different approaches to compare old answers against new ones. From semantic similarity to another comparison judge, you can tune to detect specific changes that are important to you. We do not have a labeling interface inside the tool itself, but we are thinking of adding one. We also have tracing functionality that allows us to collect user feedback if it's available in the app (think upvotes/downvotes).

7mo ago

Elena, this is an impressive step for Evidently! Expanding into LLM evaluation is so needed in today's landscape. With the variety of built-in checks and the flexibility to add custom metrics, it really simplifies a complex area. The challenges you outlined are so relatable—having a structured approach will be a game-changer for many developers. Excited to see how the community will contribute to this project! Keep pushing those boundaries! 🎉

7mo ago

Elena Samuylova

Evidently AI

Maker

@devindsgbyq Thank you! Let us know if you have the chance to check it out!

7mo ago

Arseny Kravchenko

Wanna Kicks

Wish you luck! AI reliability is still a massive problem

7mo ago

Elena Samuylova

Evidently AI

Maker

@arseny_info Thanks for the support! Hopefully we can contribute to solving in. Looking forward to the feedback from the community 🚀

7mo ago

Mikhail Rozhkov

DVC

Congrats with the launch! Great milestone @elenasamuylova and @emeli_dral! Evidenlty is part of my MLOps stack and I recommend it to my friends and clients! I'm happy to contribute to Evidently and look forward to collaboration!

7mo ago

Elena Samuylova

Evidently AI

Maker

@emeli_dral @mikhail_rozhkov Thanks for your support! I hope Evidently will fit in the updated LLMOps stack as well 🚀🚀

7mo ago

+1 amazing team +1 amazing product in addition: friendly open-source support (easy to add suggestions and see it in the next release)

7mo ago

Elena Samuylova

Evidently AI

Maker

@aadral Thanks for your support! Looking forward to LLM-related feature requests 🚀

7mo ago

Congrats on the launch, @elenasamuylova! 🎉 It's amazing to see Evidently evolving into the realm of LLMs with such robust features. The focus on a quality workflow is crucial for us as we develop AI-powered applications. I love the idea of easily integrating custom metrics and having that interactive summary for evaluations. Looking forward to exploring the new capabilities and contributing to the community! Keep up the great work!

7mo ago

Elena Samuylova

Evidently AI

Maker

@zanereed596 Thank you! 🚀

7mo ago

Vasili Shynkarenka

Heyday Health

congrats with the launch! love the video :)

7mo ago

Elena Samuylova

Evidently AI

Maker

@flreln Open-source production! :)

7mo ago

ou explain how LLM judge templates empower users to define custom evaluation criteria and create tailored prompts.

7mo ago

Elena Samuylova

Evidently AI

Maker

@bunga_trisnulia We make it easy for the user to focus only on the evaluation's contents (for example, write that "I want to label responses as concise or verbose") without thinking about how to write the rest of the evaluation prompt. We automatically add all the other parts, such as formatting prompts as JSON to get structured output, asking LLM to provide the reasoning before outputting the label, etc. Basically, we help the users to define only what's strictly necessary but do all the boilerplate on the background.

7mo ago

Eugene Ter-Avakyan

Congrats on the launch, Elena and Emeli! It's nice to see it released in open source!

7mo ago

Elena Samuylova

Evidently AI

Maker

@eugene_ter_avakyan Thank you! We are looking forward to the community input 🚀

7mo ago

Congrats on this launch, Elena! The transition from traditional ML to LLMs is a game changer. The ability to customize metrics and have a monitoring dashboard will definitely help many makers in evaluating their AI apps. Can't wait to see how the community uses Evidently! 🚀

7mo ago

Elena Samuylova

Evidently AI

Maker

@dance17219 Thank you! 🚀

7mo ago