Megaparse is a file parser optimized for LLM Ingestion. It can parse PDFs, DOCX, PPTX in a format that is ideal for LLMs. All of that accessible from a python package, an API, or a queue.
Hi everyone,
Today I’d like to introduce you to the new Quivr project. It a simple python package, API that helps you take in documents such as PDFs, Docx, PPTx, ... and turn them into Markown
It has several new abilities:
* OCR
* Vision Models
* Table Optimization in the extraction
* Open-source
You can use it in any of your products where you need to parse file to then send them to an LLM or simply store it
Here is how to get started:
* Go to https://github.com/QuivrHQ/MegaP...
* pip install megaparse
* Have fun
Give it a try! We’d love to hear your feedback and ideas in the comments.
This is part of Supabase mega Launch Week -> https://launchweek.dev/HOME
Congrats on the launch @stan_girard@amine_dirhoussi@chloe_daems
Super helpful. We are working on a product that needs something similar though we have already solved the PDF parsing problem.
Quick question - do you plan to add Excel / Spreadsheet as well? This would be super helpful.
Excited to give it a try!
Love it. Markdown is becoming the de-facto in AI input processing, and proper conversion to it (without having to install a million packages) will be paramount.
@michaelohana This is a hard piece to tackle, we are currently working hard on improving tables. We are exploring some techniques. For example we are looking at combining LLM Vision models with current OCR.
Passing the table to a dataframe. Would love to tell you more or help you with your use case. Ping me if need on twitter @_StanGirard
Megaparse is a really interesting tool for LLM data ingestion! 🔥 How does it handle parsing complex document structures, like multi-column layouts or mixed content (text, images, tables)? Does the OCR integration maintain accuracy across different fonts and handwriting? Also, how does the API handle large-scale batch processing—are there any optimizations for speed and efficiency with extensive datasets?
Megaparse sounds super useful for prepping docs for LLMs! Love the flexibility with Python, API, or queue. Does it handle complex layouts or metadata well?
Awesome tool with Megaparse! 📄✨ The ability to seamlessly parse PDFs, DOCX, and PPTX for LLM ingestion is a game-changer for data extraction. I'm curious—how does Megaparse handle complex document layouts or non-standard formats? For example, if a document has lots of embedded images or custom fonts, does it still maintain accuracy in parsing? Also, what kind of customization options do you offer for different document types or use cases?
Wow, this looks super handy for integrating document parsing into LLM workflows! 🚀 Love that it's open-source and includes OCR + table optimization—makes it a no-brainer for anyone working with complex document data. Can't wait to test it out! 🔥
Quivr - Your Second Brain