The Dataset: I grabbed 6 months of real e-commerce data from our warehouse: The Test: Five simple questions that every analytics dashboard asks: How much money did we make each day?
Use the vitals package with ellmer to evaluate and compare the accuracy of LLMs, including writing evals to test local models ...