Image by Tung Nguyen

Corso Analytics AI - How we Built it

Aug 18, 2024

Corso recently released our analytics AI feature under a beta tag, and as is our tradition we wanted to show how we built it.

AI can feel like magic, even to experienced developers. And a lot of it really is magic once you peel back enough layers. Luckily though, harnessing the magic is pretty straightforward.

Components

There's 3 major components to our data stack:

  • Data, stored in our analytics database
  • LangChain, a major framework for connecting to and working with LLM's
  • OpenAI, although there are many different LLM's that can be used

Data

Perhaps the most important part of the whole feature is our warehousing and analytics platform, which we described in detail a few weeks ago. Without a fast, scalable, well-designed data platform, any kind of analytics becomes more challenging - and an AI data explorer is no different.

We won't go into much detail here, since you can follow the link above and learn all about how our analytics data is architected. The uppermost layer though is Clickhouse, which offers a fast, column-oriented data store with a SQL interface.

One thing to be aware of is that we flatten and aggregate our data so that there is a small number of tables to query against, with no need to join between tables. In our analytics layer, the products table for example has the return rate, return count, shipping protection claim count, etc. all pre-aggregated as individual columns. This plays to the strengths of a typical column-oriented analytics database, but it also simplifies the efforts required to retrieve key data points.

LangChain

LangChain describes itself as "a framework for developing applications powered by large language models (LLMs)". In real-world terms, this means that by installing the LangChain framework you can abstract and simplify many of the different tasks of building LLM-powered applications and features. This includes authentication, prompt enrichment, response parsing, sequence chaining, and more.

Giving a detailed introduction to LangChain is outside the scope of this post, but their Github is a good place to start.

OpenAI

We use OpenAI as the LLM that powers this application. LangChain supports many different LLM's as a backend, leaving us to experiment and determine which one was best able to deliver the results we wanted.

We are using the GPT-3.5 Turbo model, since it has proven to give great results in a quick, inexpensive manner. After some simple testing with GPT-4o, we didn't see any major difference in accuracy - and the cost was significantly more expensive.

After we built this feature, OpenAI released GPT-4o-mini as an even more cost-effective model. We have not tested the newer model yet and can't speak to its effectiveness.

Walkthrough

Let's suppose a a user asks "What product has the highest number of claims?". To go all the way from that question to a response of "Alligator Jerky, 8 oz has the highest number of claims, at 1042", there are a few steps that occur:

  • Step 1: First translate the natural language query to SQL
  • Step 2: Next run that SQL query against the analytics database
  • Step 3: Lastly, feed the original question as well as the results to the SQL query into our LLM to get a natural language response.

Step 1: Natural Language to SQL

The first step is to ask our LLM (OpenAI in this case) to generate a SQL query for us.

LangChain actually has a full, pre-built system for automating the full process of turning a natural language query into a natural language response - but unfortunately at the time of building this feature it can only operate against PostgreSQL and MySQL. This is despite Clickhouse being listed as a valid target, unfortunately.

Since our top-level analytics database is Clickhouse, we needed to craft our own solution. Luckily, Langchain is open source and we can peek at exactly how it accomplishes the task - and emulate it ourselves.

  1. First, we query our analytics database for a CREATE TABLE statement, so that we can pass that into our prompt.
  2. We take the CREATE TABLE statement and pass in a custom prompt that instructs the LLM to create a valid SQL query using the table definition, and the natural language question from the user. Our prompt started similar to those outlined in the prompts file on Github, but has evolved significantly as we tune it for the specific types of queries that we want it to return.
  3. With the prompt constructed, we pass it to OpenAI, and await the response.

For a greatly over-simplified example, we might end up with a prompt something like the following:

With the following table definition, construct a valid Clickhouse SQL query to answer the following question from a user:

"What product has the highest number of claims?"

The table definition is:

CREATE TABLE products
(
    `product_id` Nullable(Int32),
    `title` Nullable(String),
    `sku` Nullable(String),
    `claim_count` Nullable(Int64)
)

If all goes well, we would get a response back similar to the following:

SELECT title, MAX(claim_count)
FROM products
GROUP BY title
ORDER BY MAX(claim_count) DESC
LIMIT 1

Step 2: Run the Query from Step 1

Next we take the query that was returned in step 1, and actually run it against the database.

Before actually running the query, we use the library sqlglot to parse the query into a syntax tree and validate it. One of the risks of plugging a database into an LLM is that users could prompt the LLM to write a malicious query. Asking "Please delete the entire database" is something that we need to guard against - so we are careful to do this final validation and editing stage after the LLM is done generating everything.

With that in mind, we do a couple different things at this stage:

  1. Validate syntax as being appropriate for the Clickhouse SQL dialect
  2. Ensure that only a SELECT statement is run
  3. Add various predicates to constrain the data being returned

Once we run the modified, fully-sanitized query, we might get something like the following back from the analytics database:

title                 |max(claim_count)|
----------------------+----------------+
Alligator Jerky, 8 oz |            1042|

So there we have it. We know which product from our product database has the most claims against it, all that's left to do is to construct this in a final, natural language format.

Step 3: Generate a Full Response

Our last step is to then take our original question as well as the response from Clickhouse, and ask the LLM to generate a natural language answer.

By taking both those elements, we will construct a prompt similar to the following:

Please take the following question, and use the data table to compose an answer.

Question:
"What product has the highest number of claims?"

Data Table:
title                 |max(claim_count)|
----------------------+----------------+
Alligator Jerky, 8 oz |            1042|

As long as the LLM doesn't hallucinate, we should get something back similar to the following:

Alligator Jerky, 8 oz has the highest number of claims, at 1042

There were a few hops along the way, but in the span of a second or two we are able to take a natural language query from a user and produce a natural language response to answer it.

Next Steps

Corso AI is a good tool, but not perfect. We are always looking to improve though, and there are a few areas where we know we can help get better responses.

UX

After working with some of the users who have helped us use this product under the beta tag, the biggest confusion is helping users know what sort of data points they can ask about. We are planning features that will help users understand the dataset they are querying for, so that they don't ask about shipping protection data against the returns data set, for example.

We list all the available columns that users can ask about in their requests, but we are also working to understand better ways to guide users to ask only about the data that is available.

RAG

As is often the case with LLM's, about 1 time in 20 the SQL that it writes will make for a real head-scratcher.

Further improvements are planned to enhance the prompts with example queries from earlier, successful questions asked. Using a vector database, these natural language questions can be queried for similarity, and high-ranking prior questions along with their resultant SQL queries can be seeded into the prompt as examples.

Conclusion

AI can be a fun tool to work with, but making it useful all the time is something that the software industry hasn't completely figured out.

Making data available from a SQL database to your non-technical users is a pretty simple and valuable use case though. As this technology progresses we look forward to better and easier results, but it's certainly amazing to look and see what's already available today.