Articles

AI-Driven Documentation Search with GPT, Weaviate, and Laravel

As a developer, I often find myself digging through documentation to solve problems or learn about new tools. Sometimes, I encounter issues with finding the right information, or the search functionality is limited. That’s when I had an idea: What if we could create a natural language search for our agency’s documentation using cutting-edge AI technology?

In this article, I’ll walk you through my journey of using OpenAI’s GPT models, their embeddings, and a vector database called Weaviate to enhance our documentation search capabilities.

#Integrate OpenAI’s API with Laravel

We’ll assume you already have a Laravel application set up, or you’re familiar with setting up a new Laravel project. In this section, we’ll focus on integrating the OpenAI API using the openai-php/laravel Composer package.

#Step 1: Install the openai-php/laravel Package

To install the openai-php/laravel package, use the Composer command below:

1composer require openai-php/laravel
2php artisan vendor:publish --provider="OpenAI\Laravel\ServiceProvider"

#Step 2: Configure Environment Variables

Navigate to the root of your Laravel project and locate the .env file. This file contains environment-specific settings. We’ll need to add our OpenAI API key to this file. You can obtain an API key by signing up for an OpenAI account.

Add the following line to your .env file:

1OPENAI_API_KEY=sk-...

You are now ready to use the OpenAI facade in your Laravel application.

1use OpenAI\Laravel\Facades\OpenAI;
2 
3$result = OpenAI::completions()->create([
4 'model' => 'text-davinci-003',
5 'prompt' => 'I want to ',
6]);
7 
8echo $result['choices'][0]['text'];

#Retrieving Documentation Data from Confluence API

In this section, we will go through the process of fetching data from the Confluence API, which stores your documentation. We will be using this data for our natural language search.

#Step 1: Set Up Confluence API Credentials

To interact with the Confluence API, you will need an API token and your Confluence URL. You can create an API token by following the instructions in the official documentation.

Once you have your API token, add the following lines to your .env file:

1CONFLUENCE_API_USER=your_api_user_here
2CONFLUENCE_API_KEY=your_api_key_here
3CONFLUENCE_URL=https://your_domain.atlassian.net

Replace the values with your own values and make sure, the user you are using has access to the documentation you want to search.

Now add these values to your config/services.php file:

1'confluence' => [
2 'api_user' => env('CONFLUENCE_API_USER'),
3 'api_key' => env('CONFLUENCE_API_KEY'),
4 'api_url' => env('CONFLUENCE_API_URL'),
5],

#Step 2: Create a ConfluenceService Class

Create a new service class called ConfluenceService in the app/Services directory to handle interactions with the Confluence API.

Below is a simple service class which handles the fetching of all pages from a specific parent page in Confluence.

1<?php
2 
3namespace App\Services;
4 
5use Illuminate\Http\Client\Response;
6use Illuminate\Support\Facades\Http;
7 
8class ConfluenceService
9{
10 private string $apiUser;
11 private string $apiKey;
12 private string $apiUrl;
13 
14 public function __construct()
15 {
16 $this->apiUser = config('services.confluence.api_user');
17 $this->apiKey = config('services.confluence.api_key');
18 $this->apiUrl = config('services.confluence.api_url');
19 }
20 
21 public function getPageDescendants(int $pageId): Response
22 {
23 $entityUrl = sprintf('content/%s/descendant/page', $pageId);
24 $data = [
25 'expand' => 'body.view',
26 'limit' => 1000,
27 ];
28 
29 $url = sprintf('%s/%s', $this->apiUrl, trim($entityUrl, '/'));
30 
31 return Http::withBasicAuth($this->apiUser, $this->apiKey)->get($url, $data);
32 }
33}

You may now use this service class to fetch data from the Confluence API.

1$confluenceService = new ConfluenceService();
2$response = $confluenceService->getPageDescendants(123456);

This response contains all subpages including their contents.

#Store embeddings of documentation in Weaviate

Weaviate is a vector database that allows you to store and query data in a vector space. You can read my article on how to set up Weaviate with Docker and Traefik to learn more about Weaviate and how to set it up. Read the article. If you have no previous experience or knowledge of Weaviate, I recommend reading the article first.

Weaviate uses a schema to define its data structure. In this example, we will use the following schema:

1{
2 "class": "Chunk",
3 "description": "Some chunk of knowledge",
4 "vectorizer": "text2vec-openai",
5 "moduleConfig": {
6 "text2vec-openai": {
7 "model": "ada",
8 "modelVersion": "002",
9 "type": "text"
10 }
11 },
12 "properties": [
13 {
14 "name": "identifier",
15 "description": "The identifier of the particular chunk of knowledge",
16 "dataType": [
17 "string"
18 ],
19 "moduleConfig": {
20 "text2vec-openai": {
21 "skip": true
22 }
23 }
24 },
25 {
26 "name": "content",
27 "description": "The contents",
28 "dataType": [
29 "text"
30 ]
31 },
32 {
33 "name": "source",
34 "description": "The source type",
35 "dataType": [
36 "string"
37 ],
38 "moduleConfig": {
39 "text2vec-openai": {
40 "skip": true
41 }
42 }
43 },
44 {
45 "name": "sourceLink",
46 "description": "URL to the article",
47 "dataType": [
48 "string"
49 ],
50 "moduleConfig": {
51 "text2vec-openai": {
52 "skip": true
53 }
54 }
55 }
56 ]
57}

We aren’t storing whole pages content because this would mean that our prompts will get too big. Instead we will chunk the content and store each chunk in Weaviate. We will also store the source and sourceLink properties to be able to link back to the original source.

#Step 1: Accessing Weaviate from PHP

To access Weaviate from PHP, we will use the weaviate-php package. Install the package using the Composer command below:

1composer require timkley/weaviate-php

You can now use the Weaviate client in your PHP code.

1<?php
2 
3use Weaviate\Weaviate;
4 
5$weaviate = new Weaviate('http://localhost:8080', 'your-token');

#Step 2: Chunk and store content in Weaviate

Looping over all our pages we’ll do the following things:

  1. Remove all HTML tags from the content
  2. Split the content into chunks manageble chunks
  3. Create a new Weaviate object for each chunk
  4. Store the object in Weaviate
1use App\Services\ConfluenceService;
2 
3$confluenceService = new ConfluenceService();
4 
5$response = $confluenceService->getPageDescendants(12345);
6 
7if ($response->successful()) {
8 $pages = $response->json()['results'];
9 
10 foreach ($pages as $page) {
11 $content = cleanUpContent($page['body']['view']['value']);
12 $chunks = chunkContent($content);
13 
14 // Delete all old chunks before creating new ones
15 $weaviate->batch()->delete('Chunk', [
16 'path' => ['identifier'],
17 'operator' => 'Equal',
18 'valueString' => $chunkyBoy->identifier,
19 ]);
20 
21 $count = 0;
22 $objects = [];
23 // Loop over the chunks and create objects matching our Weaviate schema
24 foreach ($chunks as $chunk) {
25 $objects[] = [
26 'class' => 'Chunk',
27 'properties' => [
28 'identifier' => $chunkyBoy->identifier,
29 'content' => $chunk['value'],
30 'source' => $chunkyBoy->source,
31 'sourceLink' => $chunkyBoy->sourceLink,
32 ],
33 ];
34 
35 if (++$count % $batchSize === 100) {
36 $weaviate->batch()->create($objects);
37 $objects = [];
38 $count = 0;
39 }
40 }
41 
42 $weaviate->batch()->create($objects);
43 }
44} else {
45 // Handle the error
46 echo "Failed to fetch descendant pages: " . $response->status();
47}
48 
49function cleanUpContent(string $content): string
50{
51 return Str::of($content)
52 ->replace('<', ' <')
53 ->stripTags()
54 ->replace(['\r', '\n'], ' ')
55 ->replaceMatches('/\s+/', ' ')
56 ->trim();
57}
58 
59function chunkContent(string $content): array
60{
61 $tokensPerCharacter = 0.4;
62 $tokenLimit = 150;
63 $chunkCharacterLimit = $tokenLimit / $tokensPerCharacter;
64 
65 // Split the input string into an array of sentences
66 $sentences = collect(preg_split('/(?<=[.?!])\s?(?=[a-z])/i', $content));
67 
68 $chunks = $sentences->chunkWhile(
69 function (string $sentence, int $key, Collection $chunk) use ($chunkCharacterLimit) {
70 return $chunk->sum(fn (string $sentence) => strlen($sentence)) < $chunkCharacterLimit;
71 }
72 )->map(function (Collection $chunk) {
73 $value = $chunk->implode(' ');
74 $checksum = md5($value);
75 
76 return [
77 'checksum' => $checksum,
78 'value' => $value,
79 ];
80 });
81 
82 return $chunks->all();
83}

#Implementing the Natural Language Search

In this section, we’ll implement the natural language search feature using OpenAI’s GPT models and the Weaviate vector database. Our goal is to allow users to search the documentation using natural language queries, and return the most relevant results. Here’s how we’ll do it:

Assume we’ll have an endpoint that accepts a question parameter, you could implement this in your own application using the code below:

1<?php
2 
3namespace App\Http\Controllers;
4 
5use Illuminate\Http\Request;
6use OpenAI\Laravel\Facades\OpenAI;
7use Weaviate\Weaviate;
8 
9class DocSearchController extends Controller
10{
11 public function search(Request $request)
12 {
13 $question = $request->input('question');
14 
15 if ($question) {
16 $chunks = $this->getChunks($question);
17 $messages = $this->getMessages($question, $chunks);
18 
19 $response = OpenAI::chat()->create([
20 'model' => 'gpt-3.5-turbo',
21 'messages' => $messages,
22 ]);
23 
24 $answer = $response['choices'][0]['message']['content'];
25 }
26 
27 return view('docsearch', [
28 'answer' => $answer ?? '',
29 ]);
30 }
31 
32 protected function getChunks(string $text): array
33 {
34 $weaviate = app(Weaviate::class);
35 
36 $query = <<<GQL
37 {
38 Get {
39 Chunk(
40 nearText: {
41 concepts: "$text"
42 certainty: 0.9
43 }
44 limit: 3
45 ) {
46 content
47 }
48 }
49 }
50GQL;
51 
52 $response = $weaviate->graphql()->get($query);
53 
54 if (isset($response['errors'])) {
55 return [];
56 }
57 
58 return $response ? $response['data']['Get']['Chunk'] : [];
59 }
60 
61 protected function getMessages(string $question, array $chunks): array
62 {
63 $information = implode('\n', array_column($chunks, 'content'));
64 
65 $messages = [
66 ['role' => 'system', 'content' => 'You are a helpful assistant.'],
67 ['role' => 'user', 'content' => 'Here is some information: ' . $information],
68 ['role' => 'user', 'content' => 'Please use this information to answer my question: ' . $question],
69 ];
70 
71 return $messages;
72 }
73}

#Conclusion

In this tutorial, I’ve demonstrated how to integrate OpenAI’s GPT models, Weaviate vector search, and Laravel to create a natural language search for your documentation. While this implementation works well, there are further optimizations we can apply to enhance the system and reduce API costs.

One area of improvement is only updating the embeddings of your documentation content when needed. Embeddings can be expensive to compute, both in terms of time and API costs. By only updating them when the content changes, you can save on API bills and improve response times.

To achieve this, you could use MD5 hashes to check whether the content has changed or not. When you receive a new content update, calculate its MD5 hash and compare it to the hash of the previous content. If the hashes are different, update the embeddings in Weaviate and store the new hash for future comparisons. This way, you’ll only update the embeddings when there’s an actual change in the content.

By applying these improvements, you’ll create a more efficient and cost-effective natural language search system for your documentation, while maintaining a high level of accuracy and relevance for your users.


You might find these related articles helpful or interesting, make sure to check them out!

I hope you found this article useful! 😊.