Building a Personalized Link Search Engine with Supabase pgvector, TransformerJS, and Langchain

Navigating the vast digital world requires efficient tools, especially for link discovery. In our project, we've combined the strengths of Supabase pgvector, transformerJS, and LangChain. Together, these tools power our Fastify-based backend. We also designed a user-friendly frontend using React and Vite. In this article, we'll walk you through our project's inner workings, both backend and frontend. Let's dive in!

Installation and Setup backend

Before diving into the core functionalities of our project, we'll need to set up the necessary tools and libraries. This will ensure a smooth development process, and a seamless integration of Supabase pgvector, transformerjs, and Langchain within our Fastify framework.

1. Setting up Fastify-CLI

Fastify-CLI simplifies the process of bootstrapping and running Fastify applications. Install it globally using the following command:

bash

npm install fastify-cli --global

2. Building the Boilerplate

Once Fastify-CLI is installed, we'll generate the project boilerplate tailored for our needs. We'll be utilizing the ESM (ECMAScript Modules) and TypeScript for a more robust development experience.

Run the following command to generate the boilerplate:

bash

fastify generate --esm --lang=ts server

Installing Dependencies

With the basic structure in place, let's install the necessary dependencies to power our backend:

bash

npm i @xenova/transformers @supabase/supabase-js langchain html-to-text

These packages bring in the essential tools for our link search functionalities, connecting our backend with the capabilities of tansformerJS, Supabase, and Langchain, while also allowing us to convert HTML content to plain text.

Database Setup for Hybrid Search in Supabase

To implement a powerful and efficient search functionality in our backend, we will utilize the hybrid search capabilities provided by the LangChain library. This method integrates vector similarity search with keyword-based search, ensuring both precision and flexibility. Below is a step-by-step breakdown of our database setup:

1. Enabling pgvector Extension

Firstly, we need to enable the pgvector extension, which is pivotal for working with embedding vectors.

sql

create extension vector;

2. Creating the Documents Table

The documents table will store the information and content of the links we wish to index.

id: A unique identifier for each document.
content: The main content of the document, corresponding to html to text content.
url: The link we saved.
metadata: Supplementary data about the url, corresponding to Document.metadata.
embedding: A vector representation of the document's content. The dimension 384 is commonly used for embeddings.

sql

create table documents (
  id bigserial primary key,
  content text,
  url text,
  metadata jsonb,
  embedding vector(384)
);

3. Creating a Similarity Search Function

The match_documents function is designed to perform a similarity search for documents based on embeddings. This function returns a list of documents ranked by their similarity to a provided query embedding.

sql

create function match_documents (
  query_embedding vector(384),
  match_count int DEFAULT null,
  filter jsonb DEFAULT '{}'
) returns table (
  id bigint,
  content text,
  metadata jsonb,
  similarity float
)
language plpgsql
as $$
#variable_conflict use_column
begin
  return query
  select
    id,
    content,
    metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where metadata @> filter
  order by documents.embedding <=> query_embedding
  limit match_count;
end;
$$;

4. Creating a Keyword Search Function

The kw_match_documents function executes a keyword-based search on the documents table. This is especially useful when you want to search documents based on specific terms or phrases. The results are ranked by their relevance to the provided query.

sql

create function kw_match_documents(query_text text, match_count int)
returns table (id bigint, content text, metadata jsonb, similarity real)
as $$
 
begin
return query execute
format('select id, content, metadata, ts_rank(to_tsvector(content), plainto_tsquery($1)) as similarity
from documents
where to_tsvector(content) @@ plainto_tsquery($1)
order by similarity desc
limit $2')
using query_text, match_count;
end;
$$ language plpgsql;

Integrating Supabase with Fastify

To make our backend efficient and well-integrated with Supabase, we need to set up a plugin within Fastify. This approach not only encapsulates the Supabase connection logic but also ensures that our Supabase client can be conveniently accessed throughout our application. Let's break down the steps and the code:

1. Supabase Credentials

Before diving into the code, ensure you've copied the Supabase URL and the anonymous key from your Supabase dashboard. These will be required to establish a connection to your Supabase project.

2. Creating the Fastify Plugin for Supabase

Navigate to the plugins directory and create a new file named supabase.ts. This file will contain the code for our Fastify plugin.

Now, let's dissect the code:

typescript

import fp from "fastify-plugin";
import { FastifyPluginAsync } from "fastify";
import { createClient, SupabaseClient } from "@supabase/supabase-js";

Here, we are importing necessary dependencies:

fastify-plugin: This module helps in creating Fastify plugins.
FastifyPluginAsync: A type from Fastify for creating asynchronous plugins.
createClient & SupabaseClient: Functions and types from the @supabase/supabase-js library.

typescript

declare module "fastify" {
  interface FastifyInstance {
    supabase: SupabaseClient;
  }
}

We extend Fastify's main instance interface to declare that it will also have a supabase property. This augmentation lets us attach the Supabase client to any Fastify instance, making it available throughout the application by requesting request.server.supabase.

typescript

const supabasePlugin: FastifyPluginAsync = fp(async (server, options) => {
  console.log("Connecting to Supabase");
 
  const supabaseUrl = process.env.SUPABASE_PUBLIC_URL!;
  const supabaseKey = process.env.SUPABASE_ANON_KEY!;
 
  const supabase = createClient(supabaseUrl, supabaseKey);
 
  server.decorate("supabase", supabase);
  server.addHook("onClose", async (server) => {
    server.log.info("Supabase connection closed.");
  });
});
 
export default supabasePlugin;

Here:

We define the Fastify plugin supabasePlugin.
We fetch the Supabase URL and anonymous key from environment variables.
We initialize the Supabase client using createClient.
We attach the Supabase client to the Fastify server instance using the decorate method.
An onClose hook logs a message when the Fastify server is closed, indicating that the Supabase connection is also closed.
Finally, we export the plugin, making it available for inclusion in our Fastify server setup.

With this plugin, our backend is now effectively integrated with Supabase, enabling smooth data operations and ensuring that Supabase's functionalities are readily accessible throughout the application.

Routes: Handling and Searching Links

As the backbone of our application, managing and querying saved links is crucial. This functionality is enshrined in the link.ts file, located inside the src/routes/ directory. Acting as the gateway at http://localhost:3000/links, this file dictates how our app handles link-related operations. Let's walk through link.ts to understand its core responsibilities and the magic it brings to our backend.

1. Required Modules & Types

First, we import the necessary modules. These span from Fastify core modules to specialized libraries from LangChain that will be instrumental in our link processing and retrieval.

We also define two request types:

SaveLinkRequest: Accepts a body with url and content for saving a link.
SearchLinkRequest: Accepts a body with a query (search term) and an optional count to limit the number of search results.

2. Saving a Link (`/save` Endpoint)

When a POST request is made to /links/save, the application saves the provided link to the Supabase database.

typescript

fastify.post(
    "/save",
    async function (request: FastifyRequest<SaveLinkRequest>, reply) {
      const supabase = request.server.supabase;
      const { url, content } = request.body;
 
      console.log("Saving", url);
 
      const docs = [
        new Document({
          pageContent: content,
          metadata: {
            url: url,
          },
        }),
      ];
 
      const splitter = RecursiveCharacterTextSplitter.fromLanguage("html");
      const transformer = new HtmlToTextTransformer();
 
      const sequence = splitter.pipe(transformer);
      const newDocuments = await sequence.invoke(docs);
 
      const model = new HuggingFaceTransformersEmbeddings({
        modelName: "Supabase/gte-small",
      });
 
      for (const doc of newDocuments) {
        if (doc.pageContent) {
          const embeddings = await model.embedDocuments([doc.pageContent]);
          const { error } = await supabase
            .from("documents")
            .insert([{
              url,
              content: doc.pageContent,
              embedding: JSON.stringify(embeddings[0]),
              metadata: JSON.stringify(doc.metadata),
            }]);
 
          if (error) {
            return reply.status(500).send(error);
          }
        }
      }
 
      return {
        "message": "success",
      };
    },
  );

Here's the process:

Extract the Supabase client and request body.
Create a Document instance containing the link's content and URL.
Use LangChain's RecursiveCharacterTextSplitter to split the HTML content and HtmlToTextTransformer to transform HTML into plain text.
Generate embeddings for the transformed content using HuggingFace's transformer model (named "Supabase/gte-small" in this case).
Save the link's details (including its embedding) to the Supabase documents table.

3. Searching for Links (`/search` Endpoint)

When a POST request is made to /links/search, the application retrieves relevant links from the Supabase database based on the provided search term.

typescript

fastify.post(
    "/search",
    async function (request: FastifyRequest<SearchLinkRequest>, reply) {
      const limit = request.body.count || 3;
      const supabase = request.server.supabase;
 
      const embeddings = new HuggingFaceTransformersEmbeddings({
        modelName: "Supabase/gte-small",
      });
      const retriever = new SupabaseHybridSearch(embeddings, {
        similarityK: limit,
        keywordK: limit,
        tableName: "documents",
        similarityQueryName: "match_documents",
        keywordQueryName: "kw_match_documents",
        client: supabase,
      });
 
      const results = await retriever.getRelevantDocuments(request.body.query);
 
      return {
        results,
      };
    },
  );

Here's the search workflow:

Define a limit on the number of results based on the count in the request body or default to 3.
Initialize the HuggingFace's transformer model for embeddings.
Use LangChain's SupabaseHybridSearch to enable a combination of vector similarity search and keyword-based search.
Retrieve relevant documents (links) based on the provided query.

The SupabaseHybridSearch is particularly powerful, offering a flexible search experience by harnessing both embeddings and keyword search.

With these routes in place, our backend now possesses the capability to save links efficiently, embedding their content for quick retrievals. Furthermore, our hybrid search ensures that when users search for content, they get the most relevant links, be it by content similarity or keyword match.

Full Code

typescript

import { FastifyPluginAsync, FastifyRequest } from "fastify";
import { Document } from "langchain/document";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { HtmlToTextTransformer } from "langchain/document_transformers/html_to_text";
import { HuggingFaceTransformersEmbeddings } from "langchain/embeddings/hf_transformers";
import { SupabaseHybridSearch } from "langchain/retrievers/supabase";
 
export type SaveLinkRequest = {
  Body: {
    url: string;
    content: string;
  };
};
 
export type SearchLinkRequest = {
  Body: {
    query: string;
    count?: number;
  };
};
 
const root: FastifyPluginAsync = async (fastify, opts): Promise<void> => {
  fastify.post(
    "/save",
    async function (request: FastifyRequest<SaveLinkRequest>, reply) {
      const supabase = request.server.supabase;
      const { url, content } = request.body;
 
      console.log("Saving", url);
 
      const docs = [
        new Document({
          pageContent: content,
          metadata: {
            url: url,
          },
        }),
      ];
 
      const splitter = RecursiveCharacterTextSplitter.fromLanguage("html");
      const transformer = new HtmlToTextTransformer();
 
      const sequence = splitter.pipe(transformer);
      const newDocuments = await sequence.invoke(docs);
 
      const model = new HuggingFaceTransformersEmbeddings({
        modelName: "Supabase/gte-small",
      });
 
      for (const doc of newDocuments) {
        if (doc.pageContent) {
          const embeddings = await model.embedDocuments([doc.pageContent]);
          const { error } = await supabase
            .from("documents")
            .insert([{
              url,
              content: doc.pageContent,
              embedding: JSON.stringify(embeddings[0]),
              metadata: JSON.stringify(doc.metadata),
            }]);
 
          if (error) {
            return reply.status(500).send(error);
          }
        }
      }
 
      return {
        "message": "success",
      };
    },
  );
 
  fastify.post(
    "/search",
    async function (request: FastifyRequest<SearchLinkRequest>, reply) {
      const limit = request.body.count || 3;
      const supabase = request.server.supabase;
 
      const embeddings = new HuggingFaceTransformersEmbeddings({
        modelName: "Supabase/gte-small",
      });
      const retriever = new SupabaseHybridSearch(embeddings, {
        similarityK: limit,
        keywordK: limit,
        tableName: "documents",
        similarityQueryName: "match_documents",
        keywordQueryName: "kw_match_documents",
        client: supabase,
      });
 
      const results = await retriever.getRelevantDocuments(request.body.query);
 
      return {
        results,
      };
    },
  );
};
 
export default root;

This setup, leveraging Fastify's performance and Langchain's sophisticated tools and supabase database management, ensures a seamless and effective link management system.

Setting Up the Frontend: Integrating Chrome Extension & Search Application

Building a search engine backend without a frontend interface is like constructing a library with no entrance. In this section, we will bring our backend to life by setting up a frontend in the form of a Chrome extension. The extension will serve as an easy mechanism for users to save links and will complement our search application seamlessly. Here's a step-by-step guide to establish the "VexaSearch" Chrome extension:

1. Crafting the Chrome Extension

To begin, let's set up the necessary files for our extension:

Folder Structure:

/VexaSearch
|-- manifest.json
|-- background.js

manifest.json:

The manifest.json is the metadata file for Chrome extensions. It provides essential details to the Chrome browser about how the extension should function and what permissions it requires.

json

{
    "manifest_version": 3,
    "name": "VexaSearch",
    "version": "1.0",
    "description": "Search links made easy",
    "permissions": [
        "activeTab",
        "tabs",
        "scripting",
        "contextMenus"
    ],
    "host_permissions": [
        "http://*/*",
        "https://*/*"
    ],
    "background": {
        "service_worker": "background.js"
    }
}

manifest_version: Specifies which version of the manifest specification the package requires.
name: The name of our extension, "VexaSearch".
version: The version of our extension.
description: A brief description.
permissions: A list of permissions the extension needs .
host_permissions: Defines which websites our extension can access.
background: Specifies that background.js will serve as a service worker, acting as the backbone of our extension.

background.js:

This is the brain of our extension, housing the logic that will run in the background.

javascript

chrome.contextMenus.create({
  title: "Save url to Vexxa Search",
  id: "vexxa",
  contexts: ["page"],
});
 
 
const fetchCurrentSite = () => {
  const url = new URL(window?.location?.href);
  const entirePage = document.documentElement.outerHTML;
  return {
    url: url.href,
    content: entirePage,
  };
};
 
chrome.contextMenus.onClicked.addListener(async (info, tab) => {
  if (info.menuItemId == "vexxa") {
    const tabId = tab.id;
    const response = await chrome.scripting.executeScript({
      target: {
        tabId: tabId,
      },
      func: fetchCurrentSite,
    });
    console.log(response);
    if (response.length !== 0) {
      const data = response[0]["result"];
      if (data) {
        console.log(data);
        const response = await fetch("http://localhost:3000/link/save", {
          method: "POST",
          body: JSON.stringify(data),
          headers: {
            "Content-Type": "application/json",
          },
        });
 
        if (response.status === 200) {
          const data = await response.json();
          console.log(data);
        }
      }
    }
  }
});

Creating Context Menu : We add a context menu item named "Save url to Vexxa Search". When users right-click on a page, they will see this option, allowing them to save the URL directly.
Fetching Current Site: The function fetchCurrentSite captures the current URL and the entire HTML content of the page.
Event Listener for Menu Click: When the user selects "Save url to Vexxa Search" from the context menu, the event listener fires. It first fetches the current page's URL and content, and then sends a POST request to our backend at http://localhost:3000/link/save, saving the link's data.

Loading the Extension:

Before you can use the extension, you need to load it into Chrome:

Open Chrome Browser: Navigate to the Chrome extensions page by entering chrome://extensions/ in the address bar or selecting "Extensions" from the Chrome menu.
Enable Developer Mode: On the top-right corner of the extensions page, toggle on the "Developer mode".
Load Unpacked: You will see three options appear: Load unpacked, Pack extension, and Update. Click on the Load unpacked button.
Select Your Extension Folder: Navigate to the directory where you saved the VexaSearch folder and select it. Your extension should now be loaded into Chrome and appear in the list of installed extensions.

Testing the Extension:

Once loaded, you can test the extension to ensure it's working as expected:

Open a Website: Navigate to any website in your Chrome browser.
Right-click on the Page: In the context menu that appears, you should see the option "Save url to Vexxa Search".
Select the Option: Clicking on this will trigger the background.js script, fetching the current site's data and sending it to our backend for saving.

2. Setting Up the Frontend for VexxaSearch Application

For this frontend setup, we'll be leveraging React as our primary framework, while employing Vite for faster and leaner builds. We'll be using Mantine for UI components, making our application look neat, and TanStack's React-Query to efficiently handle our application's asynchronous data.

Note: This tutorial will not delve into setting up Mantine or React-Query as their respective documentation is quite comprehensive. To set them up:

Mantine setup with Vite: Follow this guide.
React-Query setup: Follow this guide.

Frontend Code Explanation

Let's break down the provided App.tsx code:

tsx

import { useDisclosure } from "@mantine/hooks";
import {
  AppShell,
  Burger,
  Text,
  Container,
  TextInput,
  Card,
} from "@mantine/core";
import React from "react";
import { useQuery } from "@tanstack/react-query";

Imports: We're importing essential hooks, UI components, and the main React package. useQuery from React-Query helps manage and fetch asynchronous data.

tsx

export default function App() {
  const [opened, { toggle }] = useDisclosure();
  const [search, setSearch] = React.useState<string | undefined>();

State Initialization: We have two state variables - one for toggling the navbar and another to manage the search query.

tsx

const { data, status } = useQuery(
    ["searchLinks", search],
    async () => {
      if (!search) return { results: [] };
      const response = await fetch("http://localhost:3000/link/search", {
        body: JSON.stringify({ query: search }),
        method: "POST",
        headers: {
          "Content-Type": "application/json",
        },
      });
      const data = await response.json();
 
      return data as {
        results: {
          pageContent: string;
          metadata: string;
        }[];
      };
    },
    {
      enabled: !!search,
    }
  );

Data Fetching: With React-Query's useQuery, we fetch data based on the search term. Notice the enabled option; this ensures that the fetch only occurs when there's a search term.

tsx

return (
  <AppShell header={{ height: 60 }} navbar={{ ... }}>
    ...
    <TextInput
      value={search}
      onChange={(e) => setSearch(e.currentTarget.value)}
      required
      placeholder="Search your links"
    />
    ...
    <Container>
      {status === "success" && data.results.map((result) => (
        <Card
          shadow="xs"
          padding="sm"
          radius="sm"
          style={{ marginBottom: 10 }}
        >
          <Text>{result.pageContent}</Text>
          <Text size="xs" color="gray">
            {JSON.parse(result.metadata).url}
          </Text>
        </Card>
      ))}
    </Container>
  </AppShell>
);

UI Rendering: This is where Mantine shines, providing us with a clean and modern user interface. Users can search for their saved links, and the results are displayed in neat cards, with the page content and its URL.

Full code:

tsx

import { useDisclosure } from "@mantine/hooks";
import {
  AppShell,
  Burger,
  Text,
  Container,
  TextInput,
  Card,
} from "@mantine/core";
import React from "react";
import { useQuery } from "@tanstack/react-query";
 
export default function App() {
  const [opened, { toggle }] = useDisclosure();
 
  const [search, setSearch] = React.useState<string | undefined>();
 
  const { data, status } = useQuery(
    ["searchLinks", search],
    async () => {
      if (!search) return { results: [] };
      const response = await fetch("http://localhost:3000/link/search", {
        body: JSON.stringify({ query: search }),
        method: "POST",
        headers: {
          "Content-Type": "application/json",
        },
      });
      const data = await response.json();
 
      return data as {
        results: {
          pageContent: string;
          metadata: string;
        }[];
      };
    },
    {
      enabled: !!search,
    }
  );
 
  return (
    <AppShell
      header={{ height: 60 }}
      navbar={{ width: 300, breakpoint: "sm", collapsed: { mobile: !opened } }}
      padding="md"
    >
      <AppShell.Header>
        <Burger opened={opened} onClick={toggle} hiddenFrom="sm" size="sm" />
        <Text m="md" size="xl" fw={700}>
          VexxaSearch
        </Text>
      </AppShell.Header>
      <Container style={{ paddingTop: 100 }}>
        <TextInput
          value={search}
          onChange={(e) => setSearch(e.currentTarget.value)}
          required
          placeholder="Search your links"
        />
      </Container>
 
      <Container>
        {status === "success" &&
          data.results.map((result) => (
            <Card
              shadow="xs"
              padding="sm"
              radius="sm"
              style={{ marginBottom: 10 }}
            >
              <Text>{result.pageContent}</Text>
              <Text size="xs" color="gray">
                {JSON.parse(result.metadata).url}
              </Text>
            </Card>
          ))}
      </Container>
    </AppShell>
  );
}

Testing the Application

To test the application, we'll need to start the backend server and the frontend application:

Start the Backend: Navigate to the server directory and run npm run dev. This will start the backend server at http://localhost:3000.
Start the Frontend: Navigate to the client directory and run npm run dev. This will start the frontend application at http://localhost:5173.
Open the Extension: Navigate to any website and right-click on the page. Select "Save url to Vexxa Search" from the context menu. This will save the link to the backend.
Open the Application: Navigate to http://localhost:5173 and search for the link you saved. You should see the link's content and URL displayed in a neat card.

Demo

This snapshot showcases the Chrome extension in action. With a simple right-click, users can easily save their current webpage.

Interface of the VexxaSearch application

You can find the full code for the frontend application here

Conclusion

Building VexxaSearch has been quite a journey! We started with the backend, utilizing Fastify as our server framework and Supabase as our data storage solution. We then set up a convenient Chrome extension that allows users to save URLs with just a simple right-click. And lastly, we created an efficient frontend with React, enhanced with Mantine for UI components and React-Query for state management. This holistic solution ensures a seamless process from saving the links to retrieving and searching them. It's proof of the power that comes from combining different modern technologies to create an efficient, user-friendly application.

Installation and Setup backend

1. Setting up Fastify-CLI

2. Building the Boilerplate

Installing Dependencies

Database Setup for Hybrid Search in Supabase

1. Enabling pgvector Extension

2. Creating the Documents Table

3. Creating a Similarity Search Function

4. Creating a Keyword Search Function

Integrating Supabase with Fastify

1. Supabase Credentials

2. Creating the Fastify Plugin for Supabase

Routes: Handling and Searching Links

1. Required Modules & Types

2. Saving a Link (/save Endpoint)

3. Searching for Links (/search Endpoint)

Setting Up the Frontend: Integrating Chrome Extension & Search Application

1. Crafting the Chrome Extension

manifest.json:

background.js:

Loading the Extension:

Testing the Extension:

2. Setting Up the Frontend for VexxaSearch Application

Frontend Code Explanation

Testing the Application

Demo

Conclusion

Useful Links

2. Saving a Link (`/save` Endpoint)

3. Searching for Links (`/search` Endpoint)