Using OpenAI GPT to search your org files

As I wrote previously, I store all my files under ~/data/org-attach by using the org mode file attachment feature. To retrieve a file, I usually use agenda search or other org packages which can go over your org files and retrieve a headline.

But what for those rare moments when I only have a vague idea of what I'm looking for and can't hit the query exactly? Well… we can use a semantic search.

OpenAI recently opened a Large Language Model Embedding API. Roughly speaking, embedding takes a chunk of text and returns a fixed sized vector in the model's "knowledge space" (OpenAI's ada model returns 1536 dimensional vector). Two vectors which are close to each other in this knowledge space should correspond to similar things.

The usual metric of judging closeness of vectors is the "angle" between them. This is hard to imagine in 1536 dimensions, but to keep it short we can use the dot product operation to compute the "similarity".

So the idea here is simple:

Iterate over all headlines in my buffers.
Submit them to the Embedding API and cache the embedding vectors for each headline.
When querying for a headline, submit the query to the API, then calculate similarity with all the cached embeddings and present the closest ones as candidates.¹

I wrote my own "org brain" clone which I called org-graph because I wasn't happy with any available solution. My biggest problem was that most systems either prescribed one file per note, or only worked with top level headlines, or had other artificial limitations. So it was logical this feature would live in that package as well.

But just after I got ready with all the preprocessing and data preparation and API implementation, there came a shock. Emacs math is SLOOOOOOW. So slow this "dot product" operation was taking ages and the interface sucked.

But then I remembered about dynamic modules. Not wanting to give up, I decided to write a C module for some good old linear algebra.

You start with a header and some initialization:

#include <emacs-module.h>

int plugin_is_GPL_compatible;

int
emacs_module_init (struct emacs_runtime *runtime)
{
  emacs_env *env = runtime->get_environment (runtime);

  return 0;
}

Then you can start implementing the functions. I'm not going to repeat everything here, you can find the full C source at GitHub. The following function computes the dot product, which is really just a lot of multiplication and addition.

static emacs_value
dot_product (emacs_env *env, ptrdiff_t nargs, emacs_value *args,
            void *data)
{
  assert (nargs == 2);
  emacs_value a = args[0];
  emacs_value b = args[1];

  ptrdiff_t size = env->vec_size (env, a);

  double dp = 0;
  for (int i = 0; i < size; i++) {
    double first = get_at(env, a, i);
    double second = get_at(env, b, i);
    dp += first * second;
  }

  emacs_value result = env->make_float (env, dp);
  return result;
}

With a simple Makefile

.PHONY: all

all: dotproduct.so

dotproduct.o: dotproduct.c
    gcc -Wall -c dotproduct.c

dotproduct.so: dotproduct.o
    gcc -shared -o dotproduct.so dotproduct.o

we can build the module

> make
gcc -Wall -c dotproduct.c
gcc -shared -o dotproduct.so dotproduct.o

Finally, we load the module into Emacs with:

(module-load (expand-file-name "dotproduct.so"))

Armed with the now much faster math routines, I embedded about 2000 headers and now I can search my files by just giving very vague queries—it works surprisingly well, even across multiple different natural languages (i.e. if I ask about Franz Kafka's Castle it will return "Das Schloß" entry from my foreign language reading file).

(org-graph-openai-query "Book about journey to the center of the Earth")

The top 10 results. You can see it mixes the languages but all the things are somewhat related. However, it got the first hit exactly right.

[8de3805f-8971-404a-98d2-84305db1a444] 87.43 Voyage au centre de la Terre
[f128f559-74a6-4435-a357-aa7d35976097] 85.43 Nicolai Klimii Iter Subterraneum
[2e0f3f7a-73bd-4b0f-a588-c757432607ca] 85.26 Endurance: Shackleton's Incredible Voyage
[d2828ffb-343c-431f-8657-521ef32de079] 85.11 Vesmír v orechovej škrupinke
[724b7e0d-f804-4d1d-8767-c4d166492472] 84.83 Oheň nad hlubinou - Pád Straumli
[5e2b17a4-902a-4e55-87f3-25a18f239346] 84.78 The Earthsea
[5e5906eb-7ab6-4b86-b41f-2bd007ff8eba] 84.65 Tolkien: Sur les rivages de la Terre du Milieu
[c33a670e-8ec2-4886-9e12-eee71891d2ea] 84.21 Vladimir Ulrich - Bis ans Ende der Welt - Ein Pilgerbuch
[d3c57fb4-9458-487d-bcf7-9852b9fec3cf] 84.09 The Lost World
[5d0d82de-df40-43f7-9868-3470d1bd376d] 83.95 Oheň nad hlubinou - Planeta spárů

Here's another example which is purposefully silly description of the expected book's title:

(org-graph-openai-query "Philosophical book where someone spoke in a particular way")

The top 10 results from my org files are:

[5d59a53e-fb81-4927-bce2-5096d9fb8417] 88.69 Thus Spoke Zarathustra
[15e61234-4d6b-413c-986d-391996f09a19] 87.82 Quotes
[7922112d-e871-419f-a4f2-68e892d5dad1] 87.04 Epistemology
[82e38b44-5ce1-416f-bd19-075aa70c7bf6] 86.96 Western Philosophy
[babcdf60-5840-4772-ba2b-314faa756997] 86.94 Nietzsche
[b192f472-d4f6-4596-8520-627b2c77e783] 86.90 Treatise on Human Nature
[b6f1d479-53fb-4e88-a474-024fbafab99e] 86.75 The Short History of Modern Philosophy
[1a366e71-eb03-472b-8c7c-d619494e9693] 86.62 The way of the bow
[2d738ef8-4a23-4b32-bcc1-0a1cc941c4be] 86.55 The Discourses
[2f00f564-af67-4d04-bf41-8cabe7764cdb] 86.36 The Life of Reason - Santayana

Not bad, eh :)

To prepare the embeddings, you can run the function org-graph-compute-embeddings-for-buffer in an org buffer. Make sure to set the environment variable OPENAI_TOKEN. Also be aware that this will add the ID property to every headline as this is a way to track the cached embedding to the particular headline. Make sure to backup your org files before running this (you have them checked in to git right… right?!). This will fire several requests to the API (about 20 headings per request), so be patient, it should take about a minute for 500-600 headings.

You can discuss and ask questions on the discussions board on GitHub.

This blog post was inspired by GPT for second brains.

Footnotes:

Now before we go further, you need to register on OpenAI platform and the API costs money. The good news is that it is extremely cheap. It will cost you $1 to embed 2.4 MILLION tokens. With a query being roughly 10 words which corresponds to 10-15 tokens, one query will cost you about $0.0000006. So it's pretty much free and you only need a credit card to formally register. You can also set monthly spending limit to $0.01 and you would probably never run over the limit. The step 2 will cost based on how much data you have. I have about 200000 lines of org files and so far I spent less than $0.50 including all the experimenting.