Home · Juissie.jl

Juissie.Generation.SemanticSearch.Embedding.embed — Method

function embed(embedder::Embedder, text::String)::AbstractVector

Embeds a textual sequence using a provided model

Parameters

embedder : Embedder an initialized Embedder struct text : String the text sequence you want to embed

Notes

This is sort of like a class method for the Embedder

Julia has something called multiple dispatch that can be used to make this cleaner, but I'm going to handle that at a later times

source

Juissie.Generation.SemanticSearch.Embedding.embed_from_bert — Method

function embed_from_bert(embedder::Embedder, text::String)

Embeds a textual sequence using a provided Bert model

Parameters

embedder : Embedder an initialized Embedder struct the associated model and tokenizer should be Bert-specific text : String the text sequence you want to embed

return : cls_embedding The results from passing the text through the encoder, throught the model, and after stripping

source

Juissie.Generation.SemanticSearch.Embedding.Embedder — Type

struct Embedder

A struct for holding a model and a tokenizer

Attributes

tokenizer : a tokenizer object, e.g. BertTextEncoder maps your string to tokens the model can understand model : a model object, e.g. HGFBertModel the actual model architecture and weights to perform inference with

Notes

You can get class-like behavior in Julia by defining a struct and functions that operate on that struct.

source

Juissie.Generation.SemanticSearch.Embedding.Embedder — Method

function Embedder(model_name::String)

Function to initialize an Embedder struct from a HuggingFace model path.

Parameters

model_name : String a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5"

source

Juissie.SemanticSearch.Backend.index — Method

function index(corpus::Corpus)

Constructs the HNSW vector index from the data available. If the corpus has a corpus_name, then we also save the new index to disk. Must be run before searching.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use

source

Juissie.SemanticSearch.Backend.load_corpus — Method

function load_corpus(corpus_name)

Loads an already-initialized corpus from its associated "artifacts" (relational database, vector index, and informational json).

Parameters

corpus_name : str the name of your EXISTING vector database

source

Juissie.SemanticSearch.Backend.search — Function

function search(corpus::Corpus, query::String, k::Int=5)

Performs approximate nearest neighbor search to find the items in the vector index closest to the query.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use query : str The text you want to search, e.g. your question We embed this and perform semantic retrieval against the vector db k : int The number of nearest-neighbor vectors to fetch

source

Juissie.SemanticSearch.Backend.upsert_chunk — Method

function upsert_chunk(corpus::Corpus, chunk::String, doc_name::String)

Given a new chunk of text, get embedding and insert into our vector DB. Not actually a full upsert, because we have to reindex later. Process:

Generate an embedding for the text
Insert metadata into database
Increment idx counter

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use chunk : str This is the text content of the chunk you want to upsert docname : str The name of the document that chunk is from. For instance, if you were upserting all the chunks in an academic paper, docname might be the name of that paper

Notes

If the vectors have been indexed, this de-indexes them (i.e., they need to be indexed again). Currently, we handle this by setting hnsw to nothing so that it gets caught later in search.

source

Juissie.SemanticSearch.Backend.upsert_document — Method

function upsert_document(corpus::Corpus, doc_text::String, doc_name::String)

Upsert a whole document (i.e., long string). Does so by splitting the document into appropriately-sized chunks so no chunk exceeds the embedder's tokenization max sequence length, while prioritizing sentence endings.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use doctext : str A long string you want to upsert. We will break this into chunks and upsert each chunk. docname : str The name of the document the content is from

source

Juissie.SemanticSearch.Backend.upsert_document — Method

function upsert_document(corpus::Corpus, documents::Vector{String}, doc_name::String)

Upsert a collection of documents (i.e., a vector of long strings). Does so by upserting each entry of the provided documents vector (which in turn will chunkify, each document further into appropriately sized chunks).

See the upsert_document(...) above for more details

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use documents : Vector{String} a collection of long strings to upsert. doc_name : str The name of the document the content is from

source

Juissie.SemanticSearch.Backend.upsert_document_from_pdf — Method

function upsert_document_from_pdf(corpus::Corpus, filePath::String, doc_name::String)

Upsert all the data in a PDF file into the provided corpus. See the upsert_document(...) above for more details.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use filePath : String The path to the PDF file to read doc_name : str The name of the document the content is from

source

Juissie.SemanticSearch.Backend.upsert_document_from_txt — Method

function upsert_document_from_txt(corpus::Corpus, filePath::String, doc_name::String)

Upsert all the data from the text file into the provided corpus.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use filePath : String The path to the txt file to read doc_name : str The name of the document the content is from

source

Juissie.SemanticSearch.Backend.upsert_document_from_url — Function

function upsert_document_from_url(corpus::Corpus, url::String, doc_name::String, elements::Array{String}=["h1", "h2", "p"])

Extracts element-tagged text from HTML and upserts as a document.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use url : String The url you want to scrape for text doc_name : str The name of the document the content is from elements : Array{String} A list of HTML elements you want to pull the text from

source

Juissie.SemanticSearch.Backend.Corpus — Type

struct Corpus

Basically a vector database. It will have these attributes:

a relational database (SQLite)
a vector index (HNSW)
an embedder (via Embedding.jl)

Attributes

corpusname : String or Nothing this is the name of your corpus and will be used to access saved corpuses if Nothing, we can't save/load and everything will be in-memory db : a SQLite.DB connection object this is a real relational database to store metadata (e.g. chunk text, doc name) hnsw : Hierarchical Navigable Small World object this is our searchable vector index embedder : Embedder an initialized Embedder struct maxseqlen : int The maximum number of tokens per chunk. This should be the max sequence length of the tokenizer data : Vector{Any} The embeddings get stored here before we create the vector index nextidx : int stores the index we'll use for the next-upserted chunk

Notes

The struct is mutable because we want to be able to change things like incrementing next_idx.

source

Juissie.SemanticSearch.Backend.Corpus — Type

function Corpus(corpus_name::String, embedder_model_path::String="BAAI/bge-small-en-v1.5")

Initializes a Corpus struct.

In particular, does the following:

Initializes an embedder object
Creates a SQLite databse with the corpus name. It should have:

row-wise primary key uuid
doc_name representing the parent document
chunk text

We can add more metadata later, if desired

Parameters

corpusname : str or nothing the name that you want to give the database optional. if left as nothing, we use an in-memory database embeddermodelpath : str a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5" maxseq_len : int The maximum number of tokens per chunk. This should be the max sequence length of the tokenizer

source

Juissie.Generation.SemanticSearch.Backend.index — Method

function index(corpus::Corpus)

Constructs the HNSW vector index from the data available. If the corpus has a corpus_name, then we also save the new index to disk. Must be run before searching.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use

source

Juissie.Generation.SemanticSearch.Backend.load_corpus — Method

function load_corpus(corpus_name)

Loads an already-initialized corpus from its associated "artifacts" (relational database, vector index, and informational json).

Parameters

corpus_name : str the name of your EXISTING vector database

source

Juissie.Generation.SemanticSearch.Backend.search — Function

function search(corpus::Corpus, query::String, k::Int=5)

Performs approximate nearest neighbor search to find the items in the vector index closest to the query.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use query : str The text you want to search, e.g. your question We embed this and perform semantic retrieval against the vector db k : int The number of nearest-neighbor vectors to fetch

source

Juissie.Generation.SemanticSearch.Backend.upsert_chunk — Method

function upsert_chunk(corpus::Corpus, chunk::String, doc_name::String)

Given a new chunk of text, get embedding and insert into our vector DB. Not actually a full upsert, because we have to reindex later. Process:

Generate an embedding for the text
Insert metadata into database
Increment idx counter

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use chunk : str This is the text content of the chunk you want to upsert docname : str The name of the document that chunk is from. For instance, if you were upserting all the chunks in an academic paper, docname might be the name of that paper

Notes

If the vectors have been indexed, this de-indexes them (i.e., they need to be indexed again). Currently, we handle this by setting hnsw to nothing so that it gets caught later in search.

source

Juissie.Generation.SemanticSearch.Backend.upsert_document — Method

function upsert_document(corpus::Corpus, doc_text::String, doc_name::String)

Upsert a whole document (i.e., long string). Does so by splitting the document into appropriately-sized chunks so no chunk exceeds the embedder's tokenization max sequence length, while prioritizing sentence endings.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use doctext : str A long string you want to upsert. We will break this into chunks and upsert each chunk. docname : str The name of the document the content is from

source

Juissie.Generation.SemanticSearch.Backend.upsert_document — Method

function upsert_document(corpus::Corpus, documents::Vector{String}, doc_name::String)

Upsert a collection of documents (i.e., a vector of long strings). Does so by upserting each entry of the provided documents vector (which in turn will chunkify, each document further into appropriately sized chunks).

See the upsert_document(...) above for more details

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use documents : Vector{String} a collection of long strings to upsert. doc_name : str The name of the document the content is from

source

Juissie.Generation.SemanticSearch.Backend.upsert_document_from_pdf — Method

function upsert_document_from_pdf(corpus::Corpus, filePath::String, doc_name::String)

Upsert all the data in a PDF file into the provided corpus. See the upsert_document(...) above for more details.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use filePath : String The path to the PDF file to read doc_name : str The name of the document the content is from

source

Juissie.Generation.SemanticSearch.Backend.upsert_document_from_txt — Method

function upsert_document_from_txt(corpus::Corpus, filePath::String, doc_name::String)

Upsert all the data from the text file into the provided corpus.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use filePath : String The path to the txt file to read doc_name : str The name of the document the content is from

source

Juissie.Generation.SemanticSearch.Backend.upsert_document_from_url — Function

function upsert_document_from_url(corpus::Corpus, url::String, doc_name::String, elements::Array{String}=["h1", "h2", "p"])

Extracts element-tagged text from HTML and upserts as a document.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use url : String The url you want to scrape for text doc_name : str The name of the document the content is from elements : Array{String} A list of HTML elements you want to pull the text from

source

Juissie.Generation.SemanticSearch.Backend.Corpus — Type

struct Corpus

Basically a vector database. It will have these attributes:

a relational database (SQLite)
a vector index (HNSW)
an embedder (via Embedding.jl)

Attributes

corpusname : String or Nothing this is the name of your corpus and will be used to access saved corpuses if Nothing, we can't save/load and everything will be in-memory db : a SQLite.DB connection object this is a real relational database to store metadata (e.g. chunk text, doc name) hnsw : Hierarchical Navigable Small World object this is our searchable vector index embedder : Embedder an initialized Embedder struct maxseqlen : int The maximum number of tokens per chunk. This should be the max sequence length of the tokenizer data : Vector{Any} The embeddings get stored here before we create the vector index nextidx : int stores the index we'll use for the next-upserted chunk

Notes

The struct is mutable because we want to be able to change things like incrementing next_idx.

source

Juissie.Generation.SemanticSearch.Backend.Corpus — Type

function Corpus(corpus_name::String, embedder_model_path::String="BAAI/bge-small-en-v1.5")

Initializes a Corpus struct.

In particular, does the following:

Initializes an embedder object
Creates a SQLite databse with the corpus name. It should have:

row-wise primary key uuid
doc_name representing the parent document
chunk text

We can add more metadata later, if desired

Parameters

corpusname : str or nothing the name that you want to give the database optional. if left as nothing, we use an in-memory database embeddermodelpath : str a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5" maxseq_len : int The maximum number of tokens per chunk. This should be the max sequence length of the tokenizer

source

Juissie.Generation.SemanticSearch.Backend.TextUtils.chunkify — Function

function chunkify(text::String, tokenizer, sequence_length::Int=512)

Splits a provided text (e.g. paragraph) into chunks that are each as many sentences as possible while keeping the chunk's token lenght below the sequence_length. This ensures that each chunk can be fully encoded by the embedder.

Parameters

text : String The text you want to split into chunks. tokenizer : a tokenizer object, e.g. BertTextEncoder The tokenizer you will be using sequence_length : Int The maximum number of tokens per chunk. Ideally, should correspond to the max sequence length of the tokenizer

Example Usage

>>> chunkify(
    '''Hold me closer, tiny dancer. Count the headlights on the highway. Lay me down in sheets of linen. Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked.
    ''', 
    corpus.embedder.tokenizer, 
    20
)

4-element Vector{Any}:
"Hold me closer, tiny dancer. Count the headlights on the highway."
"Lay me down in sheets of linen."
"Peter Piper picked a peck of pickled peppers."
"A peck of pickled peppers Peter Piper picked."

source

Juissie.Generation.SemanticSearch.Backend.TextUtils.get_files_path — Method

function get_files_path()

Simple function to return the path to the files subdirectory.

Example Usage

testbinpath = getfilespath()*"test.bin"

source

Juissie.Generation.SemanticSearch.Backend.TextUtils.read_html_url — Function

read_html_url(url::String, elements::Array{String})

Returns a string of text from the provided HTML elements on a webpage.

Parameters

url : String the url you want to read elements : Array{String} html elements to look for in the web page, e.g. ["h1", "p"].

Notes

Defaults to extracting headers and paragraphs

source

Juissie.Generation.SemanticSearch.Backend.TextUtils.sentence_splitter — Method

function sentence_splitter(text::String)

Uses basic regex to divide a provided text (e.g. paragraph) into sentences.

Parameters

text : String The text you want to split into sentences.

Notes

Regex is hard to read. The first part looks for spaces following end-of-sentence punctuation. The second part matches at the end of the string.

Regex in Julia uses an r identifier prefix.

References

https://www.geeksforgeeks.org/regular-expressions-in-julia/

source

Juissie.SemanticSearch.Backend.TxtReader.appendToFile — Method

Append the given contents into a file specified at filename. A new file will be created if filename doesn't already exist.

NOTE: No ' ' newline character will be appended. It is the caller's responsibility to decide if the contents should have a ' ' newline character or not.

Parameters

filename: String The name of the file to open. Relative file paths are evaluated from the directory where the julia command was run. Typically the root level of the project contents: String The exact text to append into the file.

source

Juissie.SemanticSearch.Backend.TxtReader.getAllTextInFile — Method

Open the provided filename, load all the data into memory, and return. This function will also manage the file socket open(...) close(...) properly. If there was an error in opening or reading the file then the empty string will be returned

Parameters

filename: String The name of the file to open. Relative file paths are evaluated from the directory where the julia command was run. Typically the root level of the project

Returns: String The entire contets of the file, or an empty string if there was an issue

source

Juissie.SemanticSearch.Backend.TxtReader.splitFileIntoParts — Method

A simple script that allows a user to split a large file into multiple smaller files. This will create splits # of children files, each with a file size ~1/splits of the origional target file.

Parameters

fileToSplit : String The name of the file to read and split into multiple parts. If an absolute file path is given then that will be used. Otherwise, relative file paths are evaluated from the location that the julia command was run from (typically the root level of this project) outputFileNameBase : String The template for the name of the children split-out files. Each split out file with have the format of <outputFileNameBase>_<#> where # starts at 1 and increments by 1 for each subsequent file. There will be splits number of children files splits : Int How many children files should be created?

source

Juissie.Generation.build_full_query — Function

function build_full_query(query::String, context::OptionalContext=nothing)

Given a query and a list of contextual chunks, construct a full query incorporating both.

Parameters

query : String the main instruction or query string context : OptionalContext, which is Union{Vector{String}, Nothing} optional list of chunks providing additional context for the query

Notes

We use the Alpaca prompt, found here: https://github.com/tatsu-lab/stanford_alpaca with minor modifications that reflect our response preferences.

source

Juissie.Generation.check_oai_key_format — Method

function check_oai_key_format(key::String)

Uses regex to check if a provided string is in the expected format of an OpenAI API Key

Parameters

key : String the key you want to check

Notes

See here for more on the regex:

https://en.wikibooks.org/wiki/IntroducingJulia/Stringsandcharacters#Findingandreplacingthingsinsidestrings

Uses format rule provided here:

https://github.com/secretlint/secretlint/issues/676
https://community.openai.com/t/what-are-the-valid-characters-for-the-apikey/288643

Note that this only checks the key format, not whether the key is valid or has not been revoked.

source

Juissie.Generation.generate — Function

generate(generator::Union{OAIGenerator, Nothing}, query::String, context::OptionalContext=nothing, temperature::Float64=0.7)

Generate a response based on a given query and optional context using the specified OAIGenerator. This function constructs a full query, sends it to the OpenAI API, and returns the generated response.

Parameters

generator : Union{OAIGenerator, Nothing} an initialized generator (e..g OAIGenerator) leaving this as a union with nothing to note that we may want to support other generator types in the future (e.g. HFGenerator, etc.) query : String the main query string. This is basically your question context : OptionalContext, which is Union{Vector{String}, Nothing} optional list of contextual chunk strings to provide the generator additional context for the query. Ultimately, these will be coming from our vector DB temperature : Float64 controls the stochasticity of the output generated by the model

source

Juissie.Generation.generate_with_corpus — Function

function generate_with_corpus(generator::Union{OAIGenerator, Nothing}, corpus::Corpus, query::String, k::Int=5, temperature::Float64=0.7)

Parameters

generator : Union{OAIGenerator, Nothing} an initialized generator (e..g OAIGenerator) leaving this as a union with nothing to note that we may want to support other generator types in the future (e.g. HFGenerator, etc.) corpus : an initialized Corpus object the corpus / "vector database" you want to use query : String the main instruction or query string. This is basically your question k : int The number of nearest-neighbor vectors to fetch from the corpus to build your context temperature : Float64 controls the stochasticity of the output generated by the model

source

Juissie.Generation.load_OAIGeneratorWithCorpus — Function

function load_OAIGeneratorWithCorpus(corpus_name::String, auth_token::Union{String, Nothing}=nothing)

Loads an existing corpus and uses it to initialize an GeneratorWithCorpus

Parameters

corpusname : str the name that you want to give the database authtoken :: Union{String, Nothing} this is your OPENAI API key. You can either pass it explicitly as a string or leave this argument as nothing. In the latter case, we will look in your environmental variables for "OAI_KEY"

Notes

corpusname is ordered first because Julia uses positional arguments and authtoken is optional.

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Juissie.Generation.upsert_chunk_to_generator — Method

function upsert_chunk_to_generator(generator::GeneratorWithCorpus, chunk::String, doc_name::String)

Equivalent to Backend.upsert_chunk, but takes a GeneratorWithCorpus instead of a Corpus.

Parameters

generator : any struct that subtypes GeneratorWithCorpus the generator (with corpus) you want to use chunk : str This is the text content of the chunk you want to upsert docname : str The name of the document that chunk is from. For instance, if you were upserting all the chunks in an academic paper, docname might be the name of that paper

Notes

One would expect Julia's multiple dispatch to allow us to call this upsertchunk, but not so. The conflict arises in Juissie, where we would have both SemanticSearch and Generation exporting upsertchunk. This means any uses of it in Juissie must be qualified, and without doing so, neither actually gets defined.

source

Juissie.Generation.upsert_document_from_url_to_generator — Function

function upsert_document_from_url_to_generator(generator::GeneratorWithCorpus, url::String, doc_name::String, elements::Array{String}=["h1", "h2", "p"])

Equivalent to Backend.upsertdocumentfrom_url, but takes a GeneratorWithCorpus instead of a Corpus.

Parameters

generator : any struct that subtypes GeneratorWithCorpus the generator (with corpus) you want to use url : String The url you want to scrape for text doc_name : str The name of the document the content is from elements : Array{String} A list of HTML elements you want to pull the text from

Notes

See note for upsertchunkto_generator - same idea.

source

Juissie.Generation.upsert_document_to_generator — Method

function upsert_document_to_generator(generator::GeneratorWithCorpus, doc_text::String, doc_name::String)

Equivalent to Backend.upsert_document, but takes a GeneratorWithCorpus instead of a Corpus.

Parameters

generator : any struct that subtypes GeneratorWithCorpus the generator (with corpus) you want to use doctext : str A long string you want to upsert. We will break this into chunks and upsert each chunk. docname : str The name of the document the content is from

Notes

See note for upsertchunkto_generator - same idea.

source

Juissie.Generation.OAIGenerator — Type

function OAIGenerator(auth_token::Union{String, Nothing})

Initializes an OAIGenerator struct.

Parameters

authtoken :: Union{String, Nothing} this is your OPENAI API key. You can either pass it explicitly as a string or leave this argument as nothing. In the latter case, we will look in your environmental variables for "OAIKEY"

Notes

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Juissie.Generation.OAIGenerator — Type

struct OAIGenerator

A struct for handling natural language generation via OpenAI's gpt-3.5-turbo completion endpoint.

Attributes

url : String the URL of the OpenAI API endpoint header : Vector{Pair{String, String}} key-value pairs representing the HTTP headers for the request body : Dict{String, Any} this is the JSON payload to be sent in the body of the request

Notes

All natural language generation should be done via a "Generator" object of some kind for consistency. In the future, if we decide to host a model locally or something, we might do that via a HFGenerator struct.

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Juissie.Generation.OAIGeneratorWithCorpus — Type

function OAIGeneratorWithCorpus(auth_token::Union{String, Nothing}=nothing, corpus::Corpus)

Initializes an OAIGeneratorWithCorpus.

Parameters

corpusname : str or nothing the name that you want to give the database optional. if left as nothing, we use an in-memory database authtoken :: Union{String, Nothing} this is your OPENAI API key. You can either pass it explicitly as a string or leave this argument as nothing. In the latter case, we will look in your environmental variables for "OAIKEY" embeddermodelpath : str a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5" maxseq_len : int The maximum number of tokens per chunk. This should be the max sequence length of the tokenizer

Notes

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Juissie.Generation.OAIGeneratorWithCorpus — Type

struct OAIGeneratorWithCorpus

Like OAIGenerator, but has a corpus attached.

Attributes

url : String the URL of the OpenAI API endpoint header : Vector{Pair{String, String}} key-value pairs representing the HTTP headers for the request body : Dict{String, Any} this is the JSON payload to be sent in the body of the request corpus : an initialized Corpus object the corpus / "vector database" you want to use

Notes

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Juissie.SemanticSearch.Backend.PdfReader.bufferToString! — Method

Extract the contents of the buffer and convert it into a string object

WARNING: This function will clear out the contents of the buffer

Parameters

buff : The buffer to clear, it's contents will be returned as a string

source

Juissie.SemanticSearch.Backend.PdfReader.getAllTextInPDF — Function

Extract all the text data from the provided pdf file.

Open the pdf at the provided file location, extract all the text data from it (as far as possible), and return that text data as a vector of strings. Each entry in the rsult vector is the appended sum of some number of pages in the PDF. 100 Pages per entry is default. For example, the getAllTextInPDF(...)[0] will be a long string containing 100 pages worth of data. The next entry represents the next 100 pages, etc.

NOTE: This function is a "best effort" function, meaning that it will try to extract as many pages as it can. But if there are pages that are invalid, or otherwise can not be properly parsed then they will simply be ignored and not included in the returned strings.

Parameters

fileLocation : The full path to the PDF file to open. This should be relative from where the julia command has been run (not relative to this source file)

pagesPerEntry : How many pages should be collected into the buffere before turning it into an entry in the result vector.

source

Juissie.SemanticSearch.Backend.PdfReader.getPagesFromPdf — Method

Collect and return all the text data found in the pdf file found in the provided page range.

Using the provided PDF Handel, loop over all the pages in the range and attempt to extract the text data. All the collected data will be returned.

The specific pages to read are defined by [firstPageInclusive, lastPageInclusive] which (naturally) defines an inclusive range. Meaning the first and last page number will be included in the returned string. These ranges SHOULD be valid (ie, in the range [1, MaxPageCount]) but error checking will coerce the values to a proper range.

Parameters

pdfHandel : The PDF file to extract data from firstPageInclusive : The first page in the range to read lastPageInclusive : The last page in the range to read

source

Juissie.SemanticSearch.Backend.PdfReader.getPagesFromPdf — Method

Collect and return all the text data found in the pdf file found in the provided page range.

Using the provided file path, open the PDF and loop over all the pages in the range and attempt to extract the text data. All the collected data will be returned.

The specific pages to read are defined by [firstPageInclusive, lastPageInclusive] which (naturally) defines an inclusive range. Meaning the first and last page number will be included in the returned string. These ranges SHOULD be valid (ie, in the range [1, MaxPageCount]) but error checking will coerce the values to a proper range.

Parameters

fileLocation : The location of the PDF to read firstPageInclusive : The first page in the range to read lastPageInclusive : The last page in the range to read

source

Juissie.SemanticSearch.Backend.PdfReader.getPagesInPDF_All — Method

Extract all the text data from the provided pdf file.

Open the pdf at the provided file location, extract all the text data from it (as far as possible), and return that text data as a vector of strings. Each entry in the result vector is the data from a single page of the PDF file.

NOTE: This function is a "best effort" function, meaning that it will try to extract as many pages as it can. But if there are pages that are invalid, or otherwise can not be properly parsed then they will simply be ignored and not included in the returned strings.

Parameters

fileLocation : The full path to the PDF file to open. This should be relative from where the julia command has been run (not relative to this source file)

source

Juissie.Generation.SemanticSearch.Backend.TxtReader.appendToFile — Method

Append the given contents into a file specified at filename. A new file will be created if filename doesn't already exist.

NOTE: No ' ' newline character will be appended. It is the caller's responsibility to decide if the contents should have a ' ' newline character or not.

Parameters

filename: String The name of the file to open. Relative file paths are evaluated from the directory where the julia command was run. Typically the root level of the project contents: String The exact text to append into the file.

source

Juissie.Generation.SemanticSearch.Backend.TxtReader.getAllTextInFile — Method

Open the provided filename, load all the data into memory, and return. This function will also manage the file socket open(...) close(...) properly. If there was an error in opening or reading the file then the empty string will be returned

Parameters

filename: String The name of the file to open. Relative file paths are evaluated from the directory where the julia command was run. Typically the root level of the project

Returns: String The entire contets of the file, or an empty string if there was an issue

source

Juissie.Generation.SemanticSearch.Backend.TxtReader.splitFileIntoParts — Method

A simple script that allows a user to split a large file into multiple smaller files. This will create splits # of children files, each with a file size ~1/splits of the origional target file.

Parameters

fileToSplit : String The name of the file to read and split into multiple parts. If an absolute file path is given then that will be used. Otherwise, relative file paths are evaluated from the location that the julia command was run from (typically the root level of this project) outputFileNameBase : String The template for the name of the children split-out files. Each split out file with have the format of <outputFileNameBase>_<#> where # starts at 1 and increments by 1 for each subsequent file. There will be splits number of children files splits : Int How many children files should be created?

source

Juissie.SemanticSearch.Embedding.embed — Method

function embed(embedder::Embedder, text::String)::AbstractVector

Embeds a textual sequence using a provided model

Parameters

embedder : Embedder an initialized Embedder struct text : String the text sequence you want to embed

Notes

This is sort of like a class method for the Embedder

Julia has something called multiple dispatch that can be used to make this cleaner, but I'm going to handle that at a later times

source

Juissie.SemanticSearch.Embedding.embed_from_bert — Method

function embed_from_bert(embedder::Embedder, text::String)

Embeds a textual sequence using a provided Bert model

Parameters

embedder : Embedder an initialized Embedder struct the associated model and tokenizer should be Bert-specific text : String the text sequence you want to embed

return : cls_embedding The results from passing the text through the encoder, throught the model, and after stripping

source

Juissie.SemanticSearch.Embedding.Embedder — Type

struct Embedder

A struct for holding a model and a tokenizer

Attributes

tokenizer : a tokenizer object, e.g. BertTextEncoder maps your string to tokens the model can understand model : a model object, e.g. HGFBertModel the actual model architecture and weights to perform inference with

Notes

You can get class-like behavior in Julia by defining a struct and functions that operate on that struct.

source

Juissie.SemanticSearch.Embedding.Embedder — Method

function Embedder(model_name::String)

Function to initialize an Embedder struct from a HuggingFace model path.

Parameters

model_name : String a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5"

source

Juissie.SemanticSearch.TextUtils.chunkify — Function

function chunkify(text::String, tokenizer, sequence_length::Int=512)

Splits a provided text (e.g. paragraph) into chunks that are each as many sentences as possible while keeping the chunk's token lenght below the sequence_length. This ensures that each chunk can be fully encoded by the embedder.

Parameters

text : String The text you want to split into chunks. tokenizer : a tokenizer object, e.g. BertTextEncoder The tokenizer you will be using sequence_length : Int The maximum number of tokens per chunk. Ideally, should correspond to the max sequence length of the tokenizer

Example Usage

>>> chunkify(
    '''Hold me closer, tiny dancer. Count the headlights on the highway. Lay me down in sheets of linen. Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked.
    ''', 
    corpus.embedder.tokenizer, 
    20
)

4-element Vector{Any}:
"Hold me closer, tiny dancer. Count the headlights on the highway."
"Lay me down in sheets of linen."
"Peter Piper picked a peck of pickled peppers."
"A peck of pickled peppers Peter Piper picked."

source

Juissie.SemanticSearch.TextUtils.get_files_path — Method

function get_files_path()

Simple function to return the path to the files subdirectory.

Example Usage

testbinpath = getfilespath()*"test.bin"

source

Juissie.SemanticSearch.TextUtils.read_html_url — Function

read_html_url(url::String, elements::Array{String})

Returns a string of text from the provided HTML elements on a webpage.

Parameters

url : String the url you want to read elements : Array{String} html elements to look for in the web page, e.g. ["h1", "p"].

Notes

Defaults to extracting headers and paragraphs

source

Juissie.SemanticSearch.TextUtils.sentence_splitter — Method

function sentence_splitter(text::String)

Uses basic regex to divide a provided text (e.g. paragraph) into sentences.

Parameters

text : String The text you want to split into sentences.

Notes

Regex is hard to read. The first part looks for spaces following end-of-sentence punctuation. The second part matches at the end of the string.

Regex in Julia uses an r identifier prefix.

References

https://www.geeksforgeeks.org/regular-expressions-in-julia/

source

Juissie.Generation.SemanticSearch.Backend.PdfReader.bufferToString! — Method

Extract the contents of the buffer and convert it into a string object

WARNING: This function will clear out the contents of the buffer

Parameters

buff : The buffer to clear, it's contents will be returned as a string

source

Juissie.Generation.SemanticSearch.Backend.PdfReader.getAllTextInPDF — Function

Extract all the text data from the provided pdf file.

Open the pdf at the provided file location, extract all the text data from it (as far as possible), and return that text data as a vector of strings. Each entry in the rsult vector is the appended sum of some number of pages in the PDF. 100 Pages per entry is default. For example, the getAllTextInPDF(...)[0] will be a long string containing 100 pages worth of data. The next entry represents the next 100 pages, etc.

NOTE: This function is a "best effort" function, meaning that it will try to extract as many pages as it can. But if there are pages that are invalid, or otherwise can not be properly parsed then they will simply be ignored and not included in the returned strings.

Parameters

fileLocation : The full path to the PDF file to open. This should be relative from where the julia command has been run (not relative to this source file)

pagesPerEntry : How many pages should be collected into the buffere before turning it into an entry in the result vector.

source

Juissie.Generation.SemanticSearch.Backend.PdfReader.getPagesFromPdf — Method

Collect and return all the text data found in the pdf file found in the provided page range.

Using the provided PDF Handel, loop over all the pages in the range and attempt to extract the text data. All the collected data will be returned.

The specific pages to read are defined by [firstPageInclusive, lastPageInclusive] which (naturally) defines an inclusive range. Meaning the first and last page number will be included in the returned string. These ranges SHOULD be valid (ie, in the range [1, MaxPageCount]) but error checking will coerce the values to a proper range.

Parameters

pdfHandel : The PDF file to extract data from firstPageInclusive : The first page in the range to read lastPageInclusive : The last page in the range to read

source

Juissie.Generation.SemanticSearch.Backend.PdfReader.getPagesFromPdf — Method

Collect and return all the text data found in the pdf file found in the provided page range.

Using the provided file path, open the PDF and loop over all the pages in the range and attempt to extract the text data. All the collected data will be returned.

The specific pages to read are defined by [firstPageInclusive, lastPageInclusive] which (naturally) defines an inclusive range. Meaning the first and last page number will be included in the returned string. These ranges SHOULD be valid (ie, in the range [1, MaxPageCount]) but error checking will coerce the values to a proper range.

Parameters

fileLocation : The location of the PDF to read firstPageInclusive : The first page in the range to read lastPageInclusive : The last page in the range to read

source

Juissie.Generation.SemanticSearch.Backend.PdfReader.getPagesInPDF_All — Method

Extract all the text data from the provided pdf file.

Open the pdf at the provided file location, extract all the text data from it (as far as possible), and return that text data as a vector of strings. Each entry in the result vector is the data from a single page of the PDF file.

NOTE: This function is a "best effort" function, meaning that it will try to extract as many pages as it can. But if there are pages that are invalid, or otherwise can not be properly parsed then they will simply be ignored and not included in the returned strings.

Parameters

fileLocation : The full path to the PDF file to open. This should be relative from where the julia command has been run (not relative to this source file)

source

Juissie.SemanticSearch.Backend.Embedding.embed — Method

function embed(embedder::Embedder, text::String)::AbstractVector

Embeds a textual sequence using a provided model

Parameters

embedder : Embedder an initialized Embedder struct text : String the text sequence you want to embed

Notes

This is sort of like a class method for the Embedder

Julia has something called multiple dispatch that can be used to make this cleaner, but I'm going to handle that at a later times

source

Juissie.SemanticSearch.Backend.Embedding.embed_from_bert — Method

function embed_from_bert(embedder::Embedder, text::String)

Embeds a textual sequence using a provided Bert model

Parameters

embedder : Embedder an initialized Embedder struct the associated model and tokenizer should be Bert-specific text : String the text sequence you want to embed

return : cls_embedding The results from passing the text through the encoder, throught the model, and after stripping

source

Juissie.SemanticSearch.Backend.Embedding.Embedder — Type

struct Embedder

A struct for holding a model and a tokenizer

Attributes

tokenizer : a tokenizer object, e.g. BertTextEncoder maps your string to tokens the model can understand model : a model object, e.g. HGFBertModel the actual model architecture and weights to perform inference with

Notes

You can get class-like behavior in Julia by defining a struct and functions that operate on that struct.

source

Juissie.SemanticSearch.Backend.Embedding.Embedder — Method

function Embedder(model_name::String)

Function to initialize an Embedder struct from a HuggingFace model path.

Parameters

model_name : String a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5"

source

Juissie.SemanticSearch.Backend.TextUtils.chunkify — Function

function chunkify(text::String, tokenizer, sequence_length::Int=512)

Splits a provided text (e.g. paragraph) into chunks that are each as many sentences as possible while keeping the chunk's token lenght below the sequence_length. This ensures that each chunk can be fully encoded by the embedder.

Parameters

text : String The text you want to split into chunks. tokenizer : a tokenizer object, e.g. BertTextEncoder The tokenizer you will be using sequence_length : Int The maximum number of tokens per chunk. Ideally, should correspond to the max sequence length of the tokenizer

Example Usage

>>> chunkify(
    '''Hold me closer, tiny dancer. Count the headlights on the highway. Lay me down in sheets of linen. Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked.
    ''', 
    corpus.embedder.tokenizer, 
    20
)

4-element Vector{Any}:
"Hold me closer, tiny dancer. Count the headlights on the highway."
"Lay me down in sheets of linen."
"Peter Piper picked a peck of pickled peppers."
"A peck of pickled peppers Peter Piper picked."

source

Juissie.SemanticSearch.Backend.TextUtils.get_files_path — Method

function get_files_path()

Simple function to return the path to the files subdirectory.

Example Usage

testbinpath = getfilespath()*"test.bin"

source

Juissie.SemanticSearch.Backend.TextUtils.read_html_url — Function

read_html_url(url::String, elements::Array{String})

Returns a string of text from the provided HTML elements on a webpage.

Parameters

url : String the url you want to read elements : Array{String} html elements to look for in the web page, e.g. ["h1", "p"].

Notes

Defaults to extracting headers and paragraphs

source

Juissie.SemanticSearch.Backend.TextUtils.sentence_splitter — Method

function sentence_splitter(text::String)

Uses basic regex to divide a provided text (e.g. paragraph) into sentences.

Parameters

text : String The text you want to split into sentences.

Notes

Regex is hard to read. The first part looks for spaces following end-of-sentence punctuation. The second part matches at the end of the string.

Regex in Julia uses an r identifier prefix.

References

https://www.geeksforgeeks.org/regular-expressions-in-julia/

source

Juissie.Generation.SemanticSearch.Backend.Embedding.embed — Method

function embed(embedder::Embedder, text::String)::AbstractVector

Embeds a textual sequence using a provided model

Parameters

embedder : Embedder an initialized Embedder struct text : String the text sequence you want to embed

Notes

This is sort of like a class method for the Embedder

Julia has something called multiple dispatch that can be used to make this cleaner, but I'm going to handle that at a later times

source

Juissie.Generation.SemanticSearch.Backend.Embedding.embed_from_bert — Method

function embed_from_bert(embedder::Embedder, text::String)

Embeds a textual sequence using a provided Bert model

Parameters

embedder : Embedder an initialized Embedder struct the associated model and tokenizer should be Bert-specific text : String the text sequence you want to embed

return : cls_embedding The results from passing the text through the encoder, throught the model, and after stripping

source

Juissie.Generation.SemanticSearch.Backend.Embedding.Embedder — Type

struct Embedder

A struct for holding a model and a tokenizer

Attributes

tokenizer : a tokenizer object, e.g. BertTextEncoder maps your string to tokens the model can understand model : a model object, e.g. HGFBertModel the actual model architecture and weights to perform inference with

Notes

You can get class-like behavior in Julia by defining a struct and functions that operate on that struct.

source

Juissie.Generation.SemanticSearch.Backend.Embedding.Embedder — Method

function Embedder(model_name::String)

Function to initialize an Embedder struct from a HuggingFace model path.

Parameters

model_name : String a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5"

source

Juissie.Generation.SemanticSearch.TextUtils.chunkify — Function

function chunkify(text::String, tokenizer, sequence_length::Int=512)

Splits a provided text (e.g. paragraph) into chunks that are each as many sentences as possible while keeping the chunk's token lenght below the sequence_length. This ensures that each chunk can be fully encoded by the embedder.

Parameters

text : String The text you want to split into chunks. tokenizer : a tokenizer object, e.g. BertTextEncoder The tokenizer you will be using sequence_length : Int The maximum number of tokens per chunk. Ideally, should correspond to the max sequence length of the tokenizer

Example Usage

>>> chunkify(
    '''Hold me closer, tiny dancer. Count the headlights on the highway. Lay me down in sheets of linen. Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked.
    ''', 
    corpus.embedder.tokenizer, 
    20
)

4-element Vector{Any}:
"Hold me closer, tiny dancer. Count the headlights on the highway."
"Lay me down in sheets of linen."
"Peter Piper picked a peck of pickled peppers."
"A peck of pickled peppers Peter Piper picked."

source

Juissie.Generation.SemanticSearch.TextUtils.get_files_path — Method

function get_files_path()

Simple function to return the path to the files subdirectory.

Example Usage

testbinpath = getfilespath()*"test.bin"

source

Juissie.Generation.SemanticSearch.TextUtils.read_html_url — Function

read_html_url(url::String, elements::Array{String})

Returns a string of text from the provided HTML elements on a webpage.

Parameters

url : String the url you want to read elements : Array{String} html elements to look for in the web page, e.g. ["h1", "p"].

Notes

Defaults to extracting headers and paragraphs

source

Juissie.Generation.SemanticSearch.TextUtils.sentence_splitter — Method

function sentence_splitter(text::String)

Uses basic regex to divide a provided text (e.g. paragraph) into sentences.

Parameters

text : String The text you want to split into sentences.

Notes

Regex is hard to read. The first part looks for spaces following end-of-sentence punctuation. The second part matches at the end of the string.

Regex in Julia uses an r identifier prefix.

References

https://www.geeksforgeeks.org/regular-expressions-in-julia/

source

Generation.SemanticSearch.Backend.index — Method

function index(corpus::Corpus)

Constructs the HNSW vector index from the data available. If the corpus has a corpus_name, then we also save the new index to disk. Must be run before searching.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use

source

Generation.SemanticSearch.Backend.load_corpus — Method

function load_corpus(corpus_name)

Loads an already-initialized corpus from its associated "artifacts" (relational database, vector index, and informational json).

Parameters

corpus_name : str the name of your EXISTING vector database

source

Generation.SemanticSearch.Backend.search — Function

function search(corpus::Corpus, query::String, k::Int=5)

Performs approximate nearest neighbor search to find the items in the vector index closest to the query.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use query : str The text you want to search, e.g. your question We embed this and perform semantic retrieval against the vector db k : int The number of nearest-neighbor vectors to fetch

source

Generation.SemanticSearch.Backend.upsert_chunk — Method

function upsert_chunk(corpus::Corpus, chunk::String, doc_name::String)

Given a new chunk of text, get embedding and insert into our vector DB. Not actually a full upsert, because we have to reindex later. Process:

Generate an embedding for the text
Insert metadata into database
Increment idx counter

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use chunk : str This is the text content of the chunk you want to upsert docname : str The name of the document that chunk is from. For instance, if you were upserting all the chunks in an academic paper, docname might be the name of that paper

Notes

If the vectors have been indexed, this de-indexes them (i.e., they need to be indexed again). Currently, we handle this by setting hnsw to nothing so that it gets caught later in search.

source

Generation.SemanticSearch.Backend.upsert_document — Method

function upsert_document(corpus::Corpus, doc_text::String, doc_name::String)

Upsert a whole document (i.e., long string). Does so by splitting the document into appropriately-sized chunks so no chunk exceeds the embedder's tokenization max sequence length, while prioritizing sentence endings.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use doctext : str A long string you want to upsert. We will break this into chunks and upsert each chunk. docname : str The name of the document the content is from

source

Generation.SemanticSearch.Backend.upsert_document — Method

function upsert_document(corpus::Corpus, documents::Vector{String}, doc_name::String)

Upsert a collection of documents (i.e., a vector of long strings). Does so by upserting each entry of the provided documents vector (which in turn will chunkify, each document further into appropriately sized chunks).

See the upsert_document(...) above for more details

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use documents : Vector{String} a collection of long strings to upsert. doc_name : str The name of the document the content is from

source

Generation.SemanticSearch.Backend.upsert_document_from_pdf — Method

function upsert_document_from_pdf(corpus::Corpus, filePath::String, doc_name::String)

Upsert all the data in a PDF file into the provided corpus. See the upsert_document(...) above for more details.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use filePath : String The path to the PDF file to read doc_name : str The name of the document the content is from

source

Generation.SemanticSearch.Backend.upsert_document_from_txt — Method

function upsert_document_from_txt(corpus::Corpus, filePath::String, doc_name::String)

Upsert all the data from the text file into the provided corpus.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use filePath : String The path to the txt file to read doc_name : str The name of the document the content is from

source

Generation.SemanticSearch.Backend.upsert_document_from_url — Function

function upsert_document_from_url(corpus::Corpus, url::String, doc_name::String, elements::Array{String}=["h1", "h2", "p"])

Extracts element-tagged text from HTML and upserts as a document.

Parameters

corpus : an initialized Corpus object the corpus / "vector database" you want to use url : String The url you want to scrape for text doc_name : str The name of the document the content is from elements : Array{String} A list of HTML elements you want to pull the text from

source

Generation.SemanticSearch.Backend.Corpus — Type

struct Corpus

Basically a vector database. It will have these attributes:

a relational database (SQLite)
a vector index (HNSW)
an embedder (via Embedding.jl)

Attributes

corpusname : String or Nothing this is the name of your corpus and will be used to access saved corpuses if Nothing, we can't save/load and everything will be in-memory db : a SQLite.DB connection object this is a real relational database to store metadata (e.g. chunk text, doc name) hnsw : Hierarchical Navigable Small World object this is our searchable vector index embedder : Embedder an initialized Embedder struct maxseqlen : int The maximum number of tokens per chunk. This should be the max sequence length of the tokenizer data : Vector{Any} The embeddings get stored here before we create the vector index nextidx : int stores the index we'll use for the next-upserted chunk

Notes

The struct is mutable because we want to be able to change things like incrementing next_idx.

source

Generation.SemanticSearch.Backend.Corpus — Type

function Corpus(corpus_name::String, embedder_model_path::String="BAAI/bge-small-en-v1.5")

Initializes a Corpus struct.

In particular, does the following:

Initializes an embedder object
Creates a SQLite databse with the corpus name. It should have:

row-wise primary key uuid
doc_name representing the parent document
chunk text

We can add more metadata later, if desired

Parameters

corpusname : str or nothing the name that you want to give the database optional. if left as nothing, we use an in-memory database embeddermodelpath : str a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5" maxseq_len : int The maximum number of tokens per chunk. This should be the max sequence length of the tokenizer

source

Generation.build_full_query — Function

function build_full_query(query::String, context::OptionalContext=nothing)

Given a query and a list of contextual chunks, construct a full query incorporating both.

Parameters

query : String the main instruction or query string context : OptionalContext, which is Union{Vector{String}, Nothing} optional list of chunks providing additional context for the query

Notes

We base our prompt off the Alpaca prompt, found here: https://github.com/tatsu-lab/stanford_alpaca with minor modifications that reflect our response preferences.

source

Generation.check_oai_key_format — Method

function check_oai_key_format(key::String)

Uses regex to check if a provided string is in the expected format of an OpenAI API Key

Parameters

key : String the key you want to check

Notes

See here for more on the regex:

https://en.wikibooks.org/wiki/IntroducingJulia/Stringsandcharacters#Findingandreplacingthingsinsidestrings

Uses format rule provided here:

https://github.com/secretlint/secretlint/issues/676
https://community.openai.com/t/what-are-the-valid-characters-for-the-apikey/288643

Note that this only checks the key format, not whether the key is valid or has not been revoked.

source

Generation.generate — Function

generate(generator::Union{OAIGenerator, Nothing}, query::String, context::OptionalContext=nothing, temperature::Float64=0.7)

Generate a response based on a given query and optional context using the specified OAIGenerator. This function constructs a full query, sends it to the OpenAI API, and returns the generated response.

Parameters

generator : Union{OAIGenerator, Nothing} an initialized generator (e..g OAIGenerator) leaving this as a union with nothing to note that we may want to support other generator types in the future (e.g. HFGenerator, etc.) query : String the main query string. This is basically your question context : OptionalContext, which is Union{Vector{String}, Nothing} optional list of contextual chunk strings to provide the generator additional context for the query. Ultimately, these will be coming from our vector DB temperature : Float64 controls the stochasticity of the output generated by the model

source

Generation.generate_with_corpus — Function

function generate_with_corpus(generator::Union{OAIGenerator, Nothing}, corpus::Corpus, query::String, k::Int=5, temperature::Float64=0.7)

Parameters

generator : Union{OAIGenerator, Nothing} an initialized generator (e..g OAIGenerator) leaving this as a union with nothing to note that we may want to support other generator types in the future (e.g. HFGenerator, etc.) corpus : an initialized Corpus object the corpus / "vector database" you want to use query : String the main instruction or query string. This is basically your question k : int The number of nearest-neighbor vectors to fetch from the corpus to build your context temperature : Float64 controls the stochasticity of the output generated by the model

source

Generation.load_OAIGeneratorWithCorpus — Function

function load_OAIGeneratorWithCorpus(corpus_name::String, auth_token::Union{String, Nothing}=nothing)

Loads an existing corpus and uses it to initialize an OAIGeneratorWithCorpus

Parameters

corpusname : str the name that you want to give the database authtoken :: Union{String, Nothing} this is your OPENAI API key. You can either pass it explicitly as a string or leave this argument as nothing. In the latter case, we will look in your environmental variables for "OAI_KEY"

Notes

corpusname is ordered first because Julia uses positional arguments and authtoken is optional.

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Generation.load_OllamaGeneratorWithCorpus — Function

function load_OllamaGeneratorWithCorpus(corpus_name::String, model_name::String = "mistral:7b-instruct")

Loads an existing corpus and uses it to initialize an OllamaGeneratorWithCorpus

Parameters

corpusname : str the name that you want to give the database modelname :: String this is an Ollama model tag. see https://ollama.com/library defaults to mistral 7b instruct

Notes

corpusname is ordered first because Julia uses positional arguments and modelname is optional.

source

Generation.upsert_chunk_to_generator — Method

function upsert_chunk_to_generator(generator::GeneratorWithCorpus, chunk::String, doc_name::String)

Equivalent to Backend.upsert_chunk, but takes a GeneratorWithCorpus instead of a Corpus.

Parameters

generator : any struct that subtypes GeneratorWithCorpus the generator (with corpus) you want to use chunk : str This is the text content of the chunk you want to upsert docname : str The name of the document that chunk is from. For instance, if you were upserting all the chunks in an academic paper, docname might be the name of that paper

Notes

One would expect Julia's multiple dispatch to allow us to call this upsertchunk, but not so. The conflict arises in Juissie, where we would have both SemanticSearch and Generation exporting upsertchunk. This means any uses of it in Juissie must be qualified, and without doing so, neither actually gets defined.

source

Generation.upsert_document_from_url_to_generator — Function

function upsert_document_from_url_to_generator(generator::GeneratorWithCorpus, url::String, doc_name::String, elements::Array{String}=["h1", "h2", "p"])

Equivalent to Backend.upsertdocumentfrom_url, but takes a GeneratorWithCorpus instead of a Corpus.

Parameters

generator : any struct that subtypes GeneratorWithCorpus the generator (with corpus) you want to use url : String The url you want to scrape for text doc_name : str The name of the document the content is from elements : Array{String} A list of HTML elements you want to pull the text from

Notes

See note for upsertchunkto_generator - same idea.

source

Generation.upsert_document_to_generator — Method

function upsert_document_to_generator(generator::GeneratorWithCorpus, doc_text::String, doc_name::String)

Equivalent to Backend.upsert_document, but takes a GeneratorWithCorpus instead of a Corpus.

Parameters

generator : any struct that subtypes GeneratorWithCorpus the generator (with corpus) you want to use doctext : str A long string you want to upsert. We will break this into chunks and upsert each chunk. docname : str The name of the document the content is from

Notes

See note for upsertchunkto_generator - same idea.

source

Generation.OAIGenerator — Type

function OAIGenerator(auth_token::Union{String, Nothing})

Initializes an OAIGenerator struct.

Parameters

authtoken :: Union{String, Nothing} this is your OPENAI API key. You can either pass it explicitly as a string or leave this argument as nothing. In the latter case, we will look in your environmental variables for "OAIKEY"

Notes

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Generation.OAIGenerator — Type

struct OAIGenerator

A struct for handling natural language generation via OpenAI's gpt-3.5-turbo completion endpoint.

Attributes

url : String the URL of the OpenAI API endpoint header : Vector{Pair{String, String}} key-value pairs representing the HTTP headers for the request body : Dict{String, Any} this is the JSON payload to be sent in the body of the request

Notes

All natural language generation should be done via a "Generator" object of some kind for consistency.

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Generation.OAIGeneratorWithCorpus — Type

function OAIGeneratorWithCorpus(auth_token::Union{String, Nothing}=nothing, corpus::Corpus)

Initializes an OAIGeneratorWithCorpus.

Parameters

corpusname : str or nothing the name that you want to give the database optional. if left as nothing, we use an in-memory database authtoken :: Union{String, Nothing} this is your OPENAI API key. You can either pass it explicitly as a string or leave this argument as nothing. In the latter case, we will look in your environmental variables for "OAIKEY" embeddermodelpath : str a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5" maxseq_len : int The maximum number of tokens per chunk. This should be the max sequence length of the tokenizer

Notes

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Generation.OAIGeneratorWithCorpus — Type

struct OAIGeneratorWithCorpus

Like OAIGenerator, but has a corpus attached.

Attributes

url : String the URL of the OpenAI API endpoint header : Vector{Pair{String, String}} key-value pairs representing the HTTP headers for the request body : Dict{String, Any} this is the JSON payload to be sent in the body of the request corpus : an initialized Corpus object the corpus / "vector database" you want to use

Notes

When instantiating a new OAIGenerator in an externally-viewable setting (e.g. notebooks committed to GitHub or a public demo), it is important to place a semicolon after the command, e.g. '''generator=loadOAIGeneratorWithCorpus("greekphilosophers");''' to ensure that your OAI API key is not inadvertently shared.

source

Generation.OllamaGenerator — Type

function OllamaGenerator(model_name::String = "mistral:7b-instruct")

Initializes an OllamaGenerator struct for local text generation.

Parameters

model_name :: String this is an Ollama model tag. see https://ollama.com/library defaults to mistral 7b instruct

source

Generation.OllamaGenerator — Type

struct OllamaGenerator

A struct for handling natural language generation locally.

Attributes

url : String the URL of the local Ollama API endpoint header : Dict{String,Any} HTTP header for the request body : Dict{String, Any} this is the JSON payload to be sent in the body of the request

source

Generation.OllamaGeneratorWithCorpus — Type

function OllamaGeneratorWithCorpus(corpus_name::Union{String,Nothing} = nothing, model_name::String = "mistral:7b-instruct", embedder_model_path::String = "BAAI/bge-small-en-v1.5", max_seq_len::Int = 512)

Initializes an OllamaGeneratorWithCorpus.

Parameters

corpusname : str or nothing the name that you want to give the database optional. if left as nothing, we use an in-memory database modelname :: String this is an Ollama model tag. see https://ollama.com/library defaults to mistral 7b instruct embeddermodelpath : str a path to a HuggingFace-hosted model e.g. "BAAI/bge-small-en-v1.5" maxseqlen : int The maximum number of tokens per chunk. This should be the max sequence length of the tokenizer

source

Generation.OllamaGeneratorWithCorpus — Type

struct_OllamaGeneratorWithCorpus

Like OllamaGenerator, but has a corpus attached.

Attributes

url : String the URL of the local Ollama API endpoint header : Dict{String,Any} HTTP header for the request body : Dict{String, Any} this is the JSON payload to be sent in the body of the request corpus : an initialized Corpus object the corpus / "vector database" you want to use

source

Generation.SemanticSearch.Backend.TextUtils.chunkify — Function

function chunkify(text::String, tokenizer, sequence_length::Int=512)

Splits a provided text (e.g. paragraph) into chunks that are each as many sentences as possible while keeping the chunk's token lenght below the sequence_length. This ensures that each chunk can be fully encoded by the embedder.

Parameters

text : String The text you want to split into chunks. tokenizer : a tokenizer object, e.g. BertTextEncoder The tokenizer you will be using sequence_length : Int The maximum number of tokens per chunk. Ideally, should correspond to the max sequence length of the tokenizer

Example Usage

>>> chunkify(
    '''Hold me closer, tiny dancer. Count the headlights on the highway. Lay me down in sheets of linen. Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked.
    ''', 
    corpus.embedder.tokenizer, 
    20
)

4-element Vector{Any}:
"Hold me closer, tiny dancer. Count the headlights on the highway."
"Lay me down in sheets of linen."
"Peter Piper picked a peck of pickled peppers."
"A peck of pickled peppers Peter Piper picked."

source

Generation.SemanticSearch.Backend.TextUtils.get_files_path — Method

function get_files_path()

Simple function to return the path to the files subdirectory.

Example Usage

testbinpath = getfilespath()*"test.bin"

source

Generation.SemanticSearch.Backend.TextUtils.read_html_url — Function

read_html_url(url::String, elements::Array{String})

Returns a string of text from the provided HTML elements on a webpage.

Parameters

url : String the url you want to read elements : Array{String} html elements to look for in the web page, e.g. ["h1", "p"].

Notes

Defaults to extracting headers and paragraphs

source

Generation.SemanticSearch.Backend.TextUtils.sentence_splitter — Method

function sentence_splitter(text::String)

Uses basic regex to divide a provided text (e.g. paragraph) into sentences.

Parameters

text : String The text you want to split into sentences.

Notes

Regex is hard to read. The first part looks for spaces following end-of-sentence punctuation. The second part matches at the end of the string.

Regex in Julia uses an r identifier prefix.

References

https://www.geeksforgeeks.org/regular-expressions-in-julia/

source

Juissie.jl

Introduction

Installation

Verify setup

Usage

Desktop UI

API Keys

Obtaining an OpenAI API Key

Managing API Keys

Functions