Design¶
Needlestack allows users to index their vector spaces and search them in real-time. Below are some of the terms and design decisions.
Vector Spaces¶
Vector spaces are coordinate systems in which a set of vectors exists. A thing is represented in a vector space as a vector. Absolute and/or relative positions of these vectors contain information about the underlying things.
Collection¶
A Collection represents a particular vector space and it’s set of vectors.
Collections partition its vectors into one or more Shards. Performing
a kNN search on a Collection will perform a kNN search on each Shard,
then merge the results together. User’s can specify which Shards to
perform a kNN search on, ignoring those not specified.
Spatial Index¶
A BaseIndex implements how vectors are stored and searched.
This is achieved using kNN indices from third party packages
like faiss and scikit-learn. A particular BaseIndex
will use an algorithm like brute-force, kd-tree, voronoi tessellations,
etc. Metadata about each vector is stored in the BaseIndex.
Metadata for a vector includes a string id and an optional list of
primitive values (string, double, float, long, int, bool).
Services¶
There are three main components to run a live Needlestack cluster:
ClusterManagermaintains the state of all worker nodes in a clusterMergersplits and distributes task over multiple worker nodes and merges resultsSearcherruns on worker nodes to perform tasks fromMergernodes
If you’re from the map-reduce world, think of these as resource manager, reducer, and mapper respectively.
Cluster Manager¶
The ClusterManager maintains the cluster state. It is a key-value store
client that exists in each Merger and Searcher. The key-value store
should be accessible by every node in the Needlestack cluster.
The default key-value store is Zookeeper.
The information available from the ClusterManager is:
All live
Searchernodes in a Needlestack clusterAll
Collectionsand theirShardsWhich
Shardsare hosted on whichSearchernodesThe status of any given
Shardon aSearcher(ACTIVE,DOWN,BOOTING)
Merger gRPC Service¶
The Merger is a microservice that Needlestack clients interface with.
Clients can make search, retrieval, and configuration requests through
gRPC. On requests, it will retrieve the cluster state from the ClusterManager
and determine which Searcher nodes will fulfill a task. It sends tasks to
relevant Searcher nodes, waits for responses, then merges the results together.
ex. On a search requests for a Collection, a Merger node will
get the cluster state from the ClusterManager and determine which Searcher
nodes host Shards in that Collection. The Merger sends concurrent
requests to relevant Searcher nodes to perform kNN searches on those Shards.
It waits for all concurrent requests to complete, then merges the results together.
The client gets a list of search results from the Merger, as if all vectors in
the collection were hosted on it.
Searcher gRPC Service¶
The Searcher is a microservice that hosts the Shards. It is responsible
for the compute heavy task of doing kNN search algorithms. In a multi-node Needlestack
cluster, each Searcher node might only host a subset of all Shards.
A Searcher node will register itself with the ClusterManager on startup.
It is visible to the rest of the Needlestack cluster and is available to
host Shards. An instance of a Shard on a Searcher node is called a
Replica.
A Merger node can send a Searcher tasks as a gRPC requests. The task could
be something like performing a kNN search over a specific set of Shards
hosted on that node.
Service Nodes¶
The ClusterManager will use an external key-value store, separate from
the Needlestack cluster.
The Merger and Searcher microservices may exists on either the same
or different nodes. Each node in the Needlestack cluster runs a gRPC server
where a Merger and Searcher servicer can be added. Deciding between
whether to run them on the same server or keep them isolated will depend on
the constraints of your particular system.