Serve Haybarn catalogs, tables, and functions from a Java process over Apache Arrow IPC — the Java implementation of the VGI (Vector Gateway Interface) protocol.
Built by 🚜 Query.Farm
VGI lets Haybarn — Query Farm's independent derived distribution of DuckDB — ATTACH a catalog whose schemas, tables, and functions live in an external worker process. The vgi extension speaks an Arrow-IPC RPC protocol to that worker; this library is everything you need to write the worker side in Java. Your code registers functions and tables against a Worker builder — the library handles the wire protocol, schema negotiation, batch streaming, pushdown, and transports.
Wire-compatible with the Python reference implementation and the Go port: all three serve the same integration suite against the same C++ extension.
- Catalog tables — named tables with inline schemas, comments, tags, constraints, foreign keys, and per-column statistics that feed the engine's optimizer.
- Scalar functions — annotation-driven (
ScalarFn): declare acompute()method and the parameter annotations generate the spec, bind-time validation, and dispatch. - Table functions — streaming producers with projection pushdown, filter pushdown, row-id, sampling, and time-travel (
AT) support. - Table-in/out functions — exchange-style streaming transforms over input batches.
- Table buffering functions — sink/source functions that buffer all input before emitting (distributed-aggregation style lifecycles: process → combine → finalize).
- Aggregate functions — partial aggregation with cross-process state combine.
- Catalog versioning — semver data/implementation version negotiation, release manifests, multi-branch tables, transactions, and attach options.
- Java 21+ at runtime. The shared-memory side-channel (zero-copy batch transfer with a co-located engine) additionally requires JDK 22+; on 21 it transparently falls back to pipe transport.
- Haybarn with the vgi extension installed on the client side (it's in Haybarn's signed community channel:
INSTALL vgi FROM community).
Artifacts are published to Maven Central under the farm.query group.
Gradle (Kotlin DSL):
dependencies {
implementation("farm.query:vgi:0.1.0")
}Maven:
<dependency>
<groupId>farm.query</groupId>
<artifactId>vgi</artifactId>
<version>0.1.0</version>
</dependency>The RPC layer (farm.query:vgirpc) comes in transitively.
A worker with one scalar function:
import farm.query.vgi.Worker;
import farm.query.vgi.scalar.Const;
import farm.query.vgi.scalar.ScalarFn;
import farm.query.vgi.scalar.Vector;
import org.apache.arrow.vector.BigIntVector;
public final class DemoWorker {
/** {@code multiply(value INT64, factor INT64 [const]) -> INT64} */
static final class Multiply extends ScalarFn {
@Override public String name() { return "multiply"; }
@Override public String description() { return "Multiplies a value by a constant factor"; }
public void compute(@Vector BigIntVector value, @Const long factor, BigIntVector result) {
int rows = value.getValueCount();
for (int i = 0; i < rows; i++) {
if (value.isNull(i)) {
result.setNull(i);
} else {
result.set(i, value.get(i) * factor);
}
}
}
}
public static void main(String[] args) {
Worker worker = Worker.builder()
.catalogName("demo")
.registerScalar(new Multiply());
worker.runFromArgs(args); // stdio by default; --unix / --http via flags
}
}The compute() signature drives everything: @Vector parameters are per-row input columns, @Const parameters are bind-time constants, @Setting parameters read session settings, and the last unannotated Arrow vector is the framework-allocated output.
The worker JVM needs two flags — Apache Arrow requires access to java.nio internals, and the shared-memory transport uses the FFM API:
--add-opens=java.base/java.nio=org.apache.arrow.memory.core,ALL-UNNAMED
--enable-native-access=ALL-UNNAMED
With the Gradle application plugin, bake them into the start script so the worker binary is self-contained:
application {
mainClass.set("DemoWorker")
applicationDefaultJvmArgs = listOf(
"--add-opens=java.base/java.nio=org.apache.arrow.memory.core,ALL-UNNAMED",
"--enable-native-access=ALL-UNNAMED",
)
}Without the --add-opens flag the worker fails at first query with Failed to initialize MemoryUtil.
Attach and query it from Haybarn:
INSTALL vgi FROM community;
LOAD vgi;
ATTACH 'demo' AS demo (TYPE vgi, LOCATION 'launch:/path/to/demo-worker');
SELECT demo.multiply(21, 2); -- 42The launch: location scheme starts the worker once behind a flock-coordinated Unix socket and reuses it across queries and engine processes — essential for JVM workers, which are expensive to cold-start. Plain subprocess (/path/to/worker) and http(s):// locations also work.
The vgi-example-worker module (not published) is a complete worker with 90+ functions — scalar, table, aggregate, table-in/out, buffering, partitioned, multi-branch, transactional — that serves the canonical VGI integration suite. It is the best place to look for working patterns of any feature.
| Repository | What it is |
|---|---|
| Query-farm-haybarn/haybarn | Haybarn — the independent derived distribution of DuckDB by Query Farm |
| Query-farm/vgi | The vgi engine extension (C++) — the client side of the protocol |
| Query-farm/vgi-python | Python reference implementation of the worker side |
| Query-farm/vgi-go | Go implementation of the worker side |
| Query-farm/vgi-rpc-java | The transport-agnostic Arrow RPC framework this library builds on |
