OpenDataLoader LogoOpenDataLoader

Quick Start with Java

Integrate OpenDataLoader PDF as a JVM dependency or CLI

Use the core Java library when you need full JVM control or want to embed PDF parsing inside existing Java services.

Requirements

  • Java 11+ available on the system PATH

Verify Java once before installing:

java -version

Dependency (Maven)

<dependency>
  <groupId>org.opendataloader</groupId>
  <artifactId>opendataloader-pdf-core</artifactId>
  <version>1.11.0</version>
</dependency>

<repositories>
  <repository>
    <snapshots>
      <enabled>true</enabled>
    </snapshots>
    <id>vera-dev</id>
    <name>Vera development</name>
    <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
  </repository>
</repositories>

Check Maven Central for the latest version.

Sample Gradle and Maven projects live in opendataloader-pdf-examples.

Process PDFs

import org.opendataloader.pdf.api.Config;
import org.opendataloader.pdf.api.OpenDataLoaderPDF;

public class Sample {
    public static void main(String[] args) throws Exception {
        Config config = new Config();
        config.setOutputFolder("path/to/output");
        config.setGeneratePDF(true);
        config.setGenerateMarkdown(true);
        config.setGenerateHtml(true);

        try {
            // Process multiple files in one JVM invocation
            for (String pdf : new String[]{"report.pdf", "contract.pdf"}) {
                OpenDataLoaderPDF.processFile(pdf, config);
            }
        } finally {
            // Releases internal thread pools; call once at application exit, not between batches
            OpenDataLoaderPDF.shutdown();
        }
    }
}

Performance tip: Process all files within a single JVM session. Each processFile() call reuses the initialized runtime, so batching hundreds of files is significantly faster than launching separate processes.

For all Config options, see the Config Javadoc.

CLI usage

Download CLI JAR from the releases page.

Pass multiple files or directories in a single command:

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
java -jar opendataloader-pdf-cli-<VERSION>.jar \
  file1.pdf file2.pdf folder/ \
  -o output/ \
  -f json,html,pdf,markdown

For all CLI options, see the CLI Options Reference.

API docs

Full Javadoc is published at javadoc.io.

Next steps

  • Need schema details for downstream parsing? See the JSON schema.

On this page