Quick Start with Java
Integrate OpenDataLoader PDF as a JVM dependency or CLI
Use the core Java library when you need full JVM control or want to embed PDF parsing inside existing Java services.
Requirements
- Java 11+ available on the system
PATH
Verify Java once before installing:
java -versionDependency (Maven)
<dependency>
<groupId>org.opendataloader</groupId>
<artifactId>opendataloader-pdf-core</artifactId>
<version>1.11.0</version>
</dependency>
<repositories>
<repository>
<snapshots>
<enabled>true</enabled>
</snapshots>
<id>vera-dev</id>
<name>Vera development</name>
<url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
</repository>
</repositories>Check Maven Central for the latest version.
Sample Gradle and Maven projects live in opendataloader-pdf-examples.
Process PDFs
import org.opendataloader.pdf.api.Config;
import org.opendataloader.pdf.api.OpenDataLoaderPDF;
public class Sample {
public static void main(String[] args) throws Exception {
Config config = new Config();
config.setOutputFolder("path/to/output");
config.setGeneratePDF(true);
config.setGenerateMarkdown(true);
config.setGenerateHtml(true);
try {
// Process multiple files in one JVM invocation
for (String pdf : new String[]{"report.pdf", "contract.pdf"}) {
OpenDataLoaderPDF.processFile(pdf, config);
}
} finally {
// Releases internal thread pools; call once at application exit, not between batches
OpenDataLoaderPDF.shutdown();
}
}
}Performance tip: Process all files within a single JVM session. Each
processFile()call reuses the initialized runtime, so batching hundreds of files is significantly faster than launching separate processes.
For all Config options, see the Config Javadoc.
CLI usage
Download CLI JAR from the releases page.
Pass multiple files or directories in a single command:
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
java -jar opendataloader-pdf-cli-<VERSION>.jar \
file1.pdf file2.pdf folder/ \
-o output/ \
-f json,html,pdf,markdownFor all CLI options, see the CLI Options Reference.
API docs
Full Javadoc is published at javadoc.io.
Next steps
- Need schema details for downstream parsing? See the JSON schema.
Quick Start with Python
Install opendataloader-pdf and extract text, tables, and headings from PDF files using Python. Requires Java 11+ and Python 3.10+.
Quick Start with Node.js
Install @opendataloader/pdf and convert PDF files to Markdown or JSON using TypeScript or JavaScript. Requires Java 11+ and Node.js 20+.