A curated list of resources dedicated to Natural Language Processing
Please read the contribution guidelines before contributing. Please add your favourite NLP resource by raising a pull request
This list covers natural language processing — linguistic analysis, multilingual tooling, classical and neural methods, datasets, and evaluation. Large language models are included only where they advance or evaluate a core NLP task or capability (tokenization, multilinguality, MT, summarization, NER, QA, factuality, probing, distillation). General-purpose chatbots, agent frameworks, prompt-template repositories, code-generation tools, and RAG application starter kits live in other lists — see See Also.
- Research Summaries and Trends
- Prominent NLP Research Labs
- Tutorials
- Libraries
- Services
- Annotation Tools
- Tasks and Methods
- Text Embeddings
- Tokenization, Morphology, and Segmentation
- POS Tagging and Dependency Parsing
- Named Entity Recognition and Information Extraction
- Coreference Resolution
- Text Classification and Sentiment Analysis
- Topic Modeling
- Summarization
- Machine Translation
- Question Answering and Reading Comprehension
- Information Extraction Beyond NER
- Retrieval and Embeddings
- Speech and Text
- Datasets
- Multilingual NLP Frameworks
- Language Models for NLP
- Pretraining and Adaptation
- Multilingual and Cross-Lingual Models
- Evaluation and Benchmarks
- Reasoning and Test-Time Compute
- Long Context and Alternative Architectures
- Factuality, Hallucination, Calibration
- Probing and Interpretability
- Efficient and Small Language Models
- Instruction Tuning and Preference Optimization
- Bias, Fairness, Safety in NLP
- NLP per Language
- See Also
- Citation
Where to follow current NLP research:
- ACL Anthology - canonical archive of papers from ACL, EMNLP, NAACL, EACL, COLING, and related venues.
- NLP-Progress - tracks state-of-the-art results across common NLP tasks and datasets.
- Papers With Code: NLP - papers, benchmarks, and leaderboards for NLP tasks.
- Sebastian Ruder's newsletter - regular roundups of NLP research and trends.
- ACL Rolling Review - the rolling review process feeding ACL-affiliated venues.
- The Gradient - long-form essays on ML and NLP research.
- Visual NLP Paper Summaries - illustrated summaries of recent papers.
- NLP's ImageNet moment has arrived - 2018 essay on the rise of pretrained language models.
- Survey of the State of the Art in Natural Language Generation - 2017 NLG survey.
- The Illustrated Transformer and The Illustrated BERT, ELMo, and co. - canonical visual explanations.
- The Berkeley NLP Group - Notable contributions include a tool to reconstruct long dead languages, referenced here and by taking corpora from 637 languages currently spoken in Asia and the Pacific and recreating their descendant.
- Language Technologies Institute, Carnegie Mellon University - Notable projects include Avenue Project, a syntax driven machine translation system for endangered languages like Quechua and Aymara and previously, Noah's Ark which created AQMAR to improve NLP tools for Arabic.
- NLP research group, Columbia University - Responsible for creating BOLT ( interactive error handling for speech translation systems) and an un-named project to characterize laughter in dialogue.
- The Center or Language and Speech Processing, John Hopkins University - Recently in the news for developing speech recognition software to create a diagnostic test or Parkinson's Disease, here.
- Computational Linguistics and Information Processing Group, University of Maryland - Notable contributions include Human-Computer Cooperation or Word-by-Word Question Answering and modeling development of phonetic representations.
- Penn Natural Language Processing, University of Pennsylvania - famous for creating the Penn Treebank and the Penn Discourse Treebank.
- The Stanford Nautral Language Processing Group- One of the top NLP research labs in the world, notable for creating Stanford CoreNLP and their coreference resolution system
General Machine Learning
- Machine Learning 101 from Google's Senior Creative Engineer explains Machine Learning for engineer's and executives alike
- AI Playbook - a16z AI playbook is a great link to forward to your managers or content for your presentations
- Sebastian Ruder's Newsletter for commentary on the best of NLP research.
- How To Label Data guide to managing larger linguistic annotation projects
- Depends on the Definition collection of blog posts covering a wide array of NLP topics with detailed implementation
Introductions and Guides to NLP
- Understand & Implement Natural Language Processing
- NLP in Python - Collection of Github notebooks
- Natural Language Processing: An Introduction - Oxford
- NLP from Scratch with PyTorch
- Hands-On NLTK Tutorial - NLTK Tutorials, Jupyter notebooks
- Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit - An online and print book introducing NLP concepts using NLTK. The book's authors also wrote the NLTK library.
- Train a new language model from scratch - Hugging Face 🤗
- Advanced NLP with spaCy - Free online course covering text processing, large-scale data analysis, processing pipelines, and training neural network models for custom NLP tasks.
- Kaggle NLP Learning Guide - Beginner-friendly tutorials including getting started guides, deep learning for NLP, and visual explanations of techniques like BERT, GloVe, and TF-IDF.
Blogs and Newsletters
- Deep Learning, NLP, and Representations
- The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) and The Illustrated Transformer
- Natural Language Processing by Hal Daumé III
- arXiv: Natural Language Processing (Almost) from Scratch
- Karpathy's The Unreasonable Effectiveness of Recurrent Neural Networks
- Machine Learning Mastery: Deep Learning for Natural Language Processing
- Visual NLP Paper Summaries
- Advanced Natural Language Processing - CS 685, UMass Amherst CS
- Deep Natural Language Processing - Lectures series from Oxford
- Deep Learning for Natural Language Processing (cs224-n) - Richard Socher and Christopher Manning's Stanford Course
- Neural Networks for NLP - Carnegie Mellon Language Technology Institute there
- Deep NLP Course by Yandex Data School, covering important ideas from text embedding to machine translation including sequence modeling, language models and so on.
- fast.ai Code-First Intro to Natural Language Processing - This covers a blend of traditional NLP topics (including regex, SVD, naive bayes, tokenization) and recent neural network approaches (including RNNs, seq2seq, GRUs, and the Transformer), as well as addressing urgent ethical issues, such as bias and disinformation. Find the Jupyter Notebooks here
- Machine Learning University - Accelerated Natural Language Processing - Lectures go from introduction to NLP and text processing to Recurrent Neural Networks and Transformers. Material can be found here.
- Applied Natural Language Processing- Lecture series from IIT Madras taking from the basics all the way to autoencoders and everything. The github notebooks for this course are also available here
- DeepLearning.AI Natural Language Processing Specialization - 4-course program covering sentiment analysis, word embeddings, RNNs, LSTMs, attention mechanisms, and Transformer models like BERT and T5 for tasks including machine translation and summarization.
- Stanford CS336: Language Modeling from Scratch - end-to-end course on building language models, including data, tokenization, training, and evaluation.
- Stanford CS25: Transformers United - seminar series with guest lectures from authors of recent transformer and NLP research.
- Cohere LLM University - free course on LLMs, embeddings, semantic search, and NLP applications.
- Hugging Face NLP Course - hands-on NLP with Transformers, Datasets, and Tokenizers libraries.
- Speech and Language Processing - free, by Prof. Dan Jurafsy
- Natural Language Processing - free, NLP notes by Dr. Jacob Eisenstein at GeorgiaTech
- NLP with PyTorch - Brian & Delip Rao
- Text Mining in R
- Natural Language Processing with Python
- Practical Natural Language Processing
- Natural Language Processing with Spark NLP
- Deep Learning for Natural Language Processing by Stephan Raaijmakers
- Real-World Natural Language Processing - by Masato Hagiwara
- Natural Language Processing in Action, Second Edition - by Hobson Lane and Maria Dyshel
- Transformers in Action - by Nicole Koenigstein
- The Math Behind Artificial Intelligence - bt Tiago MOnteiro | A free FreeCodeCamp book teaching the math behind AI in plain English from an engineering point of view. It covers linear algebra, calculus, probability & statistics, and optimization theory with analogies, real-life applications, and Python code examples.
-
Node.js and Javascript - Node.js Libaries for NLP | Back to Top
- Twitter-text - A JavaScript implementation of Twitter's text processing library
- Knwl.js - A Natural Language Processor in JS
- Retext - Extensible system for analyzing and manipulating natural language
- NLP Compromise - Natural Language processing in the browser
- Natural - general natural language facilities for node
- Poplar - A web-based annotation tool for natural language processing (NLP)
- NLP.js - An NLP library for building bots
- node-question-answering - Fast and production-ready question answering w/ DistilBERT in Node.js
-
Python - Python NLP Libraries | Back to Top
- sentimental-onix Sentiment models for spacy using onnx
- TextAttack - Adversarial attacks, adversarial training, and data augmentation in NLP
- TextBlob - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of Natural Language Toolkit (NLTK) and Pattern, and plays nicely with both 👍
- spaCy - Industrial strength NLP with Python and Cython 👍
- textacy - Higher level NLP built on spaCy
- gensim - Python library to conduct unsupervised semantic modelling from plain text 👍
- scattertext - Python library to produce d3 visualizations of how language differs between corpora
- GluonNLP (archived) - A deep learning toolkit for NLP, built on MXNet/Gluon.
- AllenNLP (archived) - An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.
- PyTorch-NLP - NLP research toolkit designed to support rapid prototyping with better data loaders, word vector loaders, neural network layer representations, common NLP metrics such as BLEU
- Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)
- PyNLPl - Python Natural Language Processing Library. General purpose NLP library for Python, handles some specific formats like ARPA language models, Moses phrasetables, GIZA++ alignments.
- foliapy - Python library for working with FoLiA, an XML format for linguistic annotation.
- PySS3 - Python package implementing the SS3 white-box text classifier; ships with interactive visualization tools that explain predictions.
- jPTDP - A toolkit for joint part-of-speech (POS) tagging and dependency parsing. jPTDP provides pre-trained models for 40+ languages.
- BigARTM - a fast library for topic modelling
- Snips NLU - A production ready library for intent parsing
- Chazutsu - A library for downloading&parsing standard NLP research datasets
- Word Forms - Word forms can accurately generate all possible forms of an English word
- Multilingual Latent Dirichlet Allocation (LDA) - A multilingual and extensible document clustering pipeline
- Natural Language Toolkit (NLTK) - A library containing a wide variety of NLP functionality, supporting over 50 corpora.
- NLP Architect - A library for exploring the state-of-the-art deep learning topologies and techniques for NLP and NLU
- Flair - A very simple framework for state-of-the-art multilingual NLP built on PyTorch. Includes BERT, ELMo and Flair embeddings.
- Kashgari - Simple, Keras-powered multilingual NLP framework, allows you to build your models in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks. Includes BERT and word2vec embedding.
- FARM - Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
- Haystack - End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, HuggingFace’s Modelhub, and much more!
- Rita DSL - a DSL, loosely based on RUTA on Apache UIMA. Allows to define language patterns (rule-based NLP) which are then translated into spaCy, or if you prefer less features and lightweight - regex patterns.
- Transformers - Natural Language Processing for TensorFlow 2.0 and PyTorch.
- Tokenizers - Tokenizers optimized for Research and Production.
- fairSeq Facebook AI Research implementations of SOTA seq2seq models in Pytorch.
- corex_topic - Hierarchical Topic Modeling with Minimal Domain Knowledge
- Sockeye - Neural Machine Translation (NMT) toolkit that powers Amazon Translate.
- DL Translate - A deep learning-based translation library for 50 languages, built on
transformersand Facebook's mBART Large. - Jury - Evaluation of NLP model outputs offering various automated metrics.
- python-ucto - Unicode-aware regular-expression based tokenizer for various languages. Python binding to C++ library, supports FoLiA format.
- Pearmut - Human annotation tool for multilingual NLP tasks, such as machine translation.
- Stanza - Stanford NLP's Python toolkit for tokenization, POS, lemma, dependency parsing, and NER across 70+ languages.
- Sentence-Transformers - sentence/document embeddings, semantic search, and re-ranking; current standard for retrieval-style NLP.
- Argilla - open-source data annotation and feedback collection platform for LLM and NLP datasets.
- HuggingFace Datasets - standardized loaders and processing for thousands of NLP datasets.
- HuggingFace Evaluate - reference implementations for NLP metrics.
- sacrebleu - reproducible BLEU/chrF/TER scoring for machine translation.
- COMET - learned MT metrics, current de-facto standard.
- LangTest - 60+ test types for NLP model robustness, bias, and fairness.
-
C++ - C++ Libraries | Back to Top
- InsNet - A neural network library for building instance-dependent NLP models with padding-free dynamic batching.
- MIT Information Extraction Toolkit - C, C++, and Python tools for named entity recognition and relation extraction
- CRF++ - Open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data & other Natural Language Processing tasks.
- CRFsuite - CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data.
- BLLIP Parser - BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
- colibri-core - C++ library, command line tools, and Python binding for extracting and working with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
- ucto - Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports FoLiA format.
- libfolia - C++ library for the FoLiA format
- frog - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow parser, morphological analyzer.
- MeTA - ModErn Text Analysis: a C++ data sciences toolkit for mining big text data.
- Mecab (Japanese)
- Moses
- StarSpace - a library from Facebook for creating embeddings of word-level, paragraph-level, document-level and for text classification
- QSMM - adaptive probabilistic top-down and bottom-up parsers
-
Java - Java NLP Libraries | Back to Top
- Stanford NLP
- OpenNLP
- NLP4J
- Word2vec in Java
- ReVerb Web-Scale Open Information Extraction
- OpenRegex An efficient and flexible token-based regular expression language and engine.
- CogcompNLP - Core libraries developed in the U of Illinois' Cognitive Computation Group.
- MALLET - MAchine Learning for LanguagE Toolkit - package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
- RDRPOSTagger - A robust POS tagging toolkit available (in both Java & Python) together with pre-trained models for 40+ languages.
-
Scala - Scala NLP Libraries | Back to Top
- Saul - Library for developing NLP systems, including built in modules like SRL, POS, etc.
- ATR4S - Toolkit with state-of-the-art automatic term recognition methods.
- tm - Implementation of topic modeling based on regularized multilingual PLSA.
- word2vec-scala - Scala interface to word2vec model; includes operations on vectors like word-distance and word-analogy.
- Epic - Epic is a high performance statistical parser written in Scala, along with a framework for building complex structured prediction models.
- Spark NLP - Spark NLP is a natural language processing library built on top of Apache Spark ML that provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment.
-
R - R NLP Libraries | Back to Top
- text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
- wordVectors - An R package for creating and exploring word2vec and other word embedding models
- RMallet - R package to interface with the Java machine learning tool MALLET
- dfr-browser - Creates d3 visualizations for browsing topic models of text in a web browser.
- dfrtopics - R package for exploring topic models of text.
- sentiment_classifier - Sentiment Classification using Word Sense Disambiguation and WordNet Reader
- jProcessing - Japanese Natural Langauge Processing Libraries, with Japanese sentiment classification
- corporaexplorer - An R package for dynamic exploration of text collections
- tidytext - Text mining using tidy tools
- spacyr - R wrapper to spaCy NLP
- CRAN Task View: Natural Language Processing
-
- Clojure-openNLP - Natural Language Processing in Clojure (opennlp)
- Infections-clj - Rails-like inflection library for Clojure and ClojureScript
- postagga - A library to parse natural language in Clojure and ClojureScript
-
- whatlang — Natural language recognition library based on trigrams
- rust-bert - Ready-to-use NLP pipelines and Transformer-based models
- snips-nlu-rs (archived — Snips was discontinued) - A production ready library for intent parsing
-
NLP++ - NLP++ Language | Back to Top
- VSCode Language Extension - NLP++ Language Extension for VSCode
- nlp-engine - NLP++ engine to run NLP++ code on Linux including a full English parser
- VisualText - Homepage for the NLP++ Language
- NLP++ Wiki - Wiki entry for the NLP++ language
-
- CorpusLoaders - A variety of loaders for various NLP corpora
- Languages - A package for working with human languages
- TextAnalysis - Julia package for text analysis
- TextModels - Neural Network based models for Natural Language Processing
- WordTokenizers - High performance tokenizers for natural language processing and other related tasks
- Word2Vec - Julia interface to word2vec
NLP as API with higher level functionality such as NER, Topic tagging and so on | Back to Top
- Wit-ai - Natural Language Interface for apps and devices
- IBM Watson's Natural Language Understanding - API and Github demo
- Amazon Comprehend - NLP and ML suite covers most common tasks like NER, tagging, and sentiment analysis
- Google Cloud Natural Language API - Syntax Analysis, NER, Sentiment Analysis, and Content tagging in atleast 9 languages include English and Chinese (Simplified and Traditional).
- ParallelDots - High level Text Analysis API Service ranging from Sentiment Analysis to Intent Analysis
- Microsoft Cognitive Service
- TextRazor
- Rosette
- Textalytic - Natural Language Processing in the Browser with sentiment analysis, named entity extraction, POS tagging, word frequencies, topic modeling, word clouds, and more
- NLP Cloud - SpaCy NLP models (custom and pre-trained ones) served through a RESTful API for named entity recognition (NER), POS tagging, and more.
- Cloudmersive - Unified and free NLP APIs that perform actions such as speech tagging, text rephrasing, language translation/detection, and sentence parsing
- GATE - General Architecture and Text Engineering is 15+ years old, free and open source
- Anafora is free and open source, web-based raw text annotation tool
- brat - brat rapid annotation tool is an online environment for collaborative text annotation
- doccano - doccano is free, open-source, and provides annotation features for text classification, sequence labeling and sequence to sequence
- INCEpTION - A semantic annotation platform offering intelligent assistance and knowledge management
- prodigy is an annotation tool powered by active learning, costs $
- LightTag - Hosted and managed text annotation tool for teams, costs $
- rstWeb - open source local or online tool for discourse tree annotations
- GitDox - open source server annotation tool with GitHub version control and validation for XML data and collaborative spreadsheet grids
- Datasaur support various NLP tasks for individual or teams, freemium based
- Konfuzio - team-first hosted and on-prem text, image and PDF annotation tool powered by active learning, freemium based, costs $
- UBIAI - Easy-to-use text annotation tool for teams with most comprehensive auto-annotation features. Supports NER, relations and document classification as well as OCR annotation for invoice labeling, costs $
- Shoonya - Shoonya is free and open source data annotation platform with wide varities of organization and workspace level management system. Shoonya is data agnostic, can be used by teams to annotate data with various level of verification stages at scale.
- Annotation Lab - Free End-to-End No-Code platform for text annotation and DL model training/tuning. Out-of-the-box support for Named Entity Recognition, Classification, Relation extraction and Assertion Status Spark NLP models. Unlimited support for users, teams, projects, documents. Not FOSS.
- FLAT - FLAT is a web-based linguistic annotation environment based around the FoLiA format, a rich XML-based format for linguistic annotation. Free and open source.
- Argilla - open-source platform for collecting human feedback, building NLP and LLM datasets, and curating preference data.
- Label Studio - open-core multi-modal labeling platform; widely used for NLP labeling.
NLP tasks organized by linguistic problem. Each subsection lists foundational/classical work first, then neural approaches, then LLM-based methods where relevant. For modern LM-specific research (pretraining, evaluation, retrieval, reasoning, etc.) see Language Models for NLP.
Static word embeddings (foundational):
- word2vec - implementation - explainer blog
- GloVe - explainer blog
- fastText - implementation; subword n-grams handle OOV well, still useful for low-resource languages.
- sense2vec - word sense disambiguation.
- Paragraph Vectors / doc2vec
Contextual embeddings:
- ELMo - deep contextualized word representations.
- CoVe - contextualized vectors learned from MT.
- ULMFiT - language-model fine-tuning for text classification.
- InferSent - sentence representations from NLI.
Modern sentence and document embeddings: see Retrieval for NLP (Sentence-Transformers, E5, BGE-M3, Nomic, GritLM) and MTEB for current leaderboards.
-
SentencePiece - language-agnostic subword tokenization.
-
BPE and Unigram LM - the two dominant subword schemes.
-
Stanza - tokenization, lemma, and morphology for 70+ languages.
-
UDPipe - tokenization, tagging, lemmatization, parsing for Universal Dependencies.
-
Morfessor - unsupervised morphological segmentation. Tokenizer research and architecture (also see Language Models):
-
Byte-Pair Encoding (Sennrich et al.) - subword units for neural MT; foundation of modern tokenizers.
-
SentencePiece - language-agnostic subword tokenization (BPE and Unigram).
-
Tokenizers - fast Rust implementations of BPE, WordPiece, Unigram.
-
ByT5 - tokenizer-free byte-level model.
-
CANINE - tokenization-free encoder operating on Unicode characters.
-
How Good is Your Tokenizer? - tokenizer fairness across languages.
-
Byte Latent Transformer (BLT) (Meta, 2024) - dynamic byte-level patching that matches BPE-tokenized models at scale; revives the tokenizer-free direction.
-
SuperBPE (2025) - superword tokenization that improves on BPE for downstream tasks.
-
Over-Tokenized Transformer (ICML 2025) - decouples input and output vocabularies; shows a log-linear relationship between input vocabulary size and training loss, scaling vocabulary independently of model size.
-
Foundations of Tokenization (ICLR 2025) - first formal unified framework for tokenizer models using stochastic-map category theory; establishes conditions for statistical consistency.
-
The Token Tax: Systematic Bias in Multilingual Tokenization (2025) - quantifies how tokenization fertility predicts model accuracy across languages, exposing structural cost penalties for morphologically complex and low-resource languages.
-
Reducing Tokenization Premiums for Low-Resource Languages (2026) - post-hoc vocabulary additions that coalesce multi-token character sequences for low-resource languages, reducing inference cost without retraining.
- Universal Dependencies - cross-linguistically consistent treebanks, 100+ languages.
- spaCy and Stanza - production parsers across many languages.
- Deep Biaffine Attention for Neural Dependency Parsing - foundational neural parsing architecture.
- Trankit - light-weight transformer-based multilingual NLP toolkit.
- Self-Attentive Constituency Parsing (Kitaev & Klein) - strong neural constituency parser.
Foundational and neural:
- CoNLL-2003 NER - canonical English NER benchmark.
- Neural Architectures for NER (Lample et al.) - BiLSTM-CRF, the long-time go-to NER architecture.
- Flair - contextual string embeddings, strong NER across languages.
- spaCy NER - production-ready.
Open and instruction-following IE:
- Universal NER - instruction-tuned LM for open-set NER across languages.
- GLiNER (2023) - small, generalist NER model that handles arbitrary entity types at inference.
- GoLLIE - guideline-following information extraction with LMs.
- REBEL - end-to-end relation extraction as seq2seq.
LLM-based:
- GPT-NER - LLMs for named entity recognition.
- Can LLMs Replace Sentence-Level NER? (2024) - cost-quality tradeoffs.
- Generative NER in the Era of LLMs (2026) - eight open LLMs across four NER benchmarks; PEFT with structured outputs matches encoder-based NER.
-
End-to-End Neural Coreference (Lee et al.) - foundation for modern neural coreference.
-
SpanBERT - span-based pretraining; strong coreference baseline.
-
coref-hoi - higher-order inference coreference.
-
maverick-coref (2024) - efficient coreference matching the best larger systems.
-
LingMess - linguistically-motivated category-based coreference scoring. LLM-based:
-
LLMs for Coreference Resolution - prompting and fine-tuning for coreference.
-
Multilingual Coreference Shared Task: Can LLMs Dethrone Traditional Approaches? (2025) - 9 systems across 4 LLM-based and 5 traditional approaches; traditional methods still lead but LLMs are closing the gap.
- fastText classifier - strong, fast linear baseline.
- Sentiment Treebank (SST) - canonical fine-grained sentiment dataset.
- SetFit - few-shot text classification without prompts.
- FastFit - fast few-shot for many-class settings.
- SST / IMDB / AG News with DeBERTa-v3 - current encoder-fine-tuning baseline.
- PySS3 - white-box, interpretable text classifier.
- LLMs as Annotators - using LLMs for text classification labeling, with caveats.
- Latent Dirichlet Allocation (Blei et al.) - foundational topic model.
- gensim - LDA, LSI, HDP in Python.
- BigARTM - fast regularized topic modeling.
- BERTopic - clustering-based topic modeling on top of contextual embeddings; common modern default.
- Top2Vec - jointly learns topic and document vectors.
- CorEx Topic - hierarchical topic modeling with anchor words.
-
TextRank - extractive graph-based summarization.
-
Pointer-Generator Networks (See et al.) - foundational neural abstractive summarization.
-
PEGASUS - gap-sentences pretraining for summarization.
-
BART - widely used denoising seq2seq baseline.
-
BookSum and SCROLLS - long-document summarization benchmarks. LLM-based:
-
Benchmarking LLMs for News Summarization - LLMs vs fine-tuned summarizers.
-
Element-Aware Summarization with LLMs - structured prompting for summarization.
-
Understanding LLM Reasoning for Abstractive Summarization (2025) - explicit reasoning improves fluency but hurts factual grounding; longer reasoning budgets can harm faithfulness.
Statistical and foundational neural:
- Moses - reference statistical MT system.
- Attention Is All You Need - transformer; reset the field.
- Marian NMT - efficient C++ NMT framework.
- Fairseq - PyTorch sequence modeling toolkit.
Massively multilingual:
- NLLB-200 - MT for 200 languages.
- MADLAD-400 - 400+ language MT.
- SeamlessM4T - speech and text MT, 100+ languages.
Evaluation:
- COMET - learned MT metric; current de-facto standard alongside chrF.
- sacrebleu - reproducible BLEU/chrF/TER scoring.
- BERTScore - similarity-based generation metric.
LLM-based:
- Is ChatGPT a Good Translator? - LLMs as machine translation systems.
- Adapting LLMs for Document-Level MT (2024) - LLMs for context-aware translation.
- GPT-4 vs Human Translators - quality comparison on professional MT.
- Multilingual MT with Open LLMs at Practical Scale (2025) - benchmarks sub-10B open LLMs on 28-language MT; matches GPT-4-turbo and Google Translate.
- Bridging the Linguistic Divide: Survey on LLMs for MT (2025) - survey of how instruction-following, in-context learning, and preference alignment have restructured MT methodology.
Datasets and foundational systems:
- SQuAD / SQuAD 2.0 - extractive reading comprehension.
- Natural Questions - real-user questions over Wikipedia.
- HotpotQA - multi-hop reasoning.
- TriviaQA - distantly-supervised QA.
- DrQA - open-domain QA over Wikipedia.
- Document-QA - multi-paragraph reading comprehension.
Modern open-domain QA:
- DPR and FiD - retrieve-then-read; the standard pre-LLM open-domain QA pipeline.
- Atlas - retrieval-augmented LM for few-shot QA.
- See also Retrieval for NLP.
LLM-era:
- GPT-4 with retrieval on TriviaQA / NQ
- Self-RAG (2023) - retrieval, generation, and self-critique.
- GAIA - general AI assistant benchmark including multi-step QA.
- OpenIE 6 - schema-free open information extraction.
- Template-Based Information Extraction without the Templates
- Privee: An Architecture for Automatically Analyzing Web Privacy Policies
- REBEL - end-to-end relation extraction.
- DocRED - document-level relation extraction benchmark.
- LLMs for Semantic Role Labeling (2025) - generative LLMs with RAG and self-correction surpass encoder-decoder BERT-style models on SRL in English and Chinese.
- Adapting LLMs for Minimal-edit GEC (2025) - decoder-only LLMs with a novel error-rate adaptation schedule set new SOTA on BEA-test grammatical error correction.
Dense and late-interaction retrieval, increasingly the substrate for QA and IR:
-
DPR (Dense Passage Retrieval) - dual-encoder retrieval baseline.
-
ColBERT and ColBERTv2 - late-interaction retrieval; strong on out-of-domain.
-
E5 and E5-Mistral - widely used dense embedding families.
-
BGE and BGE-M3 (2024) - multilingual, multi-functionality embeddings; top of MTEB across languages.
-
Nomic Embed (2024) - fully open, reproducible embedding model.
-
Matryoshka Representation Learning - nested embeddings supporting variable dimensionality at inference.
-
GritLM (2024) - unified generation and embedding from one model.
-
RAG (Retrieval-Augmented Generation) - the original retrieval-augmented framework; foundation for modern QA pipelines.
-
Gemini Embedding (2025) - Gemini-derived dense embeddings; SOTA on MMTEB across 250+ languages and on cross-lingual retrieval (XOR-Retrieve, XTREME-UP).
-
Qwen3-Embedding (2025) - decoder-based embedding series (0.6B-8B) built on Qwen3; #1 on MTEB Multilingual and MTEB Code, surpassing prior proprietary models.
-
Rank1 (2025) - first reranking model trained with test-time compute via DeepSeek-R1 reasoning-trace distillation; SOTA on instruction-following and OOD retrieval.
-
ReasonEmbed (2025) - embedding model for reasoning-intensive retrieval with ReMixer data synthesis and Redapter adaptive training; record nDCG@10 of 38.1 on BRIGHT.
-
ColBERT-Att (2026) - extends late-interaction retrieval by integrating query and document attention weights into ColBERT scoring; improves recall on MS-MARCO, BEIR, and LoTTE. Embedding and retrieval benchmarks:
-
MMTEB (2025) - community expansion of MTEB to 500+ tasks across 250+ languages.
A short pointer set, since this borders adjacent fields:
- Whisper - multilingual ASR; the modern open default.
- SeamlessM4T - unified speech and text translation.
- Canary (NVIDIA, 2024) - top open multilingual ASR model.
- Wav2Vec 2.0 - foundational self-supervised speech pretraining.
- Coqui TTS and VieNeu-TTS - open TTS.
Dataset hubs and lists:
- HuggingFace Datasets Hub - the central index for modern NLP datasets, with versioned, streamable loaders.
- nlp-datasets - large collection of NLP datasets.
- gensim-data - data repository for pretrained NLP models and NLP corpora.
Pretraining-scale corpora (open):
- The Pile - 825 GiB diverse text corpus.
- RedPajama / RedPajama-V2 (2023-2024) - reproductions of LLaMA pretraining data; V2 is 30T tokens with quality signals.
- Dolma (AI2, 2023-2024) - 3T-token open pretraining corpus with documented filtering pipeline.
- FineWeb / FineWeb-Edu (2024) - 15T-token cleaned web corpus; FineWeb-Edu filters for educational quality.
- CulturaX - 6.3T tokens across 167 languages.
- Common Corpus (2024) - 2T-token open-license multilingual corpus.
Task and instruction datasets:
- Universal Dependencies - cross-linguistically consistent treebank annotation, 100+ languages.
- Tülu 3 SFT Mixture (2024) - open instruction-tuning data behind Tülu 3.
- tiny_qa_benchmark_pp - tiny NLP multi-lingual QA datasets and library to generate your own synthetic copies.
- UDPipe is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files. Primarily written in C++, offers a fast and reliable solution for multilingual NLP processing.
- NLP-Cube : Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing. New platform, written in Python with Dynet 2.0. Offers standalone (CLI/Python bindings) and server functionality (REST API).
- UralicNLP is an NLP library mostly for many endangered Uralic languages such as Sami languages, Mordvin languages, Mari languages, Komi languages and so on. Also some non-endangered languages are supported such as Finnish together with non-Uralic languages such as Swedish and Arabic. UralicNLP can do morphological analysis, generation, lemmatization and disambiguation.
Pretrained language models and the research around them, scoped to NLP tasks and linguistic phenomena. For general-purpose LLM tooling, agents, or RAG application kits, see See Also.
Encoders (still the workhorse for classical NLP tasks):
- BERT - bidirectional transformer pretraining; foundation for most encoder-based NLP work since 2018.
- RoBERTa - robustly optimized BERT pretraining; common encoder baseline.
- DeBERTa / DeBERTa-v3 - disentangled attention; strong on classification, NER, NLI.
- ELECTRA - replaced-token-detection pretraining, sample-efficient.
- ModernBERT (2024) - modernized encoder with rotary embeddings, FlashAttention, 8K context; current go-to encoder for classification, NER, retrieval.
- NeoBERT (2025) - 250M-parameter encoder integrating modern architecture improvements (RoPE, 4K context, optimized depth-to-width); state of the art on MTEB, surpasses ModernBERT and RoBERTa-large under identical fine-tuning.
Encoder-decoder and seq2seq:
- T5 and FLAN-T5 - text-to-text framing for NLP tasks; strong instruction-tuned encoder-decoder baselines.
- BART - denoising seq2seq pretraining; widely used for summarization and generation.
Open decoder-only LMs (used as substrate for NLP tasks):
- Llama 3 / 3.1 / 3.3 (Meta, 2024-2025) - widely adopted open-weight family; default base for fine-tuning across NLP tasks.
- Qwen 2.5 / Qwen 3 (Alibaba, 2024-2025) - strong multilingual coverage, especially Chinese; often top open model on multilingual benchmarks.
- DeepSeek-V3 (2024) - efficient MoE pretraining; competitive open base model.
- OLMo 2 (AI2, 2025) - fully open: weights, training data, code; reproducibility benchmark.
- Gemma 2 / Gemma 3 (Google, 2024-2025) - open small/mid-size models with strong NLP-task performance.
- Mistral / Mixtral - efficient dense and sparse-MoE open models.
- What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? - encoder vs decoder vs encoder-decoder for NLP transfer.
- XLM-R - cross-lingual masked LM trained on CommonCrawl, 100 languages.
- mT5 - multilingual T5 covering 101 languages.
- BLOOM - 176B-parameter open multilingual LM, 46 natural languages.
- Aya 23 / Aya Expanse (Cohere For AI, 2024) - massively multilingual instruction-tuned models covering 23-101 languages.
- Glot500 - encoder for 500+ languages, focus on low-resource.
- NLLB-200 - No Language Left Behind: MT for 200 languages.
- MADLAD-400 - 400+ language MT model and 3T-token multilingual corpus.
- SeamlessM4T / Seamless (Meta, 2023-2024) - multilingual and multimodal speech-text translation, 100+ languages.
- SEA-LION / SeaLLM (2024-2025) - LMs targeting Southeast Asian languages.
- Babel (2025) - open multilingual LLMs (9B and 83B) covering the top 25 languages by speaker population (~90% of global speakers); surpasses comparably-sized open multilingual models on XCOPA, XNLI, MGSM, FLORES-200.
- Lugha-Llama (Princeton/Mila, 2025) - Llama-3.1-8B adapted for low-resource African languages via the curated WURA corpus; SOTA open-source results on IrokoBench and AfriQA.
- AfriqueLLM (McGill, 2026) - suite of open LLMs (4B-14B) continued-pretrained on 26B tokens across 20 African languages with a comprehensive empirical study of data mixing.
- TranslateGemma (Google, 2026) - open translation-specialized models built on Gemma 3, covering 55 language pairs via SFT and RL with quality-reward models.
- MiLMMT-46 (Xiaomi, 2026) - open multilingual MT scaled across 46 languages, matching commercial systems like Google Translate and Gemini 3 Pro.
NLU and cross-lingual:
- GLUE and SuperGLUE - English NLU benchmarks.
- XTREME and XGLUE - cross-lingual NLU.
- XNLI - cross-lingual natural language inference, 15 languages.
- FLORES-200 - MT evaluation across 200 languages.
- MTEB - Massive Text Embedding Benchmark; standard for sentence/document encoders.
- BEIR - heterogeneous IR benchmark for retrieval models.
Modern LM evaluation (2023-2026):
- HELM - holistic evaluation across NLP tasks, accuracy and beyond.
- BIG-bench - 200+ tasks probing language model capabilities.
- MMLU - multitask knowledge evaluation across 57 subjects.
- MMLU-Pro (2024) - harder, more discriminative successor to MMLU.
- GPQA - graduate-level Q&A, "Google-proof" reasoning evaluation.
- IFEval - verifiable instruction-following evaluation.
- Chatbot Arena (LMSYS) - human-preference ELO leaderboard for chat models.
- LiveBench (2024) - contamination-resistant benchmark with monthly refresh.
- LM Evaluation Harness - unified framework for LM benchmark evaluation.
- MMLU-ProX (2025) - multilingual extension of MMLU-Pro to 29 typologically diverse languages; reveals up to 24.3% performance gap between high- and low-resource languages.
- MultiChallenge (2025) - multi-turn conversational benchmark exposing simultaneous instruction-following and in-context-reasoning failures; all tested frontier models score below 50%.
- FRAMES (2025) - unified RAG evaluation: 824 multi-hop questions requiring factuality, retrieval accuracy, and cross-document reasoning together.
Long-context evaluation:
- Needle in a Haystack - retrieval probe for long-context windows.
- RULER (2024) - synthetic long-context tasks beyond simple retrieval.
- LongBench - bilingual long-context benchmark across NLP tasks.
- LongBench v2 (2025) - 503 expert-crafted multiple-choice questions spanning 8K-2M-word contexts with deep multi-hop reasoning; humans score 53.7% under time pressure.
- U-NIAH (2025) - extends needle-in-haystack with multi-needle and nested configurations; shows RAG mitigates lost-in-the-middle for smaller LLMs but degrades reasoning models.
A trend-defining direction in 2024-2026: models that produce explicit reasoning traces and benefit from extra inference compute.
- Chain-of-Thought Prompting - foundational result; intermediate reasoning steps improve performance.
- Self-Consistency - majority vote over sampled CoT chains.
- Tree of Thoughts - search over reasoning trees.
- Self-Refine and Reflexion - self-correction at inference time.
- Large Language Models are Zero-Shot Reasoners - chain-of-thought for NLP reasoning tasks.
- Let's Verify Step by Step - process-supervised reward models for reasoning.
- DeepSeek-R1 (2025) - open reasoning model trained with pure RL; replicated o1-style behavior in the open.
- OpenAI o1 / o3 (2024-2025) - test-time-compute reasoning systems.
- Scaling LLM Test-Time Compute Optimally (2024) - systematic study of inference-time compute tradeoffs.
- s1: Simple Test-Time Scaling (2025) - small open reasoning recipe via budget-forcing.
- Kimi k1.5 (2025) - long-context RL with policy optimization (no MCTS, no PRM) reaching o1-level performance; introduces long-CoT distillation into short-CoT models.
- rStar-Math (2025) - small policy model paired with a process preference model trained via MCTS rollouts; enables small LMs to bootstrap reasoning without distilling from larger models.
- DAPO (2025) - open GRPO-based RL training system with four key improvements (decoupled clipping, dynamic sampling, token-level loss, entropy bonus); reproduces and surpasses DeepSeek-R1-Zero-level reasoning.
- VAPO (2025) - value-model-based RL with length-adaptive GAE and token-level clipping; surpasses value-free GRPO methods on AIME 2024 with stable training.
- ThinkPRM (2025) - generative process reward models that produce chain-of-thought verification per step, matching discriminative PRMs with 1% of the supervision labels.
- OpenThoughts (2025) - 1000+ controlled experiments on data recipes for open reasoning models; SOTA on AIME 2025 matching closed distillation baselines.
- Mamba and Mamba-2 - selective state-space models, linear-time long-context alternative to attention.
- RWKV - RNN-transformer hybrid scaling to large parameter counts.
- Jamba (2024) - hybrid Mamba-Transformer-MoE architecture.
- RoPE and YaRN - rotary position embeddings and context-length extension.
- Position Interpolation - extending context windows with minimal fine-tuning.
- Lost in the Middle - long-context degradation patterns in NLP tasks.
- RAG vs Long-Context LLMs (2024) - tradeoffs for QA over long inputs.
- Titans: Learning to Memorize at Test Time (2025) - neural long-term memory module that learns to memorize historical context at test time; scales beyond 2M tokens, outperforms transformers and modern linear-recurrent models on language modeling and reasoning.
- MiniMax-01 (2025) - 456B-parameter hybrid combining lightning (linear) attention with sparse softmax attention; matches GPT-4o-level NLP performance at up to 4M-token inference contexts.
- Native Sparse Attention (NSA) (2025) - trainable sparse attention combining coarse-grained compression with fine-grained selection; large speedups at 64K with no NLP-benchmark degradation.
- LongRoPE2 (2025) - identifies undertraining of high-frequency RoPE dimensions and applies evolutionary-search rescaling; extends LLaMA3-8B to 128K with 80x fewer training tokens than Meta's recipe.
- Characterizing SSM and Hybrid LM Long-Context Performance (2025) - first comprehensive memory and speed analysis of transformer, SSM, and hybrid models up to 220K tokens; SSMs are up to 4x faster, hybrids balance recall and efficiency.
- Survey of Hallucination in Natural Language Generation - taxonomy and mitigation strategies.
- TruthfulQA - benchmark for truthfulness in question answering.
- FActScore - fine-grained factual precision in long-form generation.
- LongFact / SAFE (2024) - long-form factuality benchmark and search-augmented evaluator.
- SelfCheckGPT - sampling-based hallucination detection.
- RAGAS - reference-free evaluation for RAG and QA pipelines.
- Lookback Lens (2024) - attention-pattern-based hallucination detection in long-context generation.
- Calibration of LLMs on Multiple Choice (2024) - calibration analysis under format effects.
- HalluLens (2025) - hallucination benchmark with extrinsic/intrinsic taxonomy and dynamic test-set regeneration to resist data leakage.
- Atomic Calibration (2025) - claim-level calibration analysis for long-form generation; models are substantially worse-calibrated on extended outputs than on single claims.
- FRANQ (2025) - faithfulness-aware uncertainty quantification for RAG fact-checking; formally separates faithfulness from factuality.
- MUCH (2025) - multilingual claim-hallucination benchmark across English, French, Spanish, German with token-level logits released for principled UQ evaluation.
- HalluHard (2026) - hard multi-turn hallucination benchmark for citation-required responses; ~30% hallucination rates persist even with web search.
- CURE: Think Through Uncertainty (2026) - trains models to reason about claim-level uncertainty before generating; large gains on biography factuality and FactBench AUROC.
- A Primer in BERTology - what BERT learns about language.
- Probing Classifiers (Belinkov) - methodology, limitations, alternatives.
- Locating and Editing Factual Associations in GPT (ROME) - causal tracing of factual recall.
- The Pyramid of NLP Probes - structural probing for linguistic knowledge.
- Toy Models of Superposition - foundation for the sparse-feature view of transformer representations.
- Towards Monosemanticity / Scaling Monosemanticity (Anthropic, 2024) - sparse autoencoders extracting interpretable features from production-scale LMs.
- Sparse Autoencoders Find Highly Interpretable Features - SAE methodology for LM interpretability.
- Neuronpedia - open platform browsing SAE features across models.
- Influence Functions Scale to LLMs (2023) - identifying training examples driving model behavior.
- Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic, 2025) - introduces cross-layer transcoders and attribution graphs to construct an interpretable replacement model; enables prompt-level circuit tracing of feature-to-feature causal interactions.
- On the Biology of a Large Language Model (Anthropic, 2025) - applies attribution graphs to Claude 3.5 Haiku across multi-hop reasoning, rhyme planning, and jailbreak case studies.
- Transcoders Beat Sparse Autoencoders for Interpretability (2025) - shows transcoders (reconstructing layer outputs from inputs) yield more interpretable features than SAEs; introduces skip transcoders.
- Survey on Sparse Autoencoders for LLM Interpretability (EMNLP 2025) - reference survey of SAE architectures, training strategies, feature explanation, and evaluation.
- Finding Highly Interpretable Prompt-Specific Circuits (2026) - identifies circuits at the per-prompt level (rather than per-task); reveals mechanism clustering by prompt family.
Distillation and small models:
- DistilBERT and MiniLM - distilled encoders for production NLP.
- Phi-3 / Phi-4 (Microsoft, 2024) - small models trained on curated data, competitive with much larger ones on NLP benchmarks.
- SmolLM2 (HuggingFace, 2025) - fully open small-LM family with reproducible training data.
- SmolLM3 (HuggingFace, 2025) - 3B fully open decoder pretrained on 11.2T tokens with NoPE and YaRN for 128K context; competitive with 4B-class models.
- Gemma 3 Technical Report (Google, 2025) - 1B-27B open models with high local-to-global attention ratio to keep KV-cache tractable at 128K context.
- Qwen3 Technical Report (Alibaba, 2025) - dense and MoE models 0.6B-235B with unified thinking/non-thinking modes; the 30B-A3B MoE matches larger dense models while activating only 3B parameters.
- Apple Intelligence Foundation Language Models (Apple, 2025) - on-device 3B model using KV-cache sharing and 2-bit QAT for 37.5% cache memory reduction without accuracy loss.
- Sentence-Transformers - sentence and paragraph embeddings via Siamese BERT.
- SetFit - few-shot text classification without prompts.
- FastFit - fast few-shot classification for many-class settings.
- GTE, BGE, and Stella - compact text embedding models near the top of MTEB.
Quantization and serving (relevant when deploying NLP models at scale):
- GPTQ - post-training quantization for transformers.
- AWQ - activation-aware weight quantization.
- KVTuner (ICML 2025) - sensitivity-aware layer-wise mixed-precision KV-cache quantization; up to 21% throughput improvement over uniform KV8.
- GGUF / llama.cpp - portable quantized inference.
- vLLM - PagedAttention-based high-throughput LM serving.
- SGLang - structured generation and efficient serving.
- Text Generation Inference (TGI) - HF production serving for LMs.
Parameter-efficient fine-tuning:
- LoRA and QLoRA - low-rank adapters and quantized fine-tuning; the standard for adapting LMs to NLP tasks on modest hardware.
- DoRA (2024) - weight-decomposed low-rank adaptation.
- PEFT - HuggingFace library bundling LoRA, prefix tuning, IA3, and others.
- FLAN - finetuned language models as zero-shot learners.
- InstructGPT - training LMs to follow instructions with human feedback.
- Self-Instruct - bootstrapping instruction data from LMs.
- Super-NaturalInstructions - 1600+ NLP tasks with instructions.
- Constitutional AI - training LMs with AI-generated feedback against a written constitution.
- Direct Preference Optimization - simpler alternative to RLHF; widely adopted.
- Tülu 3 (AI2, 2024) - fully open post-training recipe with state-of-the-art results among open models.
- LIMA - "less is more for alignment"; small high-quality SFT data goes a long way.
- TRL - reference library for SFT, DPO, GRPO, and RLHF.
- Magpie (2024-2025) - synthesizes high-quality instruction-response pairs by prompting aligned LMs with nothing; SFT on the filtered subset matches official Llama-3-Instruct.
- StereoSet - measuring stereotypical bias in pretrained LMs.
- CrowS-Pairs - social bias measurement in masked LMs.
- WinoBias - gender bias in coreference resolution.
- HolisticBias - bias measurement across many demographic axes.
- RealToxicityPrompts - toxicity in LM generation.
- Sycophancy in Language Models - models tailoring answers to user beliefs.
- Alignment Faking in Large Language Models (Anthropic, 2024) - models strategically complying during training.
- WildGuard (2024) - open safety moderation model and benchmark.
- Emergent Misalignment (2025) - finetuning on a narrow task (insecure code) unexpectedly produces broad alignment failures across unrelated domains.
- SafeDialBench (2025) - multilingual (Chinese/English) safety benchmark of 4000+ multi-turn dialogues across 22 scenarios and 7 jailbreak strategies.
- TeleAI-Safety (2025) - modular jailbreak evaluation framework integrating 19 attacks, 29 defenses, and 19 evaluation methods across 14 models and 12 risk categories.
- IndicSafe (2026) - multilingual safety benchmark across 12 Indic languages; reveals 12.8% cross-language agreement, with over-refusal in low-resource scripts.
- VLAF: Value-Conflict Alignment Faking (2026) - alignment faking occurs in models as small as 7B in 37% of cases when policy conflicts with internalized values; steering-vector mitigation reduces it 94%.
Resources organized by human language. Click a section to expand.
- CAMeL Tools - Python toolkit for Arabic NLP including dialect ID, morphology, NER.
- goarabic - Go package for Arabic text processing.
- jsastem - JavaScript Arabic stemmer.
- PyArabic - Python library for Arabic.
- RFTokenizer - trainable segmenter for Arabic, Hebrew, and Coptic.
- Farasa - QCRI segmentation, POS tagging, and NER for Arabic.
- AraBERT - Arabic BERT family.
- CAMeLBERT - BERT models for MSA, dialectal, and Classical Arabic.
- AraELECTRA - efficient Arabic pretraining (released alongside AraBERT).
- Jais (2023-2024) - bilingual Arabic-English open LM family.
- ALLaM (SDAIA, 2024) - Arabic-first foundation models.
- Multidomain Datasets - largest available multi-domain Arabic sentiment analysis resources.
- LABR - large Arabic book reviews dataset.
- Arabic Stopwords - aggregated Arabic stopwords.
- ArabicMMLU (2024) - Arabic MMLU benchmark.
- jieba - Python package for Chinese word segmentation.
- SnowNLP - Python package for Chinese NLP.
- FudanNLP - Java library for Chinese text processing.
- HanLP - multilingual NLP library with strong Chinese support.
- LTP - HIT Language Technology Platform: segmentation, POS, NER, parsing.
- Chinese-BERT-wwm - whole-word masking BERT for Chinese.
- MacBERT - improved Chinese BERT with MLM-as-correction pretraining.
- Qwen 2.5 / Qwen 3 - Alibaba's open Chinese-strong LM family.
- ChatGLM3 / GLM-4 - Tsinghua's bilingual Chinese-English LMs.
- Baichuan 2 - open Chinese LM.
- Yi - 01.AI's bilingual open LMs.
- DeepSeek-V3 - efficient open MoE model with strong Chinese.
- funNLP - large collection of Chinese NLP tools and resources.
- Named Entity Recognition for Danish
- DaNLP - NLP resources in Danish.
- Awesome Danish - curated list of resources for Danish language technology.
- python-frog - Python binding to Frog, an NLP suite for Dutch (POS tagging, lemmatization, dependency parsing, NER).
- SimpleNLG_NL - Dutch surface realiser for natural language generation, based on the SimpleNLG implementation.
- Alpino - dependency parser for Dutch (also does POS tagging and lemmatization).
- Kaldi NL - Dutch speech-recognition models based on Kaldi.
- spaCy Dutch model - industrial-strength NLP with a Dutch pipeline.
- German-NLP - curated list of open-access, open-source, and off-the-shelf resources and tools developed with a focus on German.
- awesome-hungarian-nlp - curated list of free resources for Hungarian NLP.
- Hindi Dependency Treebank - A multi-representational multi-layered treebank for Hindi and Urdu
- Universal Dependencies Treebank in Hindi
- Parallel Universal Dependencies Treebank in Hindi - A smaller part of the above-mentioned treebank.
- ISI FIRE Stopwords List (Hindi and Bangla)
- Peter Graham's Stopwords List
- NLTK Corpus 60k Words POS Tagged, Bangla, Hindi, Marathi, Telugu
- Hindi Movie Reviews Dataset ~1k Samples, 3 polarity classes
- BBC News Hindi Dataset 4.3k Samples, 14 classes
- IIT Patna Hindi ABSA Dataset 5.4k Samples, 12 Domains, 4k aspect terms, aspect and sentence level polarity in 4 classes
- Bangla ABSA 5.5k Samples, 2 Domains, 10 aspect terms
- IIT Patna Movie Review Sentiment Dataset 2k Samples, 3 polarity labels
- SAIL 2015 Twitter and Facebook labelled sentiment samples in Hindi, Bengali, Tamil, Telugu.
- IIT Bombay CFILT Resources - Sentiwordnet, parallel labelled corpora, sense-annotated corpora, and Marathi polarity-labelled corpus.
- TDIL-IC aggregates a lot of useful resources and provides access to otherwise gated datasets
- Hindi2Vec and nlp-for-hindi ULMFIT style languge model
- IIT Patna Bilingual Word Embeddings Hi-En
- Fasttext word embeddings in a whole bunch of languages, trained on Common Crawl
- Hindi and Bengali Word2Vec
- Hindi and Urdu Elmo Model
- Sanskrit Albert Trained on Sanskrit Wikipedia and OSCAR corpus
- Multi-Task Deep Morphological Analyzer - deep morphological parser for Hindi and Urdu.
- Indic NLP Library - tokenization, transliteration, MT helpers across 18 Indic languages.
- SivaReddy's Dependency Parser (Python3 port) - dependency parsing and POS tagging for Kannada, Hindi, and Telugu.
- iNLTK - NLP toolkit for Indic languages on PyTorch/Fastai.
- AI4Bharat IndicNLP Suite - tools, datasets, and models across 22 Indic languages.
- IndicBERT v2 (2022-2024) - multilingual BERT for 23 Indic languages.
- IndicTrans2 (2023-2024) - high-quality MT for 22 Indic languages.
- OpenHathi (Sarvam AI, 2023) - bilingual Hindi-English LLaMA continuation.
- Airavata (2024) - instruction-tuned Hindi LLM.
- Sarvam-1 (2024) - multilingual LM trained from scratch on 10 Indic languages.
- BharatGPT / Krutrim (2024) - Indic-focused foundation models.
- bahasa - natural language toolkit for Indonesian.
- Indonesian Word Embedding
- Indonesian fastText trained on Wikipedia.
- IndoBERT (IndoNLU) - pretrained Indonesian LM with the IndoNLU benchmark suite.
- IndoBERT (IndoLEM) - alternative IndoBERT with the IndoLEM benchmark.
- NusaCrowd / Cendol (2023-2024) - large-scale community datasets and Cendol instruction-tuned LMs for Indonesian and regional languages.
- Sailor - open Southeast-Asian LMs covering Indonesian.
- SEA-LION (2024) - Singapore AI's open Southeast-Asian LM with strong Indonesian.
- Kompas and Tempo collections at ILPS
- PANL10N for PoS tagging: 39K sentences and 900K word tokens
- IDN for PoS tagging: 10K sentences and 250K word tokens.
- Indonesian Treebank and Universal Dependencies-Indonesian
- IndoSum - text summarization and classification.
- Wordnet-Bahasa - large, free, semantic dictionary.
- KoNLPy - Python package for Korean natural language processing.
- Mecab (Korean) - C++ library for Korean NLP.
- KoalaNLP - Scala library for Korean NLP.
- KoNLP - R package for Korean NLP.
- kss - Korean sentence splitter.
- Kiwi - fast Korean morphological analyzer.
- KoBERT - Korean BERT from SKT.
- KLUE-RoBERTa - models trained on the KLUE benchmark.
- Polyglot-Ko - open Korean LMs.
- EXAONE 3.5 (LG, 2024) - bilingual Korean-English open LM family.
- HyperCLOVA X - Naver's Korean foundation model.
- KAIST Corpus - corpus from the Korea Advanced Institute of Science and Technology in Korean.
- Naver Sentiment Movie Corpus in Korean
- Chosun Ilbo archive - dataset in Korean from a major South Korean newspaper.
- Chat data - chatbot data in Korean.
- Petitions - expired petition data from the Blue House National Petition Site.
- Korean Parallel corpora - NMT dataset for Korean to French and Korean to English.
- KorQuAD - Korean SQuAD dataset (v1.0 and v2.1) with Wiki HTML source.
- Hazm - Persian NLP toolkit.
- Parsivar - Persian language processing toolkit.
- Perke - Persian keyphrase extraction.
- Perstem - Persian stemmer, morphological analyzer, and partial POS tagger.
- ParsiAnalyzer - Persian analyzer for Elasticsearch.
- virastar - Persian text cleaning.
- ParsBERT - Persian BERT.
- PersianMind (2023-2024) - Persian instruction-tuned LM.
- Dorna (Part AI, 2024) - Llama-3-based Persian instruction model.
- Bijankhan Corpus - tagged corpus suitable for Persian (Farsi) NLP research, ~2.6M manually tagged words across 40 POS tags.
- Uppsala Persian Corpus (UPC) - large freely available Persian corpus, 2.7M tokens annotated with 31 POS tags.
- Large-Scale Colloquial Persian - LSCP: 120M sentences from 27M casual Persian tweets with dependency, POS, and sentiment annotations.
- ArmanPersoNERCorpus - 250K tokens, 7,682 sentences with NER tags in IOB format.
- FarsiYar PersianNER - ~25M tokens, ~1M Persian sentences from Persian Wikipedia Corpus.
- PERLEX - first Persian dataset for relation extraction (translated SemEval-2010 Task 8).
- Persian Syntactic Dependency Treebank - 29,982 annotated sentences covering most verbs of the Persian valency lexicon.
- Uppsala Persian Dependency Treebank (UPDT) - dependency-based syntactically annotated corpus.
- Hamshahri - standard reliable Persian text collection used at CLEF 2008-2009.
- Polish-NLP - curated list of resources dedicated to Polish NLP: models, tools, and datasets.
- Portuguese-nlp - curated list of Portuguese NLP resources and tools.
- spanlp - Python library to detect, censor, and clean profanity, hate speech, and bullying in Spanish, with data from 21 Spanish-speaking countries.
- Columbian Political Speeches
- Copenhagen Treebank
- Spanish Billion Words Corpus with Word2Vec embeddings
- Compilation of Spanish Unannotated Corpora
- BETO - BERT for Spanish.
- RoBERTa-bne - Spanish RoBERTa trained on the Spanish National Library corpus.
- Latxa (2024) - open foundation LM for Basque, also covers Spanish.
- Salamandra (BSC, 2024) - multilingual LM with strong Spanish coverage from the Barcelona Supercomputing Center.
- RigoChat (2024) - Spanish-instruction-tuned open model.
- Spanish Word Embeddings (multiple methods/corpora)
- Spanish fastText Embeddings
- Spanish sent2vec Sentence Embeddings
- PyThaiNLP - Thai NLP in Python.
- JTCC - character cluster library in Java.
- CutKum - word segmentation with deep learning in TensorFlow.
- Thai Language Toolkit - tokenization and POS tagging.
- SynThai - word segmentation and POS tagging using deep learning.
- WangchanBERTa - pretrained Thai language model.
- Typhoon (SCB 10X, 2024) - open Thai LLM family.
- OpenThaiGPT (2023-2024) - open Thai instruction-tuned models.
- Sailor - open Southeast-Asian LM family covering Thai.
- Inter-BEST - text corpus with 5M words and word segmentation.
- Prime Minister 29 - dataset of speeches by the current Prime Minister of Thailand.
- awesome-ukrainian-nlp - curated list of Ukrainian NLP datasets, models, etc.
- UkrainianLT - curated list focused on machine translation and speech processing.
- underthesea - Vietnamese NLP toolkit.
- vn.vitk - Vietnamese text processing toolkit.
- VnCoreNLP - Vietnamese NLP toolkit.
- pyvi - Python Vietnamese core NLP toolkit.
- VieNeu-TTS - on-device Vietnamese text-to-speech with voice cloning.
- PhoBERT - pretrained LM for Vietnamese.
- BARTpho - sequence-to-sequence pretrained model for Vietnamese.
- PhoGPT (VinAI, 2023-2024) - open generative LM for Vietnamese.
- Vistral (2024) - Mistral-based Vietnamese chat model.
- Sailor (2024) - open multilingual LM family covering Vietnamese, Thai, Indonesian, and other Southeast Asian languages.
- Vietnamese Treebank - 10K sentences for the constituency parsing task.
- BKTreeBank - Vietnamese dependency treebank.
- UD_Vietnamese - Vietnamese Universal Dependency Treebank.
- VIVOS - free Vietnamese speech corpus, 15 hours of recorded speech (HCMUS AILab).
- VNTQcorpus(big).txt - 1.75M news sentences.
- ViText2SQL - Vietnamese Text-to-SQL semantic parsing dataset (EMNLP-2020 Findings).
- EVB Corpus - 20M words across 15 bilingual books, 100 parallel English-Vietnamese texts, 250 parallel law texts, 5K news articles, and 2K film subtitles.
- Russian: pymorphy2 - a good pos-tagger for Russian
- Asian Languages: Thai, Lao, Chinese, Japanese, and Korean ICU Tokenizer implementation in ElasticSearch
- Ancient Languages: CLTK: The Classical Language Toolkit is a Python library and collection of texts for doing NLP in ancient languages
- Hebrew: NLPH_Resources - A collection of papers, corpora and linguistic resources for NLP in Hebrew
Adjacent curated lists for topics out of scope here:
- awesome-llm - general-purpose large language model resources.
- awesome-generative-ai - generative AI across modalities.
- awesome-rag - retrieval-augmented generation systems and tooling.
- awesome-prompt-engineering - prompting techniques and template libraries.
- awesome-mlops - production ML, including LLM serving.
If you find this repository useful, please consider citing this list:
@misc{awesome-nlp,
title = {Awesome NLP},
author = {Kim, Keon Woo},
year = {2018},
url = {https://github.com/keon/awesome-nlp},
note = {GitHub repository}
}License - CC0