Skip to content

keon/awesome-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

598 Commits
 
 
 
 
 
 
 
 

Repository files navigation

awesome-nlp

Awesome

A curated list of resources dedicated to Natural Language Processing

Please read the contribution guidelines before contributing. Please add your favourite NLP resource by raising a pull request

Scope

This list covers natural language processing — linguistic analysis, multilingual tooling, classical and neural methods, datasets, and evaluation. Large language models are included only where they advance or evaluate a core NLP task or capability (tokenization, multilinguality, MT, summarization, NER, QA, factuality, probing, distillation). General-purpose chatbots, agent frameworks, prompt-template repositories, code-generation tools, and RAG application starter kits live in other lists — see See Also.

Contents

Research Summaries and Trends

Where to follow current NLP research:

Historical highlights

Prominent NLP Research Labs

Back to Top

Tutorials

Back to Top

Reading Content

General Machine Learning

Introductions and Guides to NLP

Blogs and Newsletters

Videos and Online Courses

Back to Top

Books

Libraries

Back to Top

  • Node.js and Javascript - Node.js Libaries for NLP | Back to Top

    • Twitter-text - A JavaScript implementation of Twitter's text processing library
    • Knwl.js - A Natural Language Processor in JS
    • Retext - Extensible system for analyzing and manipulating natural language
    • NLP Compromise - Natural Language processing in the browser
    • Natural - general natural language facilities for node
    • Poplar - A web-based annotation tool for natural language processing (NLP)
    • NLP.js - An NLP library for building bots
    • node-question-answering - Fast and production-ready question answering w/ DistilBERT in Node.js
  • Python - Python NLP Libraries | Back to Top

    • sentimental-onix Sentiment models for spacy using onnx
    • TextAttack - Adversarial attacks, adversarial training, and data augmentation in NLP
    • TextBlob - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of Natural Language Toolkit (NLTK) and Pattern, and plays nicely with both 👍
    • spaCy - Industrial strength NLP with Python and Cython 👍
      • textacy - Higher level NLP built on spaCy
    • gensim - Python library to conduct unsupervised semantic modelling from plain text 👍
    • scattertext - Python library to produce d3 visualizations of how language differs between corpora
    • GluonNLP (archived) - A deep learning toolkit for NLP, built on MXNet/Gluon.
    • AllenNLP (archived) - An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.
    • PyTorch-NLP - NLP research toolkit designed to support rapid prototyping with better data loaders, word vector loaders, neural network layer representations, common NLP metrics such as BLEU
    • Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)
    • PyNLPl - Python Natural Language Processing Library. General purpose NLP library for Python, handles some specific formats like ARPA language models, Moses phrasetables, GIZA++ alignments.
    • foliapy - Python library for working with FoLiA, an XML format for linguistic annotation.
    • PySS3 - Python package implementing the SS3 white-box text classifier; ships with interactive visualization tools that explain predictions.
    • jPTDP - A toolkit for joint part-of-speech (POS) tagging and dependency parsing. jPTDP provides pre-trained models for 40+ languages.
    • BigARTM - a fast library for topic modelling
    • Snips NLU - A production ready library for intent parsing
    • Chazutsu - A library for downloading&parsing standard NLP research datasets
    • Word Forms - Word forms can accurately generate all possible forms of an English word
    • Multilingual Latent Dirichlet Allocation (LDA) - A multilingual and extensible document clustering pipeline
    • Natural Language Toolkit (NLTK) - A library containing a wide variety of NLP functionality, supporting over 50 corpora.
    • NLP Architect - A library for exploring the state-of-the-art deep learning topologies and techniques for NLP and NLU
    • Flair - A very simple framework for state-of-the-art multilingual NLP built on PyTorch. Includes BERT, ELMo and Flair embeddings.
    • Kashgari - Simple, Keras-powered multilingual NLP framework, allows you to build your models in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks. Includes BERT and word2vec embedding.
    • FARM - Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
    • Haystack - End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, HuggingFace’s Modelhub, and much more!
    • Rita DSL - a DSL, loosely based on RUTA on Apache UIMA. Allows to define language patterns (rule-based NLP) which are then translated into spaCy, or if you prefer less features and lightweight - regex patterns.
    • Transformers - Natural Language Processing for TensorFlow 2.0 and PyTorch.
    • Tokenizers - Tokenizers optimized for Research and Production.
    • fairSeq Facebook AI Research implementations of SOTA seq2seq models in Pytorch.
    • corex_topic - Hierarchical Topic Modeling with Minimal Domain Knowledge
    • Sockeye - Neural Machine Translation (NMT) toolkit that powers Amazon Translate.
    • DL Translate - A deep learning-based translation library for 50 languages, built on transformers and Facebook's mBART Large.
    • Jury - Evaluation of NLP model outputs offering various automated metrics.
    • python-ucto - Unicode-aware regular-expression based tokenizer for various languages. Python binding to C++ library, supports FoLiA format.
    • Pearmut - Human annotation tool for multilingual NLP tasks, such as machine translation.
    • Stanza - Stanford NLP's Python toolkit for tokenization, POS, lemma, dependency parsing, and NER across 70+ languages.
    • Sentence-Transformers - sentence/document embeddings, semantic search, and re-ranking; current standard for retrieval-style NLP.
    • Argilla - open-source data annotation and feedback collection platform for LLM and NLP datasets.
    • HuggingFace Datasets - standardized loaders and processing for thousands of NLP datasets.
    • HuggingFace Evaluate - reference implementations for NLP metrics.
    • sacrebleu - reproducible BLEU/chrF/TER scoring for machine translation.
    • COMET - learned MT metrics, current de-facto standard.
    • LangTest - 60+ test types for NLP model robustness, bias, and fairness.
  • C++ - C++ Libraries | Back to Top

    • InsNet - A neural network library for building instance-dependent NLP models with padding-free dynamic batching.
    • MIT Information Extraction Toolkit - C, C++, and Python tools for named entity recognition and relation extraction
    • CRF++ - Open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data & other Natural Language Processing tasks.
    • CRFsuite - CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data.
    • BLLIP Parser - BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
    • colibri-core - C++ library, command line tools, and Python binding for extracting and working with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
    • ucto - Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports FoLiA format.
    • libfolia - C++ library for the FoLiA format
    • frog - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow parser, morphological analyzer.
    • MeTA - ModErn Text Analysis: a C++ data sciences toolkit for mining big text data.
    • Mecab (Japanese)
    • Moses
    • StarSpace - a library from Facebook for creating embeddings of word-level, paragraph-level, document-level and for text classification
    • QSMM - adaptive probabilistic top-down and bottom-up parsers
  • Java - Java NLP Libraries | Back to Top

    • Stanford NLP
    • OpenNLP
    • NLP4J
    • Word2vec in Java
    • ReVerb Web-Scale Open Information Extraction
    • OpenRegex An efficient and flexible token-based regular expression language and engine.
    • CogcompNLP - Core libraries developed in the U of Illinois' Cognitive Computation Group.
    • MALLET - MAchine Learning for LanguagE Toolkit - package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
    • RDRPOSTagger - A robust POS tagging toolkit available (in both Java & Python) together with pre-trained models for 40+ languages.
  • Kotlin - Kotlin NLP Libraries | Back to Top

    • Lingua A language detection library for Kotlin and Java, suitable for long and short text alike
    • Kotidgy — an index-based text data generator written in Kotlin
  • Scala - Scala NLP Libraries | Back to Top

    • Saul - Library for developing NLP systems, including built in modules like SRL, POS, etc.
    • ATR4S - Toolkit with state-of-the-art automatic term recognition methods.
    • tm - Implementation of topic modeling based on regularized multilingual PLSA.
    • word2vec-scala - Scala interface to word2vec model; includes operations on vectors like word-distance and word-analogy.
    • Epic - Epic is a high performance statistical parser written in Scala, along with a framework for building complex structured prediction models.
    • Spark NLP - Spark NLP is a natural language processing library built on top of Apache Spark ML that provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment.
  • R - R NLP Libraries | Back to Top

    • text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
    • wordVectors - An R package for creating and exploring word2vec and other word embedding models
    • RMallet - R package to interface with the Java machine learning tool MALLET
    • dfr-browser - Creates d3 visualizations for browsing topic models of text in a web browser.
    • dfrtopics - R package for exploring topic models of text.
    • sentiment_classifier - Sentiment Classification using Word Sense Disambiguation and WordNet Reader
    • jProcessing - Japanese Natural Langauge Processing Libraries, with Japanese sentiment classification
    • corporaexplorer - An R package for dynamic exploration of text collections
    • tidytext - Text mining using tidy tools
    • spacyr - R wrapper to spaCy NLP
    • CRAN Task View: Natural Language Processing
  • Clojure | Back to Top

    • Clojure-openNLP - Natural Language Processing in Clojure (opennlp)
    • Infections-clj - Rails-like inflection library for Clojure and ClojureScript
    • postagga - A library to parse natural language in Clojure and ClojureScript
  • Ruby | Back to Top

  • Rust | Back to Top

    • whatlang — Natural language recognition library based on trigrams
    • rust-bert - Ready-to-use NLP pipelines and Transformer-based models
    • snips-nlu-rs (archived — Snips was discontinued) - A production ready library for intent parsing
  • NLP++ - NLP++ Language | Back to Top

  • Julia | Back to Top

    • CorpusLoaders - A variety of loaders for various NLP corpora
    • Languages - A package for working with human languages
    • TextAnalysis - Julia package for text analysis
    • TextModels - Neural Network based models for Natural Language Processing
    • WordTokenizers - High performance tokenizers for natural language processing and other related tasks
    • Word2Vec - Julia interface to word2vec

Services

NLP as API with higher level functionality such as NER, Topic tagging and so on | Back to Top

  • Wit-ai - Natural Language Interface for apps and devices
  • IBM Watson's Natural Language Understanding - API and Github demo
  • Amazon Comprehend - NLP and ML suite covers most common tasks like NER, tagging, and sentiment analysis
  • Google Cloud Natural Language API - Syntax Analysis, NER, Sentiment Analysis, and Content tagging in atleast 9 languages include English and Chinese (Simplified and Traditional).
  • ParallelDots - High level Text Analysis API Service ranging from Sentiment Analysis to Intent Analysis
  • Microsoft Cognitive Service
  • TextRazor
  • Rosette
  • Textalytic - Natural Language Processing in the Browser with sentiment analysis, named entity extraction, POS tagging, word frequencies, topic modeling, word clouds, and more
  • NLP Cloud - SpaCy NLP models (custom and pre-trained ones) served through a RESTful API for named entity recognition (NER), POS tagging, and more.
  • Cloudmersive - Unified and free NLP APIs that perform actions such as speech tagging, text rephrasing, language translation/detection, and sentence parsing

Annotation Tools

  • GATE - General Architecture and Text Engineering is 15+ years old, free and open source
  • Anafora is free and open source, web-based raw text annotation tool
  • brat - brat rapid annotation tool is an online environment for collaborative text annotation
  • doccano - doccano is free, open-source, and provides annotation features for text classification, sequence labeling and sequence to sequence
  • INCEpTION - A semantic annotation platform offering intelligent assistance and knowledge management
  • prodigy is an annotation tool powered by active learning, costs $
  • LightTag - Hosted and managed text annotation tool for teams, costs $
  • rstWeb - open source local or online tool for discourse tree annotations
  • GitDox - open source server annotation tool with GitHub version control and validation for XML data and collaborative spreadsheet grids
  • Datasaur support various NLP tasks for individual or teams, freemium based
  • Konfuzio - team-first hosted and on-prem text, image and PDF annotation tool powered by active learning, freemium based, costs $
  • UBIAI - Easy-to-use text annotation tool for teams with most comprehensive auto-annotation features. Supports NER, relations and document classification as well as OCR annotation for invoice labeling, costs $
  • Shoonya - Shoonya is free and open source data annotation platform with wide varities of organization and workspace level management system. Shoonya is data agnostic, can be used by teams to annotate data with various level of verification stages at scale.
  • Annotation Lab - Free End-to-End No-Code platform for text annotation and DL model training/tuning. Out-of-the-box support for Named Entity Recognition, Classification, Relation extraction and Assertion Status Spark NLP models. Unlimited support for users, teams, projects, documents. Not FOSS.
  • FLAT - FLAT is a web-based linguistic annotation environment based around the FoLiA format, a rich XML-based format for linguistic annotation. Free and open source.
  • Argilla - open-source platform for collecting human feedback, building NLP and LLM datasets, and curating preference data.
  • Label Studio - open-core multi-modal labeling platform; widely used for NLP labeling.

Tasks and Methods

NLP tasks organized by linguistic problem. Each subsection lists foundational/classical work first, then neural approaches, then LLM-based methods where relevant. For modern LM-specific research (pretraining, evaluation, retrieval, reasoning, etc.) see Language Models for NLP.

Text Embeddings

Back to Top

Static word embeddings (foundational):

Contextual embeddings:

  • ELMo - deep contextualized word representations.
  • CoVe - contextualized vectors learned from MT.
  • ULMFiT - language-model fine-tuning for text classification.
  • InferSent - sentence representations from NLI.

Modern sentence and document embeddings: see Retrieval for NLP (Sentence-Transformers, E5, BGE-M3, Nomic, GritLM) and MTEB for current leaderboards.

Tokenization, Morphology, and Segmentation

Back to Top

  • SentencePiece - language-agnostic subword tokenization.

  • BPE and Unigram LM - the two dominant subword schemes.

  • Stanza - tokenization, lemma, and morphology for 70+ languages.

  • UDPipe - tokenization, tagging, lemmatization, parsing for Universal Dependencies.

  • Morfessor - unsupervised morphological segmentation. Tokenizer research and architecture (also see Language Models):

  • Byte-Pair Encoding (Sennrich et al.) - subword units for neural MT; foundation of modern tokenizers.

  • SentencePiece - language-agnostic subword tokenization (BPE and Unigram).

  • Tokenizers - fast Rust implementations of BPE, WordPiece, Unigram.

  • ByT5 - tokenizer-free byte-level model.

  • CANINE - tokenization-free encoder operating on Unicode characters.

  • How Good is Your Tokenizer? - tokenizer fairness across languages.

  • Byte Latent Transformer (BLT) (Meta, 2024) - dynamic byte-level patching that matches BPE-tokenized models at scale; revives the tokenizer-free direction.

  • SuperBPE (2025) - superword tokenization that improves on BPE for downstream tasks.

  • Over-Tokenized Transformer (ICML 2025) - decouples input and output vocabularies; shows a log-linear relationship between input vocabulary size and training loss, scaling vocabulary independently of model size.

  • Foundations of Tokenization (ICLR 2025) - first formal unified framework for tokenizer models using stochastic-map category theory; establishes conditions for statistical consistency.

  • The Token Tax: Systematic Bias in Multilingual Tokenization (2025) - quantifies how tokenization fertility predicts model accuracy across languages, exposing structural cost penalties for morphologically complex and low-resource languages.

  • Reducing Tokenization Premiums for Low-Resource Languages (2026) - post-hoc vocabulary additions that coalesce multi-token character sequences for low-resource languages, reducing inference cost without retraining.

POS Tagging and Dependency Parsing

Back to Top

Named Entity Recognition and Information Extraction

Back to Top

Foundational and neural:

Open and instruction-following IE:

  • Universal NER - instruction-tuned LM for open-set NER across languages.
  • GLiNER (2023) - small, generalist NER model that handles arbitrary entity types at inference.
  • GoLLIE - guideline-following information extraction with LMs.
  • REBEL - end-to-end relation extraction as seq2seq.

LLM-based:

Coreference Resolution

Back to Top

Text Classification and Sentiment Analysis

Back to Top

Topic Modeling

Back to Top

Summarization

Back to Top

Machine Translation

Back to Top

Statistical and foundational neural:

Massively multilingual:

Evaluation:

  • COMET - learned MT metric; current de-facto standard alongside chrF.
  • sacrebleu - reproducible BLEU/chrF/TER scoring.
  • BERTScore - similarity-based generation metric.

LLM-based:

Question Answering and Reading Comprehension

Back to Top

Datasets and foundational systems:

Modern open-domain QA:

  • DPR and FiD - retrieve-then-read; the standard pre-LLM open-domain QA pipeline.
  • Atlas - retrieval-augmented LM for few-shot QA.
  • See also Retrieval for NLP.

LLM-era:

Information Extraction Beyond NER

Back to Top

Retrieval and Embeddings

Back to Top

Dense and late-interaction retrieval, increasingly the substrate for QA and IR:

  • DPR (Dense Passage Retrieval) - dual-encoder retrieval baseline.

  • ColBERT and ColBERTv2 - late-interaction retrieval; strong on out-of-domain.

  • E5 and E5-Mistral - widely used dense embedding families.

  • BGE and BGE-M3 (2024) - multilingual, multi-functionality embeddings; top of MTEB across languages.

  • Nomic Embed (2024) - fully open, reproducible embedding model.

  • Matryoshka Representation Learning - nested embeddings supporting variable dimensionality at inference.

  • GritLM (2024) - unified generation and embedding from one model.

  • RAG (Retrieval-Augmented Generation) - the original retrieval-augmented framework; foundation for modern QA pipelines.

  • Gemini Embedding (2025) - Gemini-derived dense embeddings; SOTA on MMTEB across 250+ languages and on cross-lingual retrieval (XOR-Retrieve, XTREME-UP).

  • Qwen3-Embedding (2025) - decoder-based embedding series (0.6B-8B) built on Qwen3; #1 on MTEB Multilingual and MTEB Code, surpassing prior proprietary models.

  • Rank1 (2025) - first reranking model trained with test-time compute via DeepSeek-R1 reasoning-trace distillation; SOTA on instruction-following and OOD retrieval.

  • ReasonEmbed (2025) - embedding model for reasoning-intensive retrieval with ReMixer data synthesis and Redapter adaptive training; record nDCG@10 of 38.1 on BRIGHT.

  • ColBERT-Att (2026) - extends late-interaction retrieval by integrating query and document attention weights into ColBERT scoring; improves recall on MS-MARCO, BEIR, and LoTTE. Embedding and retrieval benchmarks:

  • MMTEB (2025) - community expansion of MTEB to 500+ tasks across 250+ languages.

Speech and Text

Back to Top

A short pointer set, since this borders adjacent fields:

Datasets

Back to Top

Dataset hubs and lists:

Pretraining-scale corpora (open):

  • The Pile - 825 GiB diverse text corpus.
  • RedPajama / RedPajama-V2 (2023-2024) - reproductions of LLaMA pretraining data; V2 is 30T tokens with quality signals.
  • Dolma (AI2, 2023-2024) - 3T-token open pretraining corpus with documented filtering pipeline.
  • FineWeb / FineWeb-Edu (2024) - 15T-token cleaned web corpus; FineWeb-Edu filters for educational quality.
  • CulturaX - 6.3T tokens across 167 languages.
  • Common Corpus (2024) - 2T-token open-license multilingual corpus.

Task and instruction datasets:

Multilingual NLP Frameworks

Back to Top

  • UDPipe is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files. Primarily written in C++, offers a fast and reliable solution for multilingual NLP processing.
  • NLP-Cube : Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing. New platform, written in Python with Dynet 2.0. Offers standalone (CLI/Python bindings) and server functionality (REST API).
  • UralicNLP is an NLP library mostly for many endangered Uralic languages such as Sami languages, Mordvin languages, Mari languages, Komi languages and so on. Also some non-endangered languages are supported such as Finnish together with non-Uralic languages such as Swedish and Arabic. UralicNLP can do morphological analysis, generation, lemmatization and disambiguation.

Language Models for NLP

Back to Top

Pretrained language models and the research around them, scoped to NLP tasks and linguistic phenomena. For general-purpose LLM tooling, agents, or RAG application kits, see See Also.

Pretraining and Adaptation

Encoders (still the workhorse for classical NLP tasks):

  • BERT - bidirectional transformer pretraining; foundation for most encoder-based NLP work since 2018.
  • RoBERTa - robustly optimized BERT pretraining; common encoder baseline.
  • DeBERTa / DeBERTa-v3 - disentangled attention; strong on classification, NER, NLI.
  • ELECTRA - replaced-token-detection pretraining, sample-efficient.
  • ModernBERT (2024) - modernized encoder with rotary embeddings, FlashAttention, 8K context; current go-to encoder for classification, NER, retrieval.
  • NeoBERT (2025) - 250M-parameter encoder integrating modern architecture improvements (RoPE, 4K context, optimized depth-to-width); state of the art on MTEB, surpasses ModernBERT and RoBERTa-large under identical fine-tuning.

Encoder-decoder and seq2seq:

  • T5 and FLAN-T5 - text-to-text framing for NLP tasks; strong instruction-tuned encoder-decoder baselines.
  • BART - denoising seq2seq pretraining; widely used for summarization and generation.

Open decoder-only LMs (used as substrate for NLP tasks):

Multilingual and Cross-Lingual Models

  • XLM-R - cross-lingual masked LM trained on CommonCrawl, 100 languages.
  • mT5 - multilingual T5 covering 101 languages.
  • BLOOM - 176B-parameter open multilingual LM, 46 natural languages.
  • Aya 23 / Aya Expanse (Cohere For AI, 2024) - massively multilingual instruction-tuned models covering 23-101 languages.
  • Glot500 - encoder for 500+ languages, focus on low-resource.
  • NLLB-200 - No Language Left Behind: MT for 200 languages.
  • MADLAD-400 - 400+ language MT model and 3T-token multilingual corpus.
  • SeamlessM4T / Seamless (Meta, 2023-2024) - multilingual and multimodal speech-text translation, 100+ languages.
  • SEA-LION / SeaLLM (2024-2025) - LMs targeting Southeast Asian languages.
  • Babel (2025) - open multilingual LLMs (9B and 83B) covering the top 25 languages by speaker population (~90% of global speakers); surpasses comparably-sized open multilingual models on XCOPA, XNLI, MGSM, FLORES-200.
  • Lugha-Llama (Princeton/Mila, 2025) - Llama-3.1-8B adapted for low-resource African languages via the curated WURA corpus; SOTA open-source results on IrokoBench and AfriQA.
  • AfriqueLLM (McGill, 2026) - suite of open LLMs (4B-14B) continued-pretrained on 26B tokens across 20 African languages with a comprehensive empirical study of data mixing.
  • TranslateGemma (Google, 2026) - open translation-specialized models built on Gemma 3, covering 55 language pairs via SFT and RL with quality-reward models.
  • MiLMMT-46 (Xiaomi, 2026) - open multilingual MT scaled across 46 languages, matching commercial systems like Google Translate and Gemini 3 Pro.

Evaluation and Benchmarks

NLU and cross-lingual:

  • GLUE and SuperGLUE - English NLU benchmarks.
  • XTREME and XGLUE - cross-lingual NLU.
  • XNLI - cross-lingual natural language inference, 15 languages.
  • FLORES-200 - MT evaluation across 200 languages.
  • MTEB - Massive Text Embedding Benchmark; standard for sentence/document encoders.
  • BEIR - heterogeneous IR benchmark for retrieval models.

Modern LM evaluation (2023-2026):

  • HELM - holistic evaluation across NLP tasks, accuracy and beyond.
  • BIG-bench - 200+ tasks probing language model capabilities.
  • MMLU - multitask knowledge evaluation across 57 subjects.
  • MMLU-Pro (2024) - harder, more discriminative successor to MMLU.
  • GPQA - graduate-level Q&A, "Google-proof" reasoning evaluation.
  • IFEval - verifiable instruction-following evaluation.
  • Chatbot Arena (LMSYS) - human-preference ELO leaderboard for chat models.
  • LiveBench (2024) - contamination-resistant benchmark with monthly refresh.
  • LM Evaluation Harness - unified framework for LM benchmark evaluation.
  • MMLU-ProX (2025) - multilingual extension of MMLU-Pro to 29 typologically diverse languages; reveals up to 24.3% performance gap between high- and low-resource languages.
  • MultiChallenge (2025) - multi-turn conversational benchmark exposing simultaneous instruction-following and in-context-reasoning failures; all tested frontier models score below 50%.
  • FRAMES (2025) - unified RAG evaluation: 824 multi-hop questions requiring factuality, retrieval accuracy, and cross-document reasoning together.

Long-context evaluation:

  • Needle in a Haystack - retrieval probe for long-context windows.
  • RULER (2024) - synthetic long-context tasks beyond simple retrieval.
  • LongBench - bilingual long-context benchmark across NLP tasks.
  • LongBench v2 (2025) - 503 expert-crafted multiple-choice questions spanning 8K-2M-word contexts with deep multi-hop reasoning; humans score 53.7% under time pressure.
  • U-NIAH (2025) - extends needle-in-haystack with multi-needle and nested configurations; shows RAG mitigates lost-in-the-middle for smaller LLMs but degrades reasoning models.

Reasoning and Test-Time Compute

A trend-defining direction in 2024-2026: models that produce explicit reasoning traces and benefit from extra inference compute.

  • Chain-of-Thought Prompting - foundational result; intermediate reasoning steps improve performance.
  • Self-Consistency - majority vote over sampled CoT chains.
  • Tree of Thoughts - search over reasoning trees.
  • Self-Refine and Reflexion - self-correction at inference time.
  • Large Language Models are Zero-Shot Reasoners - chain-of-thought for NLP reasoning tasks.
  • Let's Verify Step by Step - process-supervised reward models for reasoning.
  • DeepSeek-R1 (2025) - open reasoning model trained with pure RL; replicated o1-style behavior in the open.
  • OpenAI o1 / o3 (2024-2025) - test-time-compute reasoning systems.
  • Scaling LLM Test-Time Compute Optimally (2024) - systematic study of inference-time compute tradeoffs.
  • s1: Simple Test-Time Scaling (2025) - small open reasoning recipe via budget-forcing.
  • Kimi k1.5 (2025) - long-context RL with policy optimization (no MCTS, no PRM) reaching o1-level performance; introduces long-CoT distillation into short-CoT models.
  • rStar-Math (2025) - small policy model paired with a process preference model trained via MCTS rollouts; enables small LMs to bootstrap reasoning without distilling from larger models.
  • DAPO (2025) - open GRPO-based RL training system with four key improvements (decoupled clipping, dynamic sampling, token-level loss, entropy bonus); reproduces and surpasses DeepSeek-R1-Zero-level reasoning.
  • VAPO (2025) - value-model-based RL with length-adaptive GAE and token-level clipping; surpasses value-free GRPO methods on AIME 2024 with stable training.
  • ThinkPRM (2025) - generative process reward models that produce chain-of-thought verification per step, matching discriminative PRMs with 1% of the supervision labels.
  • OpenThoughts (2025) - 1000+ controlled experiments on data recipes for open reasoning models; SOTA on AIME 2025 matching closed distillation baselines.

Long Context and Alternative Architectures

  • Mamba and Mamba-2 - selective state-space models, linear-time long-context alternative to attention.
  • RWKV - RNN-transformer hybrid scaling to large parameter counts.
  • Jamba (2024) - hybrid Mamba-Transformer-MoE architecture.
  • RoPE and YaRN - rotary position embeddings and context-length extension.
  • Position Interpolation - extending context windows with minimal fine-tuning.
  • Lost in the Middle - long-context degradation patterns in NLP tasks.
  • RAG vs Long-Context LLMs (2024) - tradeoffs for QA over long inputs.
  • Titans: Learning to Memorize at Test Time (2025) - neural long-term memory module that learns to memorize historical context at test time; scales beyond 2M tokens, outperforms transformers and modern linear-recurrent models on language modeling and reasoning.
  • MiniMax-01 (2025) - 456B-parameter hybrid combining lightning (linear) attention with sparse softmax attention; matches GPT-4o-level NLP performance at up to 4M-token inference contexts.
  • Native Sparse Attention (NSA) (2025) - trainable sparse attention combining coarse-grained compression with fine-grained selection; large speedups at 64K with no NLP-benchmark degradation.
  • LongRoPE2 (2025) - identifies undertraining of high-frequency RoPE dimensions and applies evolutionary-search rescaling; extends LLaMA3-8B to 128K with 80x fewer training tokens than Meta's recipe.
  • Characterizing SSM and Hybrid LM Long-Context Performance (2025) - first comprehensive memory and speed analysis of transformer, SSM, and hybrid models up to 220K tokens; SSMs are up to 4x faster, hybrids balance recall and efficiency.

Factuality, Hallucination, Calibration

  • Survey of Hallucination in Natural Language Generation - taxonomy and mitigation strategies.
  • TruthfulQA - benchmark for truthfulness in question answering.
  • FActScore - fine-grained factual precision in long-form generation.
  • LongFact / SAFE (2024) - long-form factuality benchmark and search-augmented evaluator.
  • SelfCheckGPT - sampling-based hallucination detection.
  • RAGAS - reference-free evaluation for RAG and QA pipelines.
  • Lookback Lens (2024) - attention-pattern-based hallucination detection in long-context generation.
  • Calibration of LLMs on Multiple Choice (2024) - calibration analysis under format effects.
  • HalluLens (2025) - hallucination benchmark with extrinsic/intrinsic taxonomy and dynamic test-set regeneration to resist data leakage.
  • Atomic Calibration (2025) - claim-level calibration analysis for long-form generation; models are substantially worse-calibrated on extended outputs than on single claims.
  • FRANQ (2025) - faithfulness-aware uncertainty quantification for RAG fact-checking; formally separates faithfulness from factuality.
  • MUCH (2025) - multilingual claim-hallucination benchmark across English, French, Spanish, German with token-level logits released for principled UQ evaluation.
  • HalluHard (2026) - hard multi-turn hallucination benchmark for citation-required responses; ~30% hallucination rates persist even with web search.
  • CURE: Think Through Uncertainty (2026) - trains models to reason about claim-level uncertainty before generating; large gains on biography factuality and FactBench AUROC.

Probing and Interpretability

Efficient and Small Language Models

Distillation and small models:

  • DistilBERT and MiniLM - distilled encoders for production NLP.
  • Phi-3 / Phi-4 (Microsoft, 2024) - small models trained on curated data, competitive with much larger ones on NLP benchmarks.
  • SmolLM2 (HuggingFace, 2025) - fully open small-LM family with reproducible training data.
  • SmolLM3 (HuggingFace, 2025) - 3B fully open decoder pretrained on 11.2T tokens with NoPE and YaRN for 128K context; competitive with 4B-class models.
  • Gemma 3 Technical Report (Google, 2025) - 1B-27B open models with high local-to-global attention ratio to keep KV-cache tractable at 128K context.
  • Qwen3 Technical Report (Alibaba, 2025) - dense and MoE models 0.6B-235B with unified thinking/non-thinking modes; the 30B-A3B MoE matches larger dense models while activating only 3B parameters.
  • Apple Intelligence Foundation Language Models (Apple, 2025) - on-device 3B model using KV-cache sharing and 2-bit QAT for 37.5% cache memory reduction without accuracy loss.
  • Sentence-Transformers - sentence and paragraph embeddings via Siamese BERT.
  • SetFit - few-shot text classification without prompts.
  • FastFit - fast few-shot classification for many-class settings.
  • GTE, BGE, and Stella - compact text embedding models near the top of MTEB.

Quantization and serving (relevant when deploying NLP models at scale):

  • GPTQ - post-training quantization for transformers.
  • AWQ - activation-aware weight quantization.
  • KVTuner (ICML 2025) - sensitivity-aware layer-wise mixed-precision KV-cache quantization; up to 21% throughput improvement over uniform KV8.
  • GGUF / llama.cpp - portable quantized inference.
  • vLLM - PagedAttention-based high-throughput LM serving.
  • SGLang - structured generation and efficient serving.
  • Text Generation Inference (TGI) - HF production serving for LMs.

Parameter-efficient fine-tuning:

  • LoRA and QLoRA - low-rank adapters and quantized fine-tuning; the standard for adapting LMs to NLP tasks on modest hardware.
  • DoRA (2024) - weight-decomposed low-rank adaptation.
  • PEFT - HuggingFace library bundling LoRA, prefix tuning, IA3, and others.

Instruction Tuning and Preference Optimization

  • FLAN - finetuned language models as zero-shot learners.
  • InstructGPT - training LMs to follow instructions with human feedback.
  • Self-Instruct - bootstrapping instruction data from LMs.
  • Super-NaturalInstructions - 1600+ NLP tasks with instructions.
  • Constitutional AI - training LMs with AI-generated feedback against a written constitution.
  • Direct Preference Optimization - simpler alternative to RLHF; widely adopted.
  • Tülu 3 (AI2, 2024) - fully open post-training recipe with state-of-the-art results among open models.
  • LIMA - "less is more for alignment"; small high-quality SFT data goes a long way.
  • TRL - reference library for SFT, DPO, GRPO, and RLHF.
  • Magpie (2024-2025) - synthesizes high-quality instruction-response pairs by prompting aligned LMs with nothing; SFT on the filtered subset matches official Llama-3-Instruct.

Bias, Fairness, Safety in NLP

  • StereoSet - measuring stereotypical bias in pretrained LMs.
  • CrowS-Pairs - social bias measurement in masked LMs.
  • WinoBias - gender bias in coreference resolution.
  • HolisticBias - bias measurement across many demographic axes.
  • RealToxicityPrompts - toxicity in LM generation.
  • Sycophancy in Language Models - models tailoring answers to user beliefs.
  • Alignment Faking in Large Language Models (Anthropic, 2024) - models strategically complying during training.
  • WildGuard (2024) - open safety moderation model and benchmark.
  • Emergent Misalignment (2025) - finetuning on a narrow task (insecure code) unexpectedly produces broad alignment failures across unrelated domains.
  • SafeDialBench (2025) - multilingual (Chinese/English) safety benchmark of 4000+ multi-turn dialogues across 22 scenarios and 7 jailbreak strategies.
  • TeleAI-Safety (2025) - modular jailbreak evaluation framework integrating 19 attacks, 29 defenses, and 19 evaluation methods across 14 models and 12 risk categories.
  • IndicSafe (2026) - multilingual safety benchmark across 12 Indic languages; reveals 12.8% cross-language agreement, with over-refusal in low-resource scripts.
  • VLAF: Value-Conflict Alignment Faking (2026) - alignment faking occurs in models as small as 7B in 37% of cases when policy conflicts with internalized values; steering-vector mitigation reduces it 94%.

NLP per Language

Back to Top

Resources organized by human language. Click a section to expand.

NLP in Arabic

Back to Top

Libraries

  • CAMeL Tools - Python toolkit for Arabic NLP including dialect ID, morphology, NER.
  • goarabic - Go package for Arabic text processing.
  • jsastem - JavaScript Arabic stemmer.
  • PyArabic - Python library for Arabic.
  • RFTokenizer - trainable segmenter for Arabic, Hebrew, and Coptic.
  • Farasa - QCRI segmentation, POS tagging, and NER for Arabic.

Models and Embeddings

  • AraBERT - Arabic BERT family.
  • CAMeLBERT - BERT models for MSA, dialectal, and Classical Arabic.
  • AraELECTRA - efficient Arabic pretraining (released alongside AraBERT).
  • Jais (2023-2024) - bilingual Arabic-English open LM family.
  • ALLaM (SDAIA, 2024) - Arabic-first foundation models.

Datasets

NLP in Chinese

Back to Top

Libraries

  • jieba - Python package for Chinese word segmentation.
  • SnowNLP - Python package for Chinese NLP.
  • FudanNLP - Java library for Chinese text processing.
  • HanLP - multilingual NLP library with strong Chinese support.
  • LTP - HIT Language Technology Platform: segmentation, POS, NER, parsing.

Models and Embeddings

Anthology

  • funNLP - large collection of Chinese NLP tools and resources.

NLP in Danish

Back to Top

NLP in Dutch

Back to Top

  • python-frog - Python binding to Frog, an NLP suite for Dutch (POS tagging, lemmatization, dependency parsing, NER).
  • SimpleNLG_NL - Dutch surface realiser for natural language generation, based on the SimpleNLG implementation.
  • Alpino - dependency parser for Dutch (also does POS tagging and lemmatization).
  • Kaldi NL - Dutch speech-recognition models based on Kaldi.
  • spaCy Dutch model - industrial-strength NLP with a Dutch pipeline.

NLP in German

Back to Top

  • German-NLP - curated list of open-access, open-source, and off-the-shelf resources and tools developed with a focus on German.

NLP in Hungarian

Back to Top

NLP in Indic Languages

Back to Top

Data, Corpora and Treebanks

Corpora/Datasets that need a login/access can be gained via email

Language Models and Word Embeddings

Libraries and Tooling

Models and Embeddings

  • IndicBERT v2 (2022-2024) - multilingual BERT for 23 Indic languages.
  • IndicTrans2 (2023-2024) - high-quality MT for 22 Indic languages.
  • OpenHathi (Sarvam AI, 2023) - bilingual Hindi-English LLaMA continuation.
  • Airavata (2024) - instruction-tuned Hindi LLM.
  • Sarvam-1 (2024) - multilingual LM trained from scratch on 10 Indic languages.
  • BharatGPT / Krutrim (2024) - Indic-focused foundation models.

NLP in Indonesian

Back to Top

Libraries and Embeddings

Models

  • IndoBERT (IndoNLU) - pretrained Indonesian LM with the IndoNLU benchmark suite.
  • IndoBERT (IndoLEM) - alternative IndoBERT with the IndoLEM benchmark.
  • NusaCrowd / Cendol (2023-2024) - large-scale community datasets and Cendol instruction-tuned LMs for Indonesian and regional languages.
  • Sailor - open Southeast-Asian LMs covering Indonesian.
  • SEA-LION (2024) - Singapore AI's open Southeast-Asian LM with strong Indonesian.

Datasets

NLP in Korean

Back to Top

Libraries

  • KoNLPy - Python package for Korean natural language processing.
  • Mecab (Korean) - C++ library for Korean NLP.
  • KoalaNLP - Scala library for Korean NLP.
  • KoNLP - R package for Korean NLP.
  • kss - Korean sentence splitter.
  • Kiwi - fast Korean morphological analyzer.

Models and Embeddings

Blogs and Tutorials

Datasets

NLP in Persian

Back to Top

Libraries

  • Hazm - Persian NLP toolkit.
  • Parsivar - Persian language processing toolkit.
  • Perke - Persian keyphrase extraction.
  • Perstem - Persian stemmer, morphological analyzer, and partial POS tagger.
  • ParsiAnalyzer - Persian analyzer for Elasticsearch.
  • virastar - Persian text cleaning.

Models

  • ParsBERT - Persian BERT.
  • PersianMind (2023-2024) - Persian instruction-tuned LM.
  • Dorna (Part AI, 2024) - Llama-3-based Persian instruction model.

Datasets

NLP in Polish

Back to Top

  • Polish-NLP - curated list of resources dedicated to Polish NLP: models, tools, and datasets.

NLP in Portuguese

Back to Top

  • Portuguese-nlp - curated list of Portuguese NLP resources and tools.

Models

  • BERTimbau - BERT for Brazilian Portuguese.
  • Sabiá (Maritaca AI, 2023-2024) - Portuguese-focused open LMs.
  • Albertina (PORTULAN, 2023-2024) - encoder-only Portuguese LMs for both PT-PT and PT-BR.

NLP in Spanish

Back to Top

Libraries

  • spanlp - Python library to detect, censor, and clean profanity, hate speech, and bullying in Spanish, with data from 21 Spanish-speaking countries.

Data

Models and Embeddings

NLP in Thai

Back to Top

Libraries

  • PyThaiNLP - Thai NLP in Python.
  • JTCC - character cluster library in Java.
  • CutKum - word segmentation with deep learning in TensorFlow.
  • Thai Language Toolkit - tokenization and POS tagging.
  • SynThai - word segmentation and POS tagging using deep learning.

Models

  • WangchanBERTa - pretrained Thai language model.
  • Typhoon (SCB 10X, 2024) - open Thai LLM family.
  • OpenThaiGPT (2023-2024) - open Thai instruction-tuned models.
  • Sailor - open Southeast-Asian LM family covering Thai.

Data

  • Inter-BEST - text corpus with 5M words and word segmentation.
  • Prime Minister 29 - dataset of speeches by the current Prime Minister of Thailand.

NLP in Ukrainian

Back to Top

NLP in Urdu

Back to Top

Libraries

Datasets

NLP in Vietnamese

Back to Top

Libraries

  • underthesea - Vietnamese NLP toolkit.
  • vn.vitk - Vietnamese text processing toolkit.
  • VnCoreNLP - Vietnamese NLP toolkit.
  • pyvi - Python Vietnamese core NLP toolkit.
  • VieNeu-TTS - on-device Vietnamese text-to-speech with voice cloning.

Models and Embeddings

  • PhoBERT - pretrained LM for Vietnamese.
  • BARTpho - sequence-to-sequence pretrained model for Vietnamese.
  • PhoGPT (VinAI, 2023-2024) - open generative LM for Vietnamese.
  • Vistral (2024) - Mistral-based Vietnamese chat model.
  • Sailor (2024) - open multilingual LM family covering Vietnamese, Thai, Indonesian, and other Southeast Asian languages.

Data

  • Vietnamese Treebank - 10K sentences for the constituency parsing task.
  • BKTreeBank - Vietnamese dependency treebank.
  • UD_Vietnamese - Vietnamese Universal Dependency Treebank.
  • VIVOS - free Vietnamese speech corpus, 15 hours of recorded speech (HCMUS AILab).
  • VNTQcorpus(big).txt - 1.75M news sentences.
  • ViText2SQL - Vietnamese Text-to-SQL semantic parsing dataset (EMNLP-2020 Findings).
  • EVB Corpus - 20M words across 15 bilingual books, 100 parallel English-Vietnamese texts, 250 parallel law texts, 5K news articles, and 2K film subtitles.

Other Languages

  • Russian: pymorphy2 - a good pos-tagger for Russian
  • Asian Languages: Thai, Lao, Chinese, Japanese, and Korean ICU Tokenizer implementation in ElasticSearch
  • Ancient Languages: CLTK: The Classical Language Toolkit is a Python library and collection of texts for doing NLP in ancient languages
  • Hebrew: NLPH_Resources - A collection of papers, corpora and linguistic resources for NLP in Hebrew

Back to Top

See Also

Adjacent curated lists for topics out of scope here:

Citation

If you find this repository useful, please consider citing this list:

@misc{awesome-nlp,
  title  = {Awesome NLP},
  author = {Kim, Keon Woo},
  year   = {2018},
  url    = {https://github.com/keon/awesome-nlp},
  note   = {GitHub repository}
}

License

License - CC0

About

📖 A curated list of resources dedicated to Natural Language Processing (NLP)

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors