Natural language processing

spaCy & Fairseq are interesting libraries. Natural Language Processing with Transformers Book is nice book. Hugging Face NLP Course is probably the best NLP intro out there.

DALL·E 2 is fascinating. Trying to understand DALL-E in PyTorch implementation.

Getting started with NLP for absolute beginners is a nice intro.

Notes

Figuring out correctly when/what to escalate to a human would change customer service more than anything else.
GPT-3 was created by mining a human-written internet that will never again exist thanks to the creation of GPT-3

Links

SpaCy - Industrial-strength Natural Language Processing (NLP) with Python and Cython. (HN: SpaCy 3.0 (2021))
Adding voice control to your projects
Increasing data science productivity; founders of spaCy & Prodigy
Course materials for "Natural Language" course
NLP progress - Track the progress in Natural Language Processing (NLP) and give an overview of the state-of-the-art across the most common NLP tasks and their corresponding datasets. (Web)
Natural - General natural language facilities for Node.
YSDA Natural Language Processing course (2018)
PyText - Natural language modeling framework based on PyTorch.
FlashText - Extract Keywords from sentence or Replace keywords in sentences.
BERT PyTorch implementation
LASER Language-Agnostic SEntence Representations - Library to calculate and use multilingual sentence embeddings.
StanfordNLP - Python NLP Library for Many Human Languages.
nlp-tutorial - Tutorial for who is studying NLP(Natural Language Processing) using TensorFlow and PyTorch.
Better Language Models and Their Implications (2019)
gpt-2 - Code for the paper "Language Models are Unsupervised Multitask Learners".
Lingvo - Framework for building neural networks in Tensorflow, particularly sequence models.
Fairseq - Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Stanford CS224N: NLP with Deep Learning (2019) - Course page. (HN)
Advanced NLP with spaCy: Free Course (Web) (HN)
Code for Stanford Natural Language Understanding course, CS224u (2019)
Awesome Reinforcement Learning for Natural Language Processing
ParlAI - Framework for training and evaluating AI models on a variety of openly available dialogue datasets.
Training language GANs from Scratch (2019)
Olivia - Your new best friend built with an artificial neural network.
Learn-Natural-Language-Processing-Curriculum
This repository recorded my NLP journey
Project Alias - Open-source parasite to train custom wake-up names for smart home devices while disturbing their built-in microphone.
Cornell Tech NLP Code
Cornell Tech NLP Publications
Thinc - SpaCy's Machine Learning library for NLP in Python. (Docs)
Knowledge is embedded in language neural networks but can they reason? (2019)
NLP Best Practices
Transfer NLP library - Framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP.
FARM - Fast & easy transfer learning for NLP. Harvesting language models for the industry.
Transformers - State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. (Web)
NLP Roadmap 2019
Flair - Very simple framework for state-of-the-art NLP. Developed by Zalando Research.
Unsupervised Data Augmentation - Semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks.
Rasa - Open source machine learning framework to automate text-and voice-based conversations.
T5 - Text-To-Text Transfer Transformer.
100 Must-Read NLP Papers (HN)
Awesome NLP
NLP Library - Curated collection of papers for the NLP practitioner.
spacy-transformers - spaCy pipelines for pre-trained BERT, XLNet and GPT-2.
AllenNLP - Open-source NLP research library, built on PyTorch. (Announcing AllenNLP 1.0)
GloVe - Global Vectors for Word Representation.
Botpress - Open-source Virtual Assistant platform.
Mycroft - Hackable open source voice assistant. (HN)
VizSeq - Visual Analysis Toolkit for Text Generation Tasks.
Awesome Natural Language Generation
How I used NLP (Spacy) to screen Data Science Resume (2019)
Introduction to Natural Language Processing book - Survey of computational methods for understanding, generating, and manipulating human language, which offers a synthesis of classical representations and algorithms with contemporary machine learning techniques.
Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning (Code)
Tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production. (Article)
Example Notebook using BERT for NLP with Keras (2020)
NLP 2019/2020 Highlights
Overview of Modern Deep Learning Techniques Applied to Natural Language Processing
Language Identification from Very Short Strings (2019)
SentenceRepresentation - Code acompanies the paper 'Learning Sentence Representations from Unlabelled Data' Felix Hill, KyungHyun Cho and Anna Korhonen 2016.
Deep Learning for Language Processing course
Megatron LM - Ongoing research training transformer language models at scale, including: BERT & GPT-2. (Megatron with FastMoE) (Fork)
XLNet - New unsupervised language representation learning method based on a novel generalized permutation language modeling objective.
ALBERT - Lite BERT for Self-supervised Learning of Language Representations.
BERT - TensorFlow code and pre-trained models for BERT.
Multilingual Denoising Pre-training for Neural Machine Translation (2020)
List of NLP tutorials built on PyTorch
sticker - Sequence labeler that uses either recurrent neural networks, transformers, or dilated convolution networks.
sticker-transformers - Pretrained transformer models for sticker.
pke - Python Keyphrase Extraction module.
How to train a new language model from scratch using Transformers and Tokenizers (2020)
Interactive Attention Visualization - Small example of an interactive visualization for attention values as being used by transformer language models like GPT2 and BERT.
The Annotated GPT-2 (2020)
GluonNLP - Toolkit that enables easy text preprocessing, datasets loading and neural models building to help you speed up your NLP research.
Finetune - Scikit-learn style model finetuning for NLP.
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages (2020) (HN)
NLP Newsletter
NLP Paper Summaries
Advanced NLP with spaCy
Myle Ott's research
Natural Language Toolkit (NLTK) - Suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. (Web) (Book)
NLP 100 Exercise - Bootcamp designed for learning skills for programming, data analysis, and research activities. (Code)
The Transformer Family (2020)
Minimalist Implementation of a BERT Sentence Classifier
fastText - Library for efficient text classification and representation learning. (Code) (Article) (HN) (Fork)
Awesome NLP Paper Discussions - Papers & presentations from Hugging Face's weekly science day.
SynST: Syntactically Supervised Transformers
The Cost of Training NLP Models: A Concise Overview (2020)
Tutorial - Transformers (Tweet)
TTS - Deep learning for Text to Speech.
MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer (2020)
gpt-2-simple - Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts.
BERTScore - BERT score for text generation.
ML and NLP Paper Discussions
NLP Index - Collection of NLP resources.
NLP Datasets
Word Embeddings (2017)
NLP from Scratch: Annotated Attention (2020)
This Word Does Not Exist - Allows people to train a variant of GPT-2 that makes up words, definitions and examples from scratch. (Code) (HN)
Ultimate guide to choosing an online course covering practical NLP (2020)
HuggingFace nlp library - Quick overview (2020) (Twitter)
aitextgen - Robust Python tool for text-based AI training and generation using GPT-2. (HN)
Self Supervised Representation Learning in NLP (2020) (HN)
Synthetic and Natural Noise Both Break Neural Machine Translation (2017)
Inferbeddings - Injecting Background Knowledge in Neural Models via Adversarial Set Regularisation.
UCL Natural Language Processing group
Interactive Lecture Notes, Slides and Exercises for Statistical NLP
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
CMU LTI Low Resource NLP Bootcamp 2020
GPT-3: Language Models Are Few-Shot Learners (2020) (HN) (Code)
nlp - Lightweight and extensible library to easily share and access datasets and evaluation metrics for NLP.
Brainsources for NLP enthusiasts
Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper)
NLP Resources
TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables (Article) (HN)
vtext - NLP in Rust with Python bindings.
Language Technology Lab @ University of Cambridge
The Natural Language Processing Dictionary
Introduction to NLP using Fastai (2020)
Gwern on GPT-3 (HN)
Semantic Machines - Solving conversational artificial intelligence. Part of Microsoft.
The Reformer – Pushing the limits of language modeling (HN)
GPT-3 Creative Fiction (2020) (HN)
Classifying 200k articles in 7 hours using NLP (2020) (HN)
HN: Using GPT-3 to generate user interfaces (2020)
Thread of GPT-3 use cases (2020)
GPT-3 Code Experiments (Examples)
How GPT3 Works - Visualizations and Animations (2020) (Lobsters) (HN)
What is GPT-3? written in layman's terms (2020) (HN)
GPT3 Examples (HN)
DQI: Measuring Data Quality in NLP (2020)
Humanloop - Train and deploy NLP. (HN)
Do NLP Beyond English (2020) (HN)
Giving GPT-3 a Turing Test (2020) (HN)
Neural Network Methods for Natural Language Processing (2017)
Tempering Expectations for GPT-3 and OpenAI’s API (2020)
Philosophers on GPT-3 (2020) (HN)
GPT-3 Explorer - Power tool for experimenting with GPT-3. (Code)
Recent Advances in Natural Language Processing (2020) (HN)
Project Insight - NLP as a Service. (Forum post)
Bob Coecke: Quantum Natural Language Processing (QNLP) (2020) (Article)
Language-Agnostic BERT Sentence Embedding (2020)
Language Interpretability Tool (LIT) - Interactively analyze NLP models for model understanding in an extensible and framework agnostic interface.
Booste Pre Trained Models - Free-to-use GPT-2 API. (HN)
Context-theoretic Semantics for Natural Language: an Algebraic Framework (2007)
THUNLP (Natural Language Processing Lab at Tsinghua University) research
AI training method exceeds GPT-3 performance with fewer parameters (2020) (HN)
BERT Attention Analysis
Neural Modules and Models for Conversational AI (2020)
BERTopic - Topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
NLP Pandect - Comprehensive reference for all topics related to Natural Language Processing.
Practical Natural Language Processing book (Code)
NLP Reseach Project: Best Practices for Finetuning Large Transformer Language models (2020)
Deep Learning for NLP notes (2020)
Modern Practical Natural Language Processing course
LXMERT: Learning Cross-Modality Encoder Representations from Transformers in PyTorch
Awesome software for Text ML
Pretrained Transformers for Text Ranking: BERT and Beyond (2020)
SpaCy v3.0 Nightly (2020) (HN) (Tweet)
Explore trained spaCy v3.0 pipelines
spacy-streamlit - sGpaCy building blocks for Streamlit apps. (Tweet)
Informers - State-of-the-art natural language processing for Ruby.
How to Structure and Manage Natural Language Processing (NLP) Projects (2020)
Sentence-BERT for spaCy - Wraps sentence-transformers (also known as sentence-BERT) directly in spaCy.
Lingua Franca - Mycroft's multilingual text parsing and formatting library.
Simple Transformers - Based on the Transformers library by HuggingFace. Lets you quickly train and evaluate Transformer models.
Deep Bidirectional Transformers for Language Understanding (2020) - Explains a legendary paper, BERT. (HN)
EasyTransfer - Designed to make the development of transfer learning in NLP applications easier.
LambdaBERT - Transformers-style implementation of BERT using LambdaNetworks instead of self-attention.
DialoGPT - State-of-the-Art Large-scale Pretrained Response Generation Model.
Neural reading comprehension and beyond - Danqi Chen's Thesis (2020) (Code)
LAMA: LAnguage Model Analysis - Probe for analyzing the factual and commonsense knowledge contained in pretrained language models.
awesome-2vec - Curated list of 2vec-type embedding models.
Rethinking Attention with Performers (2020) (HN)
BERT Research - Key Concepts & Sources (2019)
The Pile - Large, diverse, open source language modelling data set that consists of many smaller datasets combined together.
Bort - Companion code for the paper "Optimal Subarchitecture Extraction for BERT."
Vector AI - Encode And Deploy Vectors At The Edge. (Code)
KeyBERT - Minimal keyword extraction with BERT. (Web)
Multimodal Transformer for Unaligned Multimodal Language Sequences - In PyTorch.
The Illustrated GPT-2 (Visualizing Transformer Language Models) (2020)
A Primer in BERTology: What we know about how BERT works (2020) (HN)
GPT Neo - Open-source GPT model, with pretrained 1.3B & 2.7B weight models. (HN)
TextSynth - Bellard's free GPT-NeoX-20B, GPT-J playground and paid API. (Playground) (HN)
How to Go from NLP in 1 Language to NLP in N Languages in One Shot (2020)
Contextualized Topic Models - Family of topic models that use pre-trained representations of language (e.g., BERT) to support topic modeling.
Language Style Transfer - Code for Style Transfer from Non-Parallel Text by Cross-Alignment paper.
NLU - Power of Spark NLP, the Simplicity of Python. 1 line for hundreds of NLP models and algorithms.
PyTorch Implementation of Google BERT
High Performance Natural Language Processing (2020)
duoBERT - Multi-stage passage ranking: monoBERT + duoBERT.
Awesome GPT-3
SMAC3 - Sequential Model-based Algorithm Configuration.
Semantic Experiences by Google - Experiments in understanding language.
Long-Range Arena - Systematic evaluation of efficient transformer models.
PaddleHub - Awesome pre-trained models toolkit based on PaddlePaddle.
DeepSPIN (Deep Structured Prediction in Natural Language Processing) (GitHub)
Multi-Task Learning in NLP
FastSeq - Provides efficient implementation of popular sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc.
Sentence Embeddings with BERT & XLNet
FastFormers - Provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Understanding (NLU).
Adversarial NLI - Adversarial Natural Language Inference Benchmark.
textract - Extract text from any document. No muss. No fuss. (Docs)
NLP e Named Entity Recognition (2020)
Big Bird: Transformers for Longer Sequences
NLP PyTorch Tutorial
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
CrossWeigh: Training Named Entity Tagger from Imperfect Annotations (2019) (Code)
Does GPT-2 Know Your Phone Number? (2020)
Towards Fully Automated Manga Translation (2020)
Text Classification Models - All kinds of text classification models and more with deep learning.
Awesome Text Summarization
Shortformer: Better Language Modeling using Shorter Inputs (2020) (HN)
huggingface_hub - Client library to download and publish models and other files on the huggingface.co hub.
Embeddings from the Ground Up (2020)
Ecco - Tools to visuals and explore NLP language models. (Web) (HN)
Interfaces for Explaining Transformer Language Models (2020)
DALL·E: Creating Images from Text (2021) (HN) (Reddit)
CLIP: Connecting Text and Images (2021) (HN) (Paper) (Code)
OpenNRE - Open-Source Package for Neural Relation Extraction (NRE).
Princeton NLP Group (GitHub)
Must-read papers on neural relation extraction (NRE)
FewRel Dataset, Toolkits and Baseline Models
Tree Transformer: Integrating Tree Structures into Self-Attention (2019) (Code)
SentEval: evaluation toolkit for sentence embeddings
gpt-scrolls - Collaborative collection of open-source safe GPT-3 prompts that work well.
SLING - A natural language frame semantics parser - Built to learn to read and understand Wikipedia articles in many languages for the purpose of knowledge base completion.
Awesome Neural Adaptation in NLP
Natural language generation: The commercial state of the art in 2020 (HN)
Non-Autoregressive Generation Progress
Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
VecMap - Framework to learn cross-lingual word embedding mappings.
Kiri - Natural Language Engine. (Web)
GPT3 List - List of things that people are claiming is enabled by GPT3.
DeBERTa - Decoding-enhanced BERT with Disentangled Attention.
Sockeye - Open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet. (Docs)
Robustness Gym - Python evaluation toolkit for natural language processing.
State-of-the-Art Conversational AI with Transfer Learning
GPT-Neo - GPT-3-sized model, open source and free. (HN) (Code)
Deep Daze - Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network).
Notebooks using the Hugging Face libraries
NLP Cloud - Serve spaCy pre-trained models, and your own custom models, through a RESTful API.
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters (2020) (Code)
jiant - Multitask and transfer learning toolkit for NLP. (Web)
Must-read Papers on Textual Adversarial Attack and Defense
Reranker - Build Text Rerankers with Deep Language Models.
rust-bert - Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...).
rust-tokenizers - Offers high-performance tokenizers for modern language models.
Replicating GPT-2 at Home (2021) (HN)
Shifterator - Interpretable data visualizations for understanding how texts differ at the word level.
CMU Neural Networks for NLP Course (2021) (Videos)
minnn - Exercise in developing a minimalist neural network toolkit for NLP.
Controllable Sentence Simplification (2019) (Code)
Awesome Relation Extraction
retext - Natural language processor powered by plugins part of the unified collective. (Awesome)
CLIP Playground - Try OpenAI's CLIP model in your browser.
GPT-3 Demo - GPT-3 Examples, Demos, Showcase, and NLP Use-cases.
Big Sleep - Simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.
Beyond the Imitation Game Benchmark (BIG-bench) - Collaborative benchmark intended to probe large language models, and extrapolate their future capabilities.
AutoNLP - Automatic way to train, evaluate and deploy state-of-the-art NLP models for different tasks.
DeText - Deep Neural Text Understanding Framework for Ranking and Classification Tasks.
Paragraph Vectors in PyTorch
NeuSpell: A Neural Spelling Correction Toolkit
Natural Language YouTube Search - Search inside YouTube videos using natural language.
Accelerate - Simple way to train and use NLP models with multi-GPU, TPU, mixed-precision.
Classical Language Toolkit (CLTK) - Python library offering natural language processing (NLP) for pre-modern languages. (Web)
Guide: Finetune GPT2-XL
GENRE (Generarive ENtity REtrieval) - Uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture.
Teachable NLP - GPT-2 Training as a Service.
DensePhrases - Provides answers to your natural language questions from the entire Wikipedia in real-time.
How to use GPT-3 recursively to solve general problems (2021)
Podium - Framework agnostic Python NLP library for data loading and preprocessing.
Prompts - Advanced GPT-3 playground. (Code)
TextFlint - Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing.
Awesome Text Summarization
SimCSE: Simple Contrastive Learning of Sentence Embeddings (2021) (Code)
Berkeley Neural Parser - High-accuracy NLP parser with models for 11 languages. (Web)
nlpaug - Data augmentation for NLP.
Top2Vec - Learns jointly embedded topic, document and word vectors.
Focused Attention Improves Document-Grounded Generation (2021) (Code)
NLPretext - All the goto functions you need to handle NLP use-cases.
spaCy + UDPipe
adapter-transformers - Friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models.
TextAttack - Generating adversarial examples for NLP models.
GPT-NeoX - Implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library.
Transfer Learning in Natural Language Processing (2019) (Code)
Cohere - Help computers understand language. (Tweet)
Transformers Interpret - Model explainability tool designed to work exclusively with the transformers package.
Whatlang - Natural language detection library for Rust. (Web)
Category Theory + NLP Papers
UniLM - Pre-trained models for natural language understanding (NLU) and generation (NLG) tasks.
AutoNLP - Faster and easier training and deployments of SOTA NLP models.
TAble PArSing (TAPAS) - End-to-end neural table-text understanding models.
Replacing Bert Self-Attention with Fourier Transform: 92% Accuracy, 7X Faster (2021)
FNet: Mixing Tokens with Fourier Transforms (2021) (Tweet)
True Few-Shot Learning with Language Models (2021) (Tweet) (Code)
End-to-end NLP workflows from prototype to production (Web)
Haystack - End-to-end Python framework for building natural language search interfaces to data. (HN)
PLMpapers - Must-read Papers on pre-trained language models.
English-to-Spanish translation with a sequence-to-sequence Transformer in Keras
Evaluation Harness for Large Language Models - Framework for few-shot evaluation of autoregressive language models.
MLP GPT - Jax - GPT, made only of MLPs, in Jax.
Few-Shot Question Answering by Pretraining Span Selection (2021) (Code)
Neural Extractive Search (2021) (Demo)
Hugging Face NLP Course (Code)
SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation.
LoRA: Low-Rank Adaptation of Large Language Models (2021) (Code)
PromptPapers - Must-read papers on prompt-based tuning for pre-trained language models.
Obsei - Automation tool for text analysis need.
Evaluating Large Language Models Trained on Code (2021) (Code)
Survey of Surveys for Natural Language Processing (SOS4NLP)
CLIP guided diffusion
Data driven literary analysis
DALL·E Mini - Generate images from a text prompt.
Jury - Evaluation for Natural Language Generation.
Rubrix - Free and open-source tool to explore, label, and monitor data for NLP projects.
Knowledge Neurons in Pretrained Transformers (2021) (Code) (Code)
OpenCLIP - Open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).
Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning (2021) (Code)
Can a Fruit Fly Learn Word Embeddings? (2021)
Spark NLP - Natural Language Processing library built on top of Apache Spark ML. (Web)
Spark NLP Workshop - Showcasing notebooks and codes of how to use Spark NLP in Python and Scala.
ConceptNet Numberbatch - Set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings.
OpenAI Codex - AI system that translates natural language to code. (HN)
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021)
NL-Augmenter - Collaborative Repository of Natural Language Transformations.
wevi - Word embedding visual inspector. (Code)
clip-retrieval - Easily computing clip embeddings and building a clip retrieval system with them.
NVIDIA NeMo - Toolkit for conversational AI.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
BEIR - Heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark.
UER-py - Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo.
ExplainaBoard - Explainable Leaderboard for NLP.
Fast-BERT - Super easy library for BERT based NLP models.
Genie Tookit - Generator of Natural Language Parsers for Compositional Virtual Assistants. (Paper)
Quantum Stat - Your NLP Model Training Platform.
Mistral - Framework for transparent and accessible large-scale language model training, built with Hugging Face. (Docs)
NERDA - Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks.
Data Augmentation Techniques for NLP
Feed forward VQGAN-CLIP model
Yet Another Keyword Extractor (Yake) - Unsupervised Approach for Automatic Keyword Extraction using Text Features.
Challenges in Detoxifying Language Models (2021) (Tweet)
TextBrewer - PyTorch-based model distillation toolkit for natural language processing.
GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain (2021)
PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models (2021) (Code)
VQGAN-CLIP Overview - Repo for running VQGAN+CLIP locally.
TLDR: Extreme Summarization of Scientific Documents (2020) (Code)
Can Language Models be Biomedical Knowledge Bases? (2021)
ColBERT: Contextualized Late Interaction over BERT (2020)
Investigating Pretrained Language Models for Graph-to-Text Generation (2020) (Code)
Ubiquitous Knowledge Processing Lab (GitHub)
DedupliPy - Python package for deduplication/entity resolution using active learning.
Flexible Generation of Natural Language Deductions (2021) (Code)
Machine Translation Reading List
Compressive Transformers for Long-Range Sequence Modelling (2020) (Code)
pyxclib - Tools for multi-label classification problems.
ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.
OpenPrompt - Open-Source Toolkit for Prompt-Learning.
Unsupervised Neural Machine Translation with Generative Language Models Only (2021) (Tweet)
Grounding Spatio-Temporal Language with Transformers (2021) (Code)
Fast Sentence Embeddings (fse) - Compute Sentence Embeddings Fast.
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models (2021)
Surge AI - Build powerful NLP datasets using our global labeling force and platform. (Python SDK)
Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels (Code)
ogen - OpenAPI v3 code generator for go.
PromptSource - Toolkit for collecting and applying prompts to NLP datasets. (Web) (HN)
Creating User Interface Mock-ups from High-Level Text Descriptions with Deep-Learning Models (2021)
Filtlong - Tool for filtering long reads by quality. It can take a set of long reads and produce a smaller, better subset.
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System (2021) (Code)
xFormers - Hackable and optimized Transformers building blocks, supporting a composable construction.
Language Models As or For Knowledge Bases (2021)
Wikipedia2Vec - Tool for learning vector representations of words and entities from Wikipedia. (Code)
Reflections on Foundation Models (2021) (Tweet)
textacy - NLP, before and after spaCy.
Natural Language Processing Specialization Course (Tweet)
Hugging Face on Amazon SageMaker Workshop
CS224N: Natural Language Processing with Deep Learning | Winter 2021 - YouTube
GPT-3 creates geofoam, but out of text (2021)
Towards Efficient NLP: A Standard Evaluation and A Strong Baseline (2021) (Code)
Hierarchical Transformers Are More Efficient Language Models (2021) (HN) (Code)
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration (2021) (Code)
GPT-3 is no longer the only game in town (2021) (HN)
PatrickStar - Parallel Training of Large Language Models via a Chunk-based Memory Management.
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) (2021)
Text2Art - AI Powered Text-to-Art Generator.
Emergent Communication of Generalizations (2021) (Code)
Awesome Pretrained Models for Information Retrieval
SummerTime - Text Summarization Toolkit for Non-experts.
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework (2021) (Code)
Differentially Private Fine-tuning of Language Models (2021) (Tweet)
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning (2021) (Code)
Aphantasia - CLIP + FFT/DWT/RGB = text to image/video.
OpenAI’s API Now Available with No Waitlist (2021) (HN)
Recent trends of Entity Linking, Disambiguation, and Representation
Intro to Large Language Models with Cohere
spacy-experimental - Cutting-edge experimental spaCy components and features.
AdaptNLP - High level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models for end to end tasks. (Docs)
Reading list for Awesome Sentiment Analysis papers
Aspect-Based-Sentiment-Analysis: Transformer & Explainable ML (TensorFlow)
Deploy optimized transformer based models in production
PyConverse - Conversational text Analysis using various NLP techniques.
KILT - Library for Knowledge Intensive Language Tasks.
RoFormer: Enhanced Transformer with Rotary Position Embedding (2021) (Code)
N-grammer: Augmenting Transformers with latent n-grams (2021) (Code)
textsearch - Find strings/words in text; convenience and C speed.
Mastering spaCy Book (2021) (Code)
sense2vec - Contextually-keyed word vectors.
Pureformer: Do We Even Need Attention? (2021)
Knover - Toolkit for knowledge grounded dialogue generation based on PaddlePaddle.
Language Modelling at Scale: Gopher, Ethical considerations, and Retrieval | DeepMind (2021) (HN)
CMU Advanced NLP 2021 - YouTube
whatlies - Toolkit to help understand "what lies" in word embeddings. Also benchmarking.
CLIP-Guided-Diffusion
Factual Probing Is [MASK]: Learning vs. Learning to Recall (2021) (Code)
Improving Compositional Generalization with Latent Structure and Data Augmentation (2021)
PORORO - Platform Of neuRal mOdels for natuRal language prOcessing.
PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2021) (Code)
To Understand Language Is to Understand Generalization (2021) (HN)
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (2020) (Code)
Multimodal Transformers | Transformers with Tabular Data (Article)
Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering (2021) (Code)
Improving Language Models by Retrieving from Trillions of Tokens (2021)
Open Information Extraction (OIE) Resources
Deeper Text Understanding for IR with Contextual Neural Language Modeling (2019) (Code)
x-clip - Concise but complete implementation of CLIP with various experimental improvements from recent papers.
Calamity - Self-hosted GPT playground.
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation (2021) (Code)
Transactions of the Association for Computational Linguistics (2021) (Code)
DocEE - Toolkit for document-level event extraction, containing some SOTA model implementations.
Autoregressive Entity Retrieval (2020)
Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation (2020)
A Span-Based Model for Joint Overlapped and Discontinuous Named Entity Recognition (2021)
Deduplicating Training Data Makes Language Models Better (2021) (Code)
Transformers without Tears: Improving the Normalization of Self-Attention (2019) (Code)
CTCDecoder - Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.
Custom Named Entity Recognition with Spacy3
BARTScore: Evaluating Generated Text as Text Generation (2021) (Code)
minDALL-E on Conceptual Captions - PyTorch implementation of a 1.3B text-to-image generation model trained on 14 million image-text pairs.
Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation (2021) (Code)
Multitask Prompted Training Enables Zero-Shot Task Generalization (2021) (Code)
spaCy models - Models for the spaCy Natural Language Processing (NLP) library.
Awesome Huggingface
SyntaxDot - Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.
STriP Net - Semantic Similarity of Scientific Papers (S3P) Network.
Small-Text - Active Learning for Text Classification in Python.
Plug and Play Language Models: A Simple Approach to Controlled Text Generation (2020) (Code)
RuDOLPH - One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP.
PLM papers - Paper list of pre-trained language models (PLMs).
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Improving language models by retrieving from trillions of tokens (2022) (Code)
EntitySeg Toolbox - Towards precise and open-world image segmentation.
Aligning Language Models to Follow Instructions (2022) (Tweet) (Code)
Simple Questions Generate Named Entity Recognition Datasets (2021) (Code)
KRED: Knowledge-Aware Document Representation for News Recommendations (2019) (Code)
Stanford Open Information Extraction
Python3 wrapper for Stanford OpenIE
I-BERT: Integer-only BERT Quantization (2021) (Code)
spaCy-wrap - Wrapping fine-tuned transformers in spaCy pipelines.
DeepMatcher - Python package for performing Entity and Text Matching using Deep Learning.
Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond (2020) (Code)
medspacy - Library for clinical NLP with spaCy.
Natural Language Processing with Transformers Book (Code)
blurr - Library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.
HanLP - Multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x.
Awesome Text-to-Image
NLP News Newsletter
Named Entity Recognition as Dependency Parsing (2020) (Code)
Multilingual-CLIP - OpenAI CLIP text encoders for any language.
FasterTransformer - Transformer related optimization, including BERT, GPT.
Papers about Causal Inference and Language
EET (Easy and Efficient Transformer) - Efficient PyTorch inference plugin focus on Transformer-based models with large model sizes and long sequences.
Measuring Massive Multitask Language Understanding (2021) (Code)
A Theoretical Analysis of the Repetition Problem in Text Generation (2021) (Code)
TransformerSum - Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
Natural Language Processing with Transformers Book
Transformer Memory as a Differentiable Search Index (2022) (HN) (Tweet)
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (2020) (Code)
spaCy + Stanza - Use the latest Stanza (StanfordNLP) research models directly in spaCy.
Awesome Document Understanding
Sequential Transformer - Code for training Transformers on sequential tasks such as language modeling.
bert-as-service - Mapping a variable-length sentence to a fixed-length vector using BERT model.
A Contrastive Framework for Neural Text Generation (2022) (Code)
Parallax - Tool for interactive embeddings visualization.
Serve PyTorch model as an API using AWS + serverless framework
Neural reality of argument structure constructions (2022)
DeepNet: Scaling Transformers to 1,000 Layers (2022) (HN)
Large Models of Source Code - Guide to using pre-trained large language models of source code.
HyperMixer: An MLP-based Green AI Alternative to Transformers (2022)
NLP Course Material & QA
Survey of Surveys (NLP & ML) - Collection of 700+ survey papers on Natural Language Processing (NLP) and Machine Learning (ML).
Awesome CLIP - Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
MAGMA - GPT-style multimodal model that can understand any combination of images and language.
Timexy - spaCy custom component that extracts and normalizes temporal expressions.
New Capabilities for GPT-3: Edit and Insert (2022) (HN)
Which hardware to train a 176B parameters model? (2022) (Tweet)
Fundamentals of NLP - Series of hands-on notebooks for learning the fundamentals of NLP.
BertViz - Visualize Attention in Transformer Models (BERT, GPT2, BART, etc.).
Attention Is All You Need (2017) (Code) (PyTorch Code)
Word2Vec Explained. Explaining the Intuition of Word2Vec (2021) (HN)
imgbeddings - Python package to generate image embeddings with CLIP without PyTorch/TensorFlow.
Linking Emergent and Natural Languages via Corpus Transfer (2022)
Transformer Inference Arithmetic (2022)
Training Compute-Optimal Large Language Models (2022) (Tweet)
KeyphraseVectorizers - Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert them into a document-keyphrase matrix.
Gramformer - Framework for detecting, highlighting and correcting grammatical errors on natural language text.
Classy Classification - Easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classificaiton with Huggingface.
Sphere - Web-scale retrieval for knowledge-intensive NLP.
muTransformers - Common Huggingface transformers in maximal update parametrization (µP).
Event Extraction papers - List of NLP resources focused on event extraction task.
Summarization Papers
GLID-3 - Combination of OpenAI GLIDE, Latent Diffusion and CLIP.
Optimum Transformers - Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.
Pathways Language Model (PaLM): Scaling to 540B parameters (2022) (HN) (Code) (Code)
A Divide-and-Conquer Approach to the Summarization of Long Documents (2020) (Code)
Resources for learning about Text Mining and Natural Language Processing
LinkBERT: Pretraining Language Models with Document Links (2022) (Code)
Dall-E 2 (2022) (HN) (Tweet) (Tweet) (Code) (Code) (Code) (Tweet) (Tweet) (HN) (Video Summary) (HN) (Tweet)
Variations of the Similarity Function of TextRank for Automated Summarization (2016) (Code)
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (2020) (Code)
Awesome Knowledge Distillation
You Only One Sequence (2021)
Towards Understanding and Mitigating Social Biases in Language Models (2021) (Code)
DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization (2021) (Code)
Humanloop Programmatic - Create large high-quality datasets for NLP in minutes. No hand labelling required. (HN)
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language (2022)
Second order effects of the rise of large language models (2022)
Simple Annotated implementation of GPT-NeoX in PyTorch
BLEURT: Learning Robust Metrics for Text Generation (2020) (Code)
Bootleg - Self-supervised named entity disambiguation (NED) system that links mentions in text to entities in a knowledge base. (Code)
DALL-E in Mesh-TensorFlow
A few things to try with DALL·E (2022) (HN)
Google's 540B PaLM Language Model & OpenAI's DALL-E 2 Text-to-Image Revolution (2022)
Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution (2021) (Code)
Simple and Effective Multi-Paragraph Reading Comprehension (2017) (Code)
Researchers Glimpse How AI Gets So Good at Language Processing (2022)
Cornell Conversational Analysis Toolkit (ConvoKit) - Toolkit for extracting conversational features and analyzing social phenomena in conversations.
UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models (2022) (Code)
exBERT - Visual Analysis Tool to Explore Learned Representations in Transformers Models.
How DALL-E 2 Works (2022) (HN)
Getting started with NLP for absolute beginners (2022)
EasyNLP - Comprehensive and Easy-to-use NLP Toolkit.
Reframing Human-AI Collaboration for Generating Free-Text Explanations (2021) (Tweet)
Detoxify - Comment Classification with PyTorch Lightning and Transformers.
DLATK - End to end human text analysis package, specifically suited for social media and social scientific applications.
Language modeling via stochastic processes (2022) (Code)
An Enhanced Span-based Decomposition Method for Few-Shot Sequence Labeling (2022) (Code)
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization (2021) (Code)
DataLab - Unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner.
Limitations of DALL-E (HN)
AutoPrompt - Automatic Prompt Construction for Masked Language Models.
DALL·E Flow - Human-in-the-Loop workflow for creating HD images from text.
Recon NER - Debug and correct annotated Named Entity Recognition (NER) data for inconsitencies and get insights on improving the quality of your data.
CausalNLP - Practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.
OPT: Open Pre-trained Transformer Language Models (2022) - Meta's 175B parameter language model. (Reddit) (Tweet)
Bert Extractive Summarizer - Easy to use extractive text summarization with BERT.
Dialogue Response Ranking Training with Large-Scale Human Feedback Data (2020) (Code)
LM-Debugger - Interactive tool for inspection and intervention in transformer-based language models.
100 Pages of raw notes released with the language model OPT-175 (HN)
Unsupervised Cross-Task Generalization via Retrieval Augmentation (2022) (Code)
On Continual Model Refinement in Out-of-Distribution Data Streams (2022)
GLID-3-XL - 1.4B latent diffusion model from CompVis back-ported to the guided diffusion codebase.
Neutralizing Subjectivity Bias with HuggingFace Transformers (2022)
Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists (2022) (Code) (Tweet)
gse - Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other.
BERTopic: The Future of Topic Modeling (2022) (HN)
Unifying Language Learning Paradigms (2022) (Code)
GLM: General Language Model Pretraining with Autoregressive Blank Infilling (2021) (Code)
GPT-3 limitations (2022)
Natural Language Processing Demystified
Concise Concepts - Contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.
Dynamic language understanding: adaptation to new knowledge in parametric and semi-parametric models (2022) (Tweet)
nlprule - Fast, low-resource Natural Language Processing and Text Correction library written in Rust.
Quark: Controllable Text Generation with Reinforced Unlearning (2022) (Tweet)
DALL-E 2 has a secret language (HN) (Tweet) (HN)
AdaTest - Find and fix bugs in natural language machine learning models using adaptive testing.
Diffusion-LM Improves Controllable Text Generation (2022) (Code) (Tweet)
RnG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering (2021) (Code)
Neural Prompt Search - Searching prompt modules for parameter-efficient transfer learning.
makemore - Most accessible way of tinkering with a GPT - one hackable script.
DALL-E Playground - Playground for DALL-E enthusiasts to tinker with the open-source version of OpenAI's DALL-E, based on DALL-E Mini.
Craiyon - AI model drawing images from any prompt. Formerly DALL-E mini.
Contrastive Learning for Natural Language Processing
MSCTD: A Multimodal Sentiment Chat Translation Dataset (Code)
Auto-Lambda: Disentangling Dynamic Task Relationships (2022) (Code)
Concepts in Neural Networks for NLP
DinkyTrain - Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration.
Pretrained Language Models
BERT-of-Theseus: Compressing BERT by Progressive Module Replacing (2020) (Code)
YaLM 100B - GPT-like neural network for generating and processing text by Yandex. (HN)
Pathways Autoregressive Text-to-Image model (Parti) - Autoregressive text-to-image generation model that achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge. (Web) (HN)
How Imagen Actually Works (2022)
First impressions of DALL-E, generating images from text (2022) (Lobsters)
Meta is inviting researchers to pick apart the flaws in its version of GPT-3 (2022) (HN)
'Making Moves' In DALL·E mini (2022)
min(DALL·E) - Minimal implementation of DALL·E Mini. It has been stripped to the bare essentials necessary for doing inference, and converted to PyTorch.
Awesome Document Similarity Measures
RETRO Is Blazingly Fast (2022)
LightOn - Unlock Extreme-Scale Machine Intelligence. Most repos are focused on the use of photonic hardware. (GitHub)
Minerva: Solving Quantitative Reasoning Problems with Language Models (2022) (Paper)
winkNLP - Developer friendly Natural Language Processing. (Docs)
Facebook Low Resource (FLoRes) MT Benchmark
Using GPT-3 to explain how code works (2022) (Lobsters) (HN)
Awesome Topic Models
Introducing The World’s Largest Open Multilingual Language Model: BLOOM
The DALL·E 2 Prompt Book (HN) (Tweet)
RWKV - RNN with Transformer-level performance, which can also be directly trained like a GPT transformer (parallelizable).
Kern AI - Open-source IDE for data-centric NLP. Combining programmatic labeling, extensive data management and neural search capabilities. (Code) (HN)
spaCy fishing - spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata.
DALL·E Now Available in Beta (2022) (HN)
Inside language models (from GPT-3 to PaLM)
Timeline of AI and language models
Cascades - Python library which enables complex compositions of language models such as scratchpads, chain of thought, tool use, selection-inference, and more.
Awesome Neural Symbolic
Towards Knowledge-Based Recommender Dialog System (2019) (Code)
Asent - Rule-based sentiment analysis library for Python made using SpaCy.
extractacy - Pattern extraction and named entity linking for spaCy.
A Hazard Analysis Framework for Code Synthesis Large Language Models (2022)
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (2022) (Code)
A Frustratingly Easy Approach for Entity and Relation Extraction (2021) (Code)
Chinchilla's Wild Implications (2022) (HN)
DALL·E 2 prompt book (2022) (HN)
GLM-130B - Open Bilingual Pre-Trained Model.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion (2022) (Code)
DALL-E + GPT-3 = ♥ (2022) (HN)
Run your own DALL-E-like image generator (2022) (HN)
Stable Diffusion launch announcement (2022) (HN)
Stable Diffusion
MidJourney Styles and Keywords Reference
Spent $15 in DALL·E 2 credits creating this AI image (2022) (HN)
Phraser - Better way to generate prompts.
Seminar on Large Language Models (2022)

Notes
Links

Natural language processing

Notes​

Links​

Notes

Links