Natural language processing
spaCy & Fairseq are interesting libraries. Natural Language Processing with Transformers Book is nice book. Hugging Face NLP Course is probably the best NLP intro out there.
DALL·E 2 is fascinating. Trying to understand DALL-E in PyTorch implementation.
Getting started with NLP for absolute beginners is a nice intro.
Notes
- Figuring out correctly when/what to escalate to a human would change customer service more than anything else.
- GPT-3 was created by mining a human-written internet that will never again exist thanks to the creation of GPT-3
Links
- SpaCy - Industrial-strength Natural Language Processing (NLP) with Python and Cython. (HN: SpaCy 3.0 (2021))
- Adding voice control to your projects
- Increasing data science productivity; founders of spaCy & Prodigy
- Course materials for "Natural Language" course
- NLP progress - Track the progress in Natural Language Processing (NLP) and give an overview of the state-of-the-art across the most common NLP tasks and their corresponding datasets. (Web)
- Natural - General natural language facilities for Node.
- YSDA Natural Language Processing course (2018)
- PyText - Natural language modeling framework based on PyTorch.
- FlashText - Extract Keywords from sentence or Replace keywords in sentences.
- BERT PyTorch implementation
- LASER Language-Agnostic SEntence Representations - Library to calculate and use multilingual sentence embeddings.
- StanfordNLP - Python NLP Library for Many Human Languages.
- nlp-tutorial - Tutorial for who is studying NLP(Natural Language Processing) using TensorFlow and PyTorch.
- Better Language Models and Their Implications (2019)
- gpt-2 - Code for the paper "Language Models are Unsupervised Multitask Learners".
- Lingvo - Framework for building neural networks in Tensorflow, particularly sequence models.
- Fairseq - Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
- Stanford CS224N: NLP with Deep Learning (2019) - Course page. (HN)
- Advanced NLP with spaCy: Free Course (Web) (HN)
- Code for Stanford Natural Language Understanding course, CS224u (2019)
- Awesome Reinforcement Learning for Natural Language Processing
- ParlAI - Framework for training and evaluating AI models on a variety of openly available dialogue datasets.
- Training language GANs from Scratch (2019)
- Olivia - Your new best friend built with an artificial neural network.
- Learn-Natural-Language-Processing-Curriculum
- This repository recorded my NLP journey
- Project Alias - Open-source parasite to train custom wake-up names for smart home devices while disturbing their built-in microphone.
- Cornell Tech NLP Code
- Cornell Tech NLP Publications
- Thinc - SpaCy's Machine Learning library for NLP in Python. (Docs)
- Knowledge is embedded in language neural networks but can they reason? (2019)
- NLP Best Practices
- Transfer NLP library - Framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP.
- FARM - Fast & easy transfer learning for NLP. Harvesting language models for the industry.
- Transformers - State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. (Web)
- NLP Roadmap 2019
- Flair - Very simple framework for state-of-the-art NLP. Developed by Zalando Research.
- Unsupervised Data Augmentation - Semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks.
- Rasa - Open source machine learning framework to automate text-and voice-based conversations.
- T5 - Text-To-Text Transfer Transformer.
- 100 Must-Read NLP Papers (HN)
- Awesome NLP
- NLP Library - Curated collection of papers for the NLP practitioner.
- spacy-transformers - spaCy pipelines for pre-trained BERT, XLNet and GPT-2.
- AllenNLP - Open-source NLP research library, built on PyTorch. (Announcing AllenNLP 1.0)
- GloVe - Global Vectors for Word Representation.
- Botpress - Open-source Virtual Assistant platform.
- Mycroft - Hackable open source voice assistant. (HN)
- VizSeq - Visual Analysis Toolkit for Text Generation Tasks.
- Awesome Natural Language Generation
- How I used NLP (Spacy) to screen Data Science Resume (2019)
- Introduction to Natural Language Processing book - Survey of computational methods for understanding, generating, and manipulating human language, which offers a synthesis of classical representations and algorithms with contemporary machine learning techniques.
- Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning (Code)
- Tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production. (Article)
- Example Notebook using BERT for NLP with Keras (2020)
- NLP 2019/2020 Highlights
- Overview of Modern Deep Learning Techniques Applied to Natural Language Processing
- Language Identification from Very Short Strings (2019)
- SentenceRepresentation - Code acompanies the paper 'Learning Sentence Representations from Unlabelled Data' Felix Hill, KyungHyun Cho and Anna Korhonen 2016.
- Deep Learning for Language Processing course
- Megatron LM - Ongoing research training transformer language models at scale, including: BERT & GPT-2. (Megatron with FastMoE) (Fork)
- XLNet - New unsupervised language representation learning method based on a novel generalized permutation language modeling objective.
- ALBERT - Lite BERT for Self-supervised Learning of Language Representations.
- BERT - TensorFlow code and pre-trained models for BERT.
- Multilingual Denoising Pre-training for Neural Machine Translation (2020)
- List of NLP tutorials built on PyTorch
- sticker - Sequence labeler that uses either recurrent neural networks, transformers, or dilated convolution networks.
- sticker-transformers - Pretrained transformer models for sticker.
- pke - Python Keyphrase Extraction module.
- How to train a new language model from scratch using Transformers and Tokenizers (2020)
- Interactive Attention Visualization - Small example of an interactive visualization for attention values as being used by transformer language models like GPT2 and BERT.
- The Annotated GPT-2 (2020)
- GluonNLP - Toolkit that enables easy text preprocessing, datasets loading and neural models building to help you speed up your NLP research.
- Finetune - Scikit-learn style model finetuning for NLP.
- Stanza: A Python Natural Language Processing Toolkit for Many Human Languages (2020) (HN)
- NLP Newsletter
- NLP Paper Summaries
- Advanced NLP with spaCy
- Myle Ott's research
- Natural Language Toolkit (NLTK) - Suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. (Web) (Book)
- NLP 100 Exercise - Bootcamp designed for learning skills for programming, data analysis, and research activities. (Code)
- The Transformer Family (2020)
- Minimalist Implementation of a BERT Sentence Classifier
- fastText - Library for efficient text classification and representation learning. (Code) (Article) (HN) (Fork)
- Awesome NLP Paper Discussions - Papers & presentations from Hugging Face's weekly science day.
- SynST: Syntactically Supervised Transformers
- The Cost of Training NLP Models: A Concise Overview (2020)
- Tutorial - Transformers (Tweet)
- TTS - Deep learning for Text to Speech.
- MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer (2020)
- gpt-2-simple - Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts.
- BERTScore - BERT score for text generation.
- ML and NLP Paper Discussions
- NLP Index - Collection of NLP resources.
- NLP Datasets
- Word Embeddings (2017)
- NLP from Scratch: Annotated Attention (2020)
- This Word Does Not Exist - Allows people to train a variant of GPT-2 that makes up words, definitions and examples from scratch. (Code) (HN)
- Ultimate guide to choosing an online course covering practical NLP (2020)
- HuggingFace
nlp
library - Quick overview (2020) (Twitter) - aitextgen - Robust Python tool for text-based AI training and generation using GPT-2. (HN)
- Self Supervised Representation Learning in NLP (2020) (HN)
- Synthetic and Natural Noise Both Break Neural Machine Translation (2017)
- Inferbeddings - Injecting Background Knowledge in Neural Models via Adversarial Set Regularisation.
- UCL Natural Language Processing group
- Interactive Lecture Notes, Slides and Exercises for Statistical NLP
- Beyond Accuracy: Behavioral Testing of NLP models with CheckList
- CMU LTI Low Resource NLP Bootcamp 2020
- GPT-3: Language Models Are Few-Shot Learners (2020) (HN) (Code)
- nlp - Lightweight and extensible library to easily share and access datasets and evaluation metrics for NLP.
- Brainsources for NLP enthusiasts
- Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper)
- NLP Resources
- TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables (Article) (HN)
- vtext - NLP in Rust with Python bindings.
- Language Technology Lab @ University of Cambridge
- The Natural Language Processing Dictionary
- Introduction to NLP using Fastai (2020)
- Gwern on GPT-3 (HN)
- Semantic Machines - Solving conversational artificial intelligence. Part of Microsoft.
- The Reformer – Pushing the limits of language modeling (HN)
- GPT-3 Creative Fiction (2020) (HN)
- Classifying 200k articles in 7 hours using NLP (2020) (HN)
- HN: Using GPT-3 to generate user interfaces (2020)
- Thread of GPT-3 use cases (2020)
- GPT-3 Code Experiments (Examples)
- How GPT3 Works - Visualizations and Animations (2020) (Lobsters) (HN)
- What is GPT-3? written in layman's terms (2020) (HN)
- GPT3 Examples (HN)
- DQI: Measuring Data Quality in NLP (2020)
- Humanloop - Train and deploy NLP. (HN)
- Do NLP Beyond English (2020) (HN)
- Giving GPT-3 a Turing Test (2020) (HN)
- Neural Network Methods for Natural Language Processing (2017)
- Tempering Expectations for GPT-3 and OpenAI’s API (2020)
- Philosophers on GPT-3 (2020) (HN)
- GPT-3 Explorer - Power tool for experimenting with GPT-3. (Code)
- Recent Advances in Natural Language Processing (2020) (HN)
- Project Insight - NLP as a Service. (Forum post)
- Bob Coecke: Quantum Natural Language Processing (QNLP) (2020) (Article)
- Language-Agnostic BERT Sentence Embedding (2020)
- Language Interpretability Tool (LIT) - Interactively analyze NLP models for model understanding in an extensible and framework agnostic interface.
- Booste Pre Trained Models - Free-to-use GPT-2 API. (HN)
- Context-theoretic Semantics for Natural Language: an Algebraic Framework (2007)
- THUNLP (Natural Language Processing Lab at Tsinghua University) research
- AI training method exceeds GPT-3 performance with fewer parameters (2020) (HN)
- BERT Attention Analysis
- Neural Modules and Models for Conversational AI (2020)
- BERTopic - Topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
- NLP Pandect - Comprehensive reference for all topics related to Natural Language Processing.
- Practical Natural Language Processing book (Code)
- NLP Reseach Project: Best Practices for Finetuning Large Transformer Language models (2020)
- Deep Learning for NLP notes (2020)
- Modern Practical Natural Language Processing course
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers in PyTorch
- Awesome software for Text ML
- Pretrained Transformers for Text Ranking: BERT and Beyond (2020)
- SpaCy v3.0 Nightly (2020) (HN) (Tweet)
- Explore trained spaCy v3.0 pipelines
- spacy-streamlit - sGpaCy building blocks for Streamlit apps. (Tweet)
- Informers - State-of-the-art natural language processing for Ruby.
- How to Structure and Manage Natural Language Processing (NLP) Projects (2020)
- Sentence-BERT for spaCy - Wraps sentence-transformers (also known as sentence-BERT) directly in spaCy.
- Lingua Franca - Mycroft's multilingual text parsing and formatting library.
- Simple Transformers - Based on the Transformers library by HuggingFace. Lets you quickly train and evaluate Transformer models.
- Deep Bidirectional Transformers for Language Understanding (2020) - Explains a legendary paper, BERT. (HN)
- EasyTransfer - Designed to make the development of transfer learning in NLP applications easier.
- LambdaBERT - Transformers-style implementation of BERT using LambdaNetworks instead of self-attention.
- DialoGPT - State-of-the-Art Large-scale Pretrained Response Generation Model.
- Neural reading comprehension and beyond - Danqi Chen's Thesis (2020) (Code)
- LAMA: LAnguage Model Analysis - Probe for analyzing the factual and commonsense knowledge contained in pretrained language models.
- awesome-2vec - Curated list of 2vec-type embedding models.
- Rethinking Attention with Performers (2020) (HN)
- BERT Research - Key Concepts & Sources (2019)
- The Pile - Large, diverse, open source language modelling data set that consists of many smaller datasets combined together.
- Bort - Companion code for the paper "Optimal Subarchitecture Extraction for BERT."
- Vector AI - Encode And Deploy Vectors At The Edge. (Code)
- KeyBERT - Minimal keyword extraction with BERT. (Web)
- Multimodal Transformer for Unaligned Multimodal Language Sequences - In PyTorch.
- The Illustrated GPT-2 (Visualizing Transformer Language Models) (2020)
- A Primer in BERTology: What we know about how BERT works (2020) (HN)
- GPT Neo - Open-source GPT model, with pretrained 1.3B & 2.7B weight models. (HN)
- TextSynth - Bellard's free GPT-NeoX-20B, GPT-J playground and paid API. (Playground) (HN)
- How to Go from NLP in 1 Language to NLP in N Languages in One Shot (2020)
- Contextualized Topic Models - Family of topic models that use pre-trained representations of language (e.g., BERT) to support topic modeling.
- Language Style Transfer - Code for Style Transfer from Non-Parallel Text by Cross-Alignment paper.
- NLU - Power of Spark NLP, the Simplicity of Python. 1 line for hundreds of NLP models and algorithms.
- PyTorch Implementation of Google BERT
- High Performance Natural Language Processing (2020)
- duoBERT - Multi-stage passage ranking: monoBERT + duoBERT.
- Awesome GPT-3
- SMAC3 - Sequential Model-based Algorithm Configuration.
- Semantic Experiences by Google - Experiments in understanding language.
- Long-Range Arena - Systematic evaluation of efficient transformer models.
- PaddleHub - Awesome pre-trained models toolkit based on PaddlePaddle.
- DeepSPIN (Deep Structured Prediction in Natural Language Processing) (GitHub)
- Multi-Task Learning in NLP
- FastSeq - Provides efficient implementation of popular sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc.
- Sentence Embeddings with BERT & XLNet
- FastFormers - Provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Understanding (NLU).
- Adversarial NLI - Adversarial Natural Language Inference Benchmark.
- textract - Extract text from any document. No muss. No fuss. (Docs)
- NLP e Named Entity Recognition (2020)
- Big Bird: Transformers for Longer Sequences
- NLP PyTorch Tutorial
- EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
- CrossWeigh: Training Named Entity Tagger from Imperfect Annotations (2019) (Code)
- Does GPT-2 Know Your Phone Number? (2020)
- Towards Fully Automated Manga Translation (2020)
- Text Classification Models - All kinds of text classification models and more with deep learning.
- Awesome Text Summarization
- Shortformer: Better Language Modeling using Shorter Inputs (2020) (HN)
- huggingface_hub - Client library to download and publish models and other files on the huggingface.co hub.
- Embeddings from the Ground Up (2020)
- Ecco - Tools to visuals and explore NLP language models. (Web) (HN)
- Interfaces for Explaining Transformer Language Models (2020)
- DALL·E: Creating Images from Text (2021) (HN) (Reddit)
- CLIP: Connecting Text and Images (2021) (HN) (Paper) (Code)
- OpenNRE - Open-Source Package for Neural Relation Extraction (NRE).
- Princeton NLP Group (GitHub)
- Must-read papers on neural relation extraction (NRE)
- FewRel Dataset, Toolkits and Baseline Models
- Tree Transformer: Integrating Tree Structures into Self-Attention (2019) (Code)
- SentEval: evaluation toolkit for sentence embeddings
- gpt-scrolls - Collaborative collection of open-source safe GPT-3 prompts that work well.
- SLING - A natural language frame semantics parser - Built to learn to read and understand Wikipedia articles in many languages for the purpose of knowledge base completion.
- Awesome Neural Adaptation in NLP
- Natural language generation: The commercial state of the art in 2020 (HN)
- Non-Autoregressive Generation Progress
- Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
- VecMap - Framework to learn cross-lingual word embedding mappings.
- Kiri - Natural Language Engine. (Web)
- GPT3 List - List of things that people are claiming is enabled by GPT3.
- DeBERTa - Decoding-enhanced BERT with Disentangled Attention.
- Sockeye - Open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet. (Docs)
- Robustness Gym - Python evaluation toolkit for natural language processing.
- State-of-the-Art Conversational AI with Transfer Learning
- GPT-Neo - GPT-3-sized model, open source and free. (HN) (Code)
- Deep Daze - Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network).
- Notebooks using the Hugging Face libraries
- NLP Cloud - Serve spaCy pre-trained models, and your own custom models, through a RESTful API.
- CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters (2020) (Code)
- jiant - Multitask and transfer learning toolkit for NLP. (Web)
- Must-read Papers on Textual Adversarial Attack and Defense
- Reranker - Build Text Rerankers with Deep Language Models.
- rust-bert - Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...).
- rust-tokenizers - Offers high-performance tokenizers for modern language models.
- Replicating GPT-2 at Home (2021) (HN)
- Shifterator - Interpretable data visualizations for understanding how texts differ at the word level.
- CMU Neural Networks for NLP Course (2021) (Videos)
- minnn - Exercise in developing a minimalist neural network toolkit for NLP.
- Controllable Sentence Simplification (2019) (Code)
- Awesome Relation Extraction
- retext - Natural language processor powered by plugins part of the unified collective. (Awesome)
- CLIP Playground - Try OpenAI's CLIP model in your browser.
- GPT-3 Demo - GPT-3 Examples, Demos, Showcase, and NLP Use-cases.
- Big Sleep - Simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.
- Beyond the Imitation Game Benchmark (BIG-bench) - Collaborative benchmark intended to probe large language models, and extrapolate their future capabilities.
- AutoNLP - Automatic way to train, evaluate and deploy state-of-the-art NLP models for different tasks.
- DeText - Deep Neural Text Understanding Framework for Ranking and Classification Tasks.
- Paragraph Vectors in PyTorch
- NeuSpell: A Neural Spelling Correction Toolkit
- Natural Language YouTube Search - Search inside YouTube videos using natural language.
- Accelerate - Simple way to train and use NLP models with multi-GPU, TPU, mixed-precision.
- Classical Language Toolkit (CLTK) - Python library offering natural language processing (NLP) for pre-modern languages. (Web)
- Guide: Finetune GPT2-XL
- GENRE (Generarive ENtity REtrieval) - Uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture.
- Teachable NLP - GPT-2 Training as a Service.
- DensePhrases - Provides answers to your natural language questions from the entire Wikipedia in real-time.
- How to use GPT-3 recursively to solve general problems (2021)
- Podium - Framework agnostic Python NLP library for data loading and preprocessing.
- Prompts - Advanced GPT-3 playground. (Code)
- TextFlint - Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing.
- Awesome Text Summarization
- SimCSE: Simple Contrastive Learning of Sentence Embeddings (2021) (Code)
- Berkeley Neural Parser - High-accuracy NLP parser with models for 11 languages. (Web)
- nlpaug - Data augmentation for NLP.
- Top2Vec - Learns jointly embedded topic, document and word vectors.
- Focused Attention Improves Document-Grounded Generation (2021) (Code)
- NLPretext - All the goto functions you need to handle NLP use-cases.
- spaCy + UDPipe
- adapter-transformers - Friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models.
- TextAttack - Generating adversarial examples for NLP models.
- GPT-NeoX - Implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library.
- Transfer Learning in Natural Language Processing (2019) (Code)
- Cohere - Help computers understand language. (Tweet)
- Transformers Interpret - Model explainability tool designed to work exclusively with the transformers package.
- Whatlang - Natural language detection library for Rust. (Web)
- Category Theory + NLP Papers
- UniLM - Pre-trained models for natural language understanding (NLU) and generation (NLG) tasks.
- AutoNLP - Faster and easier training and deployments of SOTA NLP models.
- TAble PArSing (TAPAS) - End-to-end neural table-text understanding models.
- Replacing Bert Self-Attention with Fourier Transform: 92% Accuracy, 7X Faster (2021)
- FNet: Mixing Tokens with Fourier Transforms (2021) (Tweet)
- True Few-Shot Learning with Language Models (2021) (Tweet) (Code)
- End-to-end NLP workflows from prototype to production (Web)
- Haystack - End-to-end Python framework for building natural language search interfaces to data. (HN)
- PLMpapers - Must-read Papers on pre-trained language models.
- English-to-Spanish translation with a sequence-to-sequence Transformer in Keras
- Evaluation Harness for Large Language Models - Framework for few-shot evaluation of autoregressive language models.
- MLP GPT - Jax - GPT, made only of MLPs, in Jax.
- Few-Shot Question Answering by Pretraining Span Selection (2021) (Code)
- Neural Extractive Search (2021) (Demo)
- Hugging Face NLP Course (Code)
- SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation.
- LoRA: Low-Rank Adaptation of Large Language Models (2021) (Code)
- PromptPapers - Must-read papers on prompt-based tuning for pre-trained language models.
- Obsei - Automation tool for text analysis need.
- Evaluating Large Language Models Trained on Code (2021) (Code)
- Survey of Surveys for Natural Language Processing (SOS4NLP)
- CLIP guided diffusion
- Data driven literary analysis
- DALL·E Mini - Generate images from a text prompt.
- Jury - Evaluation for Natural Language Generation.
- Rubrix - Free and open-source tool to explore, label, and monitor data for NLP projects.
- Knowledge Neurons in Pretrained Transformers (2021) (Code) (Code)
- OpenCLIP - Open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).
- Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning (2021) (Code)
- Can a Fruit Fly Learn Word Embeddings? (2021)
- Spark NLP - Natural Language Processing library built on top of Apache Spark ML. (Web)
- Spark NLP Workshop - Showcasing notebooks and codes of how to use Spark NLP in Python and Scala.
- ConceptNet Numberbatch - Set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings.
- OpenAI Codex - AI system that translates natural language to code. (HN)
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021)
- NL-Augmenter - Collaborative Repository of Natural Language Transformations.
- wevi - Word embedding visual inspector. (Code)
- clip-retrieval - Easily computing clip embeddings and building a clip retrieval system with them.
- NVIDIA NeMo - Toolkit for conversational AI.
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
- BEIR - Heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark.
- UER-py - Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo.
- ExplainaBoard - Explainable Leaderboard for NLP.
- Fast-BERT - Super easy library for BERT based NLP models.
- Genie Tookit - Generator of Natural Language Parsers for Compositional Virtual Assistants. (Paper)
- Quantum Stat - Your NLP Model Training Platform.
- Mistral - Framework for transparent and accessible large-scale language model training, built with Hugging Face. (Docs)
- NERDA - Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks.
- Data Augmentation Techniques for NLP
- Feed forward VQGAN-CLIP model
- Yet Another Keyword Extractor (Yake) - Unsupervised Approach for Automatic Keyword Extraction using Text Features.
- Challenges in Detoxifying Language Models (2021) (Tweet)
- TextBrewer - PyTorch-based model distillation toolkit for natural language processing.
- GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain (2021)
- PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models (2021) (Code)
- VQGAN-CLIP Overview - Repo for running VQGAN+CLIP locally.
- TLDR: Extreme Summarization of Scientific Documents (2020) (Code)
- Can Language Models be Biomedical Knowledge Bases? (2021)
- ColBERT: Contextualized Late Interaction over BERT (2020)
- Investigating Pretrained Language Models for Graph-to-Text Generation (2020) (Code)
- Ubiquitous Knowledge Processing Lab (GitHub)
- DedupliPy - Python package for deduplication/entity resolution using active learning.
- Flexible Generation of Natural Language Deductions (2021) (Code)
- Machine Translation Reading List
- Compressive Transformers for Long-Range Sequence Modelling (2020) (Code)
- pyxclib - Tools for multi-label classification problems.
- ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.
- OpenPrompt - Open-Source Toolkit for Prompt-Learning.
- Unsupervised Neural Machine Translation with Generative Language Models Only (2021) (Tweet)
- Grounding Spatio-Temporal Language with Transformers (2021) (Code)
- Fast Sentence Embeddings (fse) - Compute Sentence Embeddings Fast.
- Symbolic Knowledge Distillation: from General Language Models to Commonsense Models (2021)
- Surge AI - Build powerful NLP datasets using our global labeling force and platform. (Python SDK)
- Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels (Code)
- ogen - OpenAPI v3 code generator for go.
- PromptSource - Toolkit for collecting and applying prompts to NLP datasets. (Web) (HN)
- Creating User Interface Mock-ups from High-Level Text Descriptions with Deep-Learning Models (2021)
- Filtlong - Tool for filtering long reads by quality. It can take a set of long reads and produce a smaller, better subset.
- Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System (2021) (Code)
- xFormers - Hackable and optimized Transformers building blocks, supporting a composable construction.
- Language Models As or For Knowledge Bases (2021)
- Wikipedia2Vec - Tool for learning vector representations of words and entities from Wikipedia. (Code)
- Reflections on Foundation Models (2021) (Tweet)
- textacy - NLP, before and after spaCy.
- Natural Language Processing Specialization Course (Tweet)
- Hugging Face on Amazon SageMaker Workshop
- CS224N: Natural Language Processing with Deep Learning | Winter 2021 - YouTube
- GPT-3 creates geofoam, but out of text (2021)
- Towards Efficient NLP: A Standard Evaluation and A Strong Baseline (2021) (Code)
- Hierarchical Transformers Are More Efficient Language Models (2021) (HN) (Code)
- Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration (2021) (Code)
- GPT-3 is no longer the only game in town (2021) (HN)
- PatrickStar - Parallel Training of Large Language Models via a Chunk-based Memory Management.
- Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) (2021)
- Text2Art - AI Powered Text-to-Art Generator.
- Emergent Communication of Generalizations (2021) (Code)
- Awesome Pretrained Models for Information Retrieval
- SummerTime - Text Summarization Toolkit for Non-experts.
- NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework (2021) (Code)
- Differentially Private Fine-tuning of Language Models (2021) (Tweet)
- TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning (2021) (Code)
- Aphantasia - CLIP + FFT/DWT/RGB = text to image/video.
- OpenAI’s API Now Available with No Waitlist (2021) (HN)
- Recent trends of Entity Linking, Disambiguation, and Representation
- Intro to Large Language Models with Cohere
- spacy-experimental - Cutting-edge experimental spaCy components and features.
- AdaptNLP - High level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models for end to end tasks. (Docs)
- Reading list for Awesome Sentiment Analysis papers
- Aspect-Based-Sentiment-Analysis: Transformer & Explainable ML (TensorFlow)
- Deploy optimized transformer based models in production
- PyConverse - Conversational text Analysis using various NLP techniques.
- KILT - Library for Knowledge Intensive Language Tasks.
- RoFormer: Enhanced Transformer with Rotary Position Embedding (2021) (Code)
- N-grammer: Augmenting Transformers with latent n-grams (2021) (Code)
- textsearch - Find strings/words in text; convenience and C speed.
- Mastering spaCy Book (2021) (Code)
- sense2vec - Contextually-keyed word vectors.
- Pureformer: Do We Even Need Attention? (2021)
- Knover - Toolkit for knowledge grounded dialogue generation based on PaddlePaddle.
- Language Modelling at Scale: Gopher, Ethical considerations, and Retrieval | DeepMind (2021) (HN)
- CMU Advanced NLP 2021 - YouTube
- whatlies - Toolkit to help understand "what lies" in word embeddings. Also benchmarking.
- CLIP-Guided-Diffusion
- Factual Probing Is [MASK]: Learning vs. Learning to Recall (2021) (Code)
- Improving Compositional Generalization with Latent Structure and Data Augmentation (2021)
- PORORO - Platform Of neuRal mOdels for natuRal language prOcessing.
- PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2021) (Code)
- To Understand Language Is to Understand Generalization (2021) (HN)
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (2020) (Code)
- Multimodal Transformers | Transformers with Tabular Data (Article)
- Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering (2021) (Code)
- Improving Language Models by Retrieving from Trillions of Tokens (2021)
- Open Information Extraction (OIE) Resources
- Deeper Text Understanding for IR with Contextual Neural Language Modeling (2019) (Code)
- x-clip - Concise but complete implementation of CLIP with various experimental improvements from recent papers.
- Calamity - Self-hosted GPT playground.
- VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation (2021) (Code)
- Transactions of the Association for Computational Linguistics (2021) (Code)
- DocEE - Toolkit for document-level event extraction, containing some SOTA model implementations.
- Autoregressive Entity Retrieval (2020)
- Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation (2020)
- A Span-Based Model for Joint Overlapped and Discontinuous Named Entity Recognition (2021)
- Deduplicating Training Data Makes Language Models Better (2021) (Code)
- Transformers without Tears: Improving the Normalization of Self-Attention (2019) (Code)
- CTCDecoder - Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.
- Custom Named Entity Recognition with Spacy3
- BARTScore: Evaluating Generated Text as Text Generation (2021) (Code)
- minDALL-E on Conceptual Captions - PyTorch implementation of a 1.3B text-to-image generation model trained on 14 million image-text pairs.
- Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation (2021) (Code)
- Multitask Prompted Training Enables Zero-Shot Task Generalization (2021) (Code)
- spaCy models - Models for the spaCy Natural Language Processing (NLP) library.
- Awesome Huggingface
- SyntaxDot - Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.
- STriP Net - Semantic Similarity of Scientific Papers (S3P) Network.
- Small-Text - Active Learning for Text Classification in Python.
- Plug and Play Language Models: A Simple Approach to Controlled Text Generation (2020) (Code)
- RuDOLPH - One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP.
- PLM papers - Paper list of pre-trained language models (PLMs).
- Ongoing research training transformer language models at scale, including: BERT & GPT-2
- Improving language models by retrieving from trillions of tokens (2022) (Code)
- EntitySeg Toolbox - Towards precise and open-world image segmentation.
- Aligning Language Models to Follow Instructions (2022) (Tweet) (Code)
- Simple Questions Generate Named Entity Recognition Datasets (2021) (Code)
- KRED: Knowledge-Aware Document Representation for News Recommendations (2019) (Code)
- Stanford Open Information Extraction
- Python3 wrapper for Stanford OpenIE
- I-BERT: Integer-only BERT Quantization (2021) (Code)
- spaCy-wrap - Wrapping fine-tuned transformers in spaCy pipelines.
- DeepMatcher - Python package for performing Entity and Text Matching using Deep Learning.
- Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond (2020) (Code)
- medspacy - Library for clinical NLP with spaCy.
- Natural Language Processing with Transformers Book (Code)
- blurr - Library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.
- HanLP - Multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x.
- Awesome Text-to-Image
- NLP News Newsletter
- Named Entity Recognition as Dependency Parsing (2020) (Code)
- Multilingual-CLIP - OpenAI CLIP text encoders for any language.
- FasterTransformer - Transformer related optimization, including BERT, GPT.
- Papers about Causal Inference and Language
- EET (Easy and Efficient Transformer) - Efficient PyTorch inference plugin focus on Transformer-based models with large model sizes and long sequences.
- Measuring Massive Multitask Language Understanding (2021) (Code)
- A Theoretical Analysis of the Repetition Problem in Text Generation (2021) (Code)
- TransformerSum - Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
- Natural Language Processing with Transformers Book
- Transformer Memory as a Differentiable Search Index (2022) (HN) (Tweet)
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (2020) (Code)
- spaCy + Stanza - Use the latest Stanza (StanfordNLP) research models directly in spaCy.
- Awesome Document Understanding
- Sequential Transformer - Code for training Transformers on sequential tasks such as language modeling.
- bert-as-service - Mapping a variable-length sentence to a fixed-length vector using BERT model.
- A Contrastive Framework for Neural Text Generation (2022) (Code)
- Parallax - Tool for interactive embeddings visualization.
- Serve PyTorch model as an API using AWS + serverless framework
- Neural reality of argument structure constructions (2022)
- DeepNet: Scaling Transformers to 1,000 Layers (2022) (HN)
- Large Models of Source Code - Guide to using pre-trained large language models of source code.
- HyperMixer: An MLP-based Green AI Alternative to Transformers (2022)
- NLP Course Material & QA
- Survey of Surveys (NLP & ML) - Collection of 700+ survey papers on Natural Language Processing (NLP) and Machine Learning (ML).
- Awesome CLIP - Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
- MAGMA - GPT-style multimodal model that can understand any combination of images and language.
- Timexy - spaCy custom component that extracts and normalizes temporal expressions.
- New Capabilities for GPT-3: Edit and Insert (2022) (HN)
- Which hardware to train a 176B parameters model? (2022) (Tweet)
- Fundamentals of NLP - Series of hands-on notebooks for learning the fundamentals of NLP.
- BertViz - Visualize Attention in Transformer Models (BERT, GPT2, BART, etc.).
- Attention Is All You Need (2017) (Code) (PyTorch Code)
- Word2Vec Explained. Explaining the Intuition of Word2Vec (2021) (HN)
- imgbeddings - Python package to generate image embeddings with CLIP without PyTorch/TensorFlow.
- Linking Emergent and Natural Languages via Corpus Transfer (2022)
- Transformer Inference Arithmetic (2022)
- Training Compute-Optimal Large Language Models (2022) (Tweet)
- KeyphraseVectorizers - Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert them into a document-keyphrase matrix.
- Gramformer - Framework for detecting, highlighting and correcting grammatical errors on natural language text.
- Classy Classification - Easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classificaiton with Huggingface.
- Sphere - Web-scale retrieval for knowledge-intensive NLP.
- muTransformers - Common Huggingface transformers in maximal update parametrization (µP).
- Event Extraction papers - List of NLP resources focused on event extraction task.
- Summarization Papers
- GLID-3 - Combination of OpenAI GLIDE, Latent Diffusion and CLIP.
- Optimum Transformers - Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.
- Pathways Language Model (PaLM): Scaling to 540B parameters (2022) (HN) (Code) (Code)
- A Divide-and-Conquer Approach to the Summarization of Long Documents (2020) (Code)
- Resources for learning about Text Mining and Natural Language Processing
- LinkBERT: Pretraining Language Models with Document Links (2022) (Code)
- Dall-E 2 (2022) (HN) (Tweet) (Tweet) (Code) (Code) (Code) (Tweet) (Tweet) (HN) (Video Summary) (HN) (Tweet)
- Variations of the Similarity Function of TextRank for Automated Summarization (2016) (Code)
- Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (2020) (Code)
- Awesome Knowledge Distillation
- You Only One Sequence (2021)
- Towards Understanding and Mitigating Social Biases in Language Models (2021) (Code)
- DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization (2021) (Code)
- Humanloop Programmatic - Create large high-quality datasets for NLP in minutes. No hand labelling required. (HN)
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language (2022)
- Second order effects of the rise of large language models (2022)
- Simple Annotated implementation of GPT-NeoX in PyTorch
- BLEURT: Learning Robust Metrics for Text Generation (2020) (Code)
- Bootleg - Self-supervised named entity disambiguation (NED) system that links mentions in text to entities in a knowledge base. (Code)
- DALL-E in Mesh-TensorFlow
- A few things to try with DALL·E (2022) (HN)
- Google's 540B PaLM Language Model & OpenAI's DALL-E 2 Text-to-Image Revolution (2022)
- Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution (2021) (Code)
- Simple and Effective Multi-Paragraph Reading Comprehension (2017) (Code)
- Researchers Glimpse How AI Gets So Good at Language Processing (2022)
- Cornell Conversational Analysis Toolkit (ConvoKit) - Toolkit for extracting conversational features and analyzing social phenomena in conversations.
- UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models (2022) (Code)
- exBERT - Visual Analysis Tool to Explore Learned Representations in Transformers Models.
- How DALL-E 2 Works (2022) (HN)
- Getting started with NLP for absolute beginners (2022)
- EasyNLP - Comprehensive and Easy-to-use NLP Toolkit.
- Reframing Human-AI Collaboration for Generating Free-Text Explanations (2021) (Tweet)
- Detoxify - Comment Classification with PyTorch Lightning and Transformers.
- DLATK - End to end human text analysis package, specifically suited for social media and social scientific applications.
- Language modeling via stochastic processes (2022) (Code)
- An Enhanced Span-based Decomposition Method for Few-Shot Sequence Labeling (2022) (Code)
- SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization (2021) (Code)
- DataLab - Unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner.
- Limitations of DALL-E (HN)
- AutoPrompt - Automatic Prompt Construction for Masked Language Models.
- DALL·E Flow - Human-in-the-Loop workflow for creating HD images from text.
- Recon NER - Debug and correct annotated Named Entity Recognition (NER) data for inconsitencies and get insights on improving the quality of your data.
- CausalNLP - Practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.
- OPT: Open Pre-trained Transformer Language Models (2022) - Meta's 175B parameter language model. (Reddit) (Tweet)
- Bert Extractive Summarizer - Easy to use extractive text summarization with BERT.
- Dialogue Response Ranking Training with Large-Scale Human Feedback Data (2020) (Code)
- LM-Debugger - Interactive tool for inspection and intervention in transformer-based language models.
- 100 Pages of raw notes released with the language model OPT-175 (HN)
- Unsupervised Cross-Task Generalization via Retrieval Augmentation (2022) (Code)
- On Continual Model Refinement in Out-of-Distribution Data Streams (2022)
- GLID-3-XL - 1.4B latent diffusion model from CompVis back-ported to the guided diffusion codebase.
- Neutralizing Subjectivity Bias with HuggingFace Transformers (2022)
- Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists (2022) (Code) (Tweet)
- gse - Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other.
- BERTopic: The Future of Topic Modeling (2022) (HN)
- Unifying Language Learning Paradigms (2022) (Code)
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling (2021) (Code)
- GPT-3 limitations (2022)
- Natural Language Processing Demystified
- Concise Concepts - Contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.
- Dynamic language understanding: adaptation to new knowledge in parametric and semi-parametric models (2022) (Tweet)
- nlprule - Fast, low-resource Natural Language Processing and Text Correction library written in Rust.
- Quark: Controllable Text Generation with Reinforced Unlearning (2022) (Tweet)
- DALL-E 2 has a secret language (HN) (Tweet) (HN)
- AdaTest - Find and fix bugs in natural language machine learning models using adaptive testing.
- Diffusion-LM Improves Controllable Text Generation (2022) (Code) (Tweet)
- RnG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering (2021) (Code)
- Neural Prompt Search - Searching prompt modules for parameter-efficient transfer learning.
- makemore - Most accessible way of tinkering with a GPT - one hackable script.
- DALL-E Playground - Playground for DALL-E enthusiasts to tinker with the open-source version of OpenAI's DALL-E, based on DALL-E Mini.
- Craiyon - AI model drawing images from any prompt. Formerly DALL-E mini.
- Contrastive Learning for Natural Language Processing
- MSCTD: A Multimodal Sentiment Chat Translation Dataset (Code)
- Auto-Lambda: Disentangling Dynamic Task Relationships (2022) (Code)
- Concepts in Neural Networks for NLP
- DinkyTrain - Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration.
- Pretrained Language Models
- BERT-of-Theseus: Compressing BERT by Progressive Module Replacing (2020) (Code)
- YaLM 100B - GPT-like neural network for generating and processing text by Yandex. (HN)
- Pathways Autoregressive Text-to-Image model (Parti) - Autoregressive text-to-image generation model that achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge. (Web) (HN)
- How Imagen Actually Works (2022)
- First impressions of DALL-E, generating images from text (2022) (Lobsters)
- Meta is inviting researchers to pick apart the flaws in its version of GPT-3 (2022) (HN)
- 'Making Moves' In DALL·E mini (2022)
- min(DALL·E) - Minimal implementation of DALL·E Mini. It has been stripped to the bare essentials necessary for doing inference, and converted to PyTorch.
- Awesome Document Similarity Measures
- RETRO Is Blazingly Fast (2022)
- LightOn - Unlock Extreme-Scale Machine Intelligence. Most repos are focused on the use of photonic hardware. (GitHub)
- Minerva: Solving Quantitative Reasoning Problems with Language Models (2022) (Paper)
- winkNLP - Developer friendly Natural Language Processing. (Docs)
- Facebook Low Resource (FLoRes) MT Benchmark
- Using GPT-3 to explain how code works (2022) (Lobsters) (HN)
- Awesome Topic Models
- Introducing The World’s Largest Open Multilingual Language Model: BLOOM
- The DALL·E 2 Prompt Book (HN) (Tweet)
- RWKV - RNN with Transformer-level performance, which can also be directly trained like a GPT transformer (parallelizable).
- Kern AI - Open-source IDE for data-centric NLP. Combining programmatic labeling, extensive data management and neural search capabilities. (Code) (HN)
- spaCy fishing - spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata.
- DALL·E Now Available in Beta (2022) (HN)
- Inside language models (from GPT-3 to PaLM)
- Timeline of AI and language models
- Cascades - Python library which enables complex compositions of language models such as scratchpads, chain of thought, tool use, selection-inference, and more.
- Awesome Neural Symbolic
- Towards Knowledge-Based Recommender Dialog System (2019) (Code)
- Asent - Rule-based sentiment analysis library for Python made using SpaCy.
- extractacy - Pattern extraction and named entity linking for spaCy.
- A Hazard Analysis Framework for Code Synthesis Large Language Models (2022)
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (2022) (Code)
- A Frustratingly Easy Approach for Entity and Relation Extraction (2021) (Code)
- Chinchilla's Wild Implications (2022) (HN)
- DALL·E 2 prompt book (2022) (HN)
- GLM-130B - Open Bilingual Pre-Trained Model.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion (2022) (Code)
- DALL-E + GPT-3 = ♥ (2022) (HN)
- Run your own DALL-E-like image generator (2022) (HN)
- Stable Diffusion launch announcement (2022) (HN)
- Stable Diffusion
- MidJourney Styles and Keywords Reference
- Spent $15 in DALL·E 2 credits creating this AI image (2022) (HN)
- Phraser - Better way to generate prompts.
- Seminar on Large Language Models (2022)