Final Project - inspirations, datasets and papers
Interesting papers
To inspire project ideas, here are some cool NLP papers:
- Attention is All You Need
- Quasi-Recurrent Neural Networks
- Semi-supervised Sequence Learning
- A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
- Semi-supervised sequence tagging with bidirectional language models
- Deep Biaffine Attention for Neural Dependency Parsing
- Generating Sentences from a Continuous Space
- Improving Neural Language Models with a Continuous Cache
- Reasoning about Entailment with Neural Attention
- Ultradense Word Embeddings by Orthogonal Transformation
To inspire project ideas, here are some cool Computer Vision papers:
- Object recognition: [Krizhevsky et al.], [Russakovsky et al.], [Szegedy et al.], [Simonyan et al.], [He et al.]
- Object detection: [Girshick et al.], [Sermanet et al.], [Erhan et al.]
- Image segmentation: [Long et al.]
- Video classification: [Karpathy et al.], [Simonyan and Zisserman]
- Scene classification: [Zhou et al.]
- Face recognition: [Taigman et al.]
- Depth estimation: [Eigen et al.]
- Image-to-sentence generation: [Karpathy and Fei-Fei], [Donahue et al.], [Vinyals et al.]
- Visualization and optimization: [Szegedy et al.], [Nguyen et al.], [Zeiler and Fergus], [Goodfellow et al.], [Schaul et al.]
Interesting datasets
NLP datasets:
- Sequence Tagging: Named Entity Recognition and Chunking
- Dependency Parsing
- Quora Question Pairs
- Sentence-Level Sentiment Analysis and Document-Level Sentiment Analysis
- Textual Entailment
- Machine Translation (Ambitious)
- Yelp Reviews
- WikiText Language Modeling
- Fake News Challenge
- Toxic Comment Classification
Computer vision datasets:
- Meta Pointer: A large collection organized by CV Datasets.
- Yet another Meta pointer
- ImageNet: a large-scale image dataset for visual recognition organized by WordNet hierarchy
- SUN Database: a benchmark for scene recognition and object detection with annotated scene categories and segmented objects
- Places Database: a scene-centric database with 205 scene categories and 2.5 millions of labelled images
- NYU Depth Dataset v2: a RGB-D dataset of segmented indoor scenes
- Microsoft COCO: a new benchmark for image recognition, segmentation and captioning
- Flickr100M: 100 million creative commons Flickr images
- Labeled Faces in the Wild: a dataset of 13,000 labeled face photographs
- Human Pose Dataset: a benchmark for articulated human pose estimation
- YouTube Faces DB: a face video dataset for unconstrained face recognition in videos
- UCF101: an action recognition data set of realistic action videos with 101 action categories
- HMDB-51: a large human motion dataset of 51 action classes
Others: you can always explore the Kaggle Datasets for various types of datasets.
Default project
See the Stanford’s CS224n default project’s page.
Sample projects
We have created few sample projects, which can give you an idea about the types of challenges you can tackle
- Troll tweet / toxic comments detection
- goal: detect toxic / troll comments on Facebook or in other social medias / forums
- max team size: 3 people
- datasets:
- https://www.kaggle.com/vikasg/russian-troll-tweets
- https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
- 3d human pose from 2d image
- goal: implementing a cutting edge research paper, you can see here
- max team size: 5 people
- dataset:
- The DensePose dataset (expected to be released before mid of June)
- Unite the people dataset
- growing of NNs during training (similar to GAN)
- goal: apply progressive gorwing of neural networks during training (borrowed from GAN paper by Nvidia)
- max team size: 4 people
- dataset: the idea can be applied on various problem, the dataset would be chosen depending on the selected problem
- Fake news
- goal: attempt to develop mathematical / heuristic model + Deep Learning approach to finding fake news
- max team size: 4 people
- datasets:
- https://github.com/FakeNewsChallenge/fnc-1
- Predicting song popularity
- goal: attempt to predict how popular a song can be, based on sound data, lyrics and other criteria
- max team size: 5 people
- datasets:
- https://labrosa.ee.columbia.edu/millionsong/
- https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking
- https://www.kaggle.com/mousehead/songlyrics
- https://www.kaggle.com/artimous/every-song-you-have-heard-almost
- Predicting bitcoin prices (based on financial indicators and sentiment)
- goal: highly experimental topic, main idea would be to explore the feasibility of such algorithm
- max team size: 5 people
- datasets:
- https://www.kaggle.com/mczielinski/bitcoin-historical-data
- http://eventregistry.org/
- https://www.kaggle.com/bigquery/bitcoin-blockchain
- https://www.kaggle.com/jessevent/all-crypto-currencies
- https://www.kaggle.com/snapcrack/all-the-news