{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Tdj1XLuceOk-"
},
"source": [
"# Week 03: Word Representation\n",
"The assignment this week is to distinguish between good and bad phrases of the word \"**earn**\" (e.g., earn money). You will practice using word2vector, one of the methods learned today, in the process. \n",
"\n",
"Data used in this assignment: \n",
"https://drive.google.com/drive/folders/1qTIrefo4EFbsVF3LXhKbiahbIrvCLUBJ?usp=sharing\n",
"\n",
"* train.tsv: Some phrases with labels to train and validate the classification model. There are only two types of label: 1 means *good*; 0 means *bad*.\n",
"* test.tsv: Same format as train.tsv. It's used to test your model.\n",
"* GoogleNews-vectors-negative300.bin.gz: a pre-trained word2vector model trained by Google ([source](https://code.google.com/archive/p/word2vec/))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3GzvI76xeOlH"
},
"source": [
"## Requirements\n",
"* pandas\n",
"* tensorflow\n",
"* sklearn"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jk5Xag5ueOlI"
},
"source": [
"## Read Data\n",
"We use dataframe to store data here."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UYsjz2eCeOlI"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"def loadData(path):\n",
" ngram = []\n",
" _class = []\n",
" with open(path) as f:\n",
" for line in f.readlines():\n",
" line = line.strip(\"\\n\").split(\"\\t\")\n",
" ngram.append(line[0])\n",
" _class.append(int(line[1]))\n",
" return pd.DataFrame({\"phrase\":ngram,\"class\":_class})\n",
"train = loadData(\"train.tsv\")\n",
"print(train.head())\n",
"test = loadData(\"test.tsv\")\n",
"print(test.head())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Sl8pGQx2eOlL"
},
"source": [
"## load word2vec model\n",
"**[ TODO ]** Please load [GoogleNews-vectors-negative300.bin.gz](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g) model and check the embedding of the word `language`.\n",
"\n",
"* package `gensim` is a good choice (Look up the documentation [here](https://radimrehurek.com/gensim/models/word2vec.html))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DQjCDqZyeOlO"
},
"outputs": [],
"source": [
"w2v_model = ......\n",
"#### print \"language\" embedding "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fap51QcAeOlO"
},
"source": [
"Expected output: \n",
"\n",
"> \\[ 2.30712891e-02 1.68457031e-02 1.54296875e-01 1.27929688e-01
\n",
"> -2.67578125e-01 3.51562500e-02 1.19140625e-01 2.48046875e-01
\n",
"> 1.93359375e-01 -7.95898438e-02 1.46484375e-01 -1.43554688e-01
\n",
"> -3.04687500e-01 3.46679688e-02 -1.85546875e-02 1.06933594e-01
\n",
"> -1.52343750e-01 2.89062500e-01 2.35595703e-02 -3.80859375e-01
\n",
"> 1.09863281e-01 4.41406250e-01 3.75976562e-02 -1.22680664e-02
\n",
"> 1.62353516e-02 -2.24609375e-01 7.61718750e-02 -3.12500000e-02
\n",
"> -2.16064453e-02 1.49414062e-01 -4.02832031e-02 -4.46777344e-02
\n",
"> -1.72851562e-01 3.32031250e-02 1.50390625e-01 -5.05371094e-02
\n",
"> 2.72216797e-02 3.00781250e-01 -1.33789062e-01 -7.56835938e-02
\n",
"> 1.93359375e-01 -1.98242188e-01 -1.27563477e-02 4.19921875e-01
\n",
"> -2.19726562e-01 1.44531250e-01 -3.93066406e-02 1.94335938e-01
\n",
"> -3.12500000e-01 1.84570312e-01 1.48773193e-04 -1.67968750e-01
\n",
"> -7.37304688e-02 -3.12500000e-02 1.57226562e-01 3.30078125e-01
\n",
"> -1.42578125e-01 -3.16406250e-01 -7.32421875e-02 -5.76171875e-02
\n",
"> 1.02050781e-01 -1.08886719e-01 1.24023438e-01 -2.50244141e-02
\n",
"> -2.49023438e-01 1.25976562e-01 -1.79687500e-01 3.32031250e-01
\n",
"> 7.14111328e-03 2.51953125e-01 4.34570312e-02 -4.34570312e-02
\n",
"> -3.90625000e-01 1.76757812e-01 -1.13525391e-02 -1.97753906e-02
\n",
"> 2.79296875e-01 2.36328125e-01 1.19140625e-01 5.59082031e-02
\n",
"> 1.73828125e-01 -1.10839844e-01 -4.95605469e-02 2.13867188e-01
\n",
"> 6.17675781e-02 1.38671875e-01 -4.45556641e-03 2.55859375e-01
\n",
"> 1.80664062e-01 5.88378906e-02 -6.59179688e-02 -2.08007812e-01
\n",
"> -1.19140625e-01 -1.57226562e-01 5.02929688e-02 -6.29882812e-02
\n",
"> 5.00488281e-02 -7.27539062e-02 1.74560547e-02 -3.56445312e-02
\n",
"> -1.93359375e-01 3.93066406e-02 -3.36914062e-02 -1.07421875e-01
\n",
"> 5.78613281e-02 -8.20312500e-02 1.74560547e-02 -1.65039062e-01
\n",
"> 1.46484375e-01 -3.08837891e-02 -3.86718750e-01 2.49023438e-01
\n",
"> 8.74023438e-02 -2.15820312e-01 -4.10156250e-02 1.60156250e-01
\n",
"> 1.85546875e-01 -2.27050781e-02 -3.73535156e-02 7.86132812e-02
\n",
"> -1.46484375e-01 6.78710938e-02 1.26953125e-01 3.30078125e-01
\n",
"> 1.11328125e-01 9.27734375e-02 -3.45703125e-01 -1.41601562e-01
\n",
"> -5.29785156e-02 -1.50390625e-01 -7.81250000e-02 -1.27929688e-01
\n",
"> -4.02343750e-01 -1.41601562e-01 8.44726562e-02 1.08398438e-01
\n",
"> -4.44335938e-02 3.73535156e-02 5.61523438e-02 -1.91406250e-01
\n",
"> 1.54296875e-01 -5.12695312e-02 -6.49414062e-02 -8.30078125e-02
\n",
"> 7.17773438e-02 -1.33789062e-01 1.05468750e-01 3.33984375e-01
\n",
"> -1.08398438e-01 1.91650391e-02 2.14843750e-01 2.15820312e-01
\n",
"> -1.05468750e-01 -1.44531250e-01 4.32128906e-02 -2.71484375e-01
\n",
"> -3.78906250e-01 1.09863281e-01 -8.15429688e-02 -6.12792969e-02
\n",
"> -1.33789062e-01 9.71679688e-02 -1.04370117e-02 -1.21093750e-01
\n",
"> -2.44140625e-01 1.02050781e-01 1.10839844e-01 -1.00585938e-01
\n",
"> 1.71875000e-01 -3.61328125e-02 -4.39453125e-02 2.83203125e-01
\n",
"> -8.93554688e-02 -1.70898438e-01 2.46093750e-01 1.16699219e-01
\n",
"> 8.39843750e-02 -1.32812500e-01 -1.61132812e-01 -1.39648438e-01
\n",
"> -8.59375000e-02 -1.37695312e-01 -9.32617188e-02 -1.33789062e-01
\n",
"> 1.65039062e-01 4.93164062e-02 -1.21093750e-01 -2.11914062e-01
\n",
"> 1.61132812e-01 -1.07421875e-01 -3.97949219e-02 -3.51562500e-01
\n",
"> -5.02929688e-02 1.46484375e-01 -4.68750000e-02 4.17480469e-02
\n",
"> -1.27929688e-01 -9.76562500e-02 -2.46093750e-01 6.78710938e-02
\n",
"> -2.30468750e-01 1.80664062e-02 3.54003906e-02 7.32421875e-02
\n",
"> -2.23632812e-01 -1.25976562e-01 2.12890625e-01 -3.93066406e-02
\n",
"> -2.41699219e-02 -9.61914062e-02 7.51953125e-02 -1.46484375e-01
\n",
"> -1.49414062e-01 -8.83789062e-02 -4.88281250e-02 2.32421875e-01
\n",
"> 3.30078125e-01 1.59179688e-01 -2.35351562e-01 -1.25976562e-01
\n",
"> 2.68554688e-02 -5.29785156e-02 -6.59179688e-02 -2.17773438e-01
\n",
"> -6.37817383e-03 -2.53906250e-01 2.28515625e-01 4.93164062e-02
\n",
"> 3.54003906e-02 1.66992188e-01 -7.27539062e-02 -2.53906250e-01
\n",
"> -1.34765625e-01 3.69140625e-01 1.83593750e-01 -1.64062500e-01
\n",
"> 2.26562500e-01 -8.88671875e-02 3.69140625e-01 5.54199219e-02
\n",
"> -3.63769531e-02 -1.48437500e-01 9.13085938e-02 2.47955322e-04
\n",
"> 2.67578125e-01 -1.63085938e-01 1.19628906e-01 2.77343750e-01
\n",
"> -1.49414062e-01 1.33789062e-01 -8.25195312e-02 -1.74804688e-01
\n",
"> -1.77734375e-01 2.06054688e-01 5.07812500e-02 -2.08007812e-01
\n",
"> -1.74804688e-01 9.66796875e-02 6.98242188e-02 -5.79833984e-04
\n",
"> 9.22851562e-02 7.95898438e-02 1.41601562e-01 8.72802734e-03
\n",
"> -8.05664062e-02 4.80957031e-02 2.49023438e-01 -1.64062500e-01
\n",
"> -4.66308594e-02 -2.81250000e-01 -1.66015625e-01 -2.22656250e-01
\n",
"> -2.32421875e-01 1.32812500e-01 4.15039062e-02 1.15234375e-01
\n",
"> -7.66601562e-02 -1.10839844e-01 -1.97265625e-01 3.06396484e-02
\n",
"> -1.03515625e-01 2.49023438e-02 -2.52685547e-02 3.39355469e-02
\n",
"> 4.29687500e-02 -1.44531250e-01 2.12402344e-02 2.28271484e-02
\n",
"> -1.88476562e-01 3.22265625e-01 -1.13281250e-01 -7.61718750e-02
\n",
"> 2.94921875e-01 -1.33789062e-01 -1.80664062e-02 -6.25610352e-03
\n",
"> -1.62353516e-02 5.98144531e-02 1.21582031e-01 4.17480469e-02\\] "
]
},
{
"cell_type": "markdown",
"source": [
"**[ TODO ]** You can also find the top-N most similar words. Try it! "
],
"metadata": {
"id": "RL4Gqhyw56oX"
}
},
{
"cell_type": "code",
"source": [
"#### print top 5 most similar words to \"school\""
],
"metadata": {
"id": "zq-Jwhxe5jDy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Expected output: \n",
"> \n",
"[('elementary', 0.7868632078170776),
\n",
" ('schools', 0.7411909103393555),
\n",
" ('shool', 0.6692329049110413),
\n",
" ('elementary_schools', 0.6597153544425964),
\n",
" ('kindergarten', 0.6529811024665833)]"
],
"metadata": {
"id": "yUUOFU4J4Anl"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "zIk5hWfGeOlR"
},
"source": [
"## Preprocessing\n",
"Preprocess the two tsv files here."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rUKN7pEKeOlS"
},
"source": [
"#### adjust the ratio of the two classes of training data\n",
"In the training data, the ratio of good phrases to bad phrases is about one to thirty. That will make training classification unsatisfactory, so we need to adjust the ratio. Reducing bad phrases and adding good phrases are both common way.\n",
"\n",
"**[ TODO ]** Please adjust the ratio of good phrases to bad phrases however you think is best and output the number of the two classes for demo.\n",
"\n",
"You need to explain why you chose this ratio and how you did it."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "gpleKkC9eOlS",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 137
},
"outputId": "c8fe9973-707a-4b08-be22-c9532f76fe3e"
},
"outputs": [
{
"output_type": "error",
"ename": "SyntaxError",
"evalue": "ignored",
"traceback": [
"\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m .......\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
]
}
],
"source": [
".......\n",
"train = ......\n",
"#### print the number of training data of two classes"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-V40xvY1eOlT"
},
"source": [
"#### number words\n",
"Give each word a unique number."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "N2g3NLJ9eOlT"
},
"outputs": [],
"source": [
"from tensorflow.keras.preprocessing.text import Tokenizer\n",
"tok = Tokenizer()\n",
"tok.fit_on_texts(pd.concat([train,test],ignore_index=True)['phrase'])\n",
"vocab_size = len(tok.word_index) + 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Sj0V1G_AeOlT"
},
"source": [
"#### convert phrases into numbers\n",
"Your model can't understand words, so we have to do this transform first. \n",
"\n",
"The number should be the same as the last step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FdwSF-1JeOlU"
},
"outputs": [],
"source": [
"train_encoded_phrase = tok.texts_to_sequences(train['phrase'])\n",
"test_encoded_phrase = tok.texts_to_sequences(test['phrase'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-hhTKdm7eOlV"
},
"source": [
"#### **[ TODO ]** padding\n",
"Make all phrases the same length. The longest phrases in the two tsv files have five tokens. Hence, we should add zeroes to all the phrases that are shorter than five. \n",
"- we suggest using `pad_sequences`, but you can do it however you like"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LF8rQwmneOlV"
},
"outputs": [],
"source": [
"from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
"X_train = ...\n",
"X_test = ...\n",
"print(X_train[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gbE9uyk0eOlW"
},
"source": [
"#### **[ TODO ]** one hot encode the labels\n",
"- we suggest using `to_categorical`, but again, you can use whatever you like"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SJkFyC8_eOlX"
},
"outputs": [],
"source": [
"from tensorflow.keras.utils import to_categorical\n",
"y_train = ...\n",
"y_test = ...\n",
"print(y_train[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4AJnIDUSeOlX"
},
"source": [
"#### split training data into train and validation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "r32haPqreOlY"
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train,X_val,y_train,y_val=train_test_split(X_train,y_train,test_size=0.20,random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PKthy0kTeOlY"
},
"source": [
"#### **[ TODO ]** creating the embedding matrix\n",
"The embedding matrix is used by the classification model. It should be a list of lists. Each sub-list is an embedding vector of a word and the order of all embedding vectors should be same as the word index numbering from the *tokenizer*. The tokenizer output is stored in a dictionary. You can check it using `tok.word_index.items()`.\n",
"\n",
"Make the embedding matrix. Our example model will need one, but you can skip it if the classification model you're using doesn't need it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7zBQDNmmeOlZ"
},
"outputs": [],
"source": [
"embedding_matrix = ......"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0c9JvMZaeOlZ"
},
"source": [
"## Classification"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GpgeqgmWeOlZ"
},
"source": [
"#### build model\n",
"**[ TODO ]** Please build your classification model by ***keras*** here. Don't worry if you don't know how, just use the one given below. Feel free to make any changes or even build your own.\n",
"\n",
"You **must** use the pre-trained word2vec model to represent the words of phrases."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XRlAGm9teOla"
},
"outputs": [],
"source": [
"from tensorflow.keras.models import Sequential\n",
"from tensorflow.keras.layers import Dense , Flatten , Embedding, LSTM, LSTM, ReLU, Dropout\n",
"from tensorflow.keras.initializers import Constant\n",
"from tensorflow.keras.layers import ReLU\n",
"from tensorflow.keras.layers import Dropout\n",
"from tensorflow.keras.optimizers import RMSprop\n",
"model=Sequential()\n",
"model.add(Embedding(input_dim=vocab_size,output_dim=300,input_length=5,embeddings_initializer=Constant(embedding_matrix)))\n",
"model.add(LSTM(64,return_sequences=False))\n",
"model.add(Flatten())\n",
"model.add(Dense(2,activation='sigmoid')) \n",
"model.compile(optimizer=RMSprop(lr=1e-3),loss='binary_crossentropy',metrics=['accuracy'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7EVyMofjeOla"
},
"outputs": [],
"source": [
"print(model.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GSG0J5P7eOla"
},
"source": [
"#### train\n",
"Train classification model here.\n",
"\n",
"**[ TODO ]** Adjust the hyperparameter to optimize the validation accuracy and validation loss.\n",
"\n",
"* The higher the accuracy, the better; the lower the validation, the better.\n",
"* **number of epoch** and **batch size** are the most important\n",
" * Start with a smaller number of epochs first--it is directly correlated to the training time, and you don't want to spend too much time waiting!\n",
" * Usually the larger the batch size the better, but the batch size you are able to use depends on you computing power, so start small and increase gradually. It is recommended to use powers of 2 (2, 4, 8, 16, ...) for batch size."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MSs4f9ELeOlb"
},
"outputs": [],
"source": [
"model.fit(X_train,y_train,validation_data=(X_val,y_val),......)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VJIEsewSeOlc"
},
"source": [
"#### test\n",
"\n",
"**[ TODO ]** Test your model by test.tsv and output the accuracy. Beat the accuracy baseline: **0.98** for extra points."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "nDzmeBV4eOlc"
},
"outputs": [],
"source": [
"accuracy = model.evaluate(X_test,y_test)\n",
"print(accuracy[1])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xxVkZUuFeOlc"
},
"source": [
"## Show wrong prediction results\n",
"Observing wrong prediction result may help you improve your prediction.\n",
"\n",
"**[ TODO ]** show the wrong prediction results like this: \n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lVqH4lvleOld"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "KprbNm7KeOle"
},
"source": [
"## TA's Notes\n",
"\n",
"If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=258852025) to reserve demo time. \n",
"The score is only given after TAs review your implementation, so **make sure you make a appointment with a TA before you miss the deadline** .
After demo, please upload your assignment to elearn. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.\n",
"
Note that **late submission will not be allowed**."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IQl6lTDzeOlf"
},
"source": [
"## Learning Resource\n",
"[Deep Learning with Python](https://tanthiamhuat.files.wordpress.com/2018/03/deeplearningwithpython.pdf)\n",
"\n",
"[Classification on IMDB](https://keras.io/examples/nlp/bidirectional_lstm_imdb/)"
]
}
]
}