{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU",
    "gpuClass": "standard"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Week 9: Sentence Level Classification with BERT\n",
        "\n",
        "Your goal this week is to train a classifier that can predict the CEFR level of any given sentence. In this notebook we will guide you through the process of using 🤗[Hugging Face](https://huggingface.co/) and its transformers library as the training framework, with [Pytorch](https://pytorch.org/) as the deep learning backend, but feel free to use [TensorFlow](https://www.tensorflow.org) if that's what you are more familiar with.\n",
        "\n",
        "For this assignment we will provide a dataset containing sentences with the corresponding CEFR level, and you have to use BERT and train a sentence classifier with this dataset."
      ],
      "metadata": {
        "id": "7L3cYlZl15_K"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Prepare your environment\n",
        "\n",
        "As always, we highly recommend that you install all packages with a virtual environment manager, like [venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html), to prevent version conflicts of different packages.  "
      ],
      "metadata": {
        "id": "G8TbxtroCxM8"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Install CUDA\n",
        "Deep learning is a computionally extensive process. It takes lots of time if relying only on the CPU, especially when it's trained on a large dataset. That's why using GPU instead is generally recommended.  \n",
        "To use GPU for computation, you have to install [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) as well as the [cuDNN library](https://developer.nvidia.com/cudnn) provided by NVIDIA.  \n",
        "\n",
        "If you already had CUDA installed on your machine, then great! You're done here.  \n",
        "If you don't, you can refer to [Appendix](#Appendix-1-Install-CUDA) to see how to do so."
      ],
      "metadata": {
        "id": "wSHP0CPoXj7Z"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "### Install python packages\n",
        "The following python packages will be used in this tutorial:\n",
        "\n",
        "1. `numpy`: for matrix operation\n",
        "2. `scikit-learn`: for label encoding\n",
        "3. `datasets`: for data preparation\n",
        "4. `transformers`: for model loading and finetuing\n",
        "5. `pytorch`: the backend DL framework\n",
        "  - Note that the pt version must support the CUDA version you've installed if you want to use GPU."
      ],
      "metadata": {
        "id": "f78jLXeZfPyH"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Select GPU(s) for your backend\n",
        "\n",
        "Skip this section if you have no intension of using GPU with tensorflow/pytorch."
      ],
      "metadata": {
        "id": "3cS9CxjyfQ-F"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "\n",
        "# select your GPU. Note that this should be set before you load tensorflow or pytorch.\n",
        "os.environ['CUDA_VISIBLE_DEVICES'] = '1'\n",
        "\n",
        "# To use multiple GPUs, combine all GPU ID with commas\n",
        "# e.g. >>> os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,3'"
      ],
      "metadata": {
        "id": "sEJ8Y8SCfWp_"
      },
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import torch\n",
        "# Check if any GPU is used\n",
        "torch.cuda.is_available()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "p1p_qQKbfcCH",
        "outputId": "d25bdd87-959c-435d-8c57-22f5c350098f"
      },
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {},
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Prepare the dataset\n",
        "\n",
        "Before starting the training, we need to load and process our dataset - but wait, let's decide which model we want to use first.  \n",
        "\n",
        "In the highly unlikely chance you've never heard of it, [BERT](https://arxiv.org/abs/1810.04805) (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers) is a language model proposed by Google AI in 2018, and it's currently one of the most popular models used in NLP.  \n",
        "You can learn more about it here:\n",
        "- [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/) by Samia, 2019.\n",
        "\n",
        "\n",
        "However, we will not directly use BERT in this tutorial, because it's large and takes too long to train. Instead, we'll be using [DistilBert](https://medium.com/huggingface/distilbert-8cf3380435b5), a version of BERT that while light-weight, reserves 95% of its original accuracy.\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "-P0foxjBDQSu"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# the model you want to use. Available models can be found here: https://huggingface.co/models\n",
        "MODEL_NAME = 'distilbert-base-uncased'"
      ],
      "metadata": {
        "id": "Lqb2cHCmDIEp"
      },
      "execution_count": 18,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Load data\n",
        "\n",
        "Similar to the `transformers` library, `datasets` is also a package by huggingface. It contains many public datasets online and can help us with the data processing.  \n",
        "We can use `load_dataset` function to read the input `.csv` file provided for this assignment.\n",
        "\n",
        "Reference:\n",
        " - [Official datasets document](https://huggingface.co/docs/datasets)\n",
        " - [datasets.load_dataset](https://huggingface.co/docs/datasets/loading.html)"
      ],
      "metadata": {
        "id": "QxSUKsTkDSxJ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] load the data using the load_dataset function\n",
        "\n",
        "dataset = ..."
      ],
      "metadata": {
        "id": "hjY9HMNIt5jf"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# print(dataset['train'])\n",
        "# print(dataset['train'][1])\n",
        "# print(dataset['train']['text'][:5])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "npepcv7GfMHI",
        "outputId": "02ed2940-370f-444c-b748-abcd623891bd"
      },
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Dataset({\n",
            "    features: ['text', 'level'],\n",
            "    num_rows: 20720\n",
            "})\n",
            "{'text': 'You can contact me by e-mail.', 'level': 'A1'}\n",
            "['My mother is having her car repaired.', 'You can contact me by e-mail.', 'He had a break for the weekend, and he called me: \"I am in London, so, if you want to see me, it\\'s the time!\"', \"Research shows that 40 percent of the program's viewers are aged over 55.\", \"I'd guess she's about my age.\"]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Preprocessing\n",
        "\n",
        "As always, texts should be tokenized, embedded, and padded before being put into the model.  \n",
        "But not to worry, there are libraries from huggingface to help with this, too."
      ],
      "metadata": {
        "id": "E3olKw19uuJQ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Sentence processing\n",
        "\n",
        "Different pre-trained language models may have their own preprocessing models, and that's why we should use the tokenizers trained along with that model. In our case, we are using distilBERT, so we should use the distilBERT tokenizer.  \n",
        "\n",
        "With huggingface, loading different tokenizers is extremely easy: just import the AutoTokenizer from `transformers` and tell it what model you plan to use, and it will handle everything for you.\n",
        "\n",
        "Reference:\n",
        " - [transformers.AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer)"
      ],
      "metadata": {
        "id": "x3ANJAV7wn2R"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] load the distilBERT tokenizer using AutoTokenizer\n",
        "\n",
        "tokenizer = ..."
      ],
      "metadata": {
        "id": "DALrmSh6wpmy"
      },
      "execution_count": 22,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Label processing\n",
        "\n",
        "Our labels also need to be processed, so let's do that next.\n",
        "\n",
        "For this tutorial, we'll use the OneHotEncoder provided by scikit-learn.\n",
        "\n",
        "For now, just declare a new encoder and use `fit` to learn the data. Hint: you should still end up with 6 labels.\n",
        "\n",
        "Documents:\n",
        " - [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder)"
      ],
      "metadata": {
        "id": "Jd43REmNxCOV"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] declare a new encoder and let it learn from the dataset\n",
        "\n",
        "encoder = ..."
      ],
      "metadata": {
        "id": "q3ml0_WxxFh5"
      },
      "execution_count": 23,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# check if you still have 6 labels\n",
        "LABEL_COUNT = len(encoder.categories_[0])\n",
        "print(LABEL_COUNT)"
      ],
      "metadata": {
        "id": "BRY2GDJa1MxF",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "df210570-a2f8-478d-8902-e6f6c4f33af5"
      },
      "execution_count": 24,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "6\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Process the data\n",
        "\n",
        "To make things easier, we can write a function to process our dataset in batches. "
      ],
      "metadata": {
        "id": "qrfO8c4R1eWO"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def preprocess(dataslice):\n",
        "    \"\"\" Input: a batch of your dataset\n",
        "        Example: { 'text': [['sentence1'], ['setence2'], ...],\n",
        "                   'label': ['label1', 'label2', ...] }\n",
        "    \"\"\"\n",
        "    \n",
        "    # [ TODO ] use your tokenizor and encoder to get sentence embeddings and encoded labels\n",
        "    ...\n",
        "\n",
        "    \"\"\" Output: a batch of processed dataset\n",
        "        Example: { 'input_ids': ...,\n",
        "                   'attention_masks': ...,\n",
        "                   'label': ... }\n",
        "    \"\"\""
      ],
      "metadata": {
        "id": "cvwnYLah1dbN"
      },
      "execution_count": 25,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# map the function to the whole dataset\n",
        "processed_data = dataset.map(preprocess,    # your processing function\n",
        "                             batched = True # Process in batches so it can be faster\n",
        "                            )"
      ],
      "metadata": {
        "id": "Nr0-Y1bR2efQ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# print(processed_data)\n",
        "# processed_data['train'][0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_G5ftJbjfa4i",
        "outputId": "c88d0fc1-85ca-4cdb-e191-a426e29fcfc2"
      },
      "execution_count": 28,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "DatasetDict({\n",
            "    train: Dataset({\n",
            "        features: ['text', 'level', 'input_ids', 'attention_mask', 'label'],\n",
            "        num_rows: 20720\n",
            "    })\n",
            "})\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'text': 'My mother is having her car repaired.',\n",
              " 'level': 'B1',\n",
              " 'input_ids': [101, 2026, 2388, 2003, 2383, 2014, 2482, 13671, 1012, 102],\n",
              " 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
              " 'label': [0.0, 0.0, 1.0, 0.0, 0.0, 0.0]}"
            ]
          },
          "metadata": {},
          "execution_count": 28
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### DataCollator\n",
        "\n",
        "You might have noticed that we skipped padding the sentences. That's because we are going to do it during training.  \n",
        "\n",
        "To do training-time processing, we can use the DataCollator Class provided by `transformers`. And guess what - transformers has a class that will handle padding for us, too!\n",
        "\n",
        " - [transformers.DataCollatorWithPadding](https://huggingface.co/docs/transformers/master/en/main_classes/data_collator#transformers.DataCollatorWithPadding)"
      ],
      "metadata": {
        "id": "G9z7ZMtP22b9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] declare a collator to do padding during traning\n",
        "\n",
        "data_collator = ..."
      ],
      "metadata": {
        "id": "x5orGYjN39dz"
      },
      "execution_count": 29,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Training\n",
        "\n",
        "Finally, we can move on to training."
      ],
      "metadata": {
        "id": "di75QVgv4V81"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Preparation\n",
        "\n",
        "We can load the pretrained model from `transformers`.  \n",
        "Generally, you need to build your own model on top of BERT if you want to use BERT for some downstream tasks, but again, sequence classification is a popular topic. With the support from `transformers` library, it can be done in two lines of codes: \n",
        "\n",
        "1. Load `AutoModelForSequenceClassification` Class.\n",
        "2. Load the pretrained model."
      ],
      "metadata": {
        "id": "53dO3pg85u0n"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import AutoModelForSequenceClassification\n",
        "\n",
        "model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased',\n",
        "                                                           num_labels = LABEL_COUNT)"
      ],
      "metadata": {
        "id": "UyDyv7wp5qdD"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Split train/val data\n",
        "\n",
        "The `Dataset` class we prepared before has a `train_test_split` method. You can use it to split your (processed) dataset.\n",
        "\n",
        "Document:\n",
        " - [datasets.Dataset - Sort, shuffle, select, split, and shard](https://huggingface.co/docs/datasets/process.html#sort-shuffle-select-split-and-shard)"
      ],
      "metadata": {
        "id": "DtmP4TEh6XiB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] choose a validation size and split your data\n",
        "train_val_dataset = ..."
      ],
      "metadata": {
        "id": "NnbD1KW16YWn"
      },
      "execution_count": 31,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# print(train_val_dataset)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6u93fCpofgbe",
        "outputId": "65b7044c-6a01-4c07-f456-75d0e83d9d70"
      },
      "execution_count": 33,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "DatasetDict({\n",
            "    train: Dataset({\n",
            "        features: ['text', 'level', 'input_ids', 'attention_mask', 'label'],\n",
            "        num_rows: 18648\n",
            "    })\n",
            "    test: Dataset({\n",
            "        features: ['text', 'level', 'input_ids', 'attention_mask', 'label'],\n",
            "        num_rows: 2072\n",
            "    })\n",
            "})\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Setup training parameters\n",
        "\n",
        "We are using the TrainerAPI to do the training. Trainer is yet another utility provided by huggingface, which helps you train the model with ease.  \n",
        "\n",
        "Document:\n",
        "- [transformers.TrainingArguments](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.TrainingArguments)\n",
        "- [transformers.Trainer](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.Trainer)"
      ],
      "metadata": {
        "id": "YzmSbOaD7O0F"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import TrainingArguments, Trainer"
      ],
      "metadata": {
        "id": "ABqlinlO76Ax"
      },
      "execution_count": 34,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] set and tune your training properties\n",
        "OUTPUT_DIR = \n",
        "LEARNING_RATE = ...\n",
        "BATCH_SIZE = ...\n",
        "EPOCH = ...\n",
        "training_args = TrainingArguments(\n",
        "    output_dir = OUTPUT_DIR,\n",
        "    learning_rate = LEARNING_RATE,\n",
        "    per_device_train_batch_size = BATCH_SIZE,\n",
        "    per_device_eval_batch_size = BATCH_SIZE,\n",
        "    num_train_epochs = EPOCH,\n",
        "    # you can set more parameters here if you want\n",
        ")\n",
        "\n",
        "# now give all the information to a trainer\n",
        "trainer = Trainer(\n",
        "    \n",
        "    # set your parameters here\n",
        "\n",
        ")"
      ],
      "metadata": {
        "id": "5lzXTG1y7q7n"
      },
      "execution_count": 35,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Training\n",
        "\n",
        "This is the easy part. Simply ask the trainer to train the model for you!"
      ],
      "metadata": {
        "id": "gE0DpS3s7rhg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "trainer.train()"
      ],
      "metadata": {
        "id": "wsrQOyJCiFas"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Save for future use\n",
        "\n",
        "Hint: try using `save_pretrained`"
      ],
      "metadata": {
        "id": "4JBUA0pe-S9b"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] practice saving your model for future use\n"
      ],
      "metadata": {
        "id": "7e30uXCf-cXc",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "f003dfa4-060a-4419-9ba6-f559029e5402"
      },
      "execution_count": 37,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Configuration saved in model/finetuned/config.json\n",
            "Model weights saved in model/finetuned/pytorch_model.bin\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Prediction\n",
        "\n",
        "Now we know exactly how to train a model, but how do we use it for predicting results?"
      ],
      "metadata": {
        "id": "7QSZjlcG9fOk"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Load finetuned model"
      ],
      "metadata": {
        "id": "O7akP5Hh9ugG"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] load the model that you saved\n",
        "\n",
        "mymodel = ..."
      ],
      "metadata": {
        "id": "Lfb_zJGm9vJP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Get the prediction\n",
        "\n",
        "Here are a few example sentences:"
      ],
      "metadata": {
        "id": "a6Vs3iBE_nck"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "examples = [\n",
        "    # A2\n",
        "    \"Remember to write me a letter.\",\n",
        "    # B2\n",
        "    \"Strawberries and cream - a perfect combination.\",\n",
        "    \"This so-called \\\"Perfect Evening\\\" was so disappointing, as well as discouraging us from coming to your Circle Theatre again.\",\n",
        "    # C1\n",
        "    \"Some may altogether give up their studies, which I think is a disastrous move.\",\n",
        "]"
      ],
      "metadata": {
        "id": "FtLt6IBi_pLF"
      },
      "execution_count": 39,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "All we need to do is to transform them to embeddings, and then we can get predictions by calling your finetuned model.  \n",
        "\n",
        "Since we don't have a DataCollator to pad the sentence and do the matrix transformation this time, we have to pad and transform the matrice on our own."
      ],
      "metadata": {
        "id": "_xhk7ZRX_2U2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Transform the sentences into embeddings\n",
        "input = tokenizer(examples, truncation=True, padding=True, return_tensors=\"pt\")\n",
        "# Get the output\n",
        "logits = mymodel(**input).logits\n",
        "logits"
      ],
      "metadata": {
        "id": "tWONG_lCAkyN",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "0c408ae7-997d-4c0c-8f49-c6196397602f"
      },
      "execution_count": 40,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "tensor([[  3.4183,  -3.9602,  -4.7323,  -6.0863,  -5.3447,  -6.0282],\n",
              "        [ -9.9472, -10.1832,  -9.0002,  -4.8197,   4.0705,  -4.7691],\n",
              "        [-10.2303,  -8.6569,  -6.1997,   4.7033,  -5.0795,  -5.6949],\n",
              "        [-11.8315, -11.4786,  -9.6928,  -0.7405,   0.0399,  -3.8067]],\n",
              "       grad_fn=<AddmmBackward0>)"
            ]
          },
          "metadata": {},
          "execution_count": 40
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Logits aren't very readable for us. Let's use softmax \n",
        "activation to transform them into more probability-like numbers."
      ],
      "metadata": {
        "id": "vBMUBD-1BFW1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from torch import nn\n",
        "\n",
        "predicts = nn.functional.softmax(logits, dim = -1)\n",
        "predicts"
      ],
      "metadata": {
        "id": "yjiKxLaBBGah",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "006d24be-ed4f-48cb-ab2c-fa7b70b44dd0"
      },
      "execution_count": 41,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "tensor([[9.9878e-01, 6.2377e-04, 2.8820e-04, 7.4412e-05, 1.5621e-04, 7.8865e-05],\n",
              "        [8.1670e-07, 6.4506e-07, 2.1055e-06, 1.3770e-04, 9.9971e-01, 1.4485e-04],\n",
              "        [3.2685e-07, 1.5765e-06, 1.8400e-05, 9.9989e-01, 5.6403e-05, 3.0483e-05],\n",
              "        [4.7224e-06, 6.7206e-06, 4.0086e-05, 3.0967e-01, 6.7584e-01, 1.4431e-02]],\n",
              "       grad_fn=<SoftmaxBackward0>)"
            ]
          },
          "metadata": {},
          "execution_count": 41
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Transform logits back to labels\n",
        "\n",
        "Now you've got the output. Write a function to map it back into labels!"
      ],
      "metadata": {
        "id": "zAqgoJTFBchb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] try to process the result"
      ],
      "metadata": {
        "id": "YvcyotBfBifR",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "50cae16f-06ed-4f04-c684-496044f37233"
      },
      "execution_count": 42,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(tensor([0, 4, 3, 4]), array(['A1', 'C1', 'B2', 'C1'], dtype='<U2'))"
            ]
          },
          "metadata": {},
          "execution_count": 42
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Evaluation\n",
        "\n",
        "Let's see how you did!  \n",
        "Load the testing data and calculate your accuracy.\n",
        "\n",
        "We want you to calculate the three kinds of accuracy mentioned in the lecture, which will also be explained in the following section."
      ],
      "metadata": {
        "id": "gfCuP95IBvP-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] \n",
        "# load test data\n",
        "# preprocess\n",
        "# get predictions\n",
        "# transform predictions back into labels"
      ],
      "metadata": {
        "id": "C8hUc4sLCW40"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#  try printing out some predictions to check if the outputs are reasonable and if you need to adjust your model at the end of every step.\n",
        "\n",
        "# for idx, (sent, level) in enumerate(zip(test_data['text'], predict_label)):\n",
        "#     if idx >= 10: break\n",
        "#     print(f'{level}: {sent}') "
      ],
      "metadata": {
        "id": "AoC2EAHxCXdj",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "cc711f3c-c789-4d0f-d92f-adf04dc54432"
      },
      "execution_count": 46,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "C2: No longer a remote, backward, unimportant country, it became a force to be reckoned with in Europe.\n",
            "B2: Unfortunately he was too fast and I couldn't keep up with him.\n",
            "B2: Most mushrooms are totally harmless, but some are poisonous.\n",
            "C2: This provided solid evidence that he committed the crime.\n",
            "C1: You can't just accept everything you read in the newspapers at face value.\n",
            "A1: Remember to write me a letter.\n",
            "B1: She has long blond hair and blue eyes. She has a good figure.\n",
            "B2: Nowadays the aim in clothing is not just for covering and protecting ourselves.\n",
            "A2: Take two tablets, three times a day.\n",
            "C2: Well, you will be if you saw our slide show and talk - members can hardly forget that relaxing afternoon when we unfolded the sails on the lake and enjoyed the tranquility of the area.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Six Level Accuracy\n",
        "\n",
        "Exact accuracy is probably what you're most familiar with:\n",
        "\n",
        "$\n",
        "accuracy = \\frac{\\#exactly\\:the\\:same\\:levels}{\\#total}\n",
        "$\n",
        "\n",
        "Example:\n",
        "```\n",
        "Prediction:   A1 A2 B1 B2 C1 C2\n",
        "Ground truth: A2 B1 B1 B2 B2 C2\n",
        "                    ^  ^     ^\n",
        "```\n",
        "\n",
        "The six level accuracy is $\\frac{3}{6} = 0.5$\n",
        "\n",
        "As the requirement, <u>your exact accuracy should be higher than $0.5$</u>."
      ],
      "metadata": {
        "id": "PBnEzAFhC7ZN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] calculate accuracy"
      ],
      "metadata": {
        "id": "kR2oVECyC8vD",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "72cebbe0-5ada-446b-b5f7-6a9849292792"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.5617391304347826\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Three Level Accuracy\n",
        "\n",
        "Three Level Accuracy is used when you only want a more general sense of right or wrong.\n",
        "\n",
        "$\n",
        "accuracy = \\frac{\\#the\\:same\\:ABC\\:levels}{\\#total}\n",
        "$\n",
        "\n",
        "Example:\n",
        "```\n",
        "Prediction:   A1 A2 B1 B2 C1 C2\n",
        "Ground truth: A2 B1 B1 B2 B2 C2\n",
        "              ^     ^  ^     ^\n",
        "```\n",
        "\n",
        "The three level accuracy is $\\frac{4}{6} = 0.667$\n",
        "\n",
        "As the requirement, <u>your exact accuracy should be higher than $0.6$</u>."
      ],
      "metadata": {
        "id": "We865ayVC_N7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] calculate accuracy"
      ],
      "metadata": {
        "id": "DCAKM9MRDCBk",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "f599130e-de22-4b28-a521-f378afd52711"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.6752173913043479\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Fuzzy accuracy\n",
        "\n",
        "However, the level of a sentence is relatively subjective. Generally speaking, $\\pm1$ errors are allowed in the real evaluation in linguistic area.  \n",
        "\n",
        "For example, if the actual label is 'B1', we'll also consider the prediction 'right' if the model predicts 'B2' or 'A2'.\n",
        "\n",
        "Hence, the fuzzy accuracy is\n",
        "\n",
        "$\n",
        "accuracy = \\frac{\\#good\\:enough\\:answers}{\\#total}\n",
        "$\n",
        "\n",
        "Example:\n",
        "```\n",
        "Prediction:   0 1 2 3 4 5\n",
        "Ground truth: 0 1 1 3 3 3\n",
        "              ^ ^ ^ ^ ^\n",
        "```\n",
        "\n",
        "The fuzzy accuracy is $\\frac{5}{6} = 0.833$\n",
        "\n",
        "As the requirement, <u>your accuracy should be higher than $0.8$</u>."
      ],
      "metadata": {
        "id": "3YSo46NX7nvb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# [ TODO ] calculate accuracy"
      ],
      "metadata": {
        "id": "27fP0BTc73Al",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "e792e699-4e04-4a82-ebff-d133aead0b60"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.8665217391304347\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## TA's Note\n",
        "\n",
        "Congratulations, you made it to the end of the tutorial! Make sure you make an appointment to show your work and turn in your finished assignment before next week's lesson. We will ask you to run your code, so double check that everything is working and that your model is saved. Don't worry if you didn't pass the evaluation requirements, you'll still get partial points for trying."
      ],
      "metadata": {
        "id": "BPrSLQEDDfeE"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Appendix "
      ],
      "metadata": {
        "id": "aL7CAtRQbR7s"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "<a name=\"Appendix-1-Install-CUDA\"></a>\n",
        "\n",
        "### Appendix 1 - Install CUDA\n",
        "\n",
        "1. Check your GPU vs. CUDA compatibility:\n",
        "   - [NVIDIA -> Your GPU Compute Capability](https://developer.nvidia.com/cuda-gpus) -> GeForce and TITAN Products\n",
        "2. Check library vs. CUDA compatibility: \n",
        "   - Pytorch: [Previous PyTorch Versions](https://pytorch.org/get-started/previous-versions/)\n",
        "   - Tensorflow: [Linux/MacOX](https://www.tensorflow.org/install/source#tested_build_configurations) or [Windows](https://www.tensorflow.org/install/source_windows#tested_build_configurations)\n",
        "3. Note the highest CUDA version that fits your system.\n",
        "\n",
        "#### >> for conda/mamba users\n",
        "\n",
        "You can directly install CUDA library with the selected CUDA version.\n",
        "1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)\n",
        "2. `conda/mamba install -c conda-forge cudatoolkit=${VERSION}`\n",
        "\n",
        "#### >> for non-conda users\n",
        "\n",
        "1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)\n",
        "2. Download and install [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit-archive)\n",
        "3. Download and install [cuDNN Library](https://developer.nvidia.com/rdp/cudnn-archive)"
      ],
      "metadata": {
        "id": "lqqUYYPEanAF"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Appendix 2 - Further Readings\n",
        "\n",
        "1. [Huggingface Official Tutorials](https://github.com/huggingface/notebooks/tree/master/examples)\n",
        "2. How to use Bert with other downstream tasks: [How to use BERT from the Hugging Face transformer library](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209): \n",
        "3. Training with pytorch backend: [transformers-tutorials](https://github.com/abhimishra91/transformers-tutorials)\n",
        "4. A more complicated example that include manual data/training processing with Pytorch: [Transformers for Multi-Label Classification made simple](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1)\n",
        "5. [Text Classification with tensorflow](https://github.com/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb): tensorflow example"
      ],
      "metadata": {
        "id": "8hFAM1Cya4_c"
      }
    }
  ]
}