{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "RVPIBjwjx6pO" }, "source": [ "# Word Count, Phrase Analysis, Cross-Corpus Analysis\n", "\n", "In learning English, there are phrases and words that are overly used and seldom used - it depends on what corpus is being used. Here, we will do word count, phrase analysis and cross-corpus analysis to determine the phrases that are overly used by learners.\n", "

\n", "One dataset is taken from [`British National Corpus`](http://www.natcorp.ox.ac.uk/), which is from 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. Another one is called [`NAIST Lang-8`](https://sites.google.com/site/naistlang8corpora/),a language exchange social networking website geared towards language learners. The website is run by Lang-8 Inc., which is based in Tokyo, Japan.\n", "\n", "\n", "https://drive.google.com/drive/folders/1vtCjRptZL6T4mffzbnqwi5i4WrqVnZHr?usp=sharing\n" ] }, { "cell_type": "markdown", "metadata": { "id": "xotpb7p5x6pd" }, "source": [ "## N-gram counting\n", "We will do tokenization and calculation of frequency. The rules of tokenization in this Lab are:\n", " 1. Ignore case (e.g., \"The\" is the same as \"the\")\n", " 2. Split by white spaces and punctuations\n", " 3. Ignore all punctuation\n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GC_wab2p2Pam" }, "outputs": [], "source": [ "import os\n", "import re\n", "import string" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8iLjwEwBx6ph" }, "outputs": [], "source": [ "\n", "def tokenize(text):\n", " \"\"\"\n", " Input:\n", " \"This is an example.'\n", "\n", " Sample output: \n", " ['this', 'is', 'an', 'example', '.']\n", " \"\"\" \n", " #### [ TODO ] transform text to lower case\n", " #### [ TODO ] seperate the words by white space\n", " \n", "from collections import Counter\n", "\n", "def calculate_frequency(tokens):\n", " \"\"\"\n", " Input:\n", " ['this', 'is', 'an', 'example', ...]\n", "\n", " Sample output: \n", " {\n", " 'the': 79809, \n", " 'project': 288,\n", " ...\n", " }\n", " \"\"\"\n", " #### [ TODO ] \n", " \n", "\n", "\n", "def get_ngram(tokens, n=2):\n", " \"\"\"\n", " Input:\n", " ['this', 'is', 'an', 'example', ...]\n", "\n", " Sample output: \n", " ['this is', 'is an', 'an example', ...]\n", " \"\"\"\n", " #### [TODO] \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Sj3-ZuWP2Pao" }, "outputs": [], "source": [ "file_path = os.path.join('data', 'bnc.txt')\n", "BNC_unigram = []\n", "BNC_unigram_counter = Counter()\n", "#### [ TODO ] generate BNC unigrams and calculate document frequency of unigram in BNC\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eZlkiCuQx6pt" }, "outputs": [], "source": [ "# Read lang-8 Data\n", "file_path = os.path.join('data','lang8.txt')\n", "lang_unigram = []\n", "lang_unigram_counter = Counter()\n", "\n", "#### [ TODO ] generate lang8 unigrams and calculate document frequency of unigram in lang8\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lPBceKOax6pt" }, "source": [ "## Rank\n", "Rank unigrms by their frequencies. The higher the frequency, the higher the rank. (The most frequent unigram ranks 1.)
\n", "[ TODO ] Rank unigrams for Lang-8 and BNC.." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SH9xlXpBx6pu" }, "outputs": [], "source": [ "lang_unigram_Rank = []\n", "\n", "#### [ TODO ] Rank unigrams for lang\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rN3MQTebx6pv" }, "outputs": [], "source": [ "BNC_unigram_Rank = []\n", "\n", "#### [ TODO ] Rank unigrams for BNC\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "pm26VfkDx6pv" }, "source": [ "## Calculate Rank Ratio\n", "In this step, you need to map the same unigram in two dataset, and calculate the Rank Ratio of unigrams.
Please follow the formula for calculating Rank Ratio:
\n", "
\n", "\n", "$Rank Ratio = \\frac{Rank of BNC }{Rank of Lang8}$\n", "

\n", "If the unigram doesn't appear in BNC, the rank of it is treated as 1.\n", "\n", "[ TODO ] Please calculate all rank ratios of unigrams in Lang-8." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kNSj6gbU2Paq" }, "outputs": [], "source": [ "#### [ TODO ] Calculate Rank Ratio" ] }, { "cell_type": "markdown", "metadata": { "id": "7U08oh2Ex6pw" }, "source": [ "## sort the result\n", "[ TODO ] Please show top 30 unigrams in Rank Ratio and the value of their Rank Ratio in this format: \n", "
\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MhyGW1jC2Paq", "outputId": "f3c349ba-6859-4d68-dff8-4e02d3846c77" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rank\tunigram\t\t\t\tRank Ratio\n", "1\tdont\t\t\t\t1647.883\n", "2\twanna\t\t\t\t965.38\n", "3\tthats\t\t\t\t795.332\n", "4\tdidnt\t\t\t\t658.563\n", "5\tdoesnt\t\t\t\t503.039\n", "6\thavent\t\t\t\t497.181\n", "7\tisnt\t\t\t\t396.261\n", "8\tfavorite\t\t\t352.281\n", "9\tenglish\n", "\t\t\t338.979\n", "10\tive\t\t\t\t327.974\n", "11\ttodayi\t\t\t\t313.543\n", "12\tjapanese\n", "\t\t\t293.676\n", "13\tim\t\t\t\t279.914\n", "14\tcant\t\t\t\t275.829\n", "15\teveryday\n", "\t\t\t246.353\n", "16\thadnt\t\t\t\t245.969\n", "17\thes\t\t\t\t233.413\n", "18\tvacation\n", "\t\t\t232.454\n", "19\twasnt\t\t\t\t185.347\n", "20\tjapan\n", "\t\t\t\t172.697\n", "21\titll\t\t\t\t166.082\n", "22\tosaka\n", "\t\t\t\t165.945\n", "23\tjapans\t\t\t\t160.445\n", "24\ttheres\t\t\t\t154.79\n", "25\tsomeones\t\t\t153.884\n", "26\tarent\t\t\t\t152.039\n", "27\thasnt\t\t\t\t151.918\n", "28\tawesome\n", "\t\t\t149.661\n", "29\tinternet\t\t\t148.291\n", "30\tsemester\n", "\t\t\t146.371\n" ] } ], "source": [ "#### [ TODO ] " ] }, { "cell_type": "markdown", "metadata": { "id": "nOllPQ9-x6px" }, "source": [ "## for Bigrams\n", "[ TODO ] Do the Same Thing for Bigrams \n", "Hint: \n", "1. generate all bigrams for BNC / lang8 \n", "2. calculate frequency for each bigrams \n", "3. rank bigrams by frequency \n", "4. calculate the rank ratio of each bigram\n", "5. print out the top 30 highest rank ratio bigrams " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zNR5m63D8Zf2" }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xjBx-rcU2Par", "outputId": "d9da3005-d9bf-4af1-b045-def8bfde6194" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rank\tbigram\t\t\t\tRank Ratio\n", "1\ti dont\t\t\t\t363080.167\n", "2\tstudy english\t\t\t35188.888\n", "3\tso im\t\t\t\t15431.061\n", "4\ti didnt\t\t\t\t15370.125\n", "5\tmeet you\n", "\t\t\t13878.429\n", "6\tim very\t\t\t\t12457.973\n", "7\tlearn english\t\t\t11868.028\n", "8\ti cant\t\t\t\t8989.845\n", "9\ti havent\t\t\t8578.976\n", "10\tmy family\n", "\t\t\t7718.84\n", "11\tim so\t\t\t\t7385.273\n", "12\tmy diary\n", "\t\t\t6630.556\n", "13\ti wont\t\t\t\t6090.669\n", "14\tive been\t\t\t5649.957\n", "15\tgood night\n", "\t\t\t5608.794\n", "16\tcant understand\t\t\t5537.622\n", "17\tthey dont\t\t\t5516.94\n", "18\tby myself\n", "\t\t\t4984.605\n", "19\tmy home\n", "\t\t\t4818.499\n", "20\tthan before\n", "\t\t\t4106.673\n", "21\tmy english\t\t\t4010.532\n", "22\tin japan\n", "\t\t\t3990.14\n", "23\tim sorry\t\t\t3897.65\n", "24\tplease correct\t\t\t3738.287\n", "25\tim glad\t\t\t\t3428.818\n", "26\tim afraid\t\t\t3306.523\n", "27\tdont you\t\t\t3303.248\n", "28\tmy room\n", "\t\t\t3276.736\n", "29\tgood morning\n", "\t\t\t3173.661\n", "30\tim trying\t\t\t3031.639\n" ] } ], "source": [ "#### [ TODO ] " ] }, { "cell_type": "markdown", "metadata": { "id": "ef-_B3bnx6py" }, "source": [ "## TA's Notes\n", "\n", "If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=0) to reserve demo time. \n", "The score is only given after TAs review your implementation, so **make sure you make a appointment with a TA before you miss the deadline** .
After demo, please upload your assignment to e-learn website. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.\n", "
Note that **late submission will not be allowed**. " ] } ], "metadata": { "colab": { "collapsed_sections": [], "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }