{ "cells": [ { "cell_type": "markdown", "id": "f9f9b469", "metadata": {}, "source": [ "## From BNC to Ngram - clarification \n", "There seems to be a general sense of confusion regarding this homeword assignment. I would try to provide a more comprehensive instruction with some hints to handle some of the problems encountered during class.\n", "\n", "## Objective\n", "The end goal of this assignment is to generate the rank and rank ratio between BNC and clang8 for all the bigrams with the format \"adj. accident\"\n", "\n", "\n", "#### BNC Data: \n", "https://drive.google.com/file/d/1mKX1DLHDIqKph4e4k1MnYOV3iWtvT7-E/view?usp=sharing\n", "\n", "#### Cleaned Lang 8 Data:\n", "https://drive.google.com/file/d/11wxKJr-VpmrHZ-MR41i4E097uY-T9okH/view?usp=sharing\n", "\n", "\n", "### A. Processing BNC Data\n", "### 1.1 Extract lines containing id, title, classcode, keywords, sentences from each BNC parts (ABCDEFGHJK)\n", "\n", "using grep / egrep to match regular expression and extract relavent data \n", "\n", "Reference\n", "https://www.twblogs.net/a/5d26d705bd9eee1e5c84509d" ] }, { "cell_type": "code", "execution_count": 1, "id": "afc2d87d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "egrep -o -h Texts/*/*/A*.xml > BNC.A.txt 237.80s user 1.29s system 99% cpu 4:00.15 total\r\n" ] } ], "source": [ "# extract data from raw data\n", "! egrep -o -h \\\n", "'(.*?|.*?||.*?||.*?|.*?||

|

)' \\\n", "Texts/*/*/A*.xml > BNC.A.txt" ] }, { "cell_type": "markdown", "id": "04fc4b8f", "metadata": {}, "source": [ "**NOTE:** From the lab, it seems most systems output \n", "\n", "\\\\FACTSHEET \\\\WHAT \\\\IS \\\\AIDS\\...\\ \n", " \n", "as one line. For line_to_token() to work, please sepereate each word (\\ ~ \\) into one line." ] }, { "cell_type": "code", "execution_count": 1, "id": "4e74d7dd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [ACET factsheets & newsletters]. Sample containing about 6688 words of miscellanea (domain: social science) \r\n", "A00\r\n", " [ACET factsheets & newsletters]. \r\n", "W nonAc: medicine\r\n", " Health Sex \r\n", "\r\n", "FACTSHEET \r\n", "WHAT \r\n", "IS \r\n", "AIDS\r\n" ] } ], "source": [ "! head BNC.A.txt" ] }, { "cell_type": "markdown", "id": "a20bcf30", "metadata": {}, "source": [ " ### 2. Convert sentences to bigram (for all sections A to K, no I)\n", " After you extract all the BNC data (BNC.A.txt, BNC.B.txt, BNC.C.txt ...), you need to process xml into tokens and bigrams.\n", " \n", " ### 2.1 Convert line to word tokens" ] }, { "cell_type": "code", "execution_count": 141, "id": "5b00de4c", "metadata": {}, "outputs": [], "source": [ "import re\n", "from pprint import pprint\n", "\n", "def line_to_token(line):\n", " if line.startswith(' ', '', '') \n", " elif line.startswith('', '', '') \n", " elif line.startswith('discounted \n", " match = re.findall('(.*?)', line)\n", " return (match[0][2].strip(), match[0][0].upper(), match[0][1]) # lemma, tag, word\n", " elif line.startswith('(.*?)', line)\n", " if not match:\n", " return '??? line'\n", " return (match[0], match[0], match[0])\n", "\n", "def tokens_to_bigram(tokens):\n", " result = []\n", " for i in range(len(tokens)-1):\n", " if i == 1:\n", " word2tag2lemma2 = [tokens[i][j].lower()+' '+tokens[i+1][j] for j in range(3)]\n", " else:\n", " word2tag2lemma2 = [tokens[i][j]+' '+tokens[i+1][j] for j in range(3)]\n", " if word2tag2lemma2[0][0].isalpha() or word2tag2lemma2[0][0] == '<': \n", " result = result + [ '\\t'.join(word2tag2lemma2) ]\n", " return result" ] }, { "cell_type": "markdown", "id": "fc5eb1cc", "metadata": {}, "source": [ "### 2.2 Convert token stream to bigram stream" ] }, { "cell_type": "code", "execution_count": 142, "id": "bb7a275e", "metadata": {}, "outputs": [], "source": [ "def word_to_bigram(wordfile, bigramfile):\n", " \n", " def Batch_to_ngram(batch, fileout): \n", " with open(wordfile.format(batch)) as filein:\n", " lines = filein.readlines()\n", " for i, line in enumerate(lines):\n", " if line.startswith(' count) " ] }, { "cell_type": "code", "execution_count": 103, "id": "b268aac8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sort: No such file or directory\r\n", "sort BNC.B.2w.txt 0.00s user 0.00s system 54% cpu 0.005 total\r\n", "uniq -c 0.00s user 0.00s system 49% cpu 0.004 total\r\n", "awk '{ gsub(/^[ ]*/, \"\"); print }' 0.00s user 0.00s system 49% cpu 0.004 total\r\n", "awk '{print substr($0, index($0, \" \")+1) \"\\t\" $1}' 0.00s user 0.00s system 52% cpu 0.004 total\r\n", "egrep -v '\\t1$' > BNC.B.2w.c2+.txt 0.00s user 0.00s system 56% cpu 0.003 total\r\n" ] } ], "source": [ "#1 BNC.2w.txt ==> BNC.2w.c.txt\n", "# We sort the bigrams and count identical bigrams\n", "# NOTE: BNC.2w.c.txt should be considerably smaller than BNC.2w.txt\n", "# NOTE: Since we only care about \"adj. accident\", you can filter BNC.2w.c.txt into a much\n", "# smaller file by extracting bigrams that fit \"adj. accident\"\n", "\n", "! time sort BNC.2w.txt | uniq -c | \\\n", "awk '{ gsub(/^[ ]*/, \"\"); print }' | awk '{print substr($0, index($0, \" \")+1) \"\\t\" $1}' > BNC.2w.c.txt" ] }, { "cell_type": "code", "execution_count": 134, "id": "12341e3c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "big accident\tAJ0 NN1\tbig accident\t3\n", "fatal accident\tAJ0 NN1\tfatal accident\t82\n", "fatal accident\taj0 NN1\tfatal accident\t3\n", "serious accident\tAJ0 NN1\tserious accident\t61\n" ] } ], "source": [ "# Example data format for BNC.2w.c.txt\n", "# bigram pos lemmas count\n", "\n", "! egrep '^(big|serious|fatal) accident\\t' BNC.2w.c.txt" ] }, { "cell_type": "markdown", "id": "189a0c2b", "metadata": {}, "source": [ "### B. Processing clang8 Data\n", "During the lab, we found out that the lang8 dataset we provided last week did not contain enough \"adj. accident\" bigrams for this assignment. \n", "Please process the clang8 data provided above and extract bigrams. \n", "**NOTE:** We are only intereted in the rank / rank ratio for \"adj. accident\" bigrams. Please extract bigrams of \"adj. accident\" from both BNC and clang8 first, then calculate the rank and rank ratio. " ] }, { "cell_type": "markdown", "id": "d03d5117", "metadata": {}, "source": [ "Target output format: \n", "https://drive.google.com/file/d/1xM46aaDIeu4Z0FkikGOcmDoq7u2O47tY/view?usp=sharing\n", "\n", "Demo time sign up for Lab 3:\n", "https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit?usp=sharing\n", "\n", "For Demo, please print out all the \"adj. accident\" bigrams in descending rank ratio order.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2be0d0b0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6b157e00", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 5 }