{ "cells": [ { "cell_type": "markdown", "id": "f9f9b469", "metadata": { "id": "f9f9b469" }, "source": [ "## From BNC to Ngram \n", "\n", "### BNC Data: \n", "https://drive.google.com/file/d/1mKX1DLHDIqKph4e4k1MnYOV3iWtvT7-E/view?usp=sharing\n", "\n", "### 1. Extract lines containing id, title, classcode, keywords, sentences from each BNC parts\n", "\n", "grep (global search regular RE)\n", "grep是很常見也很常用的命令,它的主要功能是進行字符串數據的比較,然後符合用戶需求的字符串打印出來,但是注意,grep在數據中查找一個字符串時,是以“整行”爲單位進行數據篩選的。\n", "\n", "egrep (extended RE)\n", "\n", "Reference\n", "https://www.twblogs.net/a/5d26d705bd9eee1e5c84509d" ] }, { "cell_type": "code", "execution_count": null, "id": "afc2d87d", "metadata": { "id": "afc2d87d", "outputId": "1c59e1a7-61b5-4122-8c70-ebe302b529b2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "egrep -o -h BNC/Texts/*/*/A*.xml > BNC.A.txt 270.42s user 6.79s system 97% cpu 4:44.83 total\r\n" ] } ], "source": [ "! time ! egrep -o -h \\\n", "'(.*?|.*?||.*?||.*?|.*?||

|

)' \\\n", "BNC/Texts/*/*/A*.xml > BNC.A.txt" ] }, { "cell_type": "markdown", "id": "c572029d", "metadata": { "id": "c572029d" }, "source": [ "\n", "#### Repeat Step 1 for all sections A, B, C, D, E, F, G, H, J, and K " ] }, { "cell_type": "markdown", "id": "a20bcf30", "metadata": { "id": "a20bcf30" }, "source": [ " ### 2. Convert sentences to bigram (for all sections A to K, no I)\n", " ### 2.1 Convert line to word tokens" ] }, { "cell_type": "code", "execution_count": null, "id": "5b00de4c", "metadata": { "id": "5b00de4c" }, "outputs": [], "source": [ "import re\n", "from pprint import pprint\n", "\n", "def line_to_token(line):\n", " if line.startswith(' ', '', '') \n", " elif line.startswith('', '', '') \n", " elif line.startswith('discounted \n", " match = re.findall('(.*?)', line)\n", " return (match[0][2].strip(), match[0][0].upper(), match[0][1]) # lemma, tag, word\n", " elif line.startswith('(.*?)', line)\n", " if not match:\n", " return '??? line'\n", " return (match[0], match[0], match[0])\n", "\n", "def tokens_to_bigram(tokens):\n", " result = []\n", " for i in range(len(tokens)-1):\n", " if i == 1:\n", " word2tag2lemma2 = [tokens[i][j].lower()+' '+tokens[i+1][j] for j in range(3)]\n", " else:\n", " word2tag2lemma2 = [tokens[i][j]+' '+tokens[i+1][j] for j in range(3)]\n", " if word2tag2lemma2[0][0].isalpha() or word2tag2lemma2[0][0] == '<': \n", " result = result + [ '\\t'.join(word2tag2lemma2) ]\n", " return result" ] }, { "cell_type": "markdown", "id": "fc5eb1cc", "metadata": { "id": "fc5eb1cc" }, "source": [ "### 2.2 Convert token stream to bigram stream" ] }, { "cell_type": "code", "execution_count": null, "id": "bb7a275e", "metadata": { "id": "bb7a275e" }, "outputs": [], "source": [ "def word_to_bigram(wordfile, bigramfile):\n", " \n", " def Batch_to_ngram(batch, fileout): \n", " with open(wordfile.format(batch)) as filein:\n", " lines = filein.readlines()\n", " for i, line in enumerate(lines):\n", " if line.startswith(' count) " ] }, { "cell_type": "code", "execution_count": null, "id": "b268aac8", "metadata": { "id": "b268aac8", "outputId": "20af1767-a0b3-429c-9b1b-d8dac6a33110" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sort: No such file or directory\r\n", "sort BNC.B.2w.txt 0.00s user 0.00s system 54% cpu 0.005 total\r\n", "uniq -c 0.00s user 0.00s system 49% cpu 0.004 total\r\n", "awk '{ gsub(/^[ ]*/, \"\"); print }' 0.00s user 0.00s system 49% cpu 0.004 total\r\n", "awk '{print substr($0, index($0, \" \")+1) \"\\t\" $1}' 0.00s user 0.00s system 52% cpu 0.004 total\r\n", "egrep -v '\\t1$' > BNC.B.2w.c2+.txt 0.00s user 0.00s system 56% cpu 0.003 total\r\n" ] } ], "source": [ "#1 BNC.2w.txt ==> BNC.2w.c.txt\n", "! time sort BNC.2w.txt | uniq -c | \\\n", "awk '{ gsub(/^[ ]*/, \"\"); print }' | awk '{print substr($0, index($0, \" \")+1) \"\\t\" $1}' > BNC.2w.c.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "12341e3c", "metadata": { "id": "12341e3c", "outputId": "6819ee0e-7a6e-4495-ab9c-6d23225695fc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "big accident\tAJ0 NN1\tbig accident\t3\n", "fatal accident\tAJ0 NN1\tfatal accident\t82\n", "fatal accident\taj0 NN1\tfatal accident\t3\n", "serious accident\tAJ0 NN1\tserious accident\t61\n" ] } ], "source": [ "! egrep '^(big|serious|fatal) accident\\t' BNC.2w.c.txt" ] }, { "cell_type": "markdown", "source": [ "Target output: \n", "https://drive.google.com/file/d/1xM46aaDIeu4Z0FkikGOcmDoq7u2O47tY/view?usp=sharing" ], "metadata": { "id": "4y3_Vuf3NBlS" }, "id": "4y3_Vuf3NBlS" }, { "cell_type": "code", "source": [], "metadata": { "id": "fvWU9t9GNZhP" }, "id": "fvWU9t9GNZhP", "execution_count": null, "outputs": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "colab": { "provenance": [], "collapsed_sections": [] } }, "nbformat": 4, "nbformat_minor": 5 }