{
"cells": [
{
"cell_type": "markdown",
"id": "f9f9b469",
"metadata": {
"id": "f9f9b469"
},
"source": [
"## From BNC to Ngram \n",
"\n",
"### BNC Data: \n",
"https://drive.google.com/file/d/1mKX1DLHDIqKph4e4k1MnYOV3iWtvT7-E/view?usp=sharing\n",
"\n",
"### 1. Extract lines containing id, title, classcode, keywords, sentences from each BNC parts\n",
"\n",
"grep (global search regular RE)\n",
"grep是很常見也很常用的命令,它的主要功能是進行字符串數據的比較,然後符合用戶需求的字符串打印出來,但是注意,grep在數據中查找一個字符串時,是以“整行”爲單位進行數據篩選的。\n",
"\n",
"egrep (extended RE)\n",
"\n",
"Reference\n",
"https://www.twblogs.net/a/5d26d705bd9eee1e5c84509d"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afc2d87d",
"metadata": {
"id": "afc2d87d",
"outputId": "1c59e1a7-61b5-4122-8c70-ebe302b529b2"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"egrep -o -h BNC/Texts/*/*/A*.xml > BNC.A.txt 270.42s user 6.79s system 97% cpu 4:44.83 total\r\n"
]
}
],
"source": [
"! time ! egrep -o -h \\\n",
"'(
|
)' \\\n", "BNC/Texts/*/*/A*.xml > BNC.A.txt" ] }, { "cell_type": "markdown", "id": "c572029d", "metadata": { "id": "c572029d" }, "source": [ "\n", "#### Repeat Step 1 for all sections A, B, C, D, E, F, G, H, J, and K " ] }, { "cell_type": "markdown", "id": "a20bcf30", "metadata": { "id": "a20bcf30" }, "source": [ " ### 2. Convert sentences to bigram (for all sections A to K, no I)\n", " ### 2.1 Convert line to word tokens" ] }, { "cell_type": "code", "execution_count": null, "id": "5b00de4c", "metadata": { "id": "5b00de4c" }, "outputs": [], "source": [ "import re\n", "from pprint import pprint\n", "\n", "def line_to_token(line):\n", " if line.startswith('