{
"cells": [
{
"cell_type": "markdown",
"id": "f9f9b469",
"metadata": {},
"source": [
"## From BNC to Ngram - clarification \n",
"There seems to be a general sense of confusion regarding this homeword assignment. I would try to provide a more comprehensive instruction with some hints to handle some of the problems encountered during class.\n",
"\n",
"## Objective\n",
"The end goal of this assignment is to generate the rank and rank ratio between BNC and clang8 for all the bigrams with the format \"adj. accident\"\n",
"\n",
"\n",
"#### BNC Data: \n",
"https://drive.google.com/file/d/1mKX1DLHDIqKph4e4k1MnYOV3iWtvT7-E/view?usp=sharing\n",
"\n",
"#### Cleaned Lang 8 Data:\n",
"https://drive.google.com/file/d/11wxKJr-VpmrHZ-MR41i4E097uY-T9okH/view?usp=sharing\n",
"\n",
"\n",
"### A. Processing BNC Data\n",
"### 1.1 Extract lines containing id, title, classcode, keywords, sentences from each BNC parts (ABCDEFGHJK)\n",
"\n",
"using grep / egrep to match regular expression and extract relavent data \n",
"\n",
"Reference\n",
"https://www.twblogs.net/a/5d26d705bd9eee1e5c84509d"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "afc2d87d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"egrep -o -h Texts/*/*/A*.xml > BNC.A.txt 237.80s user 1.29s system 99% cpu 4:00.15 total\r\n"
]
}
],
"source": [
"# extract data from raw data\n",
"! egrep -o -h \\\n",
"'(
|
)' \\\n", "Texts/*/*/A*.xml > BNC.A.txt" ] }, { "cell_type": "markdown", "id": "04fc4b8f", "metadata": {}, "source": [ "**NOTE:** From the lab, it seems most systems output \n", "\n", "\\