{"cells":[{"cell_type":"markdown","id":"03db8fca","metadata":{"id":"03db8fca"},"source":["# Week 06: Dependency Parser and spacy\n","The assignment this week is to identify the grammar pattern VERB-PREP-NOUN using two different methods. You will practice the various functionalities of spacy in the process. \n","\n","Data used in this assignment: \n","https://drive.google.com/file/d/1OIZPsDezgLaBjw3OX30YFyeFkzegtwP8/view?usp=sharing\n","\n","* sentences.s2orc.txt\n","\n","spacy tutorials: \n","https://www.machinelearningplus.com/spacy-tutorial-nlp/#phrasematcher \n","https://spacy.io/usage/linguistic-features#entity-linking\n","\n","## Requirements\n","* pandas\n","* spacy\n","\n"]},{"cell_type":"markdown","id":"8da24123","metadata":{"id":"8da24123"},"source":["### Installation of spacy"]},{"cell_type":"code","execution_count":null,"id":"4f503a42","metadata":{"id":"4f503a42"},"outputs":[],"source":["! pip install spacy\n","! python -m spacy download en_core_web_sm"]},{"cell_type":"markdown","id":"55d6736c","metadata":{"id":"55d6736c"},"source":["### Read Data"]},{"cell_type":"code","execution_count":null,"id":"4d97fd4d","metadata":{"id":"4d97fd4d","outputId":"35b55654-7592-47d2-d975-31ea423e4617"},"outputs":[{"name":"stdout","output_type":"stream","text":[" sentence\n","0 Meanwhile, an analysis of the literature shows...\n","1 Meanwhile, this list can be supplemented with ...\n","2 At the same time, in many cases, several instr...\n","3 It is not possible to give a systematic assess...\n","4 Correlation was calculated for the years, wher...\n"]}],"source":["def loadData(path):\n"," with open(path) as f:\n"," sents = []\n"," for line in f.readlines():\n"," line = line.strip(\"\\n\").split(\"\\t\")\n"," sents.append(line[1])\n"," return pd.DataFrame({\"sentence\":sents})\n","data = loadData(\"lab_sentences.s2orc.txt\")\n","print(data.head())\n"]},{"cell_type":"code","execution_count":null,"id":"f2c07c81","metadata":{"scrolled":true,"id":"f2c07c81"},"outputs":[],"source":["import re\n","import pandas as pd\n","import spacy\n","nlp = spacy.load('en_core_web_sm')"]},{"cell_type":"markdown","id":"aef7d493","metadata":{"id":"aef7d493"},"source":["### Spacy example\n","If you have any probelm, look up the documentation [here](https://spacy.io/usage/linguistic-features)\n"]},{"cell_type":"code","execution_count":null,"id":"998b263c","metadata":{"id":"998b263c"},"outputs":[],"source":["example_text = \"\"\"The economic situation of the country is on edge , as the stock \n","market crashed causing loss of millions. Citizens who had their main investment \n","in the share-market are facing a great loss. Many companies might lay off \n","thousands of people to reduce labor cost.\n","He began immediately to rant about the gas price .\n","\"\"\"\n","\n","# Remove newline character\n","example_text = re.sub(\"\\n\", '', example_text)\n","example_doc = nlp(example_text)"]},{"cell_type":"markdown","id":"728770d5","metadata":{"id":"728770d5"},"source":["**[ TODO ]** Please print out the 2nd sentence in the example_text"]},{"cell_type":"code","execution_count":null,"id":"03a85784","metadata":{"scrolled":true,"id":"03a85784","outputId":"785657b6-5167-46e4-a9ea-2c7973b89d3e"},"outputs":[{"name":"stdout","output_type":"stream","text":["Citizens who had their main investment in the share-market are facing a great loss.\n"]}],"source":["sents = ...\n","print(sents[1])"]},{"cell_type":"markdown","id":"02f1e521","metadata":{"id":"02f1e521"},"source":["Let's start with some simple linguistic features we have been dealing with.\n","\n","**[ TODO ]** Please print out the following token features of the first sentence in example_text: \n","text, lemma, POS"]},{"cell_type":"code","execution_count":null,"id":"e1320b34","metadata":{"id":"e1320b34","outputId":"feeeb223-9ccb-406b-8b68-807b70dd23e7"},"outputs":[{"name":"stdout","output_type":"stream","text":["The the DET\n","economic economic ADJ\n","situation situation NOUN\n","of of ADP\n","the the DET\n","country country NOUN\n","is be AUX\n","on on ADP\n","edge edge NOUN\n",", , PUNCT\n","as as SCONJ\n","the the DET\n","stock stock NOUN\n","market market NOUN\n","crashed crash VERB\n","causing cause VERB\n","loss loss NOUN\n","of of ADP\n","millions million NOUN\n",". . PUNCT\n"]}],"source":["for token in sents[0]:\n"," print(...)"]},{"cell_type":"markdown","id":"82bba130","metadata":{"id":"82bba130"},"source":["**[ TODO ]** Data Process 1: Please run the s2orc data through spacy and store the result in data_doc"]},{"cell_type":"code","execution_count":null,"id":"3d4fb283","metadata":{"id":"3d4fb283"},"outputs":[],"source":["data_doc = []\n","..."]},{"cell_type":"code","execution_count":null,"id":"05334eeb","metadata":{"id":"05334eeb","outputId":"a500c2be-8c53-4885-f69b-0ef1f1550508"},"outputs":[{"data":{"text/plain":["Meanwhile, an analysis of the literature shows that the development of indicators of financial stability has not yet been completed."]},"execution_count":135,"metadata":{},"output_type":"execute_result"}],"source":["data_doc[0]"]},{"cell_type":"markdown","id":"3aa91c05","metadata":{"id":"3aa91c05"},"source":["### Named Entity Recognition\n","Named Entity: a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name. \n","\n","The following is an example of named entity recognition using spacy"]},{"cell_type":"code","execution_count":null,"id":"5d901938","metadata":{"id":"5d901938","outputId":"719095a5-1df0-43ad-938e-86e167d4e3dd"},"outputs":[{"name":"stdout","output_type":"stream","text":["Ada Lovelace PERSON\n","New York GPE\n","Thanksgiving DATE\n"]}],"source":["ner_doc = nlp(\"Ada Lovelace was born in New York at Thanksgiving.\")\n","\n","# Document level\n","for e in ner_doc.ents:\n"," print(e.text, e.label_) "]},{"cell_type":"code","execution_count":null,"id":"545a45fd","metadata":{"id":"545a45fd","outputId":"d36fdf2a-f88b-4abb-9625-f5bc43c2f731"},"outputs":[{"data":{"text/html":["
\n","\n"," Ada Lovelace\n"," PERSON\n","\n"," was born in \n","\n"," New York\n"," GPE\n","\n"," at \n","\n"," Thanksgiving\n"," DATE\n","\n",".
"],"text/plain":[""]},"metadata":{},"output_type":"display_data"}],"source":["from spacy import displacy\n","displacy.render(ner_doc,style='ent',jupyter=True)"]},{"cell_type":"markdown","id":"6a00997d","metadata":{"id":"6a00997d"},"source":["**[ TODO ]** Data Process 2: Please replace all named entities in data_doc with their labels. \n","For example, \n","\"Ada Lovelace was born in New York at Thanksgiving.\" should be adjusted to \n","\"PERSON was born in GPE at DATE.\""]},{"cell_type":"code","execution_count":null,"id":"9200b473","metadata":{"id":"9200b473"},"outputs":[],"source":["data_doc = ..."]},{"cell_type":"markdown","id":"efa97686","metadata":{"id":"efa97686"},"source":["### Dependency Parser\n","\n","If you have probelms concerning the dependency parser tags, look up the documentation [here](https://universaldependencies.org/en/dep/index.html). \n"]},{"cell_type":"code","execution_count":null,"id":"4ca13c64","metadata":{"id":"4ca13c64","outputId":"ced50ef0-728e-41d0-bc44-53d21db06db4"},"outputs":[{"name":"stdout","output_type":"stream","text":["Many companies might lay off thousands of people to reduce labor cost.\n","Many amod\n","companies nsubj\n","might aux\n","lay ROOT\n","off prt\n","thousands dobj\n","of prep\n","people pobj\n","to aux\n","reduce advcl\n","labor compound\n","cost dobj\n",". punct\n"]}],"source":["# Example of Dependency Parser\n","print(sents[2])\n","for token in sents[2]:\n"," print(token.text, token.dep_)"]},{"cell_type":"code","execution_count":null,"id":"cc85cf22","metadata":{"id":"cc85cf22","outputId":"1738d42a-284a-490e-ed0c-f40bdccd7af0"},"outputs":[{"data":{"text/html":["\n","\n"," Many\n"," ADJ\n","\n","\n","\n"," companies\n"," NOUN\n","\n","\n","\n"," might\n"," AUX\n","\n","\n","\n"," lay\n"," VERB\n","\n","\n","\n"," off\n"," ADP\n","\n","\n","\n"," thousands\n"," NOUN\n","\n","\n","\n"," of\n"," ADP\n","\n","\n","\n"," people\n"," NOUN\n","\n","\n","\n"," to\n"," PART\n","\n","\n","\n"," reduce\n"," VERB\n","\n","\n","\n"," labor\n"," NOUN\n","\n","\n","\n"," cost.\n"," NOUN\n","\n","\n","\n"," \n"," \n"," amod\n"," \n"," \n","\n","\n","\n"," \n"," \n"," nsubj\n"," \n"," \n","\n","\n","\n"," \n"," \n"," aux\n"," \n"," \n","\n","\n","\n"," \n"," \n"," prt\n"," \n"," \n","\n","\n","\n"," \n"," \n"," dobj\n"," \n"," \n","\n","\n","\n"," \n"," \n"," prep\n"," \n"," \n","\n","\n","\n"," \n"," \n"," pobj\n"," \n"," \n","\n","\n","\n"," \n"," \n"," aux\n"," \n"," \n","\n","\n","\n"," \n"," \n"," advcl\n"," \n"," \n","\n","\n","\n"," \n"," \n"," compound\n"," \n"," \n","\n","\n","\n"," \n"," \n"," dobj\n"," \n"," \n","\n",""],"text/plain":[""]},"metadata":{},"output_type":"display_data"}],"source":["from spacy import displacy\n","\n","displacy.render(sents[2], style=\"dep\")"]},{"cell_type":"markdown","id":"cb6e1a82","metadata":{"id":"cb6e1a82"},"source":["To traverse a dependency tree, use the following properties of token object. \n","token.children, token.lefts, token.rights \n","\n","If you have any probelms, please check [here](https://spacy.io/api/token#children)"]},{"cell_type":"markdown","id":"75901966","metadata":{"id":"75901966"},"source":["**[ TODO ]** Please identify a VERB-PREP-NOUN grammar structure in sent[2] by traversing the dependency tree. \n","Expected output: \n","(lay, off, thousands)\n"]},{"cell_type":"code","execution_count":null,"id":"9cec92b7","metadata":{"id":"9cec92b7"},"outputs":[],"source":[]},{"cell_type":"markdown","id":"1f4e46d2","metadata":{"id":"1f4e46d2"},"source":["**[ TODO ]** Please identify all VERB-PREP-NOUN grammar structure in data_doc by traversing the dependency trees and save the results in a list of tuples dep_gp.\n"]},{"cell_type":"code","execution_count":null,"id":"c25d7c17","metadata":{"id":"c25d7c17"},"outputs":[],"source":["dep_gp = ..."]},{"cell_type":"markdown","id":"3e4b5c69","metadata":{"id":"3e4b5c69"},"source":["**[ TODO ]** Please print out all VERB-PREP-NOUN grammar patterns in dep_gp with the verb \"charge\".\n"]},{"cell_type":"code","execution_count":null,"id":"48da0488","metadata":{"id":"48da0488"},"outputs":[],"source":[]},{"cell_type":"markdown","id":"5ee87eef","metadata":{"id":"5ee87eef"},"source":["### Rule Based Methods \n","We can also custom build rules for spacy to match patterns. \n","[Documentation](https://spacy.io/api/matcher)"]},{"cell_type":"code","execution_count":null,"id":"74664296","metadata":{"id":"74664296"},"outputs":[],"source":["from spacy.matcher import Matcher "]},{"cell_type":"code","execution_count":null,"id":"abc9b3d4","metadata":{"id":"abc9b3d4"},"outputs":[],"source":["# Example text\n","text = \"\"\"I visited Manali last time . Around same budget trips ? I was visiting Ladakh this summer . I have planned visiting New York and other abroad places for next year. Have you ever visited Kodaikanal? \"\"\"\n","text = re.sub('\\n', '', text)\n","match_doc = nlp(text)"]},{"cell_type":"code","execution_count":null,"id":"9d053b2f","metadata":{"id":"9d053b2f","outputId":"0d9e93f9-4603-43e8-fc99-645a9fe1dad5"},"outputs":[{"name":"stdout","output_type":"stream","text":[" matches found: 4\n","Match found: visited Manali\n","Match found: visiting Ladakh\n","Match found: visiting New\n","Match found: visited Kodaikanal\n"]}],"source":["# Initialize the matcher\n","matcher = Matcher(nlp.vocab)\n","\n","# Write a pattern that matches a form of \"visit\" + place\n","my_pattern = [{\"LEMMA\": \"visit\"}, {\"POS\": \"PROPN\"}]\n","\n","# Add the pattern to the matcher and apply the matcher to the doc\n","matcher.add(\"Visting_places\", [my_pattern])\n","matches = matcher(doc)\n","\n","# Counting the no of matches\n","print(\" matches found:\", len(matches))\n","\n","# Iterate over the matches and print the span text\n","for match_id, start, end in matches:\n"," print(\"Match found:\", doc[start:end].text)"]},{"cell_type":"markdown","id":"065b258c","metadata":{"id":"065b258c"},"source":["**[ TODO ]** Please identify all VERB-PREP-NOUN grammar structure in data_doc by applying a matcher rule and store the results in a list of tuples rule_gp. \n"]},{"cell_type":"code","execution_count":null,"id":"7076373d","metadata":{"id":"7076373d"},"outputs":[],"source":["rule_gp"]},{"cell_type":"markdown","id":"a4a03652","metadata":{"id":"a4a03652"},"source":["**[ TODO ]** Please print out all VERB-PREP-NOUN grammar patterns in rule_gp with the verb \"charge\".\n"]},{"cell_type":"code","execution_count":null,"id":"918eeda9","metadata":{"id":"918eeda9"},"outputs":[],"source":[]},{"cell_type":"markdown","id":"69efbb4d","metadata":{"id":"69efbb4d"},"source":["## TA's Notes\n","\n","If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=258852025) to reserve demo time. \n","The score is only given after TAs review your implementation, so **make sure you make a appointment with a TA before you miss the deadline** .
After demo, please upload your assignment to elearn. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.\n","
Note that **late submission will not be allowed**."]}],"metadata":{"kernelspec":{"display_name":"gm-transformer-venv","language":"python","name":"gm-transformer-venv"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.10"},"colab":{"provenance":[],"collapsed_sections":[]}},"nbformat":4,"nbformat_minor":5}