Beyond tagging: segmentation+labeling tasks Intro to NLP - ETHZ - 25/03/2013
Summary Information Extraction: NER and related tasks Segmentation & Labeling: Models Features Shallow parsing Entity disambiguation Hand out: mid term exam
Information Extraction Goal: Identify structured data in unstructured text: Identify mentions of instances of predefined classes: names of people, location, organizations, etc. Named-entity recognition (NER) Co-reference resolution Identify and normalize temporal expressions: Normalize temporal expressions Associated events with points in time Timeline generation Relation detection and classification: Event detection
IE: input "Citing high fuel prices, United Airlines said Friday it has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco."
IE: NER "Citing high fuel prices, United Airlines said Friday it has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco."
IE: co-reference resolution "Citing high fuel prices, United Airlines1 said Friday it1 has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, a unit of AMR Corp.3, immediately matched the move, spokesman Tim Wagner4 said. United1, a unit of UAL Corp., said the increase took effect Thursday and applies to 5 most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."
IE: temporal expressions "Citing high fuel prices, United Airlines1 said Friday it1 has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, a unit of AMR Corp.3, immediately matched the move, spokesman Tim Wagner4 said. United1, a unit of UAL Corp., said the increase took effect Thursday and applies to 5 most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."
IE: relation extraction "Citing high fuel prices, United Airlines1 said Friday it1 [has increased fares by] 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, [a unit of] AMR Corp.3, [immediately matched the move], [spokesman] Tim Wagner4 said. United1, [a unit of] UAL Corp.5, said the increase took effect Thursday and applies to most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."
IE: event extraction "Citing high fuel prices, United Airlines1 said Friday [it1 [has increased fares by] 6$ per round trip on flights to some cities also served by lower-cost carriers.]1 American Airlines2, [a unit of] AMR Corp.3, [immediately matched [the move]1], [spokesman] Tim Wagner4 said. United1, [a unit of] UAL Corp.5, said [the increase]1 took effect Thursday and applies to most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."
IE: (desired) output AIRLINE FARE-RAISE: LEAD ARLINE: UA AMOUNT: 6$ EFFECTIVE DATE: 2006-10-26 FOLLOWER: AA 1.Populate a database 2.Compare with historical stock prices
How? - Structured prediction: e.g., sequence models - Supervised and semi-supervised methods - Challenging problems - Limitations and new perspectives
Sequence classifiers Choose your favorite sequence classifier: HMM, CRF, MEMM, SVM, Perceptron,... We abstract from the specifics going forward
Beyond POS tagging Sequence labeling tasks: Assign one (POS) label to each word Cats/NNS love/vbp fish/nn./. Given: labeled data, sequence classifier Sequence segmentation + labeling: Information extraction, biomedical IE, shallow (partial) parsing Task: labels can span several words "U.N. Security Council tries to end Syria conflict"
Segmentation + labeling "U.N. Security Council tries to end Syria conflict" "U.N.ORG SecurityORG CouncilORG tries to end SyriaLOC conflict" How many organizations?
Segmentation + labeling "U.N. Security Council tries to end Syria conflict" "[U.N. Security Council]ORG tries to end [Syria]LOC conflict" One organization...
Segmentation + Labeling 1. Identify the boundaries of a segment: a. "[U.N. Security Council] tries to end Syria conflict" 2. Label the segment: a. "[U.N. Security Council]ORG tries to end Syria conflict" 3. How? a. Reduce the problem to simple labeling b. BIO encoding: split the labels in i. B-X: beginning of label X ii. I-X: inside a label X iii. O: outside a label (NULL label)
BIO encoding 1. Encode the data in BIO format: a. b. c. [U.N. Security council]org tries to end [Syria]LOC conflict. U.N.B-ORG SecurityI-ORG CouncilI-ORG trieso too endo SyriaB-LOC conflicto No termination ambiguity B-X->I-X same entity, new entity otherwise 2. Train and use your favorite sequence classifier as usual a. No guarantee that output will be consistent b. Set B-X -> I-Y transitions to 0 if necessary 3. Evaluate appropriately: P/R/F1
Features Words: a. Target token identity, stem; w=corp., sw=corp. b. Same for words within a window around target; w-1=amr.,w+1=, PoS tags: a. Shape: a. Regexp-like features; s=xxp, s+1=p Prefixes/Suffixes: a. Target token and tokens within window; pos=nnp, pos+1=, of target and neighbors; suf=., suf=p., pre=c pre=co,... Gazetteers: labeled lists of phrases (locations, cities, fir names etc.): a. gx= target is in list X; example, Tim -> gfirstname=true b. Crucial features out-of-domain c. Features on segments? SemiMM Label features: a. label-1=b-org,... b. label+1=0, (bidirectional models)
NER Identifying mentions (spans of text) referring to instances of a pre-defined set of classes
Entity ambiguity NOTE: The problem of the actual "identity" of an entity is not addressed in classic IE, unless approximated by coreference.
NER as sequence classification
Evaluation Precision/Recall/F1: micro/macro: mention/type level Correct prediction: both entity boundaries are correctly identified: [New York]GPE State vs. [New York State]GPE the label is correct: [New York State]LOC vs [New York State]GPE F1 > 90% in supervised case, in-domain! Significant degradation out-domain Domain adaptation, semi-supervised learning etc.
Relation extraction Relation: a set of ordered tuples over a domain E.g., President-of = {(Obama, US), (Sarkozy, France),...} 1. Identify named-entities 2. For each pair (ei,ej) a. Predic-Relation(ei,ej) [Possibly NULL] Task-specific features: 1. Text between entities: BOW 2. Syntactic structure
Relation Extraction Appositive constructions: part-of, is-a etc.
Distant supervision for RE Supervised approaches are limited Idea: exploit existing repositories such as Wikipedia (infoboxes), Freebase (relations) to bootstrap models for RE: 1. Start with seed set of pairs 2. Repeat: a. Build extraction patterns from data: b. Extract new pairs Issues: semantic drift, evaluation
Temporal expressions More regular patterns than in other tasks Challenges: Coverage False positives Normalization: map to interval or point in time
Partial parsing Goal: Identify the basic non-recursive (N/V/A/P) phrases of a sentence (chunking): flat/non-overlapping segmentation+labeling task Base phrases do not contain constituents of the same type Include head of phrase Exclude post-head modifiers "[The morning flight]np from [Denver]NP [has arrived]vp"
BIO encoding American Airlines, a unit of AMR Inc.
BIO encoding American Airlines, a unit of AMR Inc. [American Airlines]NP, [a unit]np of [AMR Inc.]NP
BIO encoding American Airlines, a unit of AMR Inc. [American Airlines]NP, [a unit]np of [AMR Inc.]NP AmericanB-NP AirlinesI-NP,O ao unitb-np ofo AMRB-NP Inc.I-NP
From Trees to Chunks [American Airlines]NP, [a unit]np of [AMR Inc.]NP
Partial parsing
PP as sequence classification
Limitations of NER 1. Washington was born at Haywood Farms near Oak Grove, Virginia. 2. Washington attended Yale Law School for a year and then studied at the University of Oxford. 3. Washington served as army chief of staff under William Tubman. 4. Historians laud Washington for his selection and supervision of his generals. 5. Washington was the founder of the town of Centralia, Washington.
Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. 3. WashingtonPER served as army chief of staff under William Tubman. 4. Historians laud WashingtonPER for his selection and supervision of his generals. 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC.
Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. 3. WashingtonPER served as army chief of staff under William Tubman. 4. Historians laud WashingtonPER for his selection and supervision of his generals. 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC. 5 different people!
Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. [George] 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. [George] 3. WashingtonPER served as army chief of staff under William Tubman. [George] 4. Historians laud WashingtonPER for his selection and supervision of his generals. [George] 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC. [George] With the same first name!
From strings to "things"
Polisemy A term or phrase can have multiple interpretations
Mercury?
Mercury?
Synonymy CAL University of California Berkeley College Univ. of Berkely U. of CA at Berkeley UC (Berkeley) Berkeley Bears U. of Ca. Berkeley U. of Cal. Bear's lair Univ. c UC Berkeley Berkeley University UCB California U. of CAL The Bear's Lair Cal Bears... The same concept can be expressed in different ways!
Entity disambiguation Segmentation and labeling problem Output = Wikipedia, Freebase, YAGO, etc. Updated and maintained by thousands of users Good coverage of popular topics How? Classic supervised approach dubious Exploit social aspects (link graph) One example: TagMe More @Google/ETH open house
Motivations
(a large) problem space Entities: English Wikipedia: 4 million YAGO 10 million Freebase: 40 million Web knowledge graphs: hundreds of millions Facts: Billions of tuples: link graphs, infoboxes, Freebase relations etc.
Input
Mentions identification
Entity disambiguation
Ingredients-1 A distribution of entities (Wikipedia articles) given phrases (mentions): How likely is entity k given phrase w? Assume segmentation is given as a start, e. g., via NER (can do without) Estimate from Wikipedia (anchors/titles)
Ingredients-2 A measure of entity similarity: How related/mutually relevant are two entities? Link graph: large graph O(100M) many pairs to evaluate efficient computation Google distance:
Ingredients-3 A measure of local coherence: Find entity assignments that obey documentlevel topic coherence Voting scheme: Aggregated relevance:
Basic TagMe Disambiguate mention i: 1. compute candidates: for all k such that: 2. choose the entity with highest lambda in the candidate set: NOTE: two hyperparameters: simple/efficient/effective
Input
Priors on mention
Filter with tau
Compute candidates
choose argmax P(k w)
Eval Evaluation on manually annotated data: CoNLL-Aida dataset Metrics: micro@1: fraction of entity mentions correctly disambiguated macro@1: accuracy averaged by number of documents Some results: CoNLL-Aida test-b Baseline Aida-best TagMe* macro@1 72.74 81.66 78.21 micro@1 69.82 82.54 78.64 *My own vanilla implementation of TagMe
Applications Summarization? Choose sentences/snippets containing important entities in the document? Need data annotation...
Summary Information Extraction: NER and related tasks Segmentation & Labeling: Models Features Shallow parsing Entity disambiguation Hand out: mid term exam Next class: lexical semantics and LDA
References -Book: J&M: Ch. 13.5 (Partial parsing), Ch. 22 (information extraction) - Entity disambiguation: TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). Ferragina & Scaiella, CIKM 2010 Robust Disambiguation of Named Entities in Text. Hoffart et al. EMNLP. 2011 - Tools for the project: Stanford CoreNLP, NLTK, LingPipe, OpenNLP, etc.