Beyond tagging: segmentation+labeling tasks. Intro to NLP - ETHZ - 25/03/2013

Similar documents
Information Extraction slides adapted from Jim Martin s Natural Language Processing class

CSCI 5832 Natural Language Processing

Natural Language Processing. Dependency Parsing

Knowlywood: Mining Activity Knowledge from Hollywood Narratives

Lecture 2: Image Classification pipeline. Fei-Fei Li & Andrej Karpathy Lecture 2-1

The Importance of AIM and the Operational Concept

Evaluation of Predictability as a Performance Measure

Discriminate Analysis of Synthetic Vision System Equivalent Safety Metric 4 (SVS-ESM-4)

Analysis of Aircraft Separations and Collision Risk Modeling

Measure 67: Intermodality for people First page:

Query formalisms for relational model relational algebra

Carbon Offsetting and Reduction Scheme for International Aviation (CORSIA):

ANALYSIS OF THE CONTRIUBTION OF FLIGHTPLAN ROUTE SELECTION ON ENROUTE DELAYS USING RAMS

NextGen AeroSciences, LLC Seattle, Washington Williamsburg, Virginia Palo Alto, Santa Cruz, California

AVIATION INVESTIGATION REPORT A03O0213 LOSS OF SEPARATION

Report on Geographic Scope of Market-based Measures (MBMS)

UC Berkeley Working Papers

Price-Setting Auctions for Airport Slot Allocation: a Multi-Airport Case Study

MIT ICAT. Price Competition in the Top US Domestic Markets: Revenues and Yield Premium. Nikolas Pyrgiotis Dr P. Belobaba

MODAIR: Measure and development of intermodality at AIRport. INO WORKSHOP EEC, December 6 h 2005

USE OF 3D GIS IN ANALYSIS OF AIRSPACE OBSTRUCTIONS

Folktale Classification using Learning to Rank. Dong Nguyen, Dolf Trieschnigg, and Mariët Theune University of Twente

Fusion of Flight Data with Social Media

Airline Fuel Efficiency Ranking

ADQ Regulators Working Group

SPADE-2 - Supporting Platform for Airport Decision-making and Efficiency Analysis Phase 2

Situierte Generierung

ATTEND Analytical Tools To Evaluate Negotiation Difficulty

The Combination of Flight Count and Control Time as a New Metric of Air Traffic Control Activity

STANDARDS MAP Basic Programs 1 and 2 English Language Arts Content Standards Grade Five

New Distribution Capability

COMMISSION IMPLEMENTING REGULATION (EU)

Impact of Landing Fee Policy on Airlines Service Decisions, Financial Performance and Airport Congestion

LAX Community Noise Roundtable Aviation Noise News Update April 13, 2011

Istanbul Technical University Air Transportation Management, M.Sc. Program Aviation Economics and Financial Analysis Module November 2014

Revenue Recognition Implementation Issue 2.11 NOTICE

MEASURING ACCESSIBILITY TO PASSENGER FLIGHTS IN EUROPE: TOWARDS HARMONISED INDICATORS AT THE REGIONAL LEVEL. Regional Focus.

EUROCONTROL Call Sign Similarity Project

TRAFFIC COMMERCIAL AIR CARRIERS

Overview of the TREC 2009 Entity track. What is the track about? Information need. Airlines that currently use Boeing 747 planes

Management System for Flight Information

DATA APPLICATION CATEGORY 25 FARE BY RULE

Air Traffic Complexity: An Input-Output Approach. Amy R Pritchett, Keumjin Lee and Eric JM Feron School of Aerospace Engineering Georgia Tech

OVERVIEW OF THE FAA ADS-B LINK DECISION

Aviation Economics & Finance

Management System for Flight Information

Emerging Technologies in BPM

Forecast and Overview

Monitoring & Control Tim Stevenson Yogesh Wadadekar

PREFACE. Service frequency; Hours of service; Service coverage; Passenger loading; Reliability, and Transit vs. auto travel time.

CURRENT SHORT-RANGE TRANSIT PLANNING PRACTICE. 1. SRTP -- Definition & Introduction 2. Measures and Standards

IRISH AVIATION AUTHORITY DUBLIN POINT MERGE. Presented by James O Sullivan PANS-OPS & AIRSPACE INSPECTOR Irish Aviation Authority

Department of Transportation, Federal Aviation Administration (FAA). SUMMARY: Under this notice, the FAA announces the submission deadline of

COMMUNICATIONS PANEL. WG-I 20 Meeting

Time-Space Analysis Airport Runway Capacity. Dr. Antonio A. Trani. Fall 2017

Bioinformatics of Protein Domains: New Computational Approach for the Detection of Protein Domains

Abstract. Introduction

EUROCONTROL Call Sign Similarity Project

UNIT TITLE: CONSTRUCT AND TICKET DOMESTIC AIRFARES

LCC Competition in the U.S. and EU: Implications for the Effect of Entry by Foreign Carriers on Fares in U.S. Domestic Markets

Cross-sectional time-series analysis of airspace capacity in Europe

A Multilayer and Time-varying Structural Analysis of the Brazilian Air Transportation Network

Enabling Civilian Low-Altitude Airspace and Unmanned Aerial System (UAS) Operations. Unmanned Aerial System Traffic Management (UTM)

Travel Program Implementation

ICAO EUR Region Performance Framework

An Econometric Study of Flight Delay Causes at O Hare International Airport Nathan Daniel Boettcher, Dr. Don Thompson*

Annotating, Extracting, and Linking Legal Information

Assignment of Arrival Slots

Interreg Vb /Prowad Link WP6.5. Feasibilitystudy, nature tourism routes around the North Sea Region Project description

MODAIR. Measure and development of intermodality at AIRport

An Exploration of LCC Competition in U.S. and Europe XINLONG TAN

Reservation & Ticketing Policy

THIRTEENTH AIR NAVIGATION CONFERENCE

ACTION: Notice of a new task assignment for the Aviation Rulemaking Advisory Committee

Wake Turbulence Research Modeling

Quantitative Analysis of Automobile Parking at Airports

Developing an Aircraft Weight Database for AEDT

Measurement of environmental benefits by ICAO Secretariat

NATS. SAIP AD3 Jersey Interface Change Stage 1 Assessment Meeting. Friday 2 nd February x NATS presenters.

Activity Template. Drexel-SDP GK-12 ACTIVITY. Subject Area(s): Sound Associated Unit: Associated Lesson: None

The Computerized Analysis of ATC Tracking Data for an Operational Evaluation of CDTI/ADS-B Technology

EASA Safety Information Bulletin

Efficiency and Automation

Introduction Runways delay analysis Runways scheduling integration Results Conclusion. Raphaël Deau, Jean-Baptiste Gotteland, Nicolas Durand

CONNECT Events: Flight Optimization

The Commercial UAS Symposium San Diego, CA June 16 17, 2015

EXPLANATION OF TPP TERMS AND SYMBOLS

Safety Enhancement RNAV Safe Operating and Design Practices for STARs and RNAV Departures

Amadeus e-travel Management Release Notes Highlights

ICAO Young Aviation Professionals Programme

American Institute of Aeronautics and Astronautics

TENTH SESSION OF THE STATISTICS DIVISION

BLUE PANORAMA AIRLINES POLICY ON AGENT DEBIT MEMO (ADM)

Module Objectives. Creating a Manual Fare Build

International Civil Aviation Organization SECRETARIAT ADMINISTRATIVE INSTRUCTIONS ON THE IMPLEMENTATION OF THE ICAO CIVIL AVIATION TRAINING POLICY

Appendix B Ultimate Airport Capacity and Delay Simulation Modeling Analysis

AIRPORT OPERATIONS TABLE OF CONTENTS

Module description: Traffic Sample. Pim van Leeuwen, NLR Second Demonstration Workshop Braunschweig, Germany June 25 th, 2013

Predicting Flight Delays Using Data Mining Techniques

CGE Training: Train the Trainer Scenarios

Transcription:

Beyond tagging: segmentation+labeling tasks Intro to NLP - ETHZ - 25/03/2013

Summary Information Extraction: NER and related tasks Segmentation & Labeling: Models Features Shallow parsing Entity disambiguation Hand out: mid term exam

Information Extraction Goal: Identify structured data in unstructured text: Identify mentions of instances of predefined classes: names of people, location, organizations, etc. Named-entity recognition (NER) Co-reference resolution Identify and normalize temporal expressions: Normalize temporal expressions Associated events with points in time Timeline generation Relation detection and classification: Event detection

IE: input "Citing high fuel prices, United Airlines said Friday it has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco."

IE: NER "Citing high fuel prices, United Airlines said Friday it has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco."

IE: co-reference resolution "Citing high fuel prices, United Airlines1 said Friday it1 has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, a unit of AMR Corp.3, immediately matched the move, spokesman Tim Wagner4 said. United1, a unit of UAL Corp., said the increase took effect Thursday and applies to 5 most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."

IE: temporal expressions "Citing high fuel prices, United Airlines1 said Friday it1 has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, a unit of AMR Corp.3, immediately matched the move, spokesman Tim Wagner4 said. United1, a unit of UAL Corp., said the increase took effect Thursday and applies to 5 most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."

IE: relation extraction "Citing high fuel prices, United Airlines1 said Friday it1 [has increased fares by] 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, [a unit of] AMR Corp.3, [immediately matched the move], [spokesman] Tim Wagner4 said. United1, [a unit of] UAL Corp.5, said the increase took effect Thursday and applies to most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."

IE: event extraction "Citing high fuel prices, United Airlines1 said Friday [it1 [has increased fares by] 6$ per round trip on flights to some cities also served by lower-cost carriers.]1 American Airlines2, [a unit of] AMR Corp.3, [immediately matched [the move]1], [spokesman] Tim Wagner4 said. United1, [a unit of] UAL Corp.5, said [the increase]1 took effect Thursday and applies to most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."

IE: (desired) output AIRLINE FARE-RAISE: LEAD ARLINE: UA AMOUNT: 6$ EFFECTIVE DATE: 2006-10-26 FOLLOWER: AA 1.Populate a database 2.Compare with historical stock prices

How? - Structured prediction: e.g., sequence models - Supervised and semi-supervised methods - Challenging problems - Limitations and new perspectives

Sequence classifiers Choose your favorite sequence classifier: HMM, CRF, MEMM, SVM, Perceptron,... We abstract from the specifics going forward

Beyond POS tagging Sequence labeling tasks: Assign one (POS) label to each word Cats/NNS love/vbp fish/nn./. Given: labeled data, sequence classifier Sequence segmentation + labeling: Information extraction, biomedical IE, shallow (partial) parsing Task: labels can span several words "U.N. Security Council tries to end Syria conflict"

Segmentation + labeling "U.N. Security Council tries to end Syria conflict" "U.N.ORG SecurityORG CouncilORG tries to end SyriaLOC conflict" How many organizations?

Segmentation + labeling "U.N. Security Council tries to end Syria conflict" "[U.N. Security Council]ORG tries to end [Syria]LOC conflict" One organization...

Segmentation + Labeling 1. Identify the boundaries of a segment: a. "[U.N. Security Council] tries to end Syria conflict" 2. Label the segment: a. "[U.N. Security Council]ORG tries to end Syria conflict" 3. How? a. Reduce the problem to simple labeling b. BIO encoding: split the labels in i. B-X: beginning of label X ii. I-X: inside a label X iii. O: outside a label (NULL label)

BIO encoding 1. Encode the data in BIO format: a. b. c. [U.N. Security council]org tries to end [Syria]LOC conflict. U.N.B-ORG SecurityI-ORG CouncilI-ORG trieso too endo SyriaB-LOC conflicto No termination ambiguity B-X->I-X same entity, new entity otherwise 2. Train and use your favorite sequence classifier as usual a. No guarantee that output will be consistent b. Set B-X -> I-Y transitions to 0 if necessary 3. Evaluate appropriately: P/R/F1

Features Words: a. Target token identity, stem; w=corp., sw=corp. b. Same for words within a window around target; w-1=amr.,w+1=, PoS tags: a. Shape: a. Regexp-like features; s=xxp, s+1=p Prefixes/Suffixes: a. Target token and tokens within window; pos=nnp, pos+1=, of target and neighbors; suf=., suf=p., pre=c pre=co,... Gazetteers: labeled lists of phrases (locations, cities, fir names etc.): a. gx= target is in list X; example, Tim -> gfirstname=true b. Crucial features out-of-domain c. Features on segments? SemiMM Label features: a. label-1=b-org,... b. label+1=0, (bidirectional models)

NER Identifying mentions (spans of text) referring to instances of a pre-defined set of classes

Entity ambiguity NOTE: The problem of the actual "identity" of an entity is not addressed in classic IE, unless approximated by coreference.

NER as sequence classification

Evaluation Precision/Recall/F1: micro/macro: mention/type level Correct prediction: both entity boundaries are correctly identified: [New York]GPE State vs. [New York State]GPE the label is correct: [New York State]LOC vs [New York State]GPE F1 > 90% in supervised case, in-domain! Significant degradation out-domain Domain adaptation, semi-supervised learning etc.

Relation extraction Relation: a set of ordered tuples over a domain E.g., President-of = {(Obama, US), (Sarkozy, France),...} 1. Identify named-entities 2. For each pair (ei,ej) a. Predic-Relation(ei,ej) [Possibly NULL] Task-specific features: 1. Text between entities: BOW 2. Syntactic structure

Relation Extraction Appositive constructions: part-of, is-a etc.

Distant supervision for RE Supervised approaches are limited Idea: exploit existing repositories such as Wikipedia (infoboxes), Freebase (relations) to bootstrap models for RE: 1. Start with seed set of pairs 2. Repeat: a. Build extraction patterns from data: b. Extract new pairs Issues: semantic drift, evaluation

Temporal expressions More regular patterns than in other tasks Challenges: Coverage False positives Normalization: map to interval or point in time

Partial parsing Goal: Identify the basic non-recursive (N/V/A/P) phrases of a sentence (chunking): flat/non-overlapping segmentation+labeling task Base phrases do not contain constituents of the same type Include head of phrase Exclude post-head modifiers "[The morning flight]np from [Denver]NP [has arrived]vp"

BIO encoding American Airlines, a unit of AMR Inc.

BIO encoding American Airlines, a unit of AMR Inc. [American Airlines]NP, [a unit]np of [AMR Inc.]NP

BIO encoding American Airlines, a unit of AMR Inc. [American Airlines]NP, [a unit]np of [AMR Inc.]NP AmericanB-NP AirlinesI-NP,O ao unitb-np ofo AMRB-NP Inc.I-NP

From Trees to Chunks [American Airlines]NP, [a unit]np of [AMR Inc.]NP

Partial parsing

PP as sequence classification

Limitations of NER 1. Washington was born at Haywood Farms near Oak Grove, Virginia. 2. Washington attended Yale Law School for a year and then studied at the University of Oxford. 3. Washington served as army chief of staff under William Tubman. 4. Historians laud Washington for his selection and supervision of his generals. 5. Washington was the founder of the town of Centralia, Washington.

Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. 3. WashingtonPER served as army chief of staff under William Tubman. 4. Historians laud WashingtonPER for his selection and supervision of his generals. 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC.

Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. 3. WashingtonPER served as army chief of staff under William Tubman. 4. Historians laud WashingtonPER for his selection and supervision of his generals. 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC. 5 different people!

Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. [George] 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. [George] 3. WashingtonPER served as army chief of staff under William Tubman. [George] 4. Historians laud WashingtonPER for his selection and supervision of his generals. [George] 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC. [George] With the same first name!

From strings to "things"

Polisemy A term or phrase can have multiple interpretations

Mercury?

Mercury?

Synonymy CAL University of California Berkeley College Univ. of Berkely U. of CA at Berkeley UC (Berkeley) Berkeley Bears U. of Ca. Berkeley U. of Cal. Bear's lair Univ. c UC Berkeley Berkeley University UCB California U. of CAL The Bear's Lair Cal Bears... The same concept can be expressed in different ways!

Entity disambiguation Segmentation and labeling problem Output = Wikipedia, Freebase, YAGO, etc. Updated and maintained by thousands of users Good coverage of popular topics How? Classic supervised approach dubious Exploit social aspects (link graph) One example: TagMe More @Google/ETH open house

Motivations

(a large) problem space Entities: English Wikipedia: 4 million YAGO 10 million Freebase: 40 million Web knowledge graphs: hundreds of millions Facts: Billions of tuples: link graphs, infoboxes, Freebase relations etc.

Input

Mentions identification

Entity disambiguation

Ingredients-1 A distribution of entities (Wikipedia articles) given phrases (mentions): How likely is entity k given phrase w? Assume segmentation is given as a start, e. g., via NER (can do without) Estimate from Wikipedia (anchors/titles)

Ingredients-2 A measure of entity similarity: How related/mutually relevant are two entities? Link graph: large graph O(100M) many pairs to evaluate efficient computation Google distance:

Ingredients-3 A measure of local coherence: Find entity assignments that obey documentlevel topic coherence Voting scheme: Aggregated relevance:

Basic TagMe Disambiguate mention i: 1. compute candidates: for all k such that: 2. choose the entity with highest lambda in the candidate set: NOTE: two hyperparameters: simple/efficient/effective

Input

Priors on mention

Filter with tau

Compute candidates

choose argmax P(k w)

Eval Evaluation on manually annotated data: CoNLL-Aida dataset Metrics: micro@1: fraction of entity mentions correctly disambiguated macro@1: accuracy averaged by number of documents Some results: CoNLL-Aida test-b Baseline Aida-best TagMe* macro@1 72.74 81.66 78.21 micro@1 69.82 82.54 78.64 *My own vanilla implementation of TagMe

Applications Summarization? Choose sentences/snippets containing important entities in the document? Need data annotation...

Summary Information Extraction: NER and related tasks Segmentation & Labeling: Models Features Shallow parsing Entity disambiguation Hand out: mid term exam Next class: lexical semantics and LDA

References -Book: J&M: Ch. 13.5 (Partial parsing), Ch. 22 (information extraction) - Entity disambiguation: TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). Ferragina & Scaiella, CIKM 2010 Robust Disambiguation of Named Entities in Text. Hoffart et al. EMNLP. 2011 - Tools for the project: Stanford CoreNLP, NLTK, LingPipe, OpenNLP, etc.