WP7 Multiple Domains Presented by: Jacek Maślankowski, Statistics Poland Sónia Quaresma, INE Portugal Carried out by five partners: Poland (leader) Ireland, Netherlands, Portugal (second wave) and United Kingdom 14-15.05.2018 Sofia 1
AGENDA Brief overview of the results Main findings Future perspectives 2
Three statistical domains Population PL, PT, UK, NL Intra- vs. Inter-domain data combining Agriculture PL, IE Tourism PL, NL, PT 3
Population Life Satisfaction by Twitter (Github) Everyday Citizen Satisfaction by Facebook (Github) Morbidity areas and personal well-being with Google Trends by ONS UK (wiki) Life Satisfaction classification used 4
Population Life satisfaction replicating 5
Population replicating Portuguese experience Extraction of approximately 3000 tweets posted from Portugal mainland. A list of synonyms for each sentiment (1.happy, 3.calm, 4.upset, 5.depressed and 6.discouraged) was used as key-words on api.search. Category 7.indeterminate was discarded. Geocode was used in order to extract tweets exclusively from Portugal and thus excluding other countries where Portuguese is an official language (such as Brazil). The training test resulted in a precision of 88%, with 2.neutral category having the worse performance. 6
Population Life satisfaction results 7
Population (cont.) Morbidity areas by ONS UK Everyday citizen satisfaction by ONS UK 8
Tourism Tourism accommodation establishments (data sources various portals) Border movement (road traffic, air traffic, train traffic) 9
Tourism (accommodation establishments) Data sources tripadvisor.com, agroturystyka.pl, hotels.com, booking.com Main findings it is difficult to determine if a given object meets the mandatory requirements related to the official classification of an object for a given type of facility we have to combine this data with official register of Tourism accommodation establishments 10
Tourism (accommodation establishments) Classification is an issue Price category of agritourism accomodation establishments 2% 1% 2% 14% 8% 73% 20PLN and less 20.01-40PLN 40.01-60PLN 60.01-80PLN 80.01-100PLN 100.01PLN and more 11
Tourism (internal border crossing) road traffic Słubice 1.100.000 1.000.000 900.000 800.000 700.000 600.000 500.000 400.000 300.000 200.000 100.000 0 I II III IV I II III IV I II III IV I II III IV I II III IV 2003 2004 2005 2006 2007 BASt Czech Republic Border Guards Lithuania Germany Slovakia Vehicles from BigData 29 575 585 3 691 245 51 158 035 7 485 785 People / Vehicles from BigData People / Vehicles from survey 1.88 2.09 2.13 4.19 1.72 1.34 1.60 2.04 Coverage 91% 64% 75% 49% 12
Dominican Republic Mexico Cuba Brazil Argentina Chile Colombia Panama Curacao Sint Maarten Suriname Venezuela Jamaica Martinique Ecuador Guadeloupe Aruba Caribbean Netherlands Bahamas Trinidad and Tobago Puerto Rico Barbados Costarica Tourism (internal border crossing) air traffic Web scraping Origin and destination airport (IATA and ICAO codes) Flight number Type of the aircraft Date of arrival and departure 40,00% 35,00% 30,00% 25,00% 20,00% 15,00% 10,00% 5,00% 0,00% BigData and Sample Survey Sample Survey Solely 13
Tourism (internal border crossing) train traffic 14
Agriculture Crop types identification by PL and IE Methodology: raw satellite image processing segmentation of processed data object based image classification (machine learning algorithms) Three data sources: satellite data from Sentinel-1 and Sentinel-2 in-situ survey 15
Agriculture 16
Intra-domain data combining Agritourism Impact of rural areas on development of agritourism places Three data sources: Web scraping agritourist lodgings Satellite data on agricultural fields Land and buldings register on NUTS 5 Output: agritourist lodgings are mostly located on hilly areas number of them is moderately correlated with area of forest land, meadows, and pastures, while weakly correlated with area of arable land, lakes and rivers number of agritourists is mostly related to area of lakes and rivers, then to forest land. It is negatively correlated with area of urban areas and arable areas we may conclude that conditions of agritourism development are not the same as for agriculture 17
Intra-domain data combining Lasso-based estimates on agritourism in Poland in the third quarter of 2017 NUTS 2 region Nights spent Visitors Expenditures on accommodation dolnośląskie 157299 49154 14679858 kujawsko-pomorskie 136604 36714 10747222 lubelskie 70444 22043 5726209 lubuskie 128360 21880 5419688 łódzkie 59530 11610 3784183 małopolskie 287019 93947 31192865 mazowieckie 8560 13869 174132 opolskie 98382 13508 5796491 podkarpackie 236122 73300 23019659 podlaskie 42345 4429 443955 pomorskie 187210 51759 14071677 śląskie 103083 20885 6587026 świętokrzyskie 100661 16825 6588852 warmińsko-mazurskie 190729 60776 15414115 wielkopolskie 91135 32685 6505026 zachodniopomorskie 226147 68740 17433312 18
Main findings Population domain data sources according to data quality and accessibility. Two use cases in Tourism domain can be implemented with success tourism accommodation establishments, border movement. Agriculture domain has one successful project crop types identification that is compliant with insitu survey on crop types. Population domain in Big Data is strongly related to the social statistics. 19
How to use WP7 Experience in Official Statistics? Life satisfaction pilots are shared on Github. Tourism accomodation establishments can be implemented with the use of web scraping methods. Border movement pilot can be implemented with a set of scripts with entropyeconometrics. Agriculture domain can be implemented with two approaches by Statistics Poland and Statistics Ireland. 20
Results (1) Life satisfaction of social media users based on Twitter by countries Countries involved: PL, PT, UK Classification used: 1.happy, 2.neutral, 3.calm, 4.upset, 5.depressed and 6.discouraged: according to EU-SILC classification (2) Tourism accommodation establishments by cities Countries involved: PL Classification used: price category, type of the property, facilities, city, country. (3) Border movement Countries involved: PL Classification used: type of transportation, country. (4) Crop types identification Countries involved: PL, IE Classification used: crop type, region, country. 21
Future perspectives (1/2) Mobile Call Records, Call Detail Records reliable information on Population (e.g., day and night population, number of commuters) as well as Tourism domain (e.g., number of tourists). Tourism accomodation establishments classification. Border movement add additional data sources, e.g., AIS data. Agriculture extending to all regions. 22
Future perspectives (2/2) Day and night population Population commuting to work Mobile phone data Population spending spare time (e.g., attending cultural and sport events like in EU-SILC) Population consumer confidence 23
Presenters: Jacek Maślankowski, Statistics Poland Sónia Quaresma, INE Portugal Statistics Poland (leader) CBS NL CSO IE INE PT ONS UK 14-15.05.2018 Sofia 24