Sustaining quality of services through service reliability and availability Karthikh Pandian, Chinnathurai Cognizant Technology Solutions Karthikh.pandian@cognizant.com January-4-17 ASTR 2016, Sep 9-11, St. Cambridge, MA 1
Agenda Industry Challenges Case organization & Problem Approach Findings Recommendations Next steps www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 2
Market Overview Customer Electronic Industry Challenges Evolving Technology $ Cost $ People Expectations for user experience significantly raised Growing number of suppliers that provide very specialized and cost effective services Constrained resources and budgets. Cannot make any extra mile. www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 3
What the Analysts Say? Estimated 200,000-plus new products launched annually, 80% of products fail (Forbes) Quality is considered by most experts to be among the leading contributors www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 4
Release Drivers Speed to market remain intense and this makes organization to release premature products, making customers for reliability testing. Inherent interdependency of services Business Value Drivers Schedule And Budget Constraints Technical Feasibility www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 5
Current Industry - IT Transformation Continuous Delivery DevOps Enabling a predictable, routine cycle that can be performed on demand Digital www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 6
Case organization Advance TV service provider of consumer digital video recorder (DVR) related products. Multi million subscriber base across geography Over 10+ Multi System Operators / Channel partners Features: Scheduling, Advanced searches, Television show downloads, etc. www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 7
Problem statement Approx. $0.5 M per year in release related activities, of which over 20% is paid as penalty for downtime Multiple Service layer that overlap and are synchronized causing interdependencies Complex Roll back plans Consumer DVR software stack release Release Braches Service Releases Multi system operator network www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 8
Approach Develop additional assets to monitor and accurately report outages Develop a framework on historical data for Service outages Analyze data per geography/ch annel partner and service /feature Develop release and build of weakest link to ensure containment of an outage Identify least reliable services www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 9
What do we measure? Measurement is the first step that leads to control and eventually to improvement. If you can t measure something, you can t understand it. If you can t understand it, you can t control it. If you can t control it, you can t improve it. - H. James Harrington www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 10
Service availability and reliability customized Service Reliability = Total Uptime (Days) / Number of service interruptions (Incidents) Service Availability = Total service available Time- Downtime / Total Available time www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 11
Level of service impact A very common scenario is that some services are completely unaffected Availability & Reliability ITSM Tool MSO / Channel Partner services (e.g. provisioning, staging) Customer Services (e.g. search and browse, TVE) www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 12
Reporting Framework (1 of 2) Calculation logic : Service level uptime (%) Overall service uptime (%) = Sum of service downtime = Sum of all services downtime Reliability (days) = Number of P1 service interruptions Reliability (days) = Number of P1+P2 service interruptions www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 13
Reporting Framework (2 of 2) Sample table www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 14
(%) Accelerated Stress Testing and Reliability Key services performance Identify Key services to business based on business mission of the organization www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 15
Performance Trend Service 3 Trend service breakage showed after every major release (Days) (%) (%) www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 16
Some Truth that apply The significant problems we face cannot be solved with the same level of thinking we were at when we created them. Albert Einstein www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 17
Root cause analysis ( 1 of 3 ) Significant increase in load due to reboots Mass reboots of devices increased the number calls of being made to the services causing in an overload resulting in service outages Singe point of failure Load balancer led the traffic to be diverted to a single server rather than distribute the load www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 18
Root cause analysis ( 2 of 3 ) Monitoring of released software Continuous monitoring on restart rates Automated workflows assist robust continuous delivery Impact Assessment Change management and road mapping System configuration and inter dependencies documentation www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 19
Root cause analysis ( 3 of 3 ) Mirror staging and Prod environments Gaps in production and staging environment prevents teams from identifying issues ahead of time. QA environment specific to partner and geography Inability to reproduce a partner environment www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 20
Recommendation Distributing software download Load balancer configuration based on (Core/Non-Core) Beta & Alpha testing Automation in release process QA signoff on staging environment Test Labs www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 21
What Do I DO Next - Few Tips Zero-Downtime Deployments Active and Passive environments to release software without affecting users Real Time Performance Validation Service validation based on patterns of service and usage Collaboration between product managers www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 22
Questions? Karthikh, Chinnathurai Karthikh.pandian@cognizant.com www.ieee-astr.org September 28-30 2016, Pensacola Beach, Florida 23