New Project Flash Update: Advancing Azure Virtual Machine availability monitoring | Azure Blog and Updates

New Venture Flash Replace: Advancing Azure Digital Machine availability monitoring | Azure Weblog and Updates

Posted on


“Earlier this 12 months, we launched Venture Flash within the Advancing Reliability weblog sequence, to reaffirm our dedication to empowering Azure prospects in monitoring digital machine (VM) availability in a sturdy and complete method. Right now, we’re excited to share the progress we’ve made since then in creating holistic monitoring choices to satisfy prospects’ distinct wants. I’ve requested Senior Technical Program Supervisor, Pujitha Desiraju, from the Azure Core Manufacturing High quality Engineering workforce to share the most recent investments as a part of Venture Flash, to ship the perfect monitoring expertise for patrons.”—Mark Russinovich, CTO, Azure.


Flash, because the challenge is internally recognized, is a set of efforts throughout Azure Engineering, that goals to evolve Azure’s digital machine (VM) availability monitoring ecosystem right into a centralized, holistic, and intelligible resolution prospects can depend on to satisfy their particular observability wants. As a part of this multi-year endeavor, we’re excited to announce the:

  • Normal availability of VM availability info in Azure Useful resource Graph for environment friendly and at-scale monitoring, handy for detailed downtime investigations and influence evaluation.
  • Preview of a VM availability metric in Azure Monitor for fast debugging is now publicly obtainable, pattern evaluation of VM availability over time, and establishing threshold-based alerts on eventualities that influence workload efficiency.
  • Preview of VM availability standing change occasions by way of Azure Occasion Grid for instantaneous notifications on crucial adjustments in VM availability, to shortly set off remediation actions to stop end-user influence.

Our dedication stays, to sustaining information consistency and comparable rigorous high quality requirements throughout all of the monitoring options which might be a part of Flash, together with present options like Useful resource Well being or Exercise Log, so we ship a constant and cohesive expertise to prospects.

VM availability info in Azure Useful resource Graph for at-scale evaluation

Along with already flowing VM availability states, we lately printed VM well being annotations to Azure Useful resource Graph (ARG) for detailed failure attribution and downtime evaluation, together with enabling a 14-day change monitoring mechanism to hint historic adjustments in VM availability for fast debugging. With these new additions, we’re excited to announce the final availability of VM availability info within the HealthResources dataset in ARG! With this providing customers can:

  • Effectively question the most recent snapshot of VM availability throughout all Azure subscriptions directly and at low latencies for periodic and fleetwide monitoring.
  • Precisely assess the influence to fleetwide enterprise SLAs and shortly set off decisive mitigation actions, in response to disruptions and sort of failure signature.
  • Arrange customized dashboards to oversee the excellent well being of purposes by becoming a member of VM availability info with extra useful resource metadata current in ARG.
  • Monitor related adjustments in VM availability throughout a rolling 14-day window, by utilizing the change-tracking mechanism for conducting detailed investigations.

Getting began

Customers can question ARG by way of PowerShell, REST API, Azure CLI, and even the Azure Portal. The next steps element how information may be accessed from Azure Portal.

  1. As soon as on the Azure Portal, navigate to Useful resource Graph Explorer which is able to appear like the under picture:

Determine 1: Azure Useful resource Graph Explorer touchdown web page on Azure Portal.

  1. Choose the Desk tab and (single) click on on the HealthResources desk to retrieve the most recent snapshot of VM availability info (availability state and well being annotations).

Portal view of Azure Resource Graph displaying both VM availability states and annotations across all resources at once in the results window, along with showcasing the 2 event types in the HealthResources table.

Determine 2: Azure Useful resource Graph Explorer Window depicting the most recent VM availability states and VM well being annotations within the HealthResources desk.

There might be two  varieties of occasions populated within the HealthResources desk:

Portal view of the left-hand pane in Azure Resource Graph displaying the 2 types of events within the HealthResources table along with the type of all fields embedded within each type.

 

Determine 3: Snapshot of the kind of occasions current within the HealthResources desk, as proven in Useful resource Graph Explorer on the Azure Portal.

This occasion denotes the most recent availability standing of a VM, based mostly on the well being checks carried out by the underlying Azure platform. Beneath are the supply states we at present emit for VMs:

  • Out there: The VM is up and operating as anticipated.
  • Unavailable: We’ve detected disruptions to the conventional functioning of the VM and due to this fact purposes won’t run as anticipated.
  • Unknown: The platform is unable to precisely detect the well being of the VM. Customers can often test again in a couple of minutes for an up to date state.

To ballot the most recent VM availability state, consult with the properties area which incorporates the under particulars:

Pattern



      "targetResourceType": "Microsoft.Compute/virtualMachines",

      "previousAvailabilityState": "Out there",

"targetResourceId": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>",

      "occurredTime": "2022-10-11T11:13:59.9570000Z",

      "availabilityState": "Unavailable"

Property descriptions









Subject

Description

Corresponding RHC area

targetResourceType

Sort of useful resource for which well being information is flowing

resourceType

targetResourceId

Useful resource Id

resourceId

occurredTime

Timestamp when the most recent availability state is emitted by the platform

eventTimestamp

previousAvailabilityState

Earlier availability state of the VM

previousHealthStatus

availabilityState

Present availability state of the VM

currentHealthStatus

Confer with this doc for an inventory of starter queries to additional discover this information.

This occasion contextualizes any adjustments to VM availability, by detailing essential failure attributes to assist customers examine and mitigate the disruption as wanted. See the total listing of VM well being annotations emitted by the platform.

These annotations may be broadly categorised into three buckets:

  • Downtime Annotations: These annotations are emitted when the platform detects VM availability transitioning to Unavailable. (For instance, throughout surprising host crashes, rebootful restore operations).
  • Informational Annotations: These annotations are emitted throughout management airplane actions with no influence to VM availability. (Similar to VM allocation/Cease/Delete/Begin). Often, no extra buyer motion is required in response.
  • Degraded Annotations: These annotations are emitted when VM availability is detected to be in danger. (For instance, when failure prediction fashions predict a degraded {hardware} element that may trigger the VM to reboot at any given time). We strongly urge customers to redeploy by the deadline specified within the annotation message, to keep away from any unanticipated lack of information or downtime.

To ballot the related VM well being annotations for a useful resource, if any, consult with the properties area which incorporates the next particulars:

Pattern



     "targetResourceType": "Microsoft.Compute/virtualMachines",                                                                                                                                                                        "targetResourceId": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>",

     "annotationName": "VirtualMachineHostRebootedForRepair",

     "occurredTime": "2022-09-25T20:21:37.5280000Z",

     "class": "Unplanned",

     "abstract": "We're sorry, your digital machine is not obtainable as a result of an surprising failure on the host server. Azure has begun the auto-recovery course of and is at present rebooting the host server. No  extra motion is required from you at the moment. The digital machine might be again on-line after the reboot completes.",

     "context": "Platform Initiated",

     "cause": "Sudden host failure"

Property descriptions












Subject

Description

Corresponding RHC area

targetResourceType

Sort of useful resource for which well being information is flowing

resourceType

targetResourceId

Useful resource Id

resourceId

occurredTime

Timestamp when the most recent availability state is emitted by the platform

eventTimestamp

annotationName

Title of the Annotation emitted

eventName

cause

Temporary overview of the supply influence noticed by the client

title

class

Denotes whether or not the platform exercise triggering the annotation was both deliberate upkeep or unplanned restore. This area just isn’t relevant to buyer/VM-initiated occasions.


Attainable values: Deliberate | Unplanned | Not Relevant | Null

class

context

Denotes whether or not the exercise triggering the annotation was as a result of a certified consumer or course of (customer-initiated), or as a result of Azure platform (platform-initiated) and even exercise within the visitor OS that has resulted in availability influence (VM initiated).


Attainable values: Platform-initiated | Consumer-initiated | VM-initiated | Not Relevant | Null

context

abstract

Assertion detailing the trigger for annotation emission, together with remediation steps that may be taken by customers

abstract

Confer with this doc for an inventory of starter queries to additional discover this information.

Looking forward to 2023, we’ve got a number of enhancements deliberate for the annotation metadata that’s surfaced within the HealthResources dataset. These enrichments will give customers entry to richer failure attributes to decisively put together a response to a disruption. In parallel, we intention to increase the length of historic lookback to a minimal of 30 days so customers can comprehensively observe previous adjustments in VM availability.

VM availability metric in Azure Monitor Preview

We’re excited to share that the out-of-box VM availability metric is now obtainable as a public preview for all customers! This metric shows the pattern of VM availability over time, so customers can:

Arrange threshold-based metric alerts on dipping VM availability to shortly set off applicable mitigation actions.

Correlate the VM availability metric with present platform metrics like reminiscence, community, or disk for deeper insights into regarding adjustments that influence the general efficiency of workloads.

Simply work together with and chart metric information throughout any related time window on Metrics Explorer, for fast and straightforward debugging.

Route metrics to downstream tooling like Grafana dashboards, for developing customized visualizations and dashboards.

Getting began

Customers can both eat the metric programmatically by way of the Azure Monitor REST API or straight from the Azure Portal. The next  steps spotlight metric consumption from the Azure Portal.

As soon as on the Azure Portal, navigate to the VM overview blade. The brand new metric will show as VM Availability (Preview), together with different platform metrics underneath the Monitoring tab.

Portal view of the VM overview page, with the newly added VM availability metric highlighted.

Determine 4: View the newly added VM Availability Metric on the VM overview web page on Azure Portal.

Choose (single click on) the VM availability metric chart on the overview web page, to navigate to Metrics Explorer for additional evaluation.

Portal view of VM availability metric on Metric Explorer, displaying availability as a trend in the form of a blue line, over time with occasional dips.

Determine 5: View the newly added VM availability Metric on Metrics Explorer on Azure Portal.

Metric description:








Show Title

VM Availability (preview)

Metric Values

1 throughout anticipated conduct; corresponds to VM in Out there state.


0 when VM is impacted by rebootful disruptions; corresponds to VM in Unavailable state.


NULL (exhibits a dotted or dashed line on charts) when the Azure service that’s emitting the metric is down or is unaware of the precise standing of the VM; corresponds to VM in Unknown state.

Aggregation

The default aggregation of the metric is Common, for prioritized investigations based mostly on extent of downtime incurred.


The opposite aggregations obtainable are:


Min, to instantly pinpoint to all of the instances the place VM was unavailable.


Max, to instantly pinpoint to all of the situations the place VM was Out there.


Refer right here for extra particulars on chart vary, granularity, and information aggregation.

Knowledge Retention

Knowledge for the VM availability metric might be saved for 93 days to help in pattern evaluation and historic lookback.

Pricing

Please consult with the Pricing breakdown, particularly within the “Metrics” and “Alert Guidelines” sections.

Looking forward to 2023, we plan to incorporate influence particulars (consumer vs platform initiated, deliberate vs unplanned) as dimensions to the metric, so customers are effectively geared up to interpret dips, and arrange far more focused metric alerts. With the emission of dimensions in 2023, we additionally anticipate transitioning the providing to a normal availability standing.

Introducing instantaneous notifications on adjustments in VM availability by way of Occasion Grid

We’re thrilled to introduce our newest monitoring providing—the personal preview of VM availability standing change occasions in an Occasion Grid System Subject, which makes use of the low-latency expertise of Azure Occasion Grid! Customers can now subscribe to the system matter and route these occasions to their downstream tooling utilizing any of the obtainable occasion handlers (comparable to Azure Capabilities, Logic Apps, Occasion Hubs, and Storage queues). This resolution makes use of an event-driven structure to speak scoped adjustments in VM availability to finish customers in lower than 5 seconds from the disruption incidence. This empowers customers to take instantaneous mitigation actions to stop finish consumer influence.

As a part of the personal preview, we’ll emit occasions scoped to adjustments in VM availability states, with the pattern schema under:

Pattern



     "id": "4c70abbc-4aeb-4cac-b0eb-ccf06c7cd102",

     "matter": "/subscriptions/<subscriptionId>,

   "topic": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>/suppliers/Microsoft.ResourceHealth/AvailabilityStatuses/present",

    "information":

        "resourceInfo":

"id":"/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>/suppliers/Microsoft.ResourceHealth/AvailabilityStatuses/present",       

"properties":

"targetResourceId":"/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>"

              "targetResourceType": "Microsoft.Compute/virtualMachines",

              "occurredTime": "2022-09-25T20:21:37.5280000Z"

"previousAvailabilityState": "Out there",

      "availabilityState": "Unavailable"

          

        ,

        "apiVersion": "2020-09-01"

     ,

"eventType": "Microsoft.ResourceNotifications.HealthResources.AvailabilityStatusesChanged",

   "dataVersion": "1",

     "metadataVersion": "1",

     "eventTime": "2022-09-25T20:21:37.5280000Z"

The properties area is totally in line with the microsoft.resourcehealth/availabilitystatuses occasion in ARG. The occasion grid resolution presents near-real-time alerting capabilities on the info current in ARG.

We’re at present releasing the preview to a small subset of customers to scrupulously take a look at the answer and accumulate iterative suggestions. This strategy allows us to preview and even announce the final availability of a top quality and well-rounded providing in 2023. As we glance in the direction of the final availability of this resolution, customers can count on to obtain occasions when annotations, automated RCAs are emitted by the platform.

What’s subsequent?

We’ll be closely targeted on strengthening our monitoring platform to constantly enhance the expertise for patrons based mostly on ongoing suggestions collected from the neighborhood (comparable to  aggregated VMSS well being displaying degraded inaccurately, VM unavailable for quarter-hour, Lacking VM downtimes in Exercise Log). By streamlining our inner message pipeline, we intention to not solely enhance information high quality, but in addition preserve information consistency throughout our choices and broaden the scope of failure eventualities surfaced.

Introducing Degraded VM Availability state

In mild of our upcoming efforts to centralize our monitoring structure, we’ll be well-positioned to introduce a Degraded VM availability state for digital machines in 2023. This state might be extraordinarily helpful in establishing focused alerts on predicted {hardware} failure eventualities the place there may be imminent danger to VM availability. This state can even enable customers to effectively observe instances of degraded {hardware} or software program failures needing to redeploy, which right this moment don’t trigger a corresponding change in VM availability. We can even intention to emit reminder annotations by means of the length of the VM being marked Degraded, to stop customers from overlooking the request to redeploy.

Broaden scope of failure attribution to incorporate software freeze occasions

In 2023, we plan to broaden our scope of failure attribution and emission to additionally embody software freeze occasions that could be brought about as a result of community agent updates, host OS updates lasting thirty seconds and freeze-causing restore operations. It will guarantee customers have enhanced visibility into freeze influence and might be utilized throughout our monitoring choices, together with Useful resource Well being and Exercise Logs.

Be taught Extra

Please keep tuned for extra bulletins on the Flash initiative, by monitoring updates to the Advancing Reliability Collection!



Supply hyperlink

Leave a Reply

Your email address will not be published. Required fields are marked *