NOC Engineer

Tech Ops, Bangalore, India


LogMeIn simplifies how people connect with each other and the world around them to drive meaningful interactions, deepen relationships, and create better outcomes for individuals and businesses. One of the world’s top 10 public SaaS companies, and a market leader in communication & conferencing, identity & access, and customer engagement & support solutions, LogMeIn has millions of customers spanning virtually every country across the globe. LogMeIn is headquartered in Boston with additional locations across North America, Europe, Middle East, Asia and Australia.

As a Site Reliability Administrator, you will contribute to a global team responsible for performing 1st level Event, Incident, and Problem management activities in a complex and highly technical environment working on a variety of issues across multiple network elements in a predominantly Linux environment. This position also helps with continuous product improvements, assists with non-negotiable projects that support the Division Goals, performs system-wide upgrades and occasionally acts as an individual contributor on special Ops projects. The Site Reliability Administrator will be responsible for the monitoring of multiple data centers and/or cloud environments on a local and worldwide level in a 24/7/365 production environment. The Site Reliability Admin also works closely with our Inbound Customer Care, Operations, and Development teams to prioritize the resolution of production issues. The Site Reliability Technician is responsible for sending pre-defined intra-company outage communications and updates.

  • Confirms and troubleshoots all alerts from remote monitoring tools, Inbound Care, Ops, or Dev and works to resolve all L1 and L2 issues related with our data center, cloud environments, network infrastructure, hardware and/or applications.
  • Verifies that all reported Incident and Problem Management tickets created for our data center or cloud environments are accurate and entered within SLA and kept up to date and acts as a first or second tier for response and technical support for incident, problem management and service request resolution.
  • Responds to all incoming operations incidents in the ticketing system. Responsible for prioritizing and escalating any unresolved issues to the appropriate on-call staff so the ticket can be closed in a timely manner.
  • Sends first level Incident Communications regarding any Outages, Major Incidents, or Service Center Issues to Ops Managers and Key stakeholders within SLA (Service Level Agreement) parameters via our internal tools (i.e. will sometimes perform timely notification updates to middle and senior management electronically and via telephone for extended outages and Maintenance Windows.
  • Performs timely notification updates to upper management electronically and via telephone within SLA parameters. Maintains the Outage and Maintenance database, the official Outage Announcement Templates, and all other associated reports and documentation.
  • Responsible for defining and driving continuous improvements for the Tech Ops Service and Support teams, the Operations department, and Dev-Ops support teams based on IM, PM, and RCA process executions.
  • Meets occasionally or when requested with operations teams, development, and other Site Reliability staff to prioritize future stage and live application, deployment, or project tasks.
  • Responsible for adopting and recommending continuous process improvement measures for the organization based on findings during troubleshooting and triage of L1 Incidents. Collaborates with members of the Opst team on how best to Support their applications.
  • Aggressively follows up with Site Reliability Engineers or Engineering staff on resolution of ticket and information update so ticket can be effectively closed in a timely fashion.
  • Assists other Operations departments with multi-level support to resolve complex, technical problems.
  • Proofing and recommending updates, patches, replacements or upgrades to current Site Reliability Software tools and Monitoring systems. Responsible for researching and developing new Site Reliability monitoring tools as they become available.
  • Works with other Department leads to develop, validate, and properly catalog SOP documents on the internal Opswiki and Knowledge Base.
  • Create/Update incident and problem management procedures to be used by the 1st Level and 2nd Level 2 Site Reliability Technicians.
  • Regularly participates in the Shift Handover process with previous and incoming shift teams to help sync and transfer any ongoing issues or outages.
  • Available for on-call and emergency response rotation as needed.
  • Maintains the Escalation contact matrix and processes to ensure that all levels of the Support Organization are listed and audits this list frequently and works with other staff and team members to maintain the on-call status of other Operations and Development personnel.
  • Responds to any additional needs coming from his/her Direct Management.
  • Ensures that the other members of the team follow and enforce the Ops Change Control procedures and immediately escalate any violations to Ops management.

  • Bachelors degree or equivalent experience required.
  • 3-5 years experience in a technical or network operations support environment.
  • Knowledge of Remedy, TeamTrack, Track-It!, SDE, or other ticketing systems a plus.
  • Expertise with enterprise monitoring tools such as BMC Event Manager, Remstats, HP Openview, HP Insight Manager, Nagios, Etc., desired.
  • Proven understanding of TCP/IP networking, SNMP, UNIX/Linux/Windows Server Operating Systems, HTTP/HTTPS, SMB, NFS, SMTP, IMAP, SSH, DNS, NTP, and Microsoft Office products are preferred.
  • Strong written and verbal communication skills are necessary.
  • Linux Certification or equivalent experience required with demonstrated understanding of command line tools to create, move, view and other commands to investigate files and directories.
  • Ability to update and configure Linux systems and packages.
  • Linux scripting to automate system maintenance tasks.
Be Accountable - even when no-one is looking
Thrive Together - greatness comes from unlocking each other’s potential
Advance Confidently - we find opportunity and act on it
Collaborate Openly - our whole is greater than the sum of our parts
Engage Fearlessly - we speak up and listen