We impact the lives of over 40 million consumers daily by working with clients in the Baltics, the USA, Central and South America, and the Caribbean. We operate one of the largest big data environments in the Baltics, with one of the most diverse data sets, and we tackle some of the most challenging analytical and AI problems in the industry.
Exacaster's team is looking for a Senior DevOps Engineer to build and operate the platforms that power our big data and Generative AI solutions for clients in telecommunications, finance, and investment management.
About the role
We are looking for a DevOps engineer who is excited to work at the intersection of big data infrastructure and the new wave of AI engineering. You will design, deploy, and operate the platforms that run our distributed data systems and our production grade LLM powered applications - RAG systems, AI agents, MCP servers, and ML pipelines.
This role is ideal for an engineer with strong DevOps and cloud fundamentals who is curious about Generative AI and wants to build the foundations that AI engineers, data scientists, and product teams rely on every day. Hands-on experience with AI/LLM workloads is a plus but not required - strong infrastructure, automation, and operational skills are more important. You will also actively use AI tools (Claude, AI coding agents, MCP-based assistants) to accelerate your own work.
In this role, you will take care of
Infrastructure for AI & data platforms: Design, deploy, and maintain the cloud and on-prem infrastructure that runs our big data platforms (Cloudera, Spark, Hadoop) and Generative AI workloads (AWS Bedrock, Azure OpenAI, vector databases, AI agents, MCP servers).
Cloud architecture: Build secure, scalable AWS or Azure environments - multi-account/landing-zone setups, networking, IAM and identity integration for both data and AI products.
AI platform operations: Deploy, scale, and operate LLM powered services in production: agent frameworks, RAG pipelines, vector stores, and orchestration layers. Ensure they are reliable, observable, and cost-efficient (latency, token usage, model spend).
Automation & IaC: Use Terraform and Ansible for repeatable, scalable deployments across data, AI, and supporting services. Manage secrets, keys, and AI provider credentials securely (AWS Secrets Manager, Vault, KMS).
CI/CD pipeline management: Build and maintain CI/CD pipelines (GitLab CI, ArgoCD) for infrastructure, data applications, and AI services - including AI-assisted code review, automated testing, static analysis, and release management.
Kubernetes management: Deploy and operate Kubernetes clusters (EKS/AKS) for big data and AI workloads, ensuring efficient resource utilization for GPU/CPU inference and batch jobs.
Monitoring and alerting: Implement and optimize observability with Prometheus, Zabbix, Grafana, and cloud native tools extended to LLM specific signals: model latency, error rates, prompt/response quality, and cost.
Networking & security: Collaborate with network and security teams to ensure secure connectivity for data and AI services (VPNs, Transit Gateway, private endpoints), and apply best practices for compliance in regulated industries.
Performance optimization: Identify and resolve performance bottlenecks across big data clusters and AI inference workloads - tuning resources, caching, and scaling strategies.
AI-augmented DevOps: Use AI tools (e.g. Claude Code, MCP-based assistants) in your daily workflow - for code review, IaC generation, runbook automation, incident response and help the team adopt these practices.
We are looking for a person who has
3+ years of hands-on experience in DevOps or SRE roles, ideally on big data, distributed systems, or modern AI/ML platforms.
Strong knowledge of Linux (RHEL) systems - scripting, system administration, and troubleshooting.
Hands-on experience with cloud environments, particularly AWS or Azure, including deployment of cloud-native services and infrastructure.
Expertise in deploying, managing, and scaling applications using Kubernetes.
Proficiency with Terraform and Ansible for infrastructure automation and configuration management.
Experience with CI/CD pipelines and tools such as GitLab CI, ArgoCD, or similar.
Proficiency with monitoring systems such as Prometheus or Zabbix for metrics and alerting.
Strong understanding of networking - security, VPNs, and performance tuning in hybrid environments.
Strong analytical and problem-solving skills, with a track record of resolving complex platform and performance issues.
Curiosity about Generative AI and a willingness to apply AI tools (coding agents, MCP) to your own work and to the platforms you operate.
Nice to have
Experience deploying or integrating MCP servers or AI agent frameworks.
MLOps practices - model versioning, deployment, monitoring, and lifecycle management.
Experience managing Cloudera or similar on-prem big data platforms.
Experience with Hadoop, Spark, or similar data processing frameworks.
Snowflake administration and automation (Terraform provider, key-pair auth, SCIM/SSO).
Certifications in cloud platforms (AWS, Azure) or Kubernetes.
Security best practices and tools for cloud and on-prem environments - including AI-specific concerns (data leakage, prompt injection, model access control).
We promise
Monthly salary for this position from 5000 EUR gross to 7300 EUR gross for a full-time role.
Direct involvement in building the infrastructure behind real, production Generative AI products - not pilots or PoCs.
Daily access to and active use of modern AI tooling as part of your engineering toolkit.
Participation in the company's stock options program.
Flexible Benefits & Personal learning budget.
10 Growth Days per year - dedicated time for learning and development.
Ownership and dynamics in your role.
Hybrid work environment.
All the support you need from our experienced team to become an even better professional.
And the most important thing - you will be part of a great international team!