Oleg

Middle LLM-engeneer/CUDA/C++

Moscow(🇺🇸 US)Open to Relocation

Posted May 15, 2025

Expected Annual Salary

$60000

USD • Full Time

Market Value

Expected Annual Salary

$60000

Full Time

Market Value

About Me

Achievements 1st Place at the GeoVisionHack Hackathon. 3rd Place at the PhystechRadar Challenge Hackathon. Leader of the academic Sandbox project on porting classical computations to low-qubit quantum computers. Developed Solutions Designed a high-performance CUDA application for accelerated DNA sequence alignment, integrated with Bakta, using C++20 and CUDA Runtime API, achieving a 10x speedup for genomic data (10^6 nucleotides) through optimized direct convolution kernels and FFT-based convolution with cuFFT. Developed a custom DNAVector container in C++20 for safe genomic data management with CUDA integration, optimized GPU memory management using Unified Memory, memory pools, and FP16 on tensor cores (Ampere), improving data transfer by 30%. Implemented CUDA kernels and modules for PyTorch, including custom operators for accelerating DNA convolution and attention mechanisms in LLMs, applied to bioinformatics and NLP tasks. Created a tool for automating bioinformatics pipelines in C++, utilizing Boost.Graph for process modeling, Qt for a drag-and-drop GUI, SeqAn for FASTA sequence analysis, and ONNX Runtime for integrating machine learning models. Built full-cycle NLP pipelines, including text classification, NER, and QA models based on BERT and GPT, with fine-tuning of Llama, prompt engineering, pruning, quantization, and optimization via ONNX and TensorRT for acceleration on C++ and CUDA. Developed data preparation systems for NLP, including data slicing, prompt selection, and integration with crowdsourcing platforms for data annotation, using PyTorch, NumPy, Scikit-learn, and Pandas. Implemented CUDA kernels for RL models (Qwen, GRPO), achieving 2.5–3.5x speedup for forward/backward passes with FP16 and tensor cores on RTX 4090, leveraging shared memory and WMMA for optimized GEMM. Optimized reward computation in GRPO, achieving 3–5x speedup through coalesced memory access and batch parallelization (1024 trajectories) using shared memory. Implemented dataset preprocessing (GSM8K) with Thrust and CUDA Streams, increasing throughput by 5–10x for tokenization and normalization tasks. Integrated custom PyTorch modules with CUDA for LLM attention mechanisms, applying a tile-based approach and cuDNN, achieving >70% SM utilization (Nsight Compute). Adapted solutions for ROCm/NPU with C++20 and OpenMPI, ensuring scalability and high performance on cluster architectures. Applied multithreading (std::thread, std::async) and OpenMPI for parallel data processing, NCCL for distributed GPU computing, and ROCm for cross-platform optimization. Configured reproducible build processes with CMake and vcpkg, profiled performance using NVIDIA Nsight Compute, CUDA-GDB, and Visual Profiler, achieving >70% SM utilization. Technology Stack Languages and Frameworks: C++ (C++11/17/20), Python, PyTorch, TensorFlow, SQL. CUDA and HPC: CUDA C/C++ (kernels, memory management, tiling), cuBLAS, cuDNN, cuFFT, Thrust, NCCL, NVIDIA HPC SDK, OpenMPI. GPU Optimization: Shared memory, coalescing, thread divergence minimization, FP16/BF16 on tensor cores, ROCm. Bioinformatics: SeqAn, STAR, HISAT2, StringTie, DESeq2, Salmon, Kallisto, Cufflinks, RSEM, Bioconductor, Nextflow Population Genetics: PLINK, VCFtools, BCFtools, SAMtools, GATK, ANGSD, STRUCTURE, ADMIXTURE, PopGenome, SNPRelate, fastStructure, ONNX Runtime, CWL/DWL/Nextflow, R, and related tools. NLP and ML: BERT, GPT, Llama, LangChain, LlamaIndex, prompt engineering, pruning, quantization, TensorRT.

Professional Details

Job Category

Software Engineer

Experience Level

Mid-Level

Work Mode

REMOTE

Location

Moscow(🇺🇸 US)

Contract Type

Full Time

Skills

cudanlpc++

Preferred Work Areas