Data Engineer (Python + PySpark + Cloudera)
Company Description
Talan is an international advisory group on innovation and transformation through technology, with 5000 employees, and a turnover of 600M€.
We offer our customers a continuum of services to support you at each key stage of your organization's transformation, with 4 main activities:
- CONSULTING in management and innovation : supporting business, managerial, cultural, and technological transformations.
- DATA & TECHNOLOGY to implement major transformation projects.
- CLOUD & APPLICATION SERVICES to build or integrate software solutions.
- SERVICE CENTERS of EXCELLENCE to support the latter through technology, innovation, agility, sustainability of skills and cost optimization.
Talan accelerates it's clients' transformation through innovation and technology. By understanding their challenges, with our support, innovation, technology and data, we enable them to be more efficient and resilient.
We believe that only a human oriented-practice of technology will make the new digital age an era of progress for all. Together let's commit!
Job Description
Role is based in Warsaw, POLAND - relocation is required!!!
We are looking for a skilled Data Engineer with expertise in Python, PySpark, and Cloudera to join our team. The ideal candidate will be responsible for developing and optimizing big data pipelines while ensuring efficiency and scalability. Experience with Databricks is a plus. Additionally, familiarity with Git, GitHub, Jira, and Confluence is highly valued for effective collaboration and version control.
Key Responsibilities:
- Design, develop, and maintain ETL pipelines using Python and PySpark.
- Work with Cloudera Hadoop ecosystem to manage and process large-scale datasets.
- Ensure data integrity, performance, and reliability across distributed systems.
- Collaborate with data scientists, analysts, and business stakeholders to deliver data-driven solutions.
- Implement best practices for data governance, security, and performance tuning.
- Use Git and GitHub for version control and efficient code collaboration.
- Track and manage tasks using Jira, and document processes in Confluence.
- (Optional) Work with Databricks for cloud-based big data processing.
Qualifications
Required Skills & Experience:
- Strong programming skills in Python.
- Hands-on experience with PySpark for distributed data processing.
- Expertise in Cloudera Hadoop ecosystem (HDFS, Hive, Impala).
- Experience with SQL and working with large datasets.
- Knowledge of Git and GitHub for source code management.
- Experience with Jira for task tracking and Confluence for documentation.
- Strong problem-solving and analytical skills.
Preferred Qualifications:
- Basic knowledge of Databricks for cloud-based big data solutions.
- Experience with workflow orchestration tools (e.g., Airflow, Oozie).
- Understanding of cloud platforms (AWS, Azure, or GCP).
- Exposure to Kafka or other real-time streaming technologies.
Additional Information
What do we offer you?
- Permanent, full-time contract
- Training and career development
- Benefits and perks such as private medical insurance, lunch pass card, MultiSport Plus card
- Possibility to be part of a multicultural team and work on international projects
- Hybrid position based in Warsaw, Poland
- Possibility to manage work permits