Job Location: Chennai
Experience: 2+ Years
Primary Skill: Hadoop
Responsibility:
- Design, build, test and maintain scalable and stable off the shelf applications to support
distributed processing using the Hadoop Ecosystem - Implement ETL and data processes for structured and unstructured data
- Pipelines for optimal extraction of data from a wide variety of data sources, ingestion,
transformation, conversion validation - Conduct root cause analysis and advanced performance tuning for complex business
processes and functionality - Ability to review frameworks and design principles towards suitability in the project
context - Client orientation:
- Propose the right solutions to the client by identifying & understanding critical pain
points - Contribute to the entire implementation process including driving the definition of
improvements based on business need and architectural improvements - Propose, pitch, sell, implement and prove success in continuous improvement initiatives
Work and collaborate with multiple teams and stakeholders - Agile orientation:
- Be a part of the Agile ceremonies to groom stories and develop defect-free code for the
stories - Review code for quality and implementation best practices
- Promote coding, testing and deployment best practices through hands-on research and
demonstration - Write testable code that enables extremely high levels of code coverage
- Mentor young engineers towards guiding them to become great engineers
Desired Skills/ Experience:
- Preferably 4 to 7 years of experience
- Highly skilled in:
- PySpark and Spark
- PySpark SQL and Dataframe APIs
- Interpreting Spark execution DAG as displayed in ApplicationMaster
- Writing optimal PySpark codes + deep knowledge of Spark parameter tweaking for
execution optimization - Python (2 and 3), including knowledge of libraries like NumPy, Pandas, etc.
- Writing sqoop scripts for ETL from TeraData
- SQL and Analytical thinking
- Strong understanding of:
- Hadoop and Spark architectures and the MapReduce framework
- Big data storages like HDFS, HBase, Cassandra
- Data formats like Avro, Parquet, ORC, etc.
- Exposure to at least one big data platform like Hortonworks, Cloudera, HDP, AWS-
EMR, MapR, etc. - Prior experience with:
- Using monitoring and administration tools like Ambari, Ganglia, etc.
- Scheduling big data applications using Oozie (Including workflow and coordinator
properties) - Good OO skills, including good design patterns knowledge
- Good understanding of technologies like Hive, Pig, Presto, Impala, etc.
- Prior experience in building spark infrastructure (cluster setup, administration,
performance tuning) [on-premise (bare metal) and / or cloud-based] - Knowledge of software best practices, like Test-Driven Development (TDD) and
Continuous Integration (CI)