Few words about project
We’re looking for a Senior Python Developer (Warsaw or remotely) with strong SQL and databases knowledge and experience.
The open-sourced project is connected with the development of data science tools (some of them are in the production stage, but there still remains much to do). Our client is located in the USA - California - near Los Angeles and is specialized in developing its own products - data-related tools, that helps organisations in data loss preventing, diff, optimizing, monitoring, testing and migrations. Your tasks will be connected with an open-source project, that is “live”, but still needs a lot of development. It is a command-line tool and Python library to efficiently diff rows across two different databases (e.g. PostgreSQL -> Snowflake), works for tables with 10s of billions of rows, verifies 25M+ rows in <10s, and 1B+ rows in ~5min, and bridges column types of different formats and levels of precision (e.g. double ⇆ float ⇆ decimal).
There is no overlap requirement, you can work in Polish working hours (flexible hours), no need to adjust it to the USA working hours.
What does the recruitment process look like?
Technical interview, but also a good moment for your initial questions about the project (1 hour)
Meeting with a Project Lead (1 hour)
Meeting with a CTO (1 hour)
Decision
All steps are planned online, of course
You will be responsible for...
Answering issues and pull requests in GitHub, questions on Slack
Reaching out to existing/potential users to assist in adoption
Implementing new features, fixing bugs, suggesting improvements
Writing more tests, anticipating more edge cases
Improving the CI flow on GitHub, to support testing for more databases
Assisting in the development of new modules (e.g. same-db data-diff)
Improving documentation, and writing tutorials.
Verify that all data was copied when doing critical data migration. For example, migrating from Heroku PostgreSQL to Amazon RDS.
Verifying data pipelines. Moving data from a relational database to a warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline.
Alerting and maintaining data integrity SLOs. You can create and monitor your SLO of e.g. 99.999% data integrity, and alert your team when data is missing.
Debugging complex data pipelines. When data gets lost in pipelines that may span a half-dozen systems, without verifying each intermediate datastore it’s extremely difficult to track down where a row got lost.
Detecting hard deletes for a updated_at-based pipeline. If you’re copying data to your warehouse based on a updated_at-style column, then you’ll miss hard-deletes that data-diff can find for you.
Make your replication self-healing. You can use data-diff to self-heal by using the diff output to write/update rows in the target database.
What's important for us?
Must have
Senior level and at least 7 years of commercial experience.
Strong knowledge and experience in Python programming (ideally for data solutions, but it is not a must-have).
Also strong knowledge and experience in SQL and databases (must have).
Being ready to work independently and having ownership of assigned tasks.
Experience in carrying out technical documentation.
Fluent English (C1).
Perfectly, if you have one month of notice period, or you are available ASAP, we can also wait for you longer.
Nice to have
experience in working directly with customers and experience in open-source projects.