Web Scraping Engineer
Company
Legalist
Location
Remote
Type
Full Time
Job Description
Legalist is an institutional alternative asset management firm. Founded in 2016 and incubated at Y Combinator, the firm uses data-driven technology to invest in credit assets at scale. We are always looking for talented people to join our team.
Where You Come In:
- Help to design and implement the architecture of a large-scale crawling system
- Design, implement, and maintain various components of our data acquisition infrastructure (building new crawlers, maintain existing crawlers, data cleaners & loaders)
- Work on developing tools to facilitate the scraping at scale, monitor the health of crawlers and ensure data quality of the scraped items.
- Collaborate with our product and business teams to understand / anticipate requirements to strive for greater functionality and impact in our data gathering systems
What you’ll be bringing to the team:
- 3+ Years experience with Python for data wrangling and cleaning
- 2+ Years experience with data crawling & scraping at scale (100+ spiders at least)
- Productionized experience with Scrapy is mandatory. Distributed crawling and advanced scrapy experience are a plus.
- Familiarity with scraping libraries and monitoring tools highly recommended (BeautifulSoup, Xpaths, Selenium, Puppeteer, Splash)
- Familiarity with data pipelining to integrate scraped items into existing data pipelines.
- Experience extracting data from multiple disparate sources including HTML, XML, REST, GraphQL, PDF, and spreadsheets.
- Experience running, monitoring and maintaining a large set of broad crawlers (100+ spiders)
- Sound Knowledge in bypassing Bot Detection Techniques
- Experience using techniques to protect web scrapers against site ban, IP leak, browser crash, CAPTCHA and proxy failure.
- Experience with cloud environments like GCP, AWS, as well as containerization tools like Docker and orchestration such as kubernetes or others.
- Ability to maintain all aspects of a scraping pipeline end to end (building and maintaining spiers, avoiding bot prevention techniques, data cleaning and pipelining, monitoring spider health and performance).
- OOP, SQL and Django ORM basics
Even better if you have, but not necessary:
- Experience with microservices architecture would be a plus.
- Familiarity with message brokers such as Kafka, RabbitMQ, etc
- Experience with DevOps
- Expertise in data warehouse maintenance, specifically with Google BigQuery (ETLs, data sourcing, modeling, cleansing, documentation, and maintenance)
- Familiarity with job scheduling & orchestration frameworks - e.g. Jenkins, Dagster, Prefect
Date Posted
04/26/2024
Views
9
Similar Jobs
Linux Support Engineer - Voltage Park
Views in the last 30 days - 0
Voltage Park is seeking a Linux Support Engineer for a fulltime remote position The ideal candidate will have command line level Linux sys administrat...
View DetailsTechnical Architect - CDW
Views in the last 30 days - 0
CDW offers a rewarding career opportunity for a Technical Architect with expertise in ServiceNow The role involves delighting customers by collaborati...
View DetailsFederal Security Solutions Engineer - Rapid7
Views in the last 30 days - 0
Rapid7 is seeking a Federal Solutions Engineer with 5 years of experience in cybersecurity solutions engineering or technical sales focusing on federa...
View DetailsManager, ABM - Chronosphere
Views in the last 30 days - 0
Chronosphere is seeking a datadriven ABM Manager with 7 years of marketing experience particularly in B2B SaaS with technical audiences and complex en...
View DetailsSales Engineer - Dandy
Views in the last 30 days - 0
Dandy a venturebacked company is revolutionizing the 200B dental industry with advanced technology They are looking for a Sales Engineer with 5 years ...
View DetailsEngineering Manager (Group Practice Tooling & Provider CX) - Headway
Views in the last 30 days - 0
Headway is a mental healthcare company founded in 2019 aiming to build a new mental health care system accessible to everyone They have a national net...
View Details