Is it possible to build an efficient, focused web crawler using Flink? That was the question that led to the creation of the flink-crawler open source project. In this talk I’ll discuss how we use Flink’s support for AsyncFunctions and iterations to create a scalable web crawler that continuously and efficiently performs a focused web crawl with no additional infrastructure. I’ll also discuss some of the testing and debugging challenges encountered when using features such as AsyncFunctions and iterations.
Ken Krugler is an Apache Tika committer, a member of the Apache Software Foundation, and a long-time contributor to the big data open source community. He’s been using Hadoop in anger for over a decade, and is also an expert on big data search & analytics using Solr and Elasticsearch. More recent projects include continuous analysis of display advertising for https://www.adbeat.com/, and creating a digital assistant for sales reps using Akka & TensorFlow as part of https://saleshero.ai/.