p_logo_blogpost1-e1410364660250_opt

 Today we’re excited to unveil a new series of blog posts that describe how Paper.li works from a technological stand point. These posts will cover the technology, components and processes that take place behind the scenes before a paper goes live. Although this series of posts are technical in nature, we’ll do our best to keep them understandable and interesting for all.

This series will be composed of multiple articles:

  • How we ingest and process social content data (today’s article, the Input Pipeline)
  • How we generate paper editions
  • How we serve papers to users and curators
  • Which technologies we use for our key components : storage, search, messaging, configuration management

There is a lot of cool stuff happening behind the scenes before a new edition is published. If this series doesn’t answer all of your questions or cover the topics you would like to learn more about, please feel free to make a suggestion!

Today we will cover how our backend works and more specifically how we process social data.

At Paper.li, we are driven by the data we extract and analyze. Daily, we analyze around 250 million of posts on social medias and find referenced articles to find valuable information for our publishers.

The main goal of our backend is to allow our users to easily create daily automated content digests capturing the best of what is happening in the social web. Paper.li data ingestion & publishing can be represented by two main pipelines, each of them being event driven thanks to the Apache Kafka distributed messaging system :

  • The first channel is the Input pipeline, whose role is to keep user specified sources up to date and crawl the web for articles
  • The Publishing pipeline, which publish timely digests for our users, will be covered in a future article

The input pipeline actively monitors more than 1 million sources of posts which lead us to either articles, or images, which can then be selected to be part of our users’ papers. This input pipeline process is composed of multiple steps:

  1. Scheduling
  2. Sources fetching
  3. Normalization
  4. Extraction

Those processes interact with our Cassandra cluster to save posts, articles, images and videos. The Cassandra workload for fetchers, normalizers, and extractors is only composed of writes, no read operations are done in order to have better performances.

Data Flow example

The input pipeline process start with the scheduling of a source, stored in Cassandra like this:

{:identifier "9d6ea034de36e8da5b71b61ef4bd3a649983e9a7"
:input "#kdtree"
:type "twitter_search"}

This source can then yield multiple posts which are sent directly to the normalizer process. Then, the normalizer receives a post as input, normalize it and store it in Cassandra. A typical post look like this :

{:contributor_id "twitter:424242424242424"
 :identifier "tweet:4242"
 :meta_data {:retweet_count 0 :hashtags ["kdtree"]}
 :published_at #inst "2014-05-01T10:21:25.000-00:00"
 :status "look http://t.co/lb8teemAgR #kdtree"
 :type "tweet"
 :urls ("http://oliverro.com/algorithms/2011/06/21/practical-approach-kd-trees.html")
 :version "2"}

Posts yield by a stream are ordered by publication time for a specific stream. In order to achieve this, we use a Cassandra wide row where we leverage comparators in order to be able to do time range queries :

input_streams_posts :
RowKey: 06431447f55557bc512e0715416abcdab0368255
=> (column=l@1398866895000:s@tweet:4242, value=, timestamp=1398866915761000)
=> (column=l@1398866891000:s@tweet:42429, value=, timestamp=1398866915761001)
=> (column=l@1398866890000:s@tweet:42423, value=, timestamp=1398866915761002)
=> (column=l@1398866889000:s@tweet:42421, value=, timestamp=1398866915761003)
=> (column=l@1398866888000:s@tweet:42423, value=, timestamp=1398866915761004)

The rowkey is our stream id and column names are composite containing the publication timestamp and the post id. After normalization, each URL contained in the post is sent to our extraction process which does the following :

  • fetch the URL
  • resolve potential shortened URLs
  • extract content, title, images, …
  • semantically analyze extracted content
  • produce a result entity called paper_items which is stored in Cassandra

paper_items can then be assembled and ranked by the publishing pipeline in order to produce paper.li editions for our users. They are stored like the following example:

{:category_score "0.55309199690795715"
:content "I recently discovered and ..."
:crawled_at "2014-05-05T09:52:42.445-00:00"
:source_id "oliverro.com"
:language "en"
:title "Practical approach and requirements for a KdTree implementation"
:language_score "0.9999973492568127"
:url "http://oliverro.com/algorithms/2011/06/21/practical-approach-kd-trees.html"
:uuid "5c70a487f28f47941790db1dd6c2ac8d144c20ed"
:type "link"
:provider_url "http://oliverro.com/"
:initial_url "http://oliverro.com/algorithms/2011/06/21/practical-approach-kd-trees.html"
:provider_name "oliverro.com"
:published_at "2011-06-21T00:00:00.000-00:00"
:topics ({:score 0.1167776386719197 :id "Algorithm"}
{:score 0.09158502577338368 :id "Polygon"}
{:score 0.0650094530719798 :id "Query_string"}
{:score 0.06112491054227576 :id "Big_O_notation"}
{:score 0.058464017522055656 :id "Sorting_algorithm"})
:category "Science"}

Statistics

We listen to over 1 million streams added manually by our users and process a fairly comfortable amount of data daily. We aggregate stats using Graphite and sampling. This gives us important metrics as well as allows us to monitor the input pipeline behaviour via Nagios. On average, we schedule the fetching of 18’000 streams per 5 minutes, which can lead up to 3 million posts seen by the input pipeline. Those posts are then analyzed, and we extract up to 400’000 URLs of potential articles that are crawled per 5 minutes.
The below numbers are a summary of elements processed by range of 5 minutes.

Our input pipeline supports multiple languages, here is a peek of encountered languages while crawling URLs shared on social networks :

* Russian and other languages aren’t currently in production

Coming next

Our next article in this series will focus on the publishing pipeline with details outlining the technical process for generating new editions for our users. Leave a comment below with your thoughts, questions or experiences. We look forward to hearing from you.

Interested in working with us ?

We’d love to meet you. Check out our current jobs openings and drop us an email.

Authors: Olivier Bohrer, Reynald Borer

Image: the Paper.li word cloud was created from the 1000 most used English words out of a 2 million article data pool. 

Published by

Tags:

One Response to “Behind the Scenes: A Look into Paper.li’s Data Ingestion Pipeline”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>