Editors Note: This Post is a continuation of the conversation started on “Taming Big Data with Smart Contexts” by Edouard Lambelet.


Last week, Ed gave you a glimpse of what’s coming in our future and how we are approaching the data overload at Paper.li.

The responses were thought-provoking and just as we hoped, your input fueled a few discussions both online and offline. There were lots of ideas about data and providing real value that I thought it could be interesting to share.

Let’s go through some of them.


Joining the data discussion


To begin with, let me highlight Jeremy Silver’s response as he touched on various points including: analytics, ethical data sales, connecting brands and users and semantic analysis. All interesting areas to unleash the value in data, with some just down the road for us, and some others certainly a bit further. But I will come back specifically to the point on semantics a little later in the post.

Paulo Caldeira’s reference to McLuhan was totally on point. Studying the medium and the context around it can give us some new directions in piecing the puzzle together.  Paulo also had a very simple statement on how data could be given back:

“Well, to give the data back to people we don’t need more “underwares”, all we need are apps and projects that really get all connected giving context and localization.”

I agree with Paulo. There are many ways data can be given back to users, and his suggestion is a great idea. This is certainly where the “web-ecosystem” is headed to, and already is, in a way. To fulfill the full potential of such an approach may require some new industry standards that will manage the discovery and interaction of these services. That means time…but we’ll be sure to keep this in mind, and hopefully be able to participate, in our small way, in this global undertaking.


‘JW or What just changed’ touched on many of the questions we have been asking ourselves over the past while:

“What is the type of content readers like to see? what is the point of difference that appeals to them’ and ‘what of all the sources you index for us to use is the most current – i.e. content sources that are always refreshing their collections not leaving them to languish and are quality assured in some way”

This brings up something we have worked on over the past 2 months, trying to find the best way to profile source quality and average engagement potential of articles published by a source. For sure this is another great way to bring data back to users and we are looking to integrate just that in our new project… stay tuned.


Peter Young also made a great point regarding the “value” of the data which made us really think:

“What does all of this “collection” equal in money and “time saved” and a variety of other tangible metric points.”

While we would have to dig deeper into these questions in order to provide some meaningful answers, we can say that we are currently processing upwards of 1000 articles daily per paper – just imagine the time needed to find, browse and read so many articles… every day. Thats a lot of time saved!


Boyink had a valid question about framing the discussion and asking what exactly we are thinking of?

Our thoughts at this moment are open… we know we can do many things with the data we have. Rather than just thinking about it, we believe it is best to try out new ideas. We are humble enough not to believe we have all the answers, but nimble enough to iterate through quick trials – to learn by doing. We are looking to develop some simple products, on a regular basis, which would bring value outside of the publishing context and of course bring new features to the Paper.li service itself in the long run.


Food for thought


So, as Boyink asked, what exactly are we thinking of?

As Ed said, we have built a platform that processes a LOT of content and social signals from various sources, including Twitter, Facebook, Google+ and RSS. For us, Twitter remains the largest provider of all that data so to get started we decided to consider new angles to look at the Twitter data ecosystem.

Up until now, we have focused on solving the issue of “social overload” for our users by analysing the never ending and ever increasing stream of tweets, articles, photos and videos. Once analysed, we pick the most important and relevant items and publish them in a newspaper format.

But now it’s time to ask the question:

“What else should, and can, be done to improve the quality and relevancy of the chosen content?”

In the Twitter ecosystem it’s all about followers… followers interact with the shared content. Interesting content keeps them listening… irrelevant content ends up in unfollows.

How well does the average Twitter user know their community? Just take a moment to think about that yourself. How well do you know your followers? Are you aware of their interests, what they want to read, retweet and what inspires them to share? This important part of the equation has been somewhat overlooked, until now.

Matching quality and relevant content to what audiences want to read, based on analysis and not speculation, is something that would bring great value to the content world.

The final layer to improving the shareability of content has to be the source. Where does the article come from, how relevant and engaging are the articles from that source over a given period of time? Factoring in audience and sources would certainly bring that extra value to those looking for content to share.

We believe this is something we can do. And more to the point, something we have actually started to work on. Probably the biggest part of solving this is finding a way to capture the interests of a user’s followers in such a way that we are able to find matching content. This has to do with semantic analysis. For the past few months, we have been working on adding a much more fine-grained topical analysis of articles, starting with English content. This should allow us to create more precise topical interests profiles for any and all users. And from there, well, you will have to wait just a bit longer to see what we have in store.

To get back to Jeremy’s initial mention of semantics, I believe that this is where the fun starts…


What do you think about knowing your followers? How important are the sources for you?



With a little help from my friends


Thanks again to all of you for your great input on Ed’s post and I hope some more of you will be encouraged to share your thoughts and ideas over the next while. We are committed to making your data work better and harder for you and we are interested to continue the discussions. I’d like to invite you to join me and some of the team at a Google Hangout on Feb 11th, 2 PM ET (20:00 CET) to do just that. Take a moment to meet some of the team, hear the ideas we have regarding what we could do with data and importantly share your ideas with us. Let me know if you are interested in joining the hangout, just add a comment to this post or drop us a mail at backstage@paper.li


If you haven’t done so yet, please sign up to Backstage at Paper.li , to keep the discussions going and help shape the future together.


Iskander Pols
Natural resource...Paper.li co-founder...telemarker

Published by


One Response to “Opportunities in Big Data”

  1. Paul Zecher

    I don’t know. To me there is too much talk about talk. Or rather data. I mean I think it gets sort of repetitive. It’s sort of like the freedom of speech talk. We need to protect our freedom of speech…which is sort of funny since they are usually way off, since it’s really an inalienable right. But, for instance: I have been reading Singapore Straits – I know it was called that — It’s a paper that the British had in Singapore when they were practicing their Adam Smith…it’s a great read and the “data” (I would never call it that since it’s much more it has depth and humor and, of course, a health dose of sarcasm. And it presents a very very very different from our own and one that actual presents facts and is honest about what happened or what’s going on. And it’s quite an eye opener….but here they recently changed the masthead and later some of it is in Chinese….At least last time everything before July 23rd 1937 was in Englis h and everthing after was Chinese (just the instructions)….but the paper itself had obviously been tampered while. It was obvious and it was sloppily done…which is often the case with people work on “Structure” and “Orders”. They can be amazingly efficient, but if the orders are odd or they one carrying them out is not clear on the order …or (I have case I thinking of) where it’s two people and the order seems to contractic they can be actually totally confuse people with the result since on one had it’s done with precision and another is like – how did they screw that up….if you think…orders then it makes perfect sense. I mean, the Singapore Straits ALWAYS had ads in the Front. NEver had them anywhere else….well…always in the front, which is sort of cool since I can actually see myself perusing the adds if a paper today did it that way. —You’d have to take pride in your town…like Boston or San Fran…and Singapore. But the tone also is different and the print is too obviously newish. I don’t know if I managed to push back this, but it seems at least some of it is it’s original form. (Always yell fire) when trying to get attention…don’t yell help. Fire!
    But my thing is that data is really not much. I mean, what has data done to help us solve Alheirtz…I knew someone working on Alheimertz (I know mispelling) …I grew up in Cambridge and he was excited about solving this puzzle…to me it’s got to have so much data that points to any obvious…it’s sort of like Cigarettes, there was so much data that they didin’t need a “smoking gun”. and even though maybe a bit more complicaated it seems that we should have it pretty darm acturate understanding of what is causing it. Not a cure, but a very good clue…since these patients are perfect for supplying facts about their lives since it is perhaps the Wife or son or daughter that can give a complete history….maybe more data then needed.
    I just wish we would stop talking in generalities and deal with specifics…even something as obvious as teaching math slow…so kids can think a little. Or a different way to present data then textbooks…it doesn’t have to be controversal….but talk about data is, like I said, talking about the need for free speach…to wit I always say that they just put that in to make sure the dim ones knew that!!! It’s an inherent right which causes confusing…like the bill of rights…remember they only put that in to appease some states. If you use critical thinking – at least when it comes to at least 8 or 9 of them is stuff that they already thought was obvious – based on English law!


Leave a Reply

Your email address will not be published. Required fields are marked *