From page layout to database: Enhancing content for FPinfomart
Posted by Jennifer Stein on March 3, 2010
Earlier in the week, I posted a description of how the sources that make up FPinfomart are selected. Today’s post addresses the technical aspect of “getting content into FPinfomart.”
As I was composing the first post in this series on “how content gets into FPinfomart” I had intended to describe only the licensing issue. As I wrote, however, it occurred to me that the actual process of moving content from a variety of media into a single, uniform format may also be a bit of a mystery to a lot of our users. Today’s post attempts to lift the veil and give you a bit of insight into just how we make 4,700+ sources searchable, consistently-formatted, and usable in a database.
We have many sources for our data. Some of it comes to us from 3rd parties (publishers, aggregators, etc.) while other content is produced here. Some of the content comes pre-formatted for us, while other content is run through various automatic conversion programs, and some content is converted through more manual processes.
This post will focus on a process called “Enhancing,” which is how we turn a page formatted for a printed newspaper into separate articles in the FPinfomart database. We Enhance all the Canwest-owned newspapers, as well as close to 100 additional community papers for whom we provide this service (without which they would have no digital archive of their own newspaper).
Once the paper is sent to the printing press, an electronic copy is sent to us in either PDF or Quark format. Since this file is laid out for printing, we need to do some work not only to separate out all the individual stories, images, and ads, but also to ensure that the various parts of each story are correctly identified – such as the headline, the byline, the page number, the name of the section in which it appeared, etc – i.e. the fields in our database.
To do this, we use a piece of custom-built software. We run the formatted pages through this software, which automatically detects upwards of 80% of this information, placing the appropriate information in the correct field. The data that the software can’t detect, (and some that needs correcting) is handled by our skilled team of Enhancers (individuals trained to get the right data to the right place, very quickly, accurately, and often in the middle of the night). We have Enhancers working around the clock, because the news never sleeps!
Many of the larger papers that we enhance here arrive in our Enhancers’ inbox shortly after 2:00 a.m. and are online by 4:00 a.m. – which often means you can read a copy of these papers on FPinfomart before they’re available in print.
There are over 70 million documents in the FPinfomart database, and we add between 30,000 and 40,000 new ones every single day. And, for each of these documents, you can search in standardized fields – such as headline, byline, source, and section. It’s an amazing undertaking. I’ll take a moment to thank our Enhancing team and our technical staff for providing us with some order in this potential chaos. It is because of the routines they’ve created that we are able to find the data we need, without being lost in information overload!
Are there any other mysteries of FPinfomart that you’d like to see described here? I’d love to hear your suggestions on what other inner workings you’re wondering about. Use the comments to let me know what interests you! Consider this your backstage pass.
3 Responses to “From page layout to database: Enhancing content for FPinfomart”
Sorry, the comment form is closed at this time.



Michelle said
As a throw-back to the first article in this series, I’m interested in learning more about which content is included in Infomart. For example, Infomart offers the National Post as a source, but not all articles from the National Post are in the database. Can you tell me what the criteria is for whether articles are included? Does it always have to do with copyright? Is there any way that an Infomart user can determine whether articles from a subscribed source are missing?
Thanks very much!
fpinfomart said
Thanks, Michelle – that’s a great idea for a post. I’ll try to put something together on this topic. Watch for it later this week.
Coverage and Copyright « Intelligence you can count on said
[...] by Jennifer Stein on March 11, 2010 A reader posed the following question in response to our previous post on enhancing FPinfomart content: “As a throw-back to the [...]