What is Document Content Analysis? How it differs from Document Management. (Part 1)

Fergal McGovern

CEO & Founder

2 min read
Doc content analysis

(This is the first of two blog posts.)

We work with corporate entities both in the IT and bid and proposal management space. With few exceptions, these organizations currently use, or are looking to use Microsoft Sharepoint. VisibleThread does use its own repository based on subversion or equally can integrate with Sharepoint, so we work with many Sharepoint customers.

In any case, we’ve seen a number of organizations deploy Sharepoint in ‘vanilla’ form or with low levels of structural enforcement. They often take the basic free copy of Sharepoint called Windows Sharepoint Services and deploy it with fairly minimal configuration.

This can tend to lead to a ‘wild west’ scenario, where pretty much anything goes. The almost certain indicator that you are there is when you ask a colleague for a document and the response is: “it’s out on Sharepoint”. You have no clue as to where it might be; what site? What folder? You may just throw your hands in the air, sigh in exasperation and end up trawling through your e-mail threads or hard drive for the document. Well, at least you found a version of the document, right!

So, if we don’t apply logical categorization structures and signpost them well, a content management system on its own will add little value, the vast majority of people will bypass recommended procedures. You really can’t blame them. Your data hygiene however, will quickly go down the toilet. Visibility, efficiency and consistency suffer when we cannot be sure where content exists. This is not the fault of the document management system rather it is a human problem.

Let’s take myself as an example. Looking at our own internal collateral for VisibleThread, we see these files on our marketing file share:

This might look OK but in fact the documents are really quite inconsistent:

  • ‘VT’ versus ‘VisibleThread’ (e.g. ‘VT-Bid-Proposal-Brochure-Dec-2010.pdf’ as against ‘VisibleThread – Feature-List.pdf’)
  • date based version convention as against number versioning: v1, v2
  • and so on, can you spot the other anomalies?

So, as much as I might consider our organization as fairly structured with ‘reasonable’ naming conventions, I am in fact falling down on ensuring consistency even when the documents are under my control and in a relatively small setting.

Scale up, with multiple people in larger organizations, possibly geographically dispersed and you quickly arrive at a major ‘discovery’ headache.

So now, let’s assume you’ve had the consultants in or have spent time configuring the Sharepoint deployment internally. Now you will have a reasonable library/folder structure, conventions will be in place for versioning, everyone knows where every document is neatly stored. Great, job done!

Wait, not so fast, how do we actually know that the documents are any good?

  • Can we easily understand the intent of documents?
  • How do we know we have adequate coverage of key concepts?
  • What are the dependencies between documents?
  • In a bid proposal scenario, do we have the right document heading structure?
  • Do we have enough content relating to ‘win themes‘ for bids?

What remains hidden is the content within the documents. Is it really any good? Will it deliver a great IT system or will we win the Bid?

The next post will consider this further, focusing on ‘discovery’ and ‘concept mining’ showing how document content analysis works and how it is complimentary to document management ensuring top class documents. Stay tuned.


Book a Demo