Sonntag, 3. Juni 2018

Why data goes dark and why this matters



There is an interesting parallelism between the pairs of matter and dark matter on the one side and data and dark data on the other side. The antagonist to „dark matter“ is not „light matter“ or „bright matter“, but simply „matter“.  The contrast is thus not between „dark“ and „light“, but simply between „dark matter“ and matter proper. In fact, the modifier „dark“ does not characterize a property of matter itself, but 
describes an epistemic or cognitive state, more in particular a lacuna in our ability to understand the essence of non-matter in contrast to matter as we can perceive it with our senses.
Let’s turn to data. Data, seen as an accumulation of information, exists as long as mankind. It was crucial for our survival to produce, collect and store data in our brains. Data about our environment, about the nearest source of water, the color of edible fruits. Data about the shape of a face and the shape of a fang. Data is the most natural, basic element of survival. Data as we might understand it today moved from the brain to the computer, being a kind of extension to our brain with its limited capacity. Shifting its locus of storage and its means of production however has not affected its importance. 
Data accumulated over time and over experiences still represents the basis for informed decision making. When data becomes so much and so hard to handle, when it becomes so various, heterogeneous and massive that it eludes our cognitive capabilities, it becomes „dark“, that is it escapes our senses, representing a lacuna in our knowledge as we can not get access to it. Thus, much as the modifier „dark“ in „dark matter“, the term „dark“ in „dark data“ does not characterize a property of data itself, but describes our inability to make sense of data.

Data becomes thus dark when we loose the ability to handle it. But when does this happen? Let’s have a closer look into the nature of data:

The type of data can be spread out on a data continuum, with three areas: 

<——————————————————————————————————————>

structured              semi-structured            unstructured

Structured data is found for instance in database-tables or excel spreadsheets. 
  • This type of data is formatted into a data model with a formal structure, so that its elements can be addressed, organized and accessed in various combinations. This rigid organization makes structured data easily accessible by rather simple search engine algorithms. Thus, they are easy to evaluate and to exploit.

Semi-structured data can be found in XML-code or JSON for instance.
  • This type of data is neither organized in the formal structure of a data model, nor is it completely unstructured. It may contain elements that enforce hierarchies of records or constitute fields within the data. These elements can be tags or markers, e.g. to separate semantic elements. This type of structure is also known as self-describing structure, for the structure is evolving out of the composition of the data, in contrast to the data being formatted into a rigid, prescribed model (as the structured data are)

Unstructured data comprises of documents, in particular any texts including emails, graphics, sensor data, videos, images, etc. 
  • This type of data is neither organized in a data-model nor in any other pre-defined manner. Typically, it contains a lot of text, but there might also be numbers, dates and other facts. This hodgepodge of different data entails irregularities and ambiguities, that makes it hard for traditional analyzing tools to make sense of the data.

Now that we examined data, we can understand how it becomes dark:
Generally, dark data can emerge from any part of the data continuum, structured as well as semi-structured as well as unstructured. However, most of dark data emerges from the area of unstructured data. This is due to the fact that data from unstructured sources cannot be processed using standard operations of data including filtering, projecting, joining, aggregating, averaging, etc. Hence, there is failure in processing the valuable information that are enclosed in the data. Modern, yet-in-development-techniques such as data mining, natural language processing and text analysis are needed to find patterns in order to interpret this type of data. Particularly challenging is textual content, as it requires a process of so called machine reading. That is the ability of a machine to make sense of natural language text.


Data thus becomes dark when:

  • it is left behind from processes
  • It is dismissed as valueless although it would be highly valuable if analyzed properly
  • there is no tool to capture and unlock the hidden information
  • the sheer amount exceeds the analyses-capacity
  • the available / feasible methods of analysis can only access structured data sets


Dark matter is matter that is invisible or dark to the standard astronomical equipment for it does not seem to interact with observable electromagnetic radiation, such as light. This is similar to „dark data“, which is also invisible to standard tools and instruments for analytical processing that have been developed for structured data mainly. And there is a further interesting parallelism: while dark matter is believed to account for 80% of all the matter in the universe, unstructured data amounts to 80% of all a company’s data.

Are you ready for bringing light into your dark data? Semalytix can help.




Samstag, 24. März 2018

Dark Data Understanding for Extracting Real World Evidence

Real world data (RWD) and real world evidence (RWE) are playing an increasing role in health care decisions as they provide an understanding of the value of a medical product beyond the controlled settings of clinical studies and clearly circumscribed subpopulations.

In the era of value-based care, the effort of players in the health care system is shifting from a primary focus on regulatory approval by agencies such as FDA to striving for a comprehensive assessment of the added value of a medical product in terms of patient outcomes.

Real-world evidence can help to identify which patients will get the most value from a therapy, based on their genetic, social and lifestyle footprint that is typically not captured in clinical trials. In particular, RWE has the potential to deliver a more comprehensive picture and deeper understanding of the safety, effectiveness and economics of a drug product.  Ultimately, it might yield insights that contribute to improving medical products based on unmet needs.

Real World Evidence can help to
  • monitor postmarket safety and adverse events
  • contribute to understand the value of a drug, for patients, HCPs, insurances and regulatory agencies
  • generate data to support coverage decisions and to develop guidelines and decision support tools for use in clinical practice
  • support the design of clinical trials

However, real-world evidence is typically not readily available, but is hidden in diverse datasets including:

•    Electronic health records (EHRs)
•    Lab Data: Genomic Data, Tissue Pathology, Lab Test, etc.
•    Medical claim data, Insurane Company Data
•    Social Media and Patient Networks (e.g. Twitter, Blogs. PatientsLike Me)
•    HCP Interview Data
•    Patient-reported outcomes
•    Health-monitoring devices

Most of these datasets are unstructured, containing large amounts of unanalyzed raw text that needs to be analyzed, structured and homogenized to make it accessible for analyses and distill the „evidence” out of them. This requires methods for natural language processing that are able to extract key insights.

Semalytix excells in this respect, having substantial experience in analyzing unstructured data in the healthcare domain to generate insights that allow a deeper understanding of:

  • Self-reported patient value: the value of a drug for patients as self-reported on various platforms including experience with adverse events and how they dealt with them
  • HCP assessment of level of evidence: the assessment of HCPs of their experience with drugs, reporting on their level of trust, assessment of evidence, problems, lessons learned, etc.
  • Data on special populations: the performance of drug products on special subpopulations that are not considered in studies, e.g. having mixed conditions or had several unsuccessful therapies
  • Patient histories: fostering an  understanding of which drugs are typically taken in which phase of a disease, together with which other drugs, etc.


The technology stack developed by Semalytix puts evidence contained in data sources as mentioned above at your direct disposal using easy-to-understand visual analytics.

Mittwoch, 13. Dezember 2017

Conference on Semantics, Data and Analytics

Last week I attended the Bayer-hosted conference on Semantics, Data and Analytics. It was a high-profile event with many interesting invited speakers including Harald Sack from Karlsruhe Institute of Technology (giving a nice intro to the Semantic Web), Steffen Lohmann from Fraunhofer IAIS talking about visual analytics, Martin Hoffmann-Apitius talking about data integration for biology as well as myself talking on vertical knowledge graphs.

 The issue of how to cost-effectively create knowledge graphs for the purpose of data integration was all around in the air. I was impressed by seeing how present the topic of data integration was not only at Bayer but for all pharmaceutical companies present there. In my talk, titled 'Domain-specific knowledge graphs for knowledge management and knowledge discovery' I emphasized that all data integration activities require clear use cases and competency questions to scope a project adequately and get the most out of the data. But cost is an issue. Semantics and ontologies are key to data integration, providing the five principles for semantic data integration:


  • Normalization: Semantics is inherently reductionists, abstracting from details and focusing on commonalities. Practically, this is achieved by mapping data to an agreed upon set of IDs and vocabulary elements. 
  • Reuse: Normalization is achieved by reuse of vocabulary and IDs, do not invent own IDs or vocabulary elements if suitable elements exist already? Otherwise, integration will simply not happen. 
  • Commitment: Commitment is about agreeing to understand a certain concept in the same way as the stakeholder who introduced that concept; this works without formal axiomatic definitions. It works when we speak. We can exchange messages in natural language without formally agreeing on the definition of each single word. 
  • Grouping: Normalization and typing allows to group different entities together at a certain abstraction level. This is key for aggregation (see below), that is computing summarization for data that are grouped according to some criterion. 
  • Aggregation: The ultimate goal of any semantic data integration exercise. At the end of the day, we are less interested in the single data point, put in aggregating all data points or entities that share certain characteristics or features and provide informative summaries / statics for the aggregated elements. 


I also talked about the challenges of incorporating unstructured / textual information into knowledge graphs via text mining. Errors are unavoidable when using machine reading / information techniques. On the other hand, by deploying machine reading, we are able to ingest information from text at a speed and scale that no single human would ever be able to do.

So "Where is the sweet-spot along the trade-off between being able to „machine read“ a large amount of documents and having to live with errors?"

 The panel after the talks in the morning of the 7th of December was very informative and lively. There were very interesting discussions on the role of foundational/upper ontologies in data integration, the cost of integrating data using knowledge graphs compared to using a standard data warehouse approach, the challenge of dealing with datasets and vocabularies that constantly evolve, the question how to implement quality assurance / quality control over an evolving knowledge graph and how to effectively involve users in this process. Big questions!

 It was a great conference. It was a pleasure to speak to the audience, a very interesting and knowledgeable audience indeed. The post-its all around and all the brainstorming going on were really inspiring and fruitful. When is the next edition?

Sonntag, 1. Oktober 2017

The impact of AI on customer relationship management

In a recent report, the International Data Corporation (IDC) estimates that artificial intelligence (AI) technology applied to customer relationship management (CRM) might boost global business revenue in the orders of $1.1 trillion from 2017 to 2021.

In particular, AI-driven CRM might lead to the creation of 800,000 direct jobs, and 2 Mio. of indirect jobs. The year 2018 is likely to turn out to be the mayor year for AI adoption.

Significant amount of work activities replaceable by analytics and machine learning

In a study from 2015 "Four Fundamentals of workplace automation", McKinsey has found that 45% of 2,000 work activities performed in every occupation in the economy and associated with a $14,6 trillion of wages, have the potential to be automated on the basis of machine learning technology.

Potential of analytics remains high, but progress is slow

In their report "The age of Analytics: competing in a data-driven world" from December 2016, McKinsey concludes that the potential of analytic technologies remains as high as identified in their 2011 report "Big data: The next frontier for innovation, competition, and productivity". Nevertheless, progress and adoption have been slower than anticipated in their 2011 report.

Donnerstag, 25. Mai 2017

Rechtsextremismus im Netz erkennen

Hate Speech, Fake News und rechte Hetze – wer sich Diskussionen in sozialen Netzwerken oder Kommentarspalten anschaut, stößt schnell auf fragwürdige Posts. Längst haben rechte Extremisten das Web 2.0 als Instrument für ihre Propaganda entdeckt. Social Media hilft ihnen, sich zu vernetzen und neue Anhänger zu rekrutieren.

Für unsere Kooperationspartner und Auftraggeber vom Kompetenzzentrum Rechtsextremismus (KomRex) der Uni Jena sind soziale Netzwerke deshalb ein interessantes Forschungsfeld.