RWF Consulting LLC
Advancing Business Processes Worldwide

Big Data

Cancer’s Big Data Problem


The US Department of Energy (DOE) and the National Cancer Institute will use DOE supercomputers to help fight cancer by building sophisticated models based on  population,  patient,  and molecular data, explains “Cancer’s Big Data Problem,” from IEEE Computing in Science & Engineering.

PDF Link to Cancer’s Big Data Problem


Becoming a Data Scientist

Data Science, Machine Learning, Big Data Analytics, Cognitive Computing …. well all of us have been avalanched with articles, skills demand info graph’s and point of views on these topics (yawn!). One thing is for sure; you cannot become a data scientist overnight. Its a journey, for sure a challenging one. But how do you go about becoming one? Where to start? When do you start seeing light at the end of the tunnel? What is the learning roadmap? What tools and techniques do I need to know? How will you know when you have achieved your goal?

Given how critical visualization is for data science, ironically I was not able to find (except for a few), pragmatic and yet visual representation of what it takes to become a data scientist. So here is my modest attempt at creating a curriculum, a learning plan that one can use in this becoming a data scientist journey. I took inspiration from the metro maps and used it to depict the learning path. I organized the overall plan progressively into the following areas / domains,

  1. Fundamentals
  2. Statistics
  3. Programming
  4. Machine Learning
  5. Text Mining / Natural Language Processing
  6. Data Visualization
  7. Big Data
  8. Data Ingestion
  9. Data Munging
  10. Toolbox

Each area  / domain is represented as a “metro line”, with the stations depicting the topics you must learn / master / understand in a progressive fashion. The idea is you pick a line, catch a train and go thru all the stations (topics) till you reach the final destination (or) switch to the next line. I have progressively marked each station (line) 1 thru 10 to indicate the order in which you travel. You can use this as an individual learning plan to identify the areas you most want to develop and the acquire skills. By no means this is the end; but a solid start.


Delivering value from big data with Microsoft R Server and Hadoop


Businesses are continuing to invest in Hadoop to manage analytic data stores due to its flexibility, scalability, and relatively low cost. However, Hadoop’s native tooling for advanced analytics is immature; this makes it difficult for analysts to use without significant additional training and limits the ability of businesses to deliver value from growing data assets.

Microsoft R Server leverages open-source R, the standard platform for modern analytics. R has a thriving community of more than two million users who are trained and ready to deliver results.

Microsoft R Server runs in Hadoop. It enables users to perform statistical and predictive analysis on big data while working exclusively in the familiar R environment. The software uses Hadoop’s ability to apportion large computations for transparently distributing work across the nodes of a Hadoop cluster. Microsoft R Server works inside Hadoop clusters without the complex programming typically associated with parallelizing analytical computation.

By leveraging Microsoft R Server in Hadoop, organizations tap a large and growing community of analysts—and all of the capabilities in R—with true cross-platform and open standards architecture.



Accelerating R analytics with Spark and Microsoft R Server for Hadoop


Analysts predict that the Hadoop market will reach $50.2 billion USD by 2020.1 Applications driving these large expenditures are some of the most important workloads for businesses today including:

• Analyzing clickstream data, including site-side clicks and web media tags.

• Measuring sentiment by scanning product feedback, blog feeds, social media comments, and Twitter streams.

• Analysis of behavior and risk by capturing vehicle telematics.

• Optimizing product performance and utilization by gathering data from built-in sensors.

• Tracking and analyzing people and material movement with location-aware systems.

• Identifying system performance and intrusion attempts by analyzing server and network log.

• Enabling automatic document and speech categorization.

• Extracting learning from digitized images, voice, video, and other media types.

Predictive analytics on large data sets provides organizations with a key opportunity to improve a broad variety of business outcomes, and many have embraced Apache Hadoop as the platform of choice.

In the last few years, large businesses have adopted Apache Hadoop as a next-generation data platform, one capable of managing large data assets in a way that is flexible, scalable, and relatively low cost. However, to realize predictive benefits of big data, organizations must be able to develop or hire individuals with the requisite statistics skills, then provide them with a platform for analyzing massive data assets collected in Hadoop “data lakes.”

As users adopted Hadoop, many discovered performance and complexity limited Hadoop’s use for broad predictive analytics use. In response, the Hadoop community has focused on the Apache Spark platform to provide Hadoop with significant performance improvements. With Spark atop Hadoop, users can leverage Hadoop’s big-data management capabilities while achieving new performance levels by running analytics in Apache Spark.

What remains is a challenge—conquering the complexity of Hadoop when developing predictive analytics applications.

In this white paper, we’ll describe how Microsoft R Server helps data scientists, actuaries, risk analysts, quantitative analysts, product planners, and other R users to capture the benefits of Apache Spark on Hadoop by providing a straightforward platform that eliminates much of the complexity of using Spark and Hadoop to conduct analyses on large data assets.