Archive for November, 2011|Monthly archive page

Fun with @SignalR

Background

For the past several months I have noticed David Fowler, Damien Edwards, Phil Haack and a few other MS Techies I follow on twitter, tweet excitedly about a new framework called SignalR. I didn’t understand the excitement fully and to be honest didn’t try too hard to figure out.

Yesterday Scott Hanselman published this article and re-distributed a link to an article which he wrote end of August, 2011. After going through both, I really wanted to take a closer look at SignalR, and while trying to get his ‘…monkey typing Shakespeare…’ code working I finally had the ‘Ah ha!’ moment.

I just had to come up with this quick and dirty sample to demonstrate a use for SignalR and since I want the bragging rights for it, I am going to throw it up as is without bothering about the niceties of a Nuget package or sample.

UPDATE: I have deployed a working sample of this application at http://apps.apphb.com/funwithsignalr

The Idea

First time I saw Google Wave’s technology demo where they showed multiple people editing/reviewing the document at the same time, it blew me away. I was fascinated by it, but told myself I was too ‘javasciptically challenged’ to design such an Async client side framework.

When my tubelight** finally came on, SignalR was the obvious way to do it. Following is my first rough cut that as of now only syncs content. We’ll get to the fancy colored carets at a later point (hopefully in another article).

**In India one is referred to as a tubelight if something strikes late. Back in the days of non-electronic ballasts for fluorescent lighting, the tubes used to flicker for few seconds before turning on.

Please go through Scott’s articles or SignalR documentation to understand SignalR better. I make no attempts to explain SignalR here (partly because I know precious little and partly it’s such a fantastic framework that you need very little up front to get going).

Head-first into SignalR

To build a scaffolding for SignalR framework you can use plain html with a backend server side SignalR Hub class as Scott demonstrated. But I chose to start off with an ASP.NET MVC 3 project. (Actually my first attempt at ASP.NET MVC 4 using VS 2011 Dev Preview didn’t quite go so well, so I rolled back to the stable releases of ASP.NET MVC 3 on Win 7 using VS 2010).

Setting up a support app as an ASP.NET MVC 3 web project

image

image

I called it ‘FunWithSignalR’ and set it up as an Intranet Application

Adding SignalR to your project using Package Manager Console

image

The package manager console helps you download Nuget packages among other things. If you don’t have it visible you can bring it up from the above menu (in VS 2010)

SignalR is available as a nuget package and all you have to do to include is run the following command in the Package Manager Console

Install-package SignalR

The above command will get all the dependencies you need for SignalR and if your jQuery script files are not up to speed, will get the latest libraries for jQuery too (SignalR client uses jQuery hence the dependency check results in the upgrade). The output will looks something similar to

image

Update jQuery dependencies in Views\Shared\_Layout.cshtml by pointing to the most recently updated jQuery version. As per above image the latest version for me today is jquery-1.6.4.min.js because that’s the one the package manager installed.

Setup the Code First EF Model

We’ll setup a very simple code first model by adding one Entity called BlogPost in our Model folder.

image

Add the DbContext for the BlogPost.

image

Build the solution at this point.

Scaffold the Controller and UI

To use the default scaffold tooling that comes with MVC 3, right click on the Controllers folder and select Add –> Controller.

image

Select the Template, Model and Data Context values as shown above and click Add.

At this point you will have the scaffolding necessary to Add/Edit/Delete BlogPosts.

Adding the ‘Review’ Action

Open the BlogPostController and Copy paste the Get and Post action methods for the Edit action. Change Edit to Review as shown below.

image

Copy the Edit.cshtml and paste it in the View folder. Rename it to Review.cshtml

Update the Index.cshtml such that a ‘Review’ link comes up in the Index

image

Getting SignalR into the game

The Server Side

In your web project create a folder called SignalR (You can call it anything you want, in a real life project this as server side component and could very well reside in a dll of it’s own).

In the SignalR project add a class BlogHub.cs

Add the following code to it


using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using SignalR.Hubs;

namespace FunWithSignalR.SignalR
{
[HubName("blogHub")]
public class BlogHub : Hub
{
///
<summary> /// The method called from SignalR client (JS)
/// </summary>
///String Data
///Session ID: Used in this sample to track users
///A boolean value used in the example indicating
/// if incoming value should be appended or overwritten when sent back to
/// other clients
public void Send(string message, string sessnId, bool append)
{
Clients.addMessage(message, sessnId, append);
}
}
}

The Client Side

  • Open Review.cshtml and add the following script references

image

Note: json2.min.js is not packaged with the SignalR nuget. I have packaged it as a part of my code. I got it from Scott’s ‘Shakespeare demo’.

Note 2: /signalr/hubs is dynamically generated at runtime. So don’t worry if it gives a green squidgy right now. It works fine at runtime. Without this reference SignalR client won’t work.

  • Change the Html helper for the Post property from Html.EditorFor(…) to Html.TextAreaFor(…).
  • Add a hidden field to save the session Id in it (this maybe a security hole, need to investigate best practices)


<input id="sessinId" type="hidden" value="@Session.SessionID" />

  • Drop the following script in

<script type="text/javascript">// <![CDATA[
$(function () {
// Proxy created on the fly
var blog = $.connection.blogHub;
// Declare a function on the blog hub so the server can invoke it
blog.addMessage = function (message, sessnId, append) {

var sessId = $('#sessinId').val();

if (sessId != sessnId) {
if (append) {
$('#Post').append(message);
}
else {
$('#Post').val(message);
}
}
};

// Start the connection
$.connection.hub.start();

$("#Post").keyup(function (event) {
// Call the send method on the server

var sessId = $("#sessinId").val();
if (event.keyCode >= 65 && event.keyCode <= 122) {
if (event.shiftKey || (event.keyCode == 32)) {

blog.send(String.fromCharCode(event.keyCode), sessId, true);
}
else {
blog.send(String.fromCharCode(event.keyCode + 32), sessId, true);
}
}
else {
blog.send($("#Post").val(), sessId, false);
}
});
});
// ]]></script>

  • If you see closely this is pretty similar to Scott’s 11 lines of code to get a chat client going.

Changes I have made are:

1. Send message to server on KeyUp event of the Post text area, instead of an explicit button push

2. Send more meta information like the current sessionId and if the keystroke means an append action should take place at the client action of an overwrite action should take place at the client option.

Let it Roll

Run the application and navigate to the BlogPost Index

image

Add a new Post

image

Click on Review

image

Press Ctrl+N to start a new browser instance with the same page. Hit F5 so the session id (the text field next to Save button) changes. Arrange the two browsers side by side.

Type in one Post text area and watch the other one change almost simultaneously

image

As Scott says, Kabooom! brain explodes…..

In Conclusion

To wrap up, we saw how mind numbingly easy it is to use SignalR.

It’s a fantastic abstraction over various techniques available for persistent connections over http. Under the hood it can use websockets or longpolling depending on what’s available.

You can get infinitely creative with it and build fantastic collaborative ASP.NET apps using SignalR backend.

Some day (hopefully) in the near future, I would have overcome my javascript challenges and built a real collaborative editor with all the fancy bells an whistles of Google Docs.

Finally, David Fowler and team, a million thanks for such a fantastic framework. I am loving the ‘just works’ motto Smile. And Scott Hanselman for providing the ‘Aah ha!’ moment, without which I probably would have not given SignalR a look yet.

The Code

The code can be downloaded from here (size 844 KB) (skydrive). It is provided AS IS, with no warranties expressed or implied. You are free to use it without any attribution (but I would love it if you do feed my ego Winking smile).

I have now made the code available on bitbucket Mercurial repository. You can get it from here https://bitbucket.org/sumitmaitra/funwithsignalr

Update

The code in bitbucket now uses Google’s diff-match-patch implementation and is hence Apache licensed now.

Big Data and Introduction to Hadoop

Last weekend (October 29, 2011) I attended a training on Hadoop, arranged by my employer. It started on Friday afternoon ended Sunday evening, in 6 batches of 4 hours each. In the end, all 20 attendees had their brains spilling out of their ears, but each one of us had a blast! It was a fabulous series!

Following (semi-technical) account is my 101 level take-away from some of the sessions.

What is Hadoop?

Cloudera the leading vendor for Hadoop distributions defines it as follows at their website

Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.

There you have it, from the horse’s mouth!

Little bit of history, Hadoop was sponsored by Yahoo! (yes, you read that right), with Dough Cutting being the principal architect of the project. Once Hadoop was mature enough Yahoo made Hadoop an Apache project. Dough left Yahoo and formed Cloudera which is now considered the ‘Red Hat’ of Hadoop distributions. If you really haven’t read Wikipedia about how Hadoop got it’s name, it got it’s name from Doug’s son’s toy elephant!

More trivia, Hadoop project is based on Google’s papers on their implementation of GFS and Big Table that google internally uses.

If it is just a file system + a technique how is it related to the cloud hoopla?

Well when we say Hadoop in context of a cloud we mean things on top of HDFS and MapReduce. Basically Hadoop is the entire ecosystem built on top of the ‘classic definition’. It consists of Hbase as database, Hive as a Data Warehouse, Pig as the query language all built on top of Hadoop and the Map-Reduce framework.

HDFS is designed ground up to scale seamlessly as you throw hardware at it. That’s it’s strength! Anyone designing server farms will agree, scaling horizontally is non-trivial in most cases. For HDFS, problem of scale is simply solved by throwing more hardware at the Farm. A lot of it is because actions on HDFS are asynchronous. But the concept of throwing hardware at a farm and getting scaling automatically is what endears Hadoop to Cloud computing.

Okay how is this different from SQL Server 2008 R2 running on top of Windows 2008 R2′s NTFS?

Ah ha! Now that’s a loaded question. Let’s try to go one by one (list below is pretty random with respect to importance or concepts, I am just doing a brain dump here)

1. Data is not stored in the traditional table column format. At best some of the database layers mimic this, but deep in the bowels of HDFS, there are no tables, no primary keys, no indexes. Everything is a flat file with predetermined delimiters. HDFS is optimized to recognize <Key, Value> mode of storage. Every things maps down to <Key, Value> pairs.

2. HDFS supports only forward only parsing. So you are either reading ahead or appending to the end. There is no concept of ‘Update’ or ‘Insert’.

3. Databases built on HDFS don’t guarantee ACID properties. Specially ‘Consistency’. It offers what is called as ‘Eventual Consistency’, meaning data will be saved eventually, but because of the highly asynchronous nature of the file system you are not guaranteed at what point it will finish. So HDFS based systems are NOT ideal for OLTP systems. RDBMS still rock there.

4. Taking code to the data. In traditional systems you fire a query to get data and then write code on it to manipulate it. In MapReduce, you write code and send it to Hadoop’s data store and get back the manipulated data. Essentially you are sending code to the data.

5. Traditional databases like SQL Server scale better vertically, so more cores, more memory, faster cores is the way to scale. However Hadoop by design scales horizontally. Keep throwing hardware at it and it will scale.

I am beginning to get it, why is it said Hadoop deals with unstructured data? How do we store data actually?

Unstructured is a slight misnomer in the very basic sense. By Unstructured, Hadoop implies it doesn’t know about column names, column data types, column sizes or even number of columns. Also there is no implicit concept of table. Data is stored in flat files! Flat files with some kind of delimiters that needs to be agreed upon by all users of the data store. So it could me comma delimited, pipe delimited, tab delimited. Line feed, as a thumb-rule, is always treated as the end of record. So there is a method to the madness (‘unstructured-ness’) but there are no hard-binding as employed by traditional databases. When you are dealing with data from Hadoop you are on your own with respect to data cleansing.

Data input in hadoop is as simple as loading your data file into HDFS, and by loading it’s very very close to copying files in the usual sense on any OS.

Okay, so there is no SQL, no Tables, no Columns, once I load my data how do I get it back?

In Short: Write code to do Map-Reduce.

Huh! Seriously? Map-Reduce… wha…?

Yes. You have to write code to get data from a Hadoop System. The abstractions on top of Hadoop are a few and all are sub-optimal. So the best way to get data is to write Java code that calls the MapReduce framework that slices and dices the stored data for you on the fly.

The Map-Reduce framework works in two steps, (no points for guessing), step 1 is Map and step 2 is Reduce.

Mapping Data: If it is plain de-limited text data, you have the freedom to pick your selection of keys from the record (remember records are typically linefeed separated) and values and tell the framework what your Key is and what values that key will hold. MR will deal with actual creation of the Map. When the map is being created you can control on what keys to include or what values to filter out. In the end you end up with a giant hashtable of filtered key value pairs. Now what?

Reducing Data: Once the map phase is complete code moves on to the reduce phase. The reduce phase works on mapped data and can potentially do all the aggregation and summation activities.

Finally you get a blob of the mapped and reduced data.

But… but… Do I really have to write Java?

Well, if you are that scared of Java, then you have Pig. No, I am not calling names here. Pig is a querying engine that has more ‘business-friendly’ syntax but spits out MapReduce code in the backend and does all the dirty work for you. The syntax for Pig is called, of course, Pig Latin.

When you write queries in Pig Latin, Pig converts it into MapReduce and sends it off to Hadoop, then retrieves the results and hands it back to you.

Analysis shows you get about half the performance of raw optimal hand written MapReduce java code, but the same code takes more than 10 times the time to write when compared to a Pig query.

If you are in the mood for a start-up idea, generating optimal MapReduce code from Pig Latin is a topic to consider ;-)

For those in the .NET world, Pig Latin is very similar syntactically to LINQ.

Okay, my head is now spinning, where does Hive and HBase fit in?

Describing Hive and HBase requires full articles of their own. A very brief intro to them is as follows:

HBase

HBase is a key value store that sits on top of HDFS. It is a NOSql Database.

It has a very thin veneer over raw HDFS where in it mandates that data is grouped in a Table that has rows of data.

Each row can have multiple ‘Column Families’ and each ‘Column Family’ can contain multiple columns.

Each column name is the key and it has it’s corresponding column value.

So a column of data can be represented as

row[family][column] = value

Each row need not have the same number of columns. Think of each row as a horizontal linked list, that links to a column family and then each column family links to multiple columns as <Key, Value> pairs.

row1 -> family1 -> col A = val A
-> family2 -> col B = val B
and so on.

Hive

Hive is a little closer to traditional RDBMS systems. In fact it is a Data Warehousing system that sits on top of HDFS but maintains a meta layer that helps data summation, ad-hoc queries and analysis of large data stores in HFDS.

Hive supports a high level language called Hive Query Language, that looks like SQL but restricted in a few ways like no, Updates or Deletes are allowed. However Hive has this concept of partitioning that can be used to update information, which is essentially re-writing a chunk of data whose granularity depends on the schema design.

Hive can actually sit on top of HBase and perform join operations between HBase tables.

I heard Hadoop only requires ‘Commodity Hardware’ to run. Can I bunch together the 486 machines gathering dust in my garage and bring up a Hadoop cluster?

In short: NO!

When Google originally set out to build it’s search index ‘powerful’ computers implied room sized Cray Super Computers that costed a pretty penny and available only to the CIA!

So commodity hardware implies ‘non-supercomputers’ that can be purchased by everybody. Today you can string together 10-12 high end blade servers each with about 24Gb of RAM and 12-24 TB disk space and as many cores each as you can get, to build an entry level production ready Hadoop cluster.

That’s a different point you can run code samples on a VM that will run ok on a laptop with the latest core processors and approx 8 Gigs of RAM. But that’s only good for code samples! Even for PoCs spinning up a EC2 cluster is the best way to go.


Okay, with that I conclude this article. In upcoming articles we’ll see installation as well as some real world use cases of big data on Hadoop!

Follow

Get every new post delivered to your Inbox.

Join 316 other followers