Thursday, September 3, 2009

How Many Bits Does it Take to Make Me?

After reading Bunnie's article, On Influenza A (H1N1) I came up with the idea for this post. You should definitely read his article, but here's a brief summary:

DNA is comprised of pairs of molecules, each pair can be one of four choices. In computer science lingo, each pair is represented by 2 bits. The H1N1 virus is 26,022 pairs long, or is represented by 3.2 kbytes.

How much data is 3.2 kbytes? Not very much. A compiled Hello World program in C is around 12 kbytes! It takes four times as much data to dump "Hello World" to screen than to kill a human! Here are some other interesting data points:

HIV: ~2.4 kilobytes
One page typed: ~2 kilobytes

The entire HIV genome takes up less space than the cell phone bill I pay each month!

Human DNA: ~730 megabytes
Windows XP Installation Disc: ~580 megabytes
Windows Vista Installation Disc: ~1.9 gigabytes

At some point between Windows XP and Windows Vista, operating systems became more complicated than people!

Fruit Fly: ~40 megabytes
Damn Small Linux: ~50 megabytes

Even the smallest of linux distributions is put to shame by a fruit fly.

Rice: ~105 megabytes
Canine: ~600 megabytes
Puppy Linux: ~103 megabytes

Puppy Linux is actually much closer in size to the genome of a grain of rice than that of an actual dog. Perhaps they should rename the project?

I chose to measure the size of the installation media used for each of these operating systems because this represents the minimum size required to create a functioning system. I felt that this measure is most similar to that of DNA. While your DNA does not describe every single detail about you, it does contain the necessary information to create you.

Disclaimer: I got the numbers for all the different OS sizes and genome lengths just from googling. I am probably wrong on some of them, but unfortunately I forgot to keep track of where they all came from. If you see something wrong with my numbers, let me know. I converted number of base pairs to megabytes by typing the following into google: (# of base pairs) * 2 bits in megabytes

Saturday, August 15, 2009

StackOverflow Experiment Results

Thanks to many of the StackOverflow.com users for pointing me to the official data dump, available here, I was able to complete my experiment.

I measured, using the number of questions asked containing a specific tag, the activity of various programming languages throughout the week. My hypothesis is: Newer dynamic languages like Ruby and Python will see a rise in questions ask
ed on the weekend while more corporate languages like C# and Java will see a dropoff in activity on the weekend.

My theory is that programmers choose to use languages like Python and Ruby for their personal projects, despite their weaknesses, because these languages are more fun to program in. Since programmers tend to work on these projects at night and on the weekends, they will probably be asking questions related to their projects during these times.

Fortunately, the results supported my hypothesis. A plot (made u
sing Python) of the relative number of questions asked per day of the week is shown below. The values were computed by calculating the percentage of questions asked for each topic relative to the total number of questions being asked. This controls for the overall drop in traffic to Stackoverflow.com on the weekends.

Python and Ruby both have a sharp rise on the weekend, while C# and Java both fall off. The fall of C# is quite a bit more pronounced than that of Java, but the effect is still clear. Another interesting note is that the two "workweek languages" both have a rise in activity on Mondays. Maybe programmers leave work Friday and continue to mull over problems at work during the weekend, then ask their problems early Monday morning.

Even though the relative activity of Python and Ruby rises on the weekend, it is still important to note that C# still sees activity around three times higher. This shows that there are still more people using C# than Python on the weekend, just not as many as during the week.

I'm not too sure exactly what the implications of these results are. Let me know what you think.

Wednesday, July 29, 2009

StackOverflow Experiment

I've been an active member of StackOverflow.com for a little over a month now. I've gotten some great help from the community on a variety of problems, both work related and personal. This gave me an idea to run a little experiment, piggy-backing off of the SO community.

The goal of this experiment is to determine if certain computer languages are used primarily at work, while others are used primarily for freetime/personal projects. Basically, I want to validate the stereotype that languages like C++ and Java are used in corporate settings while programmers tend to choose languages like Python or Ruby for their own projects.

I am attempting to measure this based on the assumption that most work-related projects are worked on during the week from around 9AM-5PM, and that freetime projects are worked on during evenings and on weekends. I created a script that will assess the number of questions asked on a topic at various times throughout the day and week.

My hypothesis is that the number of questions asked per hour on languages like Ruby and Python will be higher on evenings and weekends than during the work week, while the number of questions asked per hour on C++ and Java type languages will be highest during the work week.

I am going to define the work-week as 6am to 8pm Eastern Time, Monday through Friday. I feel that this time range is broad enough to capture work times across the United States, but not too broad to include much evening time for the East Coast. I also realize that the total activity on SO can vary throughout the day, so I am measuring the number of total questions asked as well to act as a control.

I also acknowledge that the number of questions asked on a topic is not the very best measure of a topic's activity. However, people do usually ask questions about what they're currently working on. I might post statistics for both questions asked and page views.

I'll have the results in a day or so, whenever I get the time to make the pretty charts.

Friday, July 17, 2009

Semantic Web Blues

At a talk I attended last week, Ralph Swick from the W3C described the current state of Semantic Web technology, and where the W3C would like to take it. The talk was great, but a couple of things Ralph said really stood out as problems in taking the Sematic Web forward.

While describing how the Sematic Web works, Ralph used the phrase "One man's metadata is another man's data." This really struck me. The metadata that we generate automatically while taking pictures on our cameras, saving documents in Word and reading emails can be incredibly valuable to the Sematic Web.

An image of a building is not that useful on it's own, but when you add the name of the photographer, the time the image was taken and the exact Lat/Long coordinates of the camera, a computer might be able to figure out what the name of the building is. Standardized meta data like this is going to be key in making the Semantic Web useful.

Unfortunately, the culture of the web today doesn't recognize this. Metadata is considered useless. Companies like Google and Yahoo even reccomend stripping it from images to decrease page loading times. Unfortunately, the cost of moving a couple extra bits over the wire outweighs the context gained from knowing where an image was taken.

This culture of minimization on the web has to change before the Semantic Web can take off. Next time you start to strip the metadata from your files to save space, remember that one man's garbage is another man's treasure. The few bytes you're throwing away could be incredibly useful to someone else. With today's hard drive prices, keeping an extra 10 or 100 megabytes around isn't costing you very much.

Tuesday, July 14, 2009

G-Code Interpreter

As part of my 3D-Printer project, I'm writing a G-Code interpreter to run on an Arduino. My machine will support a fairly limited subset of G-Code, as well as a few extra commands specific to my machine.
  1. Using the partial G-Code spec available from LinuxCNC, I developed a rudimentary parser. Here's the control flow of the parser:
  2. Split G-Code file into lines (blocks)
  3. Search each block for valid words in specific order: X, Y, Z, S, F, T, M, G
  4. Handle each word appropriately (set feed, set speed, move, etc.)

There are some notes here:

Location values must be maintained statically so code like this will run:
G01 Z1.0 F.05 X0.0
Speeds and feeds must be read first so code like this will run:
G01 Z1.0 F.05
Since my printer only has one toolhead, I changed the T word to represent temperature. I realize this is an awful and confusing decision, and will probably change it eventually. Here's my (messy) code.
void  Printer::parseBlock(char *string)
{
//Location must be maintained
static Location location = Location(0,0,0);

//Control words: Feed rate, extrusion speed and temperature
int s = 0;
int f = 0;
int t = 0;

//G Mode is persistent
static int g = 0;

//Machine Option is not persistent
int m = 0;

//Search string for accepted codes, then execute
//order is important, be careful when changing

char *p = string;

//Position First
if ((p = strpbrk(string, "X")))
{
sscanf(p, "X%f", &(location.iX));
}
if ((p = strpbrk(string, "Y")))
{
sscanf(p, "Y%f", &(location.iY));
}

if ((p = strpbrk(string, "Z")))
{
sscanf(p, "Z%f", &(location.iZ));
}

//Speed, Temp and Feed
if ((p = strpbrk(string, "S")))
{
sscanf(p, "S%d", &s);
S(s);
}

if ((p = strpbrk(string, "F")))
{
sscanf(p, "F%d", &f);
F(f);
}

if ((p = strpbrk(string, "T")))
{
sscanf(p, "T%d", &t);
T(t);
}

//Operation Codes (M)
//Conflicts are resolved by order
//AKA: conflicts are not resolved
p = string;
while ((p = strpbrk(p, "M")))
{
sscanf(p, "M%d", &m);
M(m);
//Move to next spot
p++;
}

//G Codes
//Conflicts are resolved in order, AKA not resolved
p = string;
while ((p = strpbrk(p, "G")))
{
sscanf(p, "G%d", &g);
G(g, location);
p++;
}

}

Thursday, July 9, 2009

Simmons LED Display Part 4: Django Web App


This is part four of my series on the Simmons LED Display. I'm going to describe the implementation of the web-based front end.

Here are some of the specs I wrote up for the web applic
ation:
  1. Users go to the Simmons LED website to enter messages.
  2. The Simmons LED website tells the user when their message will be displayed.
  3. The Simmons LED server displays the message at the given time for a fixed time (2 minutes).
  4. Only Simmons residents will be able to post messages.
  5. Messages will be profanity/obscenity/inappropriate filtered.
It's a pretty simple idea, the only non-straightforward aspect is some misdirection in the form of a database. The database allows for separation betw
een the display code and the web page code. Here's the flow of the program:
  1. A user visits the Simmons LED website.
  2. The user is allowed access if they are a Simmons resident.
  3. A message gets entered to the website.
  4. The website profanity filters the message.
  5. The website saves the message, with the current time, into the database.
  6. The website asks the database how many messages there are in line.
  7. The website tells the user when their message will be displayed.
  8. The display service repeatedly checks the database for new messages.
  9. The display service displays the oldest message for two minutes, then marks it as displayed.
The purpose of the database is to buffer the inputs, in case tons of users post messages all at once.

The Django code is shown below:

forms.py
models.py

views.py
Note: this code doesn't contain the profanity filter. I am planning on adding this as a validator in the message model.

The (very messy) html templates are coming soon.

Wednesday, July 8, 2009

3D Printing

Even though 3D printing seems like something out of a science fiction novel, the technology has been around for quite a few years now. Machines exist today which can take computer-generated models such as the mouse shown below (from Wikipedia), and turn them into real-life objects.
The quality of these printed objects is on par with most other manufacturing and prototyping processes and their ease of use is unparalleled. Companies like Dimension, Objet and Desktop Factory all make and sell plug and play 3D printers.

If this technology is so great and easy to use, why doesn't everyone have a 3D printer? The problem is cost. The cheapest printers made by Dimension and Desktop Factory cost around $5000! These companies are advertising these machines as breakthrough devices, and yet they cost 10% of the average US household income.

These companies need to take a step back if they want to put a 3D printer in every home in America. A (completely unscientific) survey I conducted recently on the Amazon Mechanical Turk suggested that people would jump at the opportunity to buy a desktop 3D printer in the price range of $200-$500, even if it meant lower print quality than the existing machines.

A device like this would be invaluable for home use. Coupled with an online object library such as Thingiverse and the rise of easy to use 3D CAD software like Google Sketchup, home repairs will be easier than ever. The low cost would make itself up in fewer trips to the hardware store in no time. Need a new kitchen utensil? Break the battery cover off of your TV remote? No problem. Click print and you're all set.

This is the exact goal of the RepRap project: to put an inexpensive 3D printer in every home. This group has made tremendous progress in designing and building self-reprodudicing 3D printers: printers that can print other printers. This idea is pretty alluring; imaging building a device that can build a copy of itself. The growth would be exponential.

The only problem is that the technology to print parts like electronics and motors simply doesn't exist yet. This means that at best, the RepRap can currently print around 60% of its own parts. Unfortunately, a 3D printer that can print 60% of its parts is just as useful as one that can print 0% of its parts. Until this technology arrives, I feel the RepRap will be stuck in engineering labs and hobbyists basements.

This is where a project I've been working on since last January comes in. Geoff Tsai and I have been working to design and build a low-cost printer ($200-$500), similar to the RepRap, that will be manufactured using traditional techniques rather than 3D printing. We built our first prototype in May and were successful in printing a few small test parts. I'm working on improving our software and improving the reliabilty of our machine.

I hope to have a second prototype completed by the end of the fall. More updates will be coming!