Everything for nosql

Thứ Sáu, 27 tháng 4, 2012

Variety is a lightweight tool which gives a feel for an application’s schema, as well as any schema outliers. It is particularly useful for

• quickly learning how data is structured, if inheriting a codebase with a production data dump

• finding all rare keys in a given collection

An Easy Example

We’ll make a collection, within the MongoDB shell:

db.users.insert({name: "Tom", bio: "A nice guy.", pets: ["monkey", "fish"], someWeirdLegacyKey: "I like Ike!"});

db.users.insert({name: "Dick", bio: "I swordfight."}); db.users.insert({name: "Harry", pets: "egret"});

db.users.insert({name: "Geneviève", bio: "Ça va?"}); END JAVASCRIPT

Let’s use Variety on this collection, and see what it can tell us:

$ mongo test --eval "var collection = 'users'" variety.js

The above is executed from terminal.”test” is the database containing the collection we are analyzing.

Variety’s output:

{ "_id" : { "key" : "_id" }, "value" : { "types" : [ "object" ] }, "totalOccurrences" : 4, "percentContaining" : 100 }

{ "_id" : { "key" : "name" }, "value" : { "types" : [ "string" ] }, "totalOccurrences" : 4, "percentContaining" : 100 }

{ "_id" : { "key" : "bio" }, "value" : { "types" : [ "string" ] }, "totalOccurrences" : 3, "percentContaining" : 75 }

{ "_id" : { "key" : "pets" }, "value" : { "types" : [ "string", "array" ] }, "totalOccurrences" : 2, "percentContaining" : 50 }

{ "_id" : { "key" : "someWeirdLegacyKey" }, "value" : { "type" : "string" }, "totalOccurrences" : 1, "percentContaining" : 25 }

Every document in the “users” collection has a “name” and “_id”. Most, but not all have a “bio”. Interestingly, it looks like “pets” can be either an array or a string. The application code really only expects arrays of pets. Have we discovered a bug, or a remnant of a previous schema? The first document created has a weird legacy key I’ve never seen before- the people who built the prototype didn’t clean up after themselves. These rare keys, whose contents are never used, have a strong potential to confuse developers, and could be removed once we verify our findings. For future use, results are also stored a varietyResults database.

Learn More!

Learn more about Variety now, including

• How to download Variety

• How to set a limit on the number of documents analyzed from a collection

• How to contribute, and report issues

Variety is free, open source, and written in 100% JavaScript. Check it out on Github.

-by James Cropcho

Source : http://blog.mongodb.org/post/21923016898/meet-variety-a-schema-analyzer-for-mongodb

MongoDB and Node.js at 10gen

Comments

1 day ago

With their strong roots in JavaScript, Node.js and MongoDB have always been a natural fit, and the Node.js community has embraced MongoDB with a number of open source projects. To support the community’s efforts, 10gen is happy to announce that the MongoDB Node.js driver will join the existing set of 12 officially supported drivers for MongoDB.

The Node.js driver was born out of necessity. Christian Kvalheim started using Node.js in early 2010. He had heard good things about MongoDB but was disappointed to discover that no native driver had yet been developed. So, he got to work. Over the past two years, Christian has done amazing work in his driver, and it has matured through the contributions of a large community and the rigors of production. For some time now, the driver has been on par with 10gen’s officially supported MongoDB drivers. So we were naturally thrilled to welcome Christian full time at 10gen to continue his work on the Node.js driver.

If you’ve used MongoDB with Node.js, you’re probably familiar with Mongoose, a popular object-document mapper that’s used atop the Node.js driver. We’re really excited to announce that Aaron Heckmann, one of the authors of Mongoose, has also joined the 10gen team.

10gen is fully committed to making MongoDB a first class citizen in the Node.js ecosystem. We’ve partnered with Node.js advocates Joyent and Microsoft. We’ve seen fantastic companies built on Node.js and MongoDB like Learnboost, Ordr.in, and Trello. And with Aaron and Christian on board, we see a very promising future for developers wanting to use MongoDB with Node.js.

Curious about the driver and how it works? Check out the documentation and watch some sessions on MongoDB’s integration with Node.js.

Github
Node.js Language Center
MongoDB and Node.js Presentations

Interested in working on MongoDB and Node.js? Drop us a line. We continue to look for other great engineers to join us in this effort.

0 NoSQL Architecture

The NoSQL movement continues to gain momentum as developers continue to grow weary of traditional SQL based database management and look for advancements in storage technology. A recent article provided a great roundup of some of the great new technologies in this area, particularly focusing on the different approaches to replication and partitioning. There are excellent new technologies available, but using a NoSQL database is not just a straight substitute for a SQL server. NoSQL changes the rules in many ways, and using a NoSQL database is best accompanied by a corresponding change in application architecture.

The NoSQL database approach is characterized by a move away from the complexity of SQL based servers. The logic of validation, access control, mapping querieable indexed data, correlating related data, conflict resolution, maintaining integrity constraints, and triggered procedures is moved out of the database layer. This enables NoSQL database engines to focus on exceptional performance and scalability. Of course, these fundamental data concerns of an application don’t go away, but rather must move to a programmatic layer. One of the key advantages of the NoSQL-driven architecture is that this logic can now be codified in our own familiar, powerful, flexible turing-complete programming languages, rather than relying on the vast assortment of complex APIs and languages in a SQL server (data column, definitions, queries, stored procedures, etc).

In this article, we’ll explore the different aspects of data management and suggest an architecture that uses a data management tier on top of NoSQL databases, where this tier focuses on the concerns of handling and managing data like validation, relation correlation, and integrity maintenance. Further, I believe this architecture also suggests a more user-interface-focused lightweight version of the model-viewer-controller (MVC) for the next tier. I then want to demonstrate how the Persevere 2.0 framework is well suited to be a data management layer on top of NoSQL databases. Lets look at the different aspects of databases and how NoSQL engines affect our handling of data and architecture.

Architecture with NoSQL

In order to understand how to properly architect applications with NoSQL databases you must understand the separation of concerns between data management and data storage. The past era of SQL based databases attempted to satisfy both concerns with databases. This is very difficult, and inevitably applications would take on part of the task of data management, providing certain validation tasks and adding modeling logic. One of the key concepts of the NoSQL movement is to have DBs focus on the task of high-performance scalable data storage, and provide low-level access to a data management layer in a way that allows data management tasks to be conveniently written in the programming language of choice rather than having data management logic spread across Turing-complete application languages, SQL, and sometimes even DB-specific stored procedure languages.

Data Management Architecture

Complex Data Structures

One important capability that most NoSQL databases provide is hierarchical nested structures in data entities. Hierarchical data and data with list type structures are easily described with JSON and other formats used by NoSQL databases, where multiple tables with relations would be necessary in traditional SQL databases to describe these data structures. Furthermore, JSON (or alternatives) provide a format that much more closely matches the common programming languages data structure, greatly simplifying object mapping. The ability to easily store object-style structures without impedance mismatch is a big attractant of NoSQL.

Nested data structures work elegantly in situations where the children/substructures are always accessed from within a parent document. Object oriented and RDF databases also work well with data structures that are uni-directional, one object is accessed from another, but not vice versa. However, if the data entities may need to be individually accessed and updated or relations are bi-directional, real relations become necessary. For example, if we had a database of employees and employers, we could easily envision scenarios where we would start with an employee and want to find their employer, or start with an employer and find all their employees. It may also be desirable to individually update an employee or employer without having to worry about updating all the related entities.

In some situations, nested structures can eliminate unnecessary bi-directional relations and greatly simplify database design, but there are still critical parts of real applications where relations are essential.

Handling Relational Data

The NoSQL style databases has often been termed non-relational databases. This is an unfortunate term. These databases can certainly be used with data that has relations, which is actually extremely important. In fact, real data almost always has relations. Truly non-relational data management would be virtually worthless. Understanding how to deal with relations has not always been well-addressed by NoSQL discussions and is perhaps one of the most important issues for real application development on top of NoSQL databases.

The handling of relations with traditional RDBMSs is very well understood. Table structures are defined by data normalization, and data is retrieved through SQL queries that often make extensive use of joins to leverage the relations of data to aggregate information from multiple normalized tables. The benefits of normalization are also clear. How then do we model relations and utilize them with NoSQL databases?

There are a couple approaches. First, we can retain normalization strategies and avoid any duplication of data. Alternately, we can choose to de-normalize data which can have benefits for improved query performance.

With normalized data we can preserve key invariants, making it easy to maintain consistent data, without having to worry about keeping duplicated data in sync. However, normalization can often push the burden of effort on to queries to aggregate information from multiple records and can often incur substantial performance costs. Substantial effort has been put into providing high-performance JOINs in RDBMSs to provide optimally efficient access to normalized data. However, in the NoSQL world, most DBs do not provide any ad-hoc JOIN type of query functionality. Consequently, to perform a query that aggregates information across tables often requires application level iteration, or creative use of map-reduce functions. Queries that utilize joining for filtering across different mutable records often cannot be properly addressed with map-reduce functions, and must use application level iteration.

NoSQL advocates might suggest that the lack of JOIN functionality is beneficial; it encourages de-normalization that provides much more efficient query-time data access. All aggregation happens for each (less frequent) write, thus allowing queries to avoid any O(n) aggregation operations. However, de-normalization can have serious consequences. De-normalization means that data is prone to inconsistencies. Generally, this means duplication of data; when that data is mutated, applications must rely on synchronization techniques to avoid having copies become inconsistent. This invariant can easily be violated by application code. While it is typically suitable for multiple applications to access database management servers, with de-normalized data, database access becomes fraught with invariants that must be carefully understood.

These hazards do not negate the value of database de-normalization as an optimization and scalability technique. However, with such an approach, database access should be viewed as an internal aspect of implementation rather than a reusable API. The management of data consistency becomes an integral compliment to the NoSQL storage as part of the whole database system.

The NoSQL approach is headed in the wrong direction if it is attempting to invalidate the historic pillars of data management, established by Edgar Codd. These basic rules for maintaining consistent data are timeless, but with the proper architecture a full NoSQL-based data management system does not need to contradict these ideas. Rather it couples NoSQL data storage engines with database management logic, allowing for these rules to be fulfilled in much more natural ways. In fact, Codd himself, the undisputed father of relational databases, was opposed to SQL. Most likely, he would find a properly architected database management application layer combined with a NoSQL storage engine to fit much closer to his ideals of a relational database then the traditional SQL database.

Network or In-process Programmatic Interaction?

With the vastly different approach of NoSQL servers, it is worth considering if the traditional network-based out-of-process interaction approach of SQL servers is truly optimal for NoSQL servers. Interestingly, both of the approaches to relational data point to the value of more direct in-process programmatic access to indexes rather than the traditional query-request-over-tcp style communication. JOIN style queries over normalized data is very doable with NoSQL databases, but it relies on iterating through data sets with lookups during each loop. These lookups can be very cheap at the index level, but can incur a lot of overhead at the TCP handling and query parsing level. Direct programmatic interaction with the database sidesteps the unnecessary overhead, allowing for reasonably fast ad-hoc relational queries. This does not hinder clustering or replication across multiple machines, the data management layer can be connected to the storage system on each box.

De-normalization approaches also work well with in-process programmatic access. Here the reasons are different. Now, access to the database should be funneled through a programmatic layer that handles all data synchronization needs to preserve invariants so that multiple higher level application modules can safely interact with the database (whether programmatically or a higher level TCP/IP based communication such as HTTP). With programmatic-only access, the data can be more safely protected from access that might violate integrity expectations.

Browser vendors have also come to similar conclusions of programmatic access to indexes rather than query-based access in the W3C process to define the browser-based database API. Earlier efforts to provide browser-based databases spurred by Google Gears and later implemented in Safari were SQL-based. But the obvious growing dissatisfaction with SQL among developers and the impedance mismatches between RDBMS style data structures and JavaScript style data structures, has led the W3C, with a proposal from Oracle (and supported by Mozilla and Microsoft), to orient towards a NoSQL-style indexed key-value document database API modeled after the Berkeley DB API.

Schemas/Validation

Most NoSQL databases could also be called schema-free databases as this is often one of the most highly touted aspects of these type of databases. The key advantage of schema-free design is that it allows applications to quickly upgrade the structure of data without expensive table rewrites. It also allows for greater flexibility in storing heterogeneously structured data. But while applications may benefit greatly from freedom from storage schemas, this certainly does not eliminate the need to enforce data validity and integrity constraints.

Moving the validity/integrity enforcement to the data management layer has significant advantages. SQL databases had very limited stiff schemas, whereas we have much more flexibility enforcing constraints with a programming language. We can enforce complex rules, mix strict type enforcements on certain properties, and leave other properties free to carry various types or be optional. Validation can even employ access to external systems to verify data. By moving validation out of the storage layer, we can centralize validation in our data management layer and have the freedom to create rich data structures and evolve our applications without storage system induced limitations.

ACID/BASE and Relaxing Consistency Constraints

One aspect of the NoSQL movement has been a move away from trying to maintain completely perfect consistency across distributed servers (everyone has the same view of data) due to the burden this places on databases, particularly in distributed systems. The now famous CAP theorem states that of consistency, availability, and network partitioning, only two can be guaranteed at any time. Traditional relational databases have kept strict transactional semantics to preserve consistency, but many NoSQL databases are moving towards a more scalable architecture that relaxes consistency. Relaxing consistency is often called eventual consistency. This permits much more scalable distributed storage systems where writes can occur without using two phase commits or system-wide locks.

However, relaxing consistency does lead to the possibility of conflicting writes. When multiple nodes can accept modifications without expensive lock coordination, concurrent writes can occur in conflict. Databases like CouchDB will put objects into a conflict state when this occurs. However, it is inevitably the responsibility of the application to deal with these conflicts. Again, our suggested data management layer is naturally the place for the conflict resolution logic.

Data management can also be used to customize the consistency level. In general, one can implement more relaxed consistency-based replication systems on top of individual database storage systems based on stricter transactional semantics. Customized replication and consistency enforcements can be very useful for applications where some updates may require higher integrity and some may require the higher scalability of relaxed consistency.

Customizing replication can also be useful for determining exactly what constitutes a conflict. Multi-Version Concurency Control (MVCC) style conflict resolution like that of CouchDB can be very naive. MVCC assumes the precondition for any update is the version number of the previous version of the document. This certainly is not necessarily always the correct precondition, and many times unexpected inconsistent data may be due to updates that were based on other record/document states. Creating the proper update logs and correctly finding conflicts during synchronization can often involve application level design decisions that a storage can’t make on its own.

Persevere

Persevere is a RESTful persistence framework, version 2.0 is designed for NoSQL databases while maintaining the architecture principles of the relational model, providing a solid complementary approach. Persevere’s persistence system, Perstore, uses a data store API that is actually directly modeled after W3C’s No-SQL-inspired indexed database API. Combined with Persevere’s RESTful HTTP handler (called Pintura), data can be efficiently and scalably stored in NoSQL storage engines and accessed through a data modeling logic layer that allows users to access data in familiar RESTful terms with appropriate data views, while preserving data consistency. Persevere provides JSON Schema based validation, data modification, creation, and deletion handlers, notifications, and a faceted-based security system.

Persevere’s evolutionary schema approach leans on the convenience of JSON schema to provide a powerful set of validation tools. In Persevere 2.0, we can define a data model:

var Model = require("model").Model;// productStore provides the API for accessing the storage DB.Model("Product", productStore, {  properties: {    name: String, // we can easily define type constraints    price: { // we can create more sophisticated constraints      type: "number",      miminum: 0,    },    productCode: {      set: function(value){        // or even programmatic validation      }    }  },  // note that we have not restricted additional properties from being added  // we could restrict additional properties with:  // additionalProperties: false  // we can specify different action handlers. These are optional, they will  // pass through to the store if not defined.  query: function(query, options){    // we can specify how queries should be handled and    //delivered to the storage system    productStore.query(query, options  },  put: function(object, directives){     // we could specify any access checks, or updates to other objects     // that need to take place     productStore.put(object, directives);  }});

The Persevere 2.0 series of articles provides more information about creating data models as well as using facets for controlling access to data.

Persevere’s Relational Links

Persevere also provides relation management. This is also based on the JSON Schema specification, and has a RESTful design based on the link relation mechanism that is used in HTML and Atom. JSON Schema link definitions provide a clean technique for defining bi-directional links in a way that gives link visibility to clients. Building on our product example, we could define a Manufacturer model, and link products to their manufacturers:

Model("Product", productStore, {   properties: {      ...      manufacturerId: String   },   links: [     {        rel: "manufacturer",        href: "Manufacturer/{manufacterId}"     }   ],...});Model("Manufacturer", manufacturerStore, {   properties: {      name: String,      id: String,      ...   },   links: [     {        rel: "products",        href: "Product/?manufacterId={id}"     }   ]});

With this definition we have explicitly defined how one can traverse back and forth (bi-directionally) between a product and a manufacturer, using a standard normalized foreign key (no extra effort in synchronizing bi-direction references).

The New mVC

By implementing a logically complete data management system, we have effectively implemented the “model” of the MVC architecture. This actually allows the MVC layer to stay more focused and lightweight. Rather than having to handle data modeling and management concerns, the MVC can focus on the user interface (the viewer and controller), and simply a minimal model connector that interfaces with the full model implementation, the data management layer. I’d suggest this type of user interface layer be recapitalized as mVC to denote the de-emphasis on data modeling concerns in the user interface logic. This type of architecture facilitates multiple mVC UIs connecting to a single data management system, or vice versa, a single mVC could connect to multiple data management systems, aggregating data for the user. This decoupling improves the opportunity for independent evolution of components.

Conclusion

The main conceptual ideas that I believe are key to the evolution of NoSQL-based architectures:

Database management needs to move to a two layer architecture, separating the concerns of data modeling and data storage. Persevere demonstrates this data modeling type of framework with a web-oriented RESTful design that complements the new breed of NoSQL servers.

With this two layered approach, the data storage server should be coupled to a particular data model manager that ensures consistency and integrity. All access must go through this data model manager to protect the invariants enforced by the managerial layer.

With a coupled management layer, storage servers are most efficiently accessed through a programmatic API, preferably keeping the storage system in-process to minimize communication overhead.

The W3C Indexed Database API fits this model well, with an API that astutely abstracts the storage (including indexing storage) concerns from the data modeling concerns. This is applicable on the server just as well as the client-side. Kudos to the W3C for an excellent NoSQL-style API.

Source : http://www.sitepen.com/blog/2010/05/11/nosql-architecture

0 10 things you should know about NoSQL databases

By
Guy Harrison

August 26, 2010, 8:26 AM PDT

Takeaway: The relational database model has prevailed for decades, but a new type of database — known as NoSQL — is gaining attention in the enterprise. Here’s an overview of its pros and cons.

For a quarter of a century, the relational database (RDBMS) has been the dominant model for database management. But, today, non-relational, “cloud,” or “NoSQL” databases are gaining mindshare as an alternative model for database management. In this article, we’ll look at the 10 key aspects of these non-relational NoSQL databases: the top five advantages and the top five challenges.

Note: This article is also available as a PDF download.

Five advantages of NoSQL

1: Elastic scaling

For years, database administrators have relied on scale up — buying bigger servers as database load increases — rather than scale out — distributing the database across multiple hosts as load increases. However, as transaction rates and availability requirements increase, and as databases move into the cloud or onto virtualized environments, the economic advantages of scaling out on commodity hardware become irresistible.

RDBMS might not scale out easily on commodity clusters, but the new breed of NoSQL databases are designed to expand transparently to take advantage of new nodes, and they’re usually designed with low-cost commodity hardware in mind.

2: Big data

Just as transaction rates have grown out of recognition over the last decade, the volumes of data that are being stored also have increased massively. O’Reilly has cleverly called this the “industrial revolution of data.” RDBMS capacity has been growing to match these increases, but as with transaction rates, the constraints of data volumes that can be practically managed by a single RDBMS are becoming intolerable for some enterprises. Today, the volumes of “big data” that can be handled by NoSQL systems, such as Hadoop, outstrip what can be handled by the biggest RDBMS.

3: Goodbye DBAs (see you later?)

Despite the many manageability improvements claimed by RDBMS vendors over the years, high-end RDBMS systems can be maintained only with the assistance of expensive, highly trained DBAs. DBAs are intimately involved in the design, installation, and ongoing tuning of high-end RDBMS systems.

NoSQL databases are generally designed from the ground up to require less management: automatic repair, data distribution, and simpler data models lead to lower administration and tuning requirements — in theory. In practice, it’s likely that rumors of the DBA’s death have been slightly exaggerated. Someone will always be accountable for the performance and availability of any mission-critical data store.

4: Economics

NoSQL databases typically use clusters of cheap commodity servers to manage the exploding data and transaction volumes, while RDBMS tends to rely on expensive proprietary servers and storage systems. The result is that the cost per gigabyte or transaction/second for NoSQL can be many times less than the cost for RDBMS, allowing you to store and process more data at a much lower price point.

5: Flexible data models

Change management is a big headache for large production RDBMS. Even minor changes to the data model of an RDBMS have to be carefully managed and may necessitate downtime or reduced service levels.

NoSQL databases have far more relaxed — or even nonexistent — data model restrictions. NoSQL Key Value stores and document databases allow the application to store virtually any structure it wants in a data element. Even the more rigidly defined BigTable-based NoSQL databases (Cassandra, HBase) typically allow new columns to be created without too much fuss.

The result is that application changes and database schema changes do not have to be managed as one complicated change unit. In theory, this will allow applications to iterate faster, though,clearly, there can be undesirable side effects if the application fails to manage data integrity.

Five challenges of NoSQL

The promise of the NoSQL database has generated a lot of enthusiasm, but there are many obstacles to overcome before they can appeal to mainstream enterprises. Here are a few of the top challenges.

1: Maturity

RDBMS systems have been around for a long time. NoSQL advocates will argue that their advancing age is a sign of their obsolescence, but for most CIOs, the maturity of the RDBMS is reassuring. For the most part, RDBMS systems are stable and richly functional. In comparison, most NoSQL alternatives are in pre-production versions with many key features yet to be implemented.

Living on the technological leading edge is an exciting prospect for many developers, but enterprises should approach it with extreme caution.

2: Support

Enterprises want the reassurance that if a key system fails, they will be able to get timely and competent support. All RDBMS vendors go to great lengths to provide a high level of enterprise support.

In contrast, most NoSQL systems are open source projects, and although there are usually one or more firms offering support for each NoSQL database, these companies often are small start-ups without the global reach, support resources, or credibility of an Oracle, Microsoft, or IBM.

3: Analytics and business intelligence

NoSQL databases have evolved to meet the scaling demands of modern Web 2.0 applications. Consequently, most of their feature set is oriented toward the demands of these applications. However, data in an application has value to the business that goes beyond the insert-read-update-delete cycle of a typical Web application. Businesses mine information in corporate databases to improve their efficiency and competitiveness, and business intelligence (BI) is a key IT issue for all medium to large companies.

NoSQL databases offer few facilities for ad-hoc query and analysis. Even a simple query requires significant programming expertise, and commonly used BI tools do not provide connectivity to NoSQL.

Some relief is provided by the emergence of solutions such as HIVE or PIG, which can provide easier access to data held in Hadoop clusters and perhaps eventually, other NoSQL databases. Quest Software has developed a product — Toad for Cloud Databases — that can provide ad-hoc query capabilities to a variety of NoSQL databases.

4: Administration

The design goals for NoSQL may be to provide a zero-admin solution, but the current reality falls well short of that goal. NoSQL today requires a lot of skill to install and a lot of effort to maintain.

5: Expertise

There are literally millions of developers throughout the world, and in every business segment, who are familiar with RDBMS concepts and programming. In contrast, almost every NoSQL developer is in a learning mode. This situation will address naturally over time, but for now, it’s far easier to find experienced RDBMS programmers or administrators than a NoSQL expert.

Conclusion

NoSQL databases are becoming an increasingly important part of the database landscape, and when used appropriately, can offer real benefits. However, enterprises should proceed with caution with full awareness of the legitimate limitations and issues that are associated with these databases.

About the author

Guy Harrison is the director of research and development at Quest Software. A recognized database expert with more than 20 years of experience in application and database administration, performance tuning, and software development, Guy is the author of several books and many articles on database technologies and a regular speaker at technical conferences.

Great article. I think 2012 is going to be the prime time for NoSQL.
In a random email from Dice posting about a video game website looking for a noSQL person. The same posting had the most obscure technologies ever. They are a web company that is looking for a Python / NoSQL expert. Good luck with that one...
If you want to see a NoSQL-based system that is mature, stable and highly usable, have a look at EPIC or the V.A.'s VistA. Both are medical software systems based on the MUMPS hierarchical database that has been in use since the late '80s. As pointed out by others, this schema-free system can be used to model just about any kind of database you want - including relational - with huge scalability of data content and user number. It scores on the first nine points above, and an effort is now underway to address number 10 by training a new cadre of MUMPS programmers.

Martin Mendelson, MD, PhD
Source : http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772

0 Introducing MySQL to MongoDB Replication

The last article on this blog described our planned
MySQL to MongoDB replication hackathon at the recent
Open DB Camp in Sardinia. Well, it worked, and the code is now checked into the
Tungsten Replicator project. This article describes exactly what we did to write the code and set up replication. You can view it as a kind of cookbook both for implementing new database types in Tungsten as well as setting up replication to MongoDB.

The Team

MySQL to MongoDB replication was a group effort with three people:
Flavio Percoco, Stephane Giron, and me. Flavio has worked on MongoDB for a couple of years and is extremely well-informed both on database setup as well as application design. Stephane Giron is a replication engineer at Continuent and has done a substantial amount of the work on data extraction from MySQL, especially row replication. I work on the core execution framework as well as performance.

Getting Started with MongoDB

There were a couple of talks on MongoDB during the first morning of Open DB camp (Saturday May 7th), which Stephane and I dutifully attended to raise our consciousness. We got cracking on implementation around 2pm that afternoon. The first step was to bring up MongoDB 1.8.1 and study its habits with help from Flavio.

MongoDB is definitely easy to set up. You get binary builds from the
MongoDB download page. Here is a minimal set of commands to unpack MongoDB 1.8.1 and start the mongod using directory
data to hold tables.

$ tar -xvzf mongodb-osx-x86_64-1.8.1.tgz
$ cd mongodb-osx-x86_64-1.8.1
$ mkdir data
$ bin/mongo --dbpath data
(... messages ...)

You connect to mongod using the mongo client. Here's an example of connecting and creating a table with a single row.

$ bin/mongo localhost:27017
MongoDB shell version: 1.8.1
connecting to: localhost:27017/test
> use mydb
switched to db mydb
> db.test.insert({"test": "test value", "anumber" : 5 })                 
> db.test.find()
{ "_id" : ObjectId("4dce9a4f3d6e186ffccdd4bb"), "test" : "test value", "anumber" : 5 }
> exit

This is schema-less programming in action. You just insert BSON documents (BSON = Binary JSON) into
collections, which is Mongolese for
tables. MongoDB creates the collection for you as soon as you put something in it. The automatic materialization is quite addictive once you get used to it, which takes about 5 minutes.

The MongoDB client language is really handy. It is based on JavaScript. There are what seem to be some non-Javascript commands like "show dbs" to show databases or "show collections" to list the tables. Everything else is object-oriented and easy to understand. For example, to find all the records in collection test, as we saw above, you just connect to the database and issue a command on the local db object. Collections appear as properties of db, and operations on the collection are methods.

It helps that the MongoDB folks provide very accessible documentation, for example a
SQL to MongoDB translation chart. I put together a little practice program using the
MongoDB Java driver to insert, referring to the
Javadoc for the class library when in doubt about API calls. There are also a couple of very helpful examples,
like this one, included with the driver.

All told, setup and orientation took us about 45 minutes. It helped enormously that Flavio is a MongoDB expert, which minimized flail considerably.

Implementing Basic Replication from MySQL to MongoDB

After setup we proceeded to implement replication. Here is an overview of the replicator pipeline to move data from MySQL to MongoDB. Pipelines are message processing flows within the replicator.

Direct pipelines move data from DBMS to another within a single replicator. They are already a standard part of Tungsten Replicator and most of the code shown above already exists, except for the parts shown in red. Before we started, we therefore needed to set up a replicator with a direct pipeline.

We first built the code according to the
instructions on the Tungsten project wiki, uploaded the binary to our test host, and configured the replicator. First, we ran the Tungsten
configure script to set defaults for the MySQL server (user name, extract method, etc.). Next we ran the configure-service command to set up the direct pipeline configuration file. Both commands together look like the following:


./configure
./configure-service -C --role=direct mongodb

The second command created a file called tungsten-replicator/conf/static-mongodb.properties with all the information about the direct pipeline implementation but of course nothing yet about MongoDB.

Now we could start the implementation. To move data to MongoDB, we needed two new components:

A Tungsten RawApplier to apply row updates to MongoDB. RawApplier is the basic interface you implement to create an applier to a database.

A Tungsten Filter to stick column names on row updates after extracting from MySQL. MySQL row replication does not do this automatically, which makes it difficult to construct JSON at the other end because you do not have the right property names.

To get started on the applier I implemented a very simple class named
MongoApplier that could take an insert from MySQL, turn it into a BSON document, and add it to an equivalently named database and collection in MongoDB. I added this to the replicator code tree, then built and uploaded tungsten-replicator.jar. (Use 'ant dist' in the replicator directory to build the JAR.)

To start using the new MongoDB applier, we needed to edit the service properties file to use this component instead of the standard MySQL applier that configuration adds by default. To do this, you can open up static-mongodb.properties with your favorite editor. Add the following properties at the bottom of the APPLIERS section.


# MongoDB applier.  You must specify a connection string for the server. 
# This currently supports only a single server. 
replicator.applier.mongodb=com.continuent.tungsten.replicator.applier.MongoApplier
replicator.applier.mongodb.connectString=localhost:27017

Next, you need to fix up the direct pipeline so that the last stage uses the new applier. We located the direct pipeline definition (around line 208 in the properties file) and set the applier to mongodb as shown in the following example.


# Write from parallel queue to database.
replicator.stage.d-pq-to-dbms=com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask
replicator.stage.d-pq-to-dbms.extractor=parallel-q-extractor
replicator.stage.d-pq-to-dbms.applier=mongodb
replicator.stage.d-pq-to-dbms.filters=mysqlsessions
replicator.stage.d-pq-to-dbms.taskCount=${replicator.global.apply.channels}
replicator.stage.d-pq-to-dbms.blockCommitRowCount=${replicator.global.buffer.size}

We then started the replicator using 'replicator start.' At that point we could do the following on MySQL:


mysql> create table foo(id int primary key, msg varchar(35));
Query OK, 0 rows affected (0.05 sec)
mysql> insert into foo values(1, 'hello from MySQL!');
Query OK, 1 row affected (0.00 sec)

...And within a second we could see the following over in MongoDB:


> show collections
foo
system.indexes
> db.foo.find();
{ "_id" : ObjectId("4dc55e45ad90a25b9b57909d"), "1" : "1", "2" : "hello from MySQL!" }

This kind of progress was very encouraging. It took roughly 2 hours to get to move the first inserts across. Compared to replicating to a new SQL database like Oracle that's lightning fast. However, there were still no property names because we were not adding column names to row updates.

Meanwhile, Stephane had finished the column name filter (
ColumnNameFilter) and checked it in. I rebuilt and refreshed the replicator code, then edited static-mongodb.properties as follows to add the filter. First put in the filter definition in the FILTERS section:


# Column name filter.  Adds column name metadata to row updates.  This is
# required for MySQL row replication if you have logic that requires column
# names.
replicator.filter.colnames=com.continuent.tungsten.replicator.filter.ColumnNameFilter

Next, make the first stage of the direct pipeline use the filter:


# Extract from binlog into queue.
replicator.stage.d-binlog-to-q=com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask
replicator.stage.d-binlog-to-q.extractor=mysql
replicator.stage.d-binlog-to-q.filters=colnames,pkey
replicator.stage.d-binlog-to-q.applier=queue

We then restarted the replicator. Thereupon, we started to see inserts like the following, complete with property names:

> db.foo.find()
{ "_id" : ObjectId("4dc77bacad9092bd1aef046d"), "id" : "25", "data" : "data value" }

That was better, much better! To this point we had put in exactly 2 hours and 45 minutes wall clock time. It was enough to prove the point and more than enough for a demo the next day. The hackathon was a rousing success.

Further Development

Over the next couple of days I rounded out the MongoApplier to add support for UPDATE and DELETE operations, as well as to implement restart. The full implementation is now checked in on code.google.com, so you can repeat our experiences by downloading code and building yourself or by grabbing one of the Tungsten nightly builds.

Restart is an interesting topic. Tungsten uses a table to store the sequence number of the last transaction it applied. We do this by creating an equivalent collection in MongoDB, which is updated after each commit. There is a problem in that MongoDB does not have transactions. Each update is effectively auto-commit, much like MyISAM table type on MySQL. This means that while Tungsten can restart properly after a clean shutdown, slave replication is not crash safe. Lack of atomic transactions is a bigger issue with MongoDB and other NoSQL databases that goes far beyond replication. For now, this is just how Tungsten's MongoDB support works.

Speaking of things that don't work, the current implementation is a prototype only. We have not tested it with more than a few data types. It only works with a single MongoDB daemon. It does not set keys properly or specify indexes on tables. There are no guarantees about performance, except to say that if you had more than a small amount of data it would be quite slow. (OK, that's a guarantee after all.)

Epilog

Overall all the hackathon was a great success, not to mention lots of fun. It went especially well because we had a relatively small problem and three people (Stephane, Flavio, and Robert) with complementary skills that we could combine easily for a quick solution. That seems to be a recipe for succeeding on future hackathons.

From a technical point of view, it helped that MongoDB is schema-less. Unlike SQL databases, just adding a document materializes the table in MongoDB. This made our applier implementation almost trivially easy, because processing row updates takes only a few dozen lines of Java code in total. It also explains why a lot of people are quite attached to the NoSQL programming model.

I am looking forward to learning a lot more about MongoDB and other NoSQL databases. It would take two or three weeks of work to get our prototype to work with real applications. Also, it looks as if we can implement replication going from MongoDB to MySQL. According to Flavio there is a way to search the transaction log of MongoDB as a regular collection. By appropriately transforming BSON objects back to SQL tuples, we can offer replication back to MySQL.

There are many other lessons about MongoDB and NoSQL in general but it seems best to leave them for a future article when I have more experience and actually know what I'm talking about. Meanwhile, you are welcome to try out our newest Tungsten replication feature.

Source : http://scale-out-blog.blogspot.com/2011/05/introducing-mysql-to-mongodb.html

0 How to Write Title Tags for SEO

You can tell which search engine optimization specialists are afraid to trust their content: they always force feed keywords into their title tags, or they put some sort of number into their titles. You can easily find many SEO tutorials that tell you how to write title tags for search engine optimization. The formula is pretty simple:

Do some keyword research

Write a title using the keywords

Go back, Jack, and Do It Again …

With 500 SEO experts telling you how to write title tags for SEO you should reasonably expect all 500 of them to rank 1st for the query “How to write title tags for SEO”.

Let’s hope Google buries THIS article on — I dunno — PAGE TWELVE or something like that. I mean, Google, this article should NOT be ranked in the top 100 results for “How to write title tags for SEO”. That would not be fair.

Real search engine optimization doesn’t care about a SINGLE keyword. If you write a page of content it should be ranking 1st for 5, 10, 15, maybe 100 SEO expressions. If the best you can do with an article is to get on the 1st page of SERPs for 1-3 expressions your SEO sucks. Get another job.

The Search Optimization Community Is Too Title-Obsessed

Does putting keywords in your page titles help? Absolutely. We all know I deliberately used “how to write title tags for SEO” in my title because there are actually people searching on that query (although I suspect it’s mostly SEOs who want to see how well they rank for the expression, especially since they’re not sure how many of their links are helping now).

We tend to be very mechanical in our SEO advice. There are multiple articles on SEO Theory going back AT LEAST to 2007 where I advise people to put their keywords in the page titles. So if you can read this schlock advice on SEO Theory, it must be pretty damn good schlock advice, right?

It’s not schlock advice — but neither is it a very advanced SEO technique. If you’re sitting there bored to tears because all your boss wants you to do is put his favorite keywords in your page titles, you can slip one past him by optimizing those same pages for other keywords AT THE SAME TIME.

Page titles and links are two cornerstones of really bad SEO advice. They leave nothing to the imagination.

How to Change Your SEO Page Title Philosophy

If you’re an SEO blogger who has been writing “How To …” and “5 Reasons/Ways” articles today is the day you should stop walking on that treadmill and try something different. Here’s one suggestion for how you can go for a walk in the garden and enjoy beautiful rays of sunshine.

For the next 10 articles you write on your SEO blog, use titles that start out with “What I Think About …” and then put some topic in there. Don’t do anything stupid like go check your keyword tools to see which variations on “What I think About” people are searching for.

If you cannot resist temptation then start your titles with “Here Is My Opinion On …” — you MUST use all four words.

Better yet, try starting a page title with “If I May Intrude On Your Thoughts For a Moment …”. Do this 10 times in a row for your next 10 SEO blog posts. Whatever it is you want to say has to come after the expression “If I May Intrude On Your Thoughts For a Moment …”.

Do NOT even comment on THIS blog post until you have written at least ONE such blog post. I mean it. I will delete your comment if you have not published an article on your SEO blog that starts out with “If I May Intrude On Your Thoughts For a Moment …”. You don’t deserve to share your opinion on this article if you’re too chicken to use a title like that.

Why You Should Stop Optimizing Page Titles

Journalists are struggling to find their true identity in the SEO-driven world of Google News and Bing News. Their organizations are being told by high-falutin’ SEO gurus that they have to inject keywords into every article title. Why? Because people use keywords to search for content.

What those SEO experts neglect to mention, however, is that people get their ideas for those keywords from some published source. By my best guestimate about 1/4 to 1/2 of the search traffic to all my blogs is driven by the article titles I write. Some of those article titles are keyword-enrichened SEO wonders like “20 Hard Core SEO Tips”. Some of the titles are actually based on oddball queries people use to find content on this site.

Here is an example: “anchor text for kindergarten writing”. That phrase surfaced this barely relevant page and enticed someone to spend a fair amount of time poking around SEO Theory. What a great title for an SEO article: “Anchor Text for Kindergarten Writing”. I have no idea of what it could possibly mean but writing such an article would be fun.

If you want to play in the big leagues of SEO you have to stop CHASING keywords and start MAKING them. That is the special power of journalists. If they publish articles with memorable headlines, or which say especially witty things, people will search for THOSE ARTICLES. If you want to own the rankings then just build the query spaces for yourself. I’ve been doing this for years.

I would much rather see this article rank 1st for “Anchor Text for Kindergarten Writing” than “How to Write Title Tags for SEO”. Anyone can compete for a “How to …” page title but to create value in a truly unique title no one else thinks is worth using — THAT is powerful SEO. It’s more powerful than all the link anchor text in the world.

How to Use Page Titles in SEO

Nothing confuses an SEO more than seeing a listing in a SERP that “doesn’t make sense” or which “violates the rule”. You’ve all still got your dog-eared copies of “Mechanical SEO Copywriting 101″. Turn to page 58 where it says, “A well-optimized page uses keywords in the title, meta description, and page URL”. Okay, now that we know this universal rule is carven in stone and that Google’s stock price is tied to making it always be true, let’s throw the book out.

Screw using page titles for SEO. Or, rather, screw using keyword research to decide what your page titles should be. Keyword research should not be making our search optimization decisions. It should be your search optimization decision to do keyword research TODAY but not necessarily TOMORROW. Keyword research is OPTIONAL for the truly advanced practitioner of search engine optimization.

Your page title is a blank canvas. You can use it to create art or you paint by numbers there. It really doesn’t matter to me. If you’re too afraid to NOT use keywords in your page title you’re not ready for advanced SEO.

If you’re too afraid to publish an article that doesn’t have a snowball’s chance in Hell of getting to the top of the SERPs for a competitive expression, you don’t deserve any real search referral traffic. All you’ve earned with that attitude is a very small fraction of the traffic that is available to you.

Your page title should set the reader’s expectation. It should create the framework for the value you are offering to the reader. Anyone can write a dumb article titled “10 Great SEO Blogs You Should Be Reading Today”. All they have to do is grab their bookmarks and flesh them out with 1-2 sentences of drool-laden fanboy copy.

But if you want to prove your search optimization Kung Fu doesn’t suck you need to write “My Favorite SEO Blogs Have Never Used Keywords In Article Titles Because They Glow in the Dark” and each paragraph you write about a blog has to sing itself off the page and into the SERPs for a year. That is, every paragraph you write needs to bring in visitors today, tomorrow, and the next 365 days. THAT is true search engine optimization.

There’s no keyword research involved. You don’t build links. You can put numbers in your page URLs.

Fear Undermines Your SEO More Than Competition

It’s your lack of faith in your ability to create something interesting and valuable that holds you back. Here are a few examples of powerful faith-in-self SEO blogs. You should be able to figure out why just by looking at these queries.

This Bay Area Dude

This Seattle Area Guy

Look at the SECOND Listing, Not The First

This Opinionated Fellow Don’t Care About Keywords

The Greatest Movie of All Time

I could go on — perhaps I should — but if you don’t see the point by now then you need to go back and reread your SEO 101 book. You’re not ready for Sophomoric SEO, let alone the Advanced Stuff.

But seriously — my point is that you create value in the page title. It’s not the other way around. Sure, there are times when you fall back on a mechanical approach. Maybe you’re tired. Maybe you’re not feeling well. Maybe you feel like people are not getting what they should be getting when they come to your blog for some crazy expression. There are GOOD reasons to chase the keywords, but none of them include “Because that is what an SEO is supposed to do.”

What Is An SEO Supposed To Do With A Page Title?

If your client is on the phone demanding to know why they’re not ranking number 1 for “How to Write Title Tags for SEO” even though you put the keywords into the page title and URL, you need a backup plan. Notice how I say that quite often? SEO is more about backup plans than links or title tags because, frankly, shit happens and you need to know how to deal with it. Page Titles are not magic bullets. Nor are links. Look at all the grief and anguish link-dependent SEOs put themselves (and their clients) through every time Google burps. It’s stupid.

Give the client more revenue through the search results and he generally shuts up. Sure, he wants that vanity query but if you show him the money he may just fall in love with a new query. So what if there was no money in it before you came along? That’s no reason to waste your time and resources trying to please a client who hired you to do what he supposedly doesn’t know how to do.

As long as you let the client call the shots on how to optimize for search you’re setting yourself up for failure. If the client really knows so much about SEO he doesn’t need you nearly as much as you need him. What kind of value are you bringing to a business relationship like that?

So here is what you need to do with page titles: forget they even exist. Heck, just let the search engines figure out what they want to put there. Use the titles to say something important and relevant but assume no one will see them. What would you want to say if you knew you were the only person on Earth who would hear it?

The reason why this is important is that if you say something ELSE that is worthwhile, people will go back and check that title so they can remember it. And then again, if people WANT to remember your article title, which will be easier: a title that 100 other schmucks are trying to rank for or a title that YOU own, in which YOU created the first and most value?

If you have to think about that you’re not ready for this article. That’s okay. Lots of people are not ready for advanced search engine optimization. That is, after all, why they are searching with queries like “How to Write Title Tags for SEO”.

Bless their SEO 101 hearts.

Michael Martinez was previously the Director of Search Strategies for Visible Technologies, Inc. A former moderator at SEO forums such as JimWorld and Spider-food, Michael has been active in search engine optimization since 1998 and Web site design and promotion since 1996.Michael was a regular contributor to Suite101 (1998-2003) and SEOmoz (2006).

Source : http://www.seo-theory.com/2012/04/20/did-you-really-expect-keywords-here/

0 Big Cloud and Big Data at Structure San Francisco Comments

GigaOM presents the 5th annual Structure conference, June 20-21 in San Francisco, bringing together the leaders innovating, shaping and defining the ongoing cloud computing evolution, including a number of our technology partners, appfog, Joyent, and Amazon.

The shift to the cloud isn’t just a change the way IT is organized and delivered, but is also the impetus for a new way of building chips, servers, networking and the other building blocks of the computing industry. The cloud allows us to think and do things never before possible. Find out how the cloud is affecting the way you work and do business around the world.

Featured speakers include:
Lucas Carlson, CEO, Appfog
Tate Cantrell, CTO, Verne Global
Jason Hoffman, Founder and CTO, Joyent
Solomon Hykes, Co-Founder and CEO, dotCloud
Luke Kanies, CEO, Puppet Labs
Joshua McKenty, CEO, Piston Cloud Computing
Manoj Saxena, GM, IBM Watson Solutions
Werner Vogels, CTO and VP, Amazon
Joe Weinman, SVP, Telx

New speakers are being added weekly, and you can see the full list here.

Over two days, GigaOM’s workshops and in-depth discussions will provide you access to the people and information you need to learn about the technologies that are powering the rise of cloud computing, internet infrastructure and big data applications.

We hope you can make it to the event in San Francisco, June 20-21 as a follow-up to MongoSF, on May 4th!

0 Cache Reheating - Not to be Ignored

Cache Reheating - Not to be Ignored

An important aspect to keep in mind with databases is the cost of cache reheating after a server restart. Consider the following diagram which shows several cache servers (e.g., memcached) in front of a database server.

This sort of setup is common and can work quite well when appropriate; it removes read load from the database and allows more RAM to be utilized for scaling (when the database doesn’t scale horizontally). But what happens if all the cache servers restart at the same time, say, on a power glitch in a data center?

0 Don’t Become a One-trick Architect

We are near the dawn of a new workload: BigData. While some people say that “it is always darkest just before the dawn”. I beg to differ: I think it is darkest just before it goes pitch black. Have a cup of wakeup coffee, get your eyes adjusted to the new light, and to flying blind a bit, because the next couple of years are going to be really interesting.

In this post, I will be sharing my views on where we have been and a bit about where we are heading in the enterprise architecture space. I will say in advance that my opinions on BigData are just crystalizing, and it is most likely that I will be adjusting them and changing my mind.

Yet, I think it will be useful to go back in history and try to learn from it. Bear we me, as I reflect on the past and walk down memory lane.

The Dawn of Computing: Mainframes

If you were to look at the single most successful computing platform of ancient times (that would be before 1990), you are stuck with the choice between the Apple II, the C64 or the IBM S/360 mainframe. The first two are consumer devices, and I may get back to those in another blog post. Today, let us look at the heavy lifting server platforms, since we are after all going to talk about data.

Under the guidance of Fred Brooks, IBM created one of the most durable, highly available and performing computing platforms in the history of mankind. Today, mainframes are challenged by other custom build supercomputers, x86/x64 scale-up platforms and massive scale out systems (ex: HADOOP). But even now, the good old S/360 still holds on to a very unique value proposition. No, it is not the fact that some of these machines almost never need to reboot. It is not the prophetic beauty of JCL (a job scheduler that “gets” parallelism) or the intricacies of COBOL or PL/I…

In fact, it is not the mainframe itself that gives it an edge, it is the idea of MIPS: Paying for compute cycles!

When you pay for compute cycles, every CPU instruction you use counts. The desire to write better code goes from being a question of aesthetics, to a question about making business sense (i.e. money). Good programmers, who can reduce MIPS count, can easily visualize their value proposition to business people, and justify extraordinary consulting fees.

As we shall see, it took the rest of us a long time to realize this.

The Early 90ies: Cataclysm

My programming career really took off in the late 80ies and early 90ies – before I got bitten by the database bug. I used to write C/C++, LISP, Perl, Assembler (various), ML and even a bit of Visual Basic (sorry!) back then. Pardon my nostalgia, but in those “old days” it was expected that you were fluent in more than one programming language.

There were some common themes back then.

First of all, we took pride in killing mainframes. We saw the old green/black terminals as an early, and failed, attempt at computing – dinosaurs that had to be killed off by the cataclysm of cheap, x86 compute power (or maybe RISK processors, though I never go around to using them). We embraced new programming languages and the IDE with open arms. We thought we succeeded: IBM entered a major crisis in the 80ies for the first time in their long and proud history. However, it could be argued that it was IBM as a company that failed, not the mainframe. There are still a lot of mainframes alive today, some of them have not been turned off since the 70ies and they run a large part of what we like to call civilisation. As a computer scientist, you have to tip your hat to that.

Another theme was a general sense of quick money. IBM had a lot of fat on their organization, and all that cost had to go somewhere: It ended up as MIPS charges to the customer. This made mainframes so expensive that it was easy to compete. It was the era of the “shallow port”. ERP systems running on ISAM style “databases” were ported 1-1 to relational databases on “decentral platforms” – aka: affordable machines. Back then, this was much cheaper than running on the mainframes and it required relatively little coding effort.

The results of shallow ports was code strewn with cursors. I suspect that a lot of our hatred towards cursors is from that time. People would do incredibly silly things with cursors, because that was the easy way to port mainframe code. Oracle supported this shallow port paradigm by improving the speed of cursors and introducing the notion of sequencers and Row ID, which allows even the database ignoramus to get decent, or should we say: “does not suck completely”, performance out of a shallow port. If you hook up a profiling tool to SAP or Axapta today, you can still see the leftovers from those times.

Late 90ies: All Abstractions are Leaky

Just as proper relational data models were beginning to take off, and we began realizing the benefits of cost based optimizers and set based operations, something happened that would throw us into a dark decade of computer science. As with all such things, it started well intentioned: A brilliant movement of programmers began to worry about encapsulation and maintainability of large code bases. Object oriented (OO) programming truly took off, it was older than that, but it now had critical mass.

The KoolAid everyone drank was that garbage collection, high level (OO) languages, IDE support and platform independence (P code/Byte Code) would massively boost productivity and that we never had to worry about low level code, storage and hardware again. To some extend, this was true: we could now trains armies of cheap programmers and throw them at them at problem (I that ironic way history plays tricks on us, the slang term was: The Mongolian Horde Technique). We also had less need to worry about ugly memory allocations and pointers – in the garbage collector we trusted. Everything was a class, or an object, and the compilers better damn well keep up with that idea (they didn’t). The champion of them all: JAVA, suffered the most under this delusion. C/C++ programmers became an endangered species. Visual Basic programmers became almost well respected. And, to please people who would give up pointers and malloc, but not curly brackets, a brilliant Dane invented C#.

At my university Danes, proud of our heritage, even invented our own languages supporting this new world (Mjolnir Beta if anyone is still interested). Everyone was high on OO. A natural consequence of this was that relational database had to keep up. If I could not squeeze a massive object graph into a relational database, that was a problem with the database, not with OO. Relational databases were not really taught at universities, it was bourgeois.

This was the dawn of object databases and the formalization of the 3-tier architecture. Central to the this architecture was the notion that scale-out was mostly about two things:

Scaling out the business logic (ex: COM+, Entity Beans)
Scaling out the rendering logic (ex: HTML, Java Applets)

We still hadn’t figured out how to fully scale databases, though most database vendors were working on it, and there were expensive implementations that already had viable solutions (DB2 and Tandem for example). What we called “Scale-out” in the 3-tier stack was functional scale, not data scale: Divide up the app into separate parts and run those parts on different machines.

I suspect we scaled out business logic because, we believed (I did!) that this was where the heavy lifting had to be done. There was also a sense that component reuse was a good thing, and that highly abstract implementations of business logic (libraries of classes) facilitated this. Of course, we did not see the dark side: that taking a dependency on a generic component, tightly coupled you to release cycles of that component. And thus was born “DLL-hell” and an undergrowth of JAVA protocols for distributed communication (ex: CORBA, JMS, RMI)

Moving business logic to components outside the database also created a demand for distributed transactions and frameworks for handling it (ex: DTC). People started arguing that code like this was the Right Way™ of doing things:

Begin Distributed Transaction

MyAccount = DataAccesLayer.ReadAccount()

if withdrawAmount <= accountBalance then

    MyAccount.Balance = accountBalance – withdrawAmount

    MyTransaction = DatabaseAccessLayer.CreateTran()

    MyTransaction.Debit = withDrawAmount

    MyTransaction.Target = MyAccount

    MyTransaction.Credit = withDrawAmount

    MyTransaction.Source = OtherAccount

    MyTransaction.Commit

else

    Rollback Transaction

    Throw “You do not have enough money!”

end

Commit Distributed Transaction

You get the idea… Predictably, this led to chatty interfaces. Network administrators started worrying about latency, people were buying dark fibers like there was no tomorrow. Database administrators were told to fix the problem and tune the database, which was mostly seen as a poorly performing file system. Looking back, it was a miracle that noSQL wasn’t invented back then.

Since this was the dawn of the Internet, scale out rendering made a lot of sense. It was all about moving fat HTML code over stateless protocols. It was not unreasonable to assume that you needed web farms to handle this rendering, so we naturally scaled out this part of the application. Much later, someone would invent AJAX and improve browsers to a point where this would become less of a concern.

We were high on compute power and coding OO like crazy. But like the Titanic and the final end of the Victorian technology bubble, we never saw what was about to hit us.

Y2K: Mediocre, But Ambitious

The new millennium had dawned (in 2000 or 2001, depending on how you count) and people generally survived. The mainframes didn’t break down, life went on. But a lot of programmers found themselves out of work.

In this light, it is interesting to consider that programmers were considered the most expensive part of running successful software back then. JAVA didn’t live up to its promise of cross platform delivery – it became: “Write once, debug everywhere” and people hated the plug-ins for the browser. While productivity gains from OO were clearly delivered, I personally think that IntelliSense was the most significant advance in productivity that happened between 1995 and 2005 (it takes away work from typing, so I can use it on thinking).

Professional managers, as the good glory hunters they are, quickly sniffed out the money to be made in computers during the tech bubble. As these things go, they invented a new term: “IT” that they could use to convince themselves that once you name something, you understand it. It was argued that we needed to “build computer systems like we build bridges”, but the actions we took looked more like “building computer systems like we butcher animals”. The capitalist conveyor belt metaphor, made so popular by Henry Ford and “enriched” by the Jack Welsh’ish horror regime of the 10-80-10 curve, eventually led to the predictable result: Outsourcing.

Make no mistake, I am a great fan of outsourcing. I truly believe in the world-wide, utilitarian view of giving away our technology, knowledge, money and work to poor countries. It is a great way to equalize opportunities in the world and help nations grow out of poverty – fast. In fact, I think we need to start outsourcing more to Africa now. Outsourcing it is the ultimate, global socialist principle.

The problem with outsourcing isn’t that we gave work away to unqualified people in Asia and Russia – because we didn’t – those people quickly became great, and some of the best minds in computer science were created there. Chances are that you are reading this blog on a software written by Indians.

The problem with outsourcing was that it led to a large portion of westerners artificially inflating the demand for Feng Shui decorators, lifestyle coaches, Pilates instructors, postmodern “writers”, public servants and middle level management. These days, Europe is waking up to this problem.

But then again, if we send the jobs out of the country, all the unemployed have to do something with their time. Perhaps it is no coincidence that the computer game industry soon outgrew the movie industry. You can after all waste a lot more time playing Skyrim (amazing by the way) than watching the latest Mission Impossible, and you don’t have to support Tom Cruise’s delusions in the process

Sorry, I ramble, somebody stop me. Yet, the fact is that a lot of companies wasted an enormous amount of money sending work outside the country, dealing with cultural differences and building very poor software. One of the the results (again a big equalizer, this time of life expectancy) was that numerous people must have died from mal-practice as the doctors were busy arguing about the latest way to AVOID building a centralized, and properly implemented, Electronic Patient Journal/Record system. As far as I know, that battle is still with us.

The Rise of the East: Nothing interesting to say about XML

Once you have latched on to the idea of a fully componentized code base, it makes sense to standardize on the protocols used for interoperability. Just when we thought we couldn’t waste any more compute cycles – someone came up with SOAP and XML.

This led to a new generation of architects arguing that we need Service Oriented Architectures (SOA). There are very few of those today (both the architects and the systems), but the idea still rings true and for the customers who have adopted it. And it IS very true: a lot of things become easier if you can agree on XML as an interchange format. We also saw the rise of database agnostic data access layers (DAL). Everyone was really afraid to bind themselves to a monopoly provider.

I don’t think it is a coincidence that open standards and the rise of Asia/Eastern Europe/Russia coincided with *nix becoming very popular. The critical mass of programmers, as well as the interoperability offered by new and open standards, made it viable to build new businesses and software. The East has a cultural advantage of teaching mathematics in school at a level unknown to most westerns – I suspect it will be the end of our Western culture if we don’t adopt. Good riddance..

And the problem was again the over interpretation of a new idea. Just because XML is a great interchange format does not mean it is a good idea to store data in that format. We saw the rise (and fall) of XML database and a massive amount of programmer effort wasted on turning things that work perfectly well (SQL for example) into XML based, slower, versions of the same thing (XQuery for example). But something was true about this, and it lingered in our minds that SQL didn’t address everything, there WAS a gap: Unstructured data…

All the while, we were still high on compute power. Racking blade machines with SAN storage like there was no tomorrow. The term “server sprawl” was invented – but no one who wrote code cared. We continued to believe all that compute power was really cheap. Moore’s Law just kept on giving.

The end of the Free Lunch: Multi Core

But around 2005, it was the end of it. CPU clock speed stopped increasing. Throwing hardware at bad code was no longer an option. In fact, throwing hardware at bad code made it scale WORSE.

Those of us who had flirted with scale-out in the 90ies, and failed miserably, got flashbacks. Predictably, we acted like grumpy old men: angry and bitter that we didn’t get the blonde while she was still young and hot. Oracle became hugely successful with RAC, and the idea of good data modeling that actually has to run on hardware came back on the table, snatched from the hands of the OO fans while they were distracted by UML. People started arguing that perhaps, some of that business logic we all loved scaling out (functional scale-out, mind you), DID belong in the database after all. Thus came SQLCLR and JAVA support in Oracle and DB2. Opportunistic companies started publishing data models and selling them for good money, it was believed that if ONLY we could do 3NF modeling of data, all would be well. Inmon followers were high on data models!

People demanded automatic scale and everyone in the know, knew that this was not going to work without a fair amount of reengineering. But of course, we continued to pay homage to the idea – people don’t like their illusions broken.

While everyone was busy wrapping their head around scale-out, the infrastructure guys were beginning to show signs of panic. Big banks were complaining (and still are) that their server rooms were overflowing with machines. Their network switches were not keeping up anymore and the SAN guys went into denial: “You may measure a problem in Windows, but that is Windows who can’t do I/O, I see no problem in the SAN”. All our XML, OO and lazy code and CPU greed had started to take its toll on the environments. Somebody started talking about “Green Computing” – damn hippies!

Flashback: Kimball’s Revenge

While Teradata was busy selling “magic scale out” engines that had made great progress in this space, Kimball followers were spraying out Data Marts and delivering value in less time than it took the “EDW people” to do a vendor evaluation. The Kimball front got themselves a nice weapon of mass destruction with the Westmere and Nehalem CPU cores: cheap, powerful CPU power that took up very little space in the server room and didn’t need any special network components. It was all in-process on the same mainboard. Itanium (IA-64) went the way of the Alpha chip, the final realization that most code out there truly sucks: high clock speed beat elegant architectures.

Finally, a single machine with lots of compute power, no magic scale-out tricks, could solve a major subset of all warehouse problems (and interestingly most of the classic OLTP problems too). Vendors started digging out old 1970 storage formats that had a great fit with Kimball. Column stores and bitmap indexing got a revival. We saw the rise of Neteeza, Redbrick, Analysis Services, TM1, DB2 OLAP Views and ESSBase. This trend continues today, for example with Vertica and the Microsoft VertiPaq engines. there is even a great website dedicated to it: The BI Verdict.

Data kept growing though, and today we are seeing an interest in MPP, Kimball style, warehouses promising truly massive scale. Scale-out and -Up combined in a beautiful harmony.

But of course, social media beat us all to it. No matter how much data we tried to cram into the warehouse, the world could spam it faster than we could consume it.

BigData: More Pictures of Cats than you can Consume

Our tour of history is nearly at an end – bear with me a little longer. Today, the old dream of handling unstructured data is becoming reality – but perhaps a nightmare for some.

With BigData and HADOOP, we have a new architecture that for the first time makes some promises we can start believing:

Storage is TRULY cheap and massive (But you have to live with SATA DAS, not SAN)
Unstructured data, and queries on it, can be run in decent time (Not optimally fast, but good enough)
Semi automated scale is doable (but only if you know how to write Map/Reduce or use expression trees)

This means that we can now drink the barrage of data coming from the Internet. Of course, human nature being what it is, a lot of this data is pure noise. In fact, the signal to noise ratio is scary on the Internet – I can think of no better word for it.

Among all the noise of pictures of cats, tweets and re-tweets about toilet visits, and latest news about Britney Spear’s haircut, there is something we treasure: Knowledge about human nature. The problem that BigData helps us address is that we don’t always know how to find the needle in the haystack. If we build BigData installations, we can keep the data around, unafraid of losing it or burning all our money, while we dig for that insight we so desperately seek about our customers and the world around us.

written elsewhere

Just be careful, there is a price associated with choosing temporary ignorance over knowledge seeking.

The Cloud: Grow or Pay!

Remember those infrastructure people who complained about server rooms overflowing? Do I hear your growls of anger? As architect/programmer/DBA, I think we often have felt the frustration of dealing with infrastructure departments (in my case the SAN guys take the price). “Why can’t those people just rack the damn server and be done with it – instead of letting me wait 3 months?” we ask ourselves as we again bang our fist into the table in frustration.

But the fact of the matter is that those server rooms ARE overflowing and overheating all over the world. Handling all the complicated logistics of racking and patching servers is expensive, and that is a direct consequence of our greed for compute power. Those infrastructure people are just trying to survive the onslaught of the server sprawl – give them a break. They respond by virtualizing (SAN and virtual machines) and they fight back when we ask for hardware that is non-standard or “not certified”.

Perhaps it is time we look ourselves in the mirror and ask if we really need to waste all those CPU cycles on overly complex architectures that protect us from ourselves and our inability to properly analyse what we plan to do with data. Every time we rack a new server, power consumption goes up in the server room and it gets just a little warmer in there. We are incurring a long term cost in power and cooling by writing poor code and designing overly flexible architectures. At the end of the day, a conditional branch in the code (an “if statement”), metadata and dynamically typed, interpreted code costs CPU, and if we want to be flexible, we often need a lot of that.

If there is one thing we have learned from mainframes, it is that code lives nearly forever. Especially when it is so bad that no one dares to touch it, or so complex that no one understands it.

If it makes your day, think of this as a green initiative: you are saving the planet by writing better code and saving power. But of course, such tree hugger mentality gets us nowhere in the modern business world, so how about this instead: with the dawn (overcast?) of the cloud, compute cycles will once again be billed direct to you: the coder and DBA!

MIPS are back in a new form, and it again makes good business sense to write efficient code. In fact, it always made good sense (even in the 90ies), but all those costs you were incurring from writing bad code or forcing data into storage formats that are unnatural (e.g. XML) has so far been hidden in your corporate balance sheet. With the cloud, that cost is about to uncloak itself, and like a Klingon battleship – this might leave you in trouble!

Summary: What history Taught Us

If you are still with me here, thank you for putting up with my tantrums. It is time to wrap this up. What has history taught us about architectures and how we go about storing data?

I like to think of computing as evolutionary, not revolutionary. There are very few major breakthroughs in computer science (perhaps with the aforementioned exception of IntelliType) that don’t arrive with a lot of advance warning, and grow up slowly from many failed attempts at implementations. For example, column stores are super hot today, but they were invented back in the 70ies.

Let us have a look at the “evolutionary tree of life” for data storage engines:

Above is of course a gross over-simplification, and you could argue about where the arrows should connect. But no matter where those arrows lead, it makes the point I want to make: That nearly all of these technologies are still around today and very much alive (perhaps with the exception of XML databases, but those were pretty silly to begin with). In fact, most large organisations have most of them around.

Are all these technologies just alive because they have not yet been replaced with the latest and greatest? Is noSQL, HADOOP or some other new product bound to replace them all over time?

I think the different technologies are there for a good reason. A reason very similar to why there are many species of animals in the world. Each technology is evolved to solve a certain problem, and to live well in a specific environment. When you target the right technology to a certain problem (for example column stores towards dimensional models) you solve that problem elegantly and with the least amount of compute cycles – preserving energy. Very soon, solving problems with a low number of compute cycles in elegant ways is going to really count.

While you might be able to solve most of the structured DW data problems with HADOOP, you eventually have to ask: Is a generic map/reduce paradigm really the way to go about that?

Humans have a strange way of wanting a single answer to the big and complex question in life, and we waste significant time searching for it. Entire cults of management theories (and I use that term lightly) and religions are build around these over simplifications. We seek simple answers to complex question, and as IT architects, this can lead us down a path of believing in “killer architectures”.

What history has taught us is that such killer architectures do not exist, but that the optimal strategy (just like in nature) depends on which environment you find yourself in. Being an IT architect means rising above all these technologies, to see that big picture, and resisting the temptation to find a single solution/platform for all problems in the enterprise. They key is to understand the environment you live in, and design for it using a palette of technologies.

Staying in the metaphor, this leads me to another conclusion: Just like evolution made us store fat on our bodies in different places depending on the usage (thank goodness) you also have to consider the option of storing your data in more than one place. Having a bit of padding all over the body is a lot more charming (and healthy) than a beer belly Smile

Thứ Sáu, 27 tháng 4, 2012

Thứ Năm, 26 tháng 4, 2012

Architecture with NoSQL

Complex Data Structures

Handling Relational Data

Network or In-process Programmatic Interaction?

Schemas/Validation

ACID/BASE and Relaxing Consistency Constraints

Persevere

Persevere’s Relational Links

The New mVC

Conclusion

Five advantages of NoSQL

1: Elastic scaling

2: Big data

3: Goodbye DBAs (see you later?)

4: Economics

5: Flexible data models

Five challenges of NoSQL

1: Maturity

2: Support

3: Analytics and business intelligence

4: Administration

5: Expertise

Conclusion

About the author

Thứ Hai, 23 tháng 4, 2012

The Search Optimization Community Is Too Title-Obsessed

How to Change Your SEO Page Title Philosophy

Why You Should Stop Optimizing Page Titles

How to Use Page Titles in SEO

Fear Undermines Your SEO More Than Competition

What Is An SEO Supposed To Do With A Page Title?

Thứ Sáu, 20 tháng 4, 2012

Thứ Năm, 19 tháng 4, 2012

Thứ Tư, 18 tháng 4, 2012

The Dawn of Computing: Mainframes

The Early 90ies: Cataclysm

Late 90ies: All Abstractions are Leaky

Y2K: Mediocre, But Ambitious

The Rise of the East: Nothing interesting to say about XML

The end of the Free Lunch: Multi Core

Flashback: Kimball’s Revenge

BigData: More Pictures of Cats than you can Consume

The Cloud: Grow or Pay!

Summary: What history Taught Us

Blogger templates

Popular posts

Blogger news

Blogroll

About

Labels

Archive