Saturday, August 3, 2013

Book review: Apache Solr for Indexing Data How-to

A few days ago I kindly received a copy of the book “Apache Solr for Indexing Data How-to” by Alexandre Rafalovitch for review. Here are my impressions about it.

Solr, by now a nine-year old project, is a powerful piece of software, with lots of high-level features and facilities for text-centric data. And it builds on Lucene, itself an 11-year-old stand-alone project.

At 80 pages, “Apache Solr for Indexing Data How-to” doesn’t try to cover all the features. Instead, it focuses on indexing, that is, getting data from some source (Relational database, text files, etc) into Solr. This is of course a major part of using Solr.

When starting out with Solr, most people first follow the official tutorial, but then feel lost when faced with real-world requirements. The official wiki docs have greatly improved in the last few years but there’s still a large gap between the tutorial and the docs. The reference guide is also great but for a novice it may seem daunting at first. You can see this in many questions on Stackoverflow. This book helps close that gap a bit, at least the part about getting your data into Solr.

You can read it like a cookbook, as a guidance for specific indexing scenarios. As a good “how-to” book, each section starts with a short introduction, then a step-by-step guidance on how to get to the goal, and a “how it works” section explaining everything. An additional section adds tips and further references about each subject.

Of course you can also read it like a regular book. It starts with the most basic scenario, picking up where the tutorial leaves off, and then dives into more complex scenarios. All examples are on github so you can follow on a concrete instance of Solr while reading. The book is written for Solr 4.3. As of now Solr 4.4 is already out and 4.5 is coming soon, but don’t worry, the dev team seems to follow Semantic Versioning so there aren’t any breaking changes.

One problem with this kind of books is that often they can’t focus just on the main topic (in this case, indexing) without at least touching on other topics. Indexing is related to the Solr schema, which in turn is a function of the search needs of your application. This book dabbles in faceting and searching when the scenario demands it, but otherwise acknowledges its limited scope and refers the reader to other books or the reference documentation when appropriate, so you never feel lost.

Another issue is the simplification of some scenarios in order to focus on operative topics and avoid scope creep. For example, the section on indexing data from a relational database uses an example where the database has only one table, no foreign keys. In most real-world scenarios you’ll have lots of related database tables which you’ll have to denormalize and flatten depending on your search needs.

Overall, I think “Apache Solr for Indexing Data How-to” is great for a novice in Solr. It’s a simple, concrete guide to indexing which is one the first things you do with Solr. Just don’t expect it to be all-comprehensive: it doesn’t cover all scenarios and you should read it along the docs to truly understand the concepts at work. It’s designed to help you move forward when, as a beginner, everything looks too complex and you have no idea what to do.

The tutorial will get you started, but this book will get you going.

No comments: