Full-Text Search is a very powerful feature that can be used for a wide range of business scenarios such as building a search engine. The problem is that Full-Text Indexes aren't supported in the current version of SQL Azure, so we have to use Lucene.NET for this. At the moment of writing, Lucene.NET does not work with the latest Azure SDK, so we had to fix that. In this post I will describe how I got Lucene.NET 3.0.3.0 working on Azure with SDK 1.8.
Full-Text Search
What is Full-Text Search? I think that MSDN has a better explanation than I can write:
Full-text search is applicable to a wide range of business scenarios such as e-businesses - searching for items on a web site; law firms - searching for case histories in a legal-data repository; or human resources departments - matching job descriptions with stored resumes. The basic administrative and development tasks of full-text search are equivalent regardless of business scenarios. However, in a given business scenario, full-text index and queries can be honed to meet business goals. For example, for an e-business maximizing performance might be more important than ranking of results, recall accuracy (how many of the existing matches are actually returned by a full-text query), or supporting multiple languages. For a law firm, returning every possible hit (total recall of information) might be the most important consideration.
I use Full-Text Search to build a search engine, so I benefit from a couple of features:
- Searching one or more specific words or phrases (to find a book by a small portion of the title for example)
- Searching a word or phrase close to another word or phrase (to give results even though the user made a typo)
- Using weighted values (for instance: if a specific word is found in a title, it is more important than when a word is found in the name of a publisher)
But there are more very useful search queries that can be constructed, as can be found on MSDN. Another big feature of Full-Text Search compared to a simple SQL LIKE statement is the performance. Full-Text Search is FAST. Where a LIKE query on millions of records can take seconds or minutes, a Full-Text Search query can search through millions of records in a few milliseconds.
Lucene.NET
Full-Text Search works fine on SQL Server, but not on SQL Azure, the Azure version of SQL Server. This means that we have to make do with Lucene.NET Lucene.NET is a direct port of Lucene, which is:
Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
It supports the following features:
- Ranked searching - best results returned first
- Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
- Fielded searching (e.g., title, author, contents)
- Date-range searching
- Sorting by any field
- Multiple-index searching with merged results
- Allows simultaneous update and searching
The current version of Lucene.NET is 3.0.3.0 and can be simply downloaded via NuGet (but don't do this yet, because that version doesn't work).
Lucene.NET on Azure

The 'best' way to get Azure working on Azure is to have a worker role run Lucene.NET, and create an index periodically (it needs to be refreshed from time to time to handle new entries) and store the index on the file system (BLOB storage). The BLOB storage can be accessed through a REST API or through the Azure API for .NET (or other languages). There is a project to do this automatically for you, it's called AzureDirectory and simply stores your index in the BLOB storage for you.
The BLOB storage automatically replicates the index across other servers in the same location (local replication) or in other countries/states (geo replication).
When the index is created on the (distributed) file system, other servers can access the data (they can't access data from other servers, that's why we place it in the BLOB storage) by reading the contents from the BLOB storage and processing it further to handle search requests.
In the image above you can see my setup: I have 1 worker role (there's no use having multiple instances of it, because only 1 worker can create the index) which creates an index in the BLOB storage, and I have a web role that has read-only access to the index to handle search requests.
The problem

There is a problem however. The AzureDirectory project is outdated and does not work with Azure SDK version 1.8. No matter what version of AzureDirectory or Lucene.NET you try, the examples from the internet won't work because you will get many build errors and DLLs that can't be found by Visual Studio.
I literally read ALL the links that I found on Google, I asked questions on StackOverflow, but nobody know the answer. Until I met Richard Astbury at Microsoft, who delved into the problem and released a fix.
The fix was quite simple: the AzureDirectory project used old dependencies (like the old Azure storage engine), so he updated the Lucene.NET reference to version 3.0.3.0 and the Azure storage client to version 2.0 (which comes with Azure SDK 1.8). His source can be found on Github. Simple build the source (in release mode), and include the generated DLLs in your project.
My setup
Below is a simplified version of the code that I've got working. This will probably be changed later, but it's a good start:
WorkerRole:
public class WorkerRole : RoleEntryPoint
{
private AzureDirectory _directory;
public override void Run()
{
Indexer indexer = new Indexer(_directory); // Create an Indexer class (which I made myself)
while (true)
{
indexer.CreateIndex(); // Start creating the index
}
}
public override bool OnStart()
{
// Set the maximum number of concurrent connections
ServicePointManager.DefaultConnectionLimit = 12;
// Set up the Azure directory to store full text index
string connectionString = CloudConfigurationManager.GetSetting("ConnectionString");
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(connectionString);
_directory = new AzureDirectory(storageAccount, "BookCatalog");
return base.OnStart();
}
}
What we see here is quite simple: we get the connection string for the storage emulator from the ServiceConfiguration.Local.csfg with CloudConfigurationManager.GetSetting(), create a CloudStorageAccount and create an Azure directory called "BookCatalog". Then we simply create an indexer and start working.
Indexer:
class Indexer
{
private readonly AzureDirectory _directory;
public Indexer(AzureDirectory directory)
{
_directory = directory;
}
public void CreateIndex()
{
IndexWriter indexWriter = new IndexWriter(_directory, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); // Create the IndexWriter
foreach (Book book in _bookService.List()) // Use whatever data source you like
{
Document document = CreateBookDocument(book); // Create a 'Document' for each row of your data
indexWriter.AddDocument(document);
}
try
{
indexWriter.Optimize(); // Call optimize once to improve performance
indexWriter.Dispose(); // Commit and dispose the object
Thread.Sleep(60 * 10 * 1000); // Sleep 10 minutes when the index is created successfully, otherwise immediately restart the process
}
catch (Exception)
{
indexWriter.Rollback();
indexWriter.Dispose();
}
}
private Document CreateBookDocument(Book book)
{
Document document = new Document();
document.Add(new Field("Id", book.Id.ToString(), Field.Store.YES, Field.Index.NO, Field.TermVector.NO)); // We want to store the ID so we can retrieve data for the ID, but we don't want to index it because it is useless searching through the IDs
document.Add(new Field("Title", book.Title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
document.Add(new Field("Publisher", book.Publisher, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
document.Add(new Field("Isbn10", book.Isbn10, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
return document;
}
}
So the first thing you do is create an IndexWriter with the supplied Directory (where the index is going to be stored), create Document objects for all the entries that you want to index, add them to the IndexWriter, and call Dispose(), which commits the documents. When creating a Document you can choose if you want to store fields, analyze them or not, and if you want to vectorize them or not. I'm not going into that in this post.
The reader, BookRepository:
private AzureDirectory _directory;
private IndexSearcher _searcher;
private readonly CloudStorageAccount _storageAccount;
public BookRepository(EfContext context) : base(context)
{
// Get Azure settings
string connectionString = CloudConfigurationManager.GetSetting("ConnectionString");
_storageAccount = CloudStorageAccount.Parse(connectionString);
}
public IEnumerable<Book> Search(string search)
{
_directory = new AzureDirectory(_storageAccount, "BookCatalog"); // Get the directory where the index is stored
IndexReader reader = IndexReader.Open(_directory, true); // Open it in read-only mode
_searcher = new IndexSearcher(reader); // Create an IndexReader which will read the index
Query query = new MultiFieldQueryParser(Version.LUCENE_30, GetFields(), new StandardAnalyzer(Version.LUCENE_30)).Parse(search); // Construct the query using a StandardAnalyzer()
TopDocs results = _searcher.Search(query, 10); // Get the top 10 best matching hits
return ParseResults(results); // Convert the TopDocs to usable entities
}
private string[] GetFields()
{
return new[] { "Id", "Title", "Publisher", "Isbn10" }; // List the fields in the index
}
private IEnumerable<Book> ParseResults(TopDocs results)
{
ScoreDoc[] hits = results.ScoreDocs;
List<Book> books = hits.Select(hit => _searcher.Doc(hit.Doc)).Select(document => new Book
{
Id = Guid.Parse(document.Get("Id")),
Title = document.Get("Title"),
Publisher = document.Get("Publisher"),
Isbn10 = document.Get("Isbn10"),
}).ToList();
return books;
}
So that's that. Pretty straightforward stuff I think. Just get the Directory again, read the data from the directory with an IndexReader, construct a query (I used MultiFieldQueryParser because I want to search in multiple columns), and parse the results. You can also use other analyzers than the StandardAnalyzer, but that's out of the scope for this blog post.
Summary
Running Full-Text Search on Azure isn't support out of the box, and the third-party solution (Lucene.NET) doesn't work out of the box either. Luckily there's a relatively easy fix: download the source for the AzureDirectory project, update the references, build it in release mode, and include the DLLs in your project. When you've got it working, you have a blazing fast search monster that can handle everything you throw at it.