Full-Text Search on Azure with Lucene.NET

by Leon Cullens 18. november 2012 15:50

Full-Text Search is a very powerful feature that can be used for a wide range of business scenarios such as building a search engine. The problem is that Full-Text Indexes aren't supported in the current version of SQL Azure, so we have to use Lucene.NET for this. At the moment of writing, Lucene.NET does not work with the latest Azure SDK, so we had to fix that. In this post I will describe how I got Lucene.NET 3.0.3.0 working on Azure with SDK 1.8.

Full-Text Search

What is Full-Text Search? I think that MSDN has a better explanation than I can write:

Full-text search is applicable to a wide range of business scenarios such as e-businesses - searching for items on a web site; law firms - searching for case histories in a legal-data repository; or human resources departments - matching job descriptions with stored resumes. The basic administrative and development tasks of full-text search are equivalent regardless of business scenarios. However, in a given business scenario, full-text index and queries can be honed to meet business goals. For example, for an e-business maximizing performance might be more important than ranking of results, recall accuracy (how many of the existing matches are actually returned by a full-text query), or supporting multiple languages. For a law firm, returning every possible hit (total recall of information) might be the most important consideration.

I use Full-Text Search to build a search engine, so I benefit from a couple of features:

  • Searching one or more specific words or phrases (to find a book by a small portion of the title for example)
  • Searching a word or phrase close to another word or phrase (to give results even though the user made a typo)
  • Using weighted values (for instance: if a specific word is found in a title, it is more important than when a word is found in the name of a publisher)

But there are more very useful search queries that can be constructed, as can be found on MSDN. Another big feature of Full-Text Search compared to a simple SQL LIKE statement is the performance. Full-Text Search is FAST. Where a LIKE query on millions of records can take seconds or minutes, a Full-Text Search query can search through millions of records in a few milliseconds.

Lucene.NET

Full-Text Search works fine on SQL Server, but not on SQL Azure, the Azure version of SQL Server. This means that we have to make do with Lucene.NET Lucene.NET is a direct port of Lucene, which is:

Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

It supports the following features:

  • Ranked searching - best results returned first
  • Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
  • Fielded searching (e.g., title, author, contents)
  • Date-range searching
  • Sorting by any field
  • Multiple-index searching with merged results
  • Allows simultaneous update and searching

The current version of Lucene.NET is 3.0.3.0 and can be simply downloaded via NuGet (but don't do this yet, because that version doesn't work).

Lucene.NET on Azure

The 'best' way to get Azure working on Azure is to have a worker role run Lucene.NET, and create an index periodically (it needs to be refreshed from time to time to handle new entries) and store the index on the file system (BLOB storage). The BLOB storage can be accessed through a REST API or through the Azure API for .NET (or other languages). There is a project to do this automatically for you, it's called AzureDirectory and simply stores your index in the BLOB storage for you.

The BLOB storage automatically replicates the index across other servers in the same location (local replication) or in other countries/states (geo replication).

When the index is created on the (distributed) file system, other servers can access the data (they can't access data from other servers, that's why we place it in the BLOB storage) by reading the contents from the BLOB storage and processing it further to handle search requests.

In the image above you can see my setup: I have 1 worker role (there's no use having multiple instances of it, because only 1 worker can create the index) which creates an index in the BLOB storage, and I have a web role that has read-only access to the index to handle search requests.

The problem

There is a problem however. The AzureDirectory project is outdated and does not work with Azure SDK version 1.8. No matter what version of AzureDirectory or Lucene.NET you try, the examples from the internet won't work because you will get many build errors and DLLs that can't be found by Visual Studio.

I literally read ALL the links that I found on Google, I asked questions on StackOverflow, but nobody know the answer. Until I met Richard Astbury at Microsoft, who delved into the problem and released a fix.

The fix was quite simple: the AzureDirectory project used old dependencies (like the old Azure storage engine), so he updated the Lucene.NET reference to version 3.0.3.0 and the Azure storage client to version 2.0 (which comes with Azure SDK 1.8). His source can be found on Github. Simple build the source (in release mode), and include the generated DLLs in your project.

My setup

Below is a simplified version of the code that I've got working. This will probably be changed later, but it's a good start:

WorkerRole:

public class WorkerRole : RoleEntryPoint
{
    private AzureDirectory _directory;

    public override void Run()
    {
        Indexer indexer = new Indexer(_directory); // Create an Indexer class (which I made myself)

        while (true)
        {
            indexer.CreateIndex(); // Start creating the index
        }
    }

    public override bool OnStart()
    {
        // Set the maximum number of concurrent connections 
        ServicePointManager.DefaultConnectionLimit = 12;

        // Set up the Azure directory to store full text index
        string connectionString = CloudConfigurationManager.GetSetting("ConnectionString");
        CloudStorageAccount storageAccount = CloudStorageAccount.Parse(connectionString);
        _directory = new AzureDirectory(storageAccount, "BookCatalog");

        return base.OnStart();
    }
}

What we see here is quite simple: we get the connection string for the storage emulator from the ServiceConfiguration.Local.csfg with CloudConfigurationManager.GetSetting(), create a CloudStorageAccount and create an Azure directory called "BookCatalog". Then we simply create an indexer and start working.

Indexer:

class Indexer
{
    private readonly AzureDirectory _directory;

    public Indexer(AzureDirectory directory)
    {
        _directory = directory;
    }

    public void CreateIndex()
    {
        IndexWriter indexWriter = new IndexWriter(_directory, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); // Create the IndexWriter

        foreach (Book book in _bookService.List()) // Use whatever data source you like
        {
            Document document = CreateBookDocument(book); // Create a 'Document' for each row of your data

            indexWriter.AddDocument(document);
        }

        try
        {
            indexWriter.Optimize(); // Call optimize once to improve performance
            indexWriter.Dispose(); // Commit and dispose the object

            Thread.Sleep(60 * 10 * 1000); // Sleep 10 minutes when the index is created successfully, otherwise immediately restart the process
        }
        catch (Exception)
        {
            indexWriter.Rollback();
            indexWriter.Dispose();
        }
    }

    private Document CreateBookDocument(Book book)
    {
        Document document = new Document();
        document.Add(new Field("Id", book.Id.ToString(), Field.Store.YES, Field.Index.NO, Field.TermVector.NO)); // We want to store the ID so we can retrieve data for the ID, but we don't want to index it because it is useless searching through the IDs
        document.Add(new Field("Title", book.Title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
        document.Add(new Field("Publisher", book.Publisher, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
        document.Add(new Field("Isbn10", book.Isbn10, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));

        return document;
    }
}

So the first thing you do is create an IndexWriter with the supplied Directory (where the index is going to be stored), create Document objects for all the entries that you want to index, add them to the IndexWriter, and call Dispose(), which commits the documents. When creating a Document you can choose if you want to store fields, analyze them or not, and if you want to vectorize them or not. I'm not going into that in this post.

The reader, BookRepository:

private AzureDirectory _directory;
private IndexSearcher _searcher;
private readonly CloudStorageAccount _storageAccount;

public BookRepository(EfContext context) : base(context)
{
    // Get Azure settings
    string connectionString = CloudConfigurationManager.GetSetting("ConnectionString");
    _storageAccount = CloudStorageAccount.Parse(connectionString);
}

public IEnumerable<Book> Search(string search)
{
    _directory = new AzureDirectory(_storageAccount, "BookCatalog"); // Get the directory where the index is stored

    IndexReader reader = IndexReader.Open(_directory, true); // Open it in read-only mode

    _searcher = new IndexSearcher(reader); // Create an IndexReader which will read the index
    Query query = new MultiFieldQueryParser(Version.LUCENE_30, GetFields(), new StandardAnalyzer(Version.LUCENE_30)).Parse(search); // Construct the query using a StandardAnalyzer()

    TopDocs results = _searcher.Search(query, 10); // Get the top 10 best matching hits

    return ParseResults(results); // Convert the TopDocs to usable entities
}

private string[] GetFields()
{
    return new[] { "Id", "Title", "Publisher", "Isbn10" }; // List the fields in the index
} 

private IEnumerable<Book> ParseResults(TopDocs results)
{
    ScoreDoc[] hits = results.ScoreDocs;

    List<Book> books = hits.Select(hit => _searcher.Doc(hit.Doc)).Select(document => new Book
    {
        Id = Guid.Parse(document.Get("Id")),
        Title = document.Get("Title"),
        Publisher = document.Get("Publisher"),
        Isbn10 = document.Get("Isbn10"),
    }).ToList();

    return books;
}

So that's that. Pretty straightforward stuff I think. Just get the Directory again, read the data from the directory with an IndexReader, construct a query (I used MultiFieldQueryParser because I want to search in multiple columns), and parse the results. You can also use other analyzers than the StandardAnalyzer, but that's out of the scope for this blog post.

Summary

Running Full-Text Search on Azure isn't support out of the box, and the third-party solution (Lucene.NET) doesn't work out of the box either. Luckily there's a relatively easy fix: download the source for the AzureDirectory project, update the references, build it in release mode, and include the DLLs in your project. When you've got it working, you have a blazing fast search monster that can handle everything you throw at it.

Tags: , , , , , ,

Windows Azure

Comments (10) -

Jim
Jim United Kingdom
19-11-2012 16:16:57 #

Leon, I've managed to do something similar to your blog post in the last couple of days, and it all works fine in the dev emulator for ~3million rows of data.

As soon as I run it in Azure, the indexing process crashes after about 500,000 records, no matter how much try-catch error handling I put in, I can never catch where it fails.

The Worker role just falls over, and re-starts.

I have Diagnostics set up, logging quite a lot to try and see where it fails, but it just eventually stops (nothing from my OnStop trace line, then the role restarts.

Have you seen anything like this with your implementation?

Very frustrating!

Reply

Leon Cullens
Leon Cullens Netherlands
19-11-2012 16:39:45 #

Hi Jim,

I haven't tested my implementation with millions of records yet, currently I'm testing it against a couple of hundred records. Please check if you're not using too much disk or RAM space, that can be one of the reasons the indexing fails. It's hard to tell you what the problem is without seeing the code, especially since the AzureDirectory code is coded pretty poorly (especially the error handling).

If you can't figure out the problem, feel free to ask a question on StackOverflow, maybe there's someone else who experienced similar problems.

Good luck!

Reply

Jim
Jim United Kingdom
19-11-2012 16:51:21 #

Cheers Leon.

How can I check that? Do I need to remote into the instance?

I'll be asking a Q on SO soon I think Smile

Reply

Leon Cullens
Leon Cullens Netherlands
19-11-2012 17:17:23 #

Hi Jim,

There is no need to enable remote control, this is a task that perfectly fits Windows Azure Diagnostics, which involves creating a couple of performance counters that simply log some data.

See the Azure documentation on Azure Diagnostics: www.windowsazure.com/.../, specific information on performance counters: msdn.microsoft.com/.../hh411520.aspx and this StackOverflow thread that answers your question: stackoverflow.com/.../how-to-view-report-on-windows-azure-cpu-and-memory-usage.

I think you'll figure it out Smile

Reply

Mike
Mike Russia
21-11-2012 14:14:26 #

It seems that you have typo in github.com/.../7238a5e07ac8ed34ae6dc89ffd069fec4378cdca

Reply

Leon Cullens
Leon Cullens Netherlands
21-11-2012 16:11:44 #

Hi Mike,

I can't see it. Where is the typo?

Reply

Mike
Mike Russia
21-11-2012 16:21:03 #

It should be:
blob.ReleaseLease(new AccessCondition() { LeaseId = tempLease});
(line 47)

Reply

Sergey
Sergey United States
21-11-2012 21:16:08 #

Very Nice post! Thanks for putting it out there. Just minor improvement I would move Thread.Sleep to WorkerRole.Run method since CreateIndex method should not be responsible of delaying the service, I think it's worker role's job.

Cheers

Reply

mandy
mandy India
29-11-2012 6:47:08 #

hi.. I have tried the solutions which you gave.. and those work. Can you please give source code of this code? the one which you have created with worker role?

Thanks

Reply

Joe
Joe United States
16-12-2012 2:41:15 #

Thanks Leon. I tried Lucene based search on Windows Azure about a year ago. I was able to get it to work using the Azure Library for Licene.Net (code.msdn.microsoft.com/.../Azure-Library-for-83562538). However, there were a bunch of limitations in terms of index writer and just overall management.

I even emailed Scott Guthrie about Full Text Search Service on Windows Azure and he thought that I was talking about Full-Text Search in SQL Azure.

Anyway, I had since given up on the project until a better option came along and it did. It is Amazon CloudSearch service (http://aws.amazon.com/cloudsearch). I waited for a while to hear anything from Azure team regarding a similar CloudSearch service for lower cost than Amazon's but nothing (I even emailed Scott Guthrie about Amazon's CloudSearch offering).

Finally, I am starting my project again and while I will be writing everything in Windows Azure, I will be utilizing Amazon CloudSearch through JSON results. Since low latency and very fast response is my key requirement, SQL Azure Full-Text Search when available will still be very slow for my needs. So I have no choice but to use Amazon CloudSearch (I don't want to use other search providers such as the ones based on Solr or others at the moment because I want to stick to one or two cloud platforms at the most.)

Reply

Pingbacks and trackbacks (3)+

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

about

Name: Leon Cullens
Country: The Netherlands
Job: Software Engineer / Entrepreneur
Studied: Computer Science 
Main skills: Microsoft technology (Azure, ASP.NET MVC, Windows 8, C#, SQL Server, Entity Framework), software architecture (enterprise architecture, design patterns), Marketing, growth hacking, entrepreneurship

advertisements

my apps