Solr Search, Stop Words and DisMax Search Handler

Yesterday I ran across an interesting issue while troubleshooting a Solr (Lucene-based search server) issue for a client. Basically, the client has a CQ 5.3 implementation, but for various reasons they decided to use Solr for search on their site as opposed to using the built-in JCR search. No biggie, but they eventually noticed the following issue: say a page is named “The Black Cat was Crossing the Street”, if a user enters “the black cat was crossing the street” they would get no results. However, if they entered “black cat crossing street” they would get the expected page OR if they entered “the black cat was crossing the street” as a phrase-search (with actual double quotes surrounding it), they would also get the expected result.

So I was tasked to troubleshoot this issue, and what I found was quite interesting, a bit obscure, but also makes perfect sense thinking back on it. First of all, I was told that the client was using a “stopwords” filter to prevent the Solr index from growing too large, so basically, “the”, “was” and the second instance of “the” should have effectively been filtered out of both the indexed title AND the query input. I soon found out that if I took out 2 of the 3 stopwords (either both “the”s or one “the” and the “was”), I would receive the desired result…odd.

So without going into what I had to do in order to figure this one out, here’s what the issue was: in addition to using stopwords, the client was also using a DisMax Search Handler, which has some additional configuration parameters and can search on multiple fields in order to effectively assign “weighting” in the search. This particular implementation had “title”, “tag”, “author” and “text” defined as the fields to search on (defined in the “qf” config parameter), so ultimately what I found to be the issue was that, while the “title” and “text” fields were defined (in “schema.xml”) as a simple “text” config WITH stopwords, “tag” and “author” were defined as “lowercase” fields WITHOUT stopwords. This understandably caused a discrepancy in how the four fields were being compared against the query by the DisMax Search Handler, thus leading to the issue.

Now, despite being able to find many posts online about people who have encountered this issue, I was only able to find one that explains exactly WHY it is happening, and gave me the solution to completely solve the issue. Everyone else (of those I found) were basically was satisfied to hack around it, but this wouldn’t do for my client, so I decided to post about the issue, link to the site that ultimately helped me solve the issue, and hopefully yield to other developers some more relevant Google results.

I will say that, while the above site was helpful, I disagree that the ultimate solution is to simply not use stopwords. Stopwords are very useful to pare down the size of your search index if you have a huge amount of content on your site, and are a valid solution for indexing and search-performance concerns. As such you should be able to use them in tandem with a DisMax Search Handler. In my humble opinion, the solution is to make sure that you have the same stopwords configuration being used on all the fields that are being used by your DisMax Search Handler, that way you can have the best of both worlds, an efficient search index as well as the ability to fine tune the weightings of different fields in your search.

Leave a Reply

You must be logged in to post a comment.