Solr Buckets

I have been meaning to make this post for a couple of months now. I came across this topic a while back when I was working on creating some custom facets for a website running eZ Find, powered by Solr. The Problem: I needed to bucket all the authors together by last name (eg. A-C, D-F and so on). I also wanted to do this all within Solr, since the book I was using for reference said it was possible and I did not want to use Javascript (I wanted to take advantage of Solr’s speed; let Solr do the work). Now, the book I was using, Apache Solr Enterprise Search Server, was very good at explaining how to do everything when it came to creating my new custom fieldType. Where it was lacking was in how to declare my new synonyms for the SynonymFilterFactory. So for starters lets look at the fieldType:

<fieldType name="bucketFirstLetterGroup" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer type="index">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="^([a-zA-Z]).*" group="1" />
    <filter class="solr.SynonymFilterFactory" synonyms="mb_letterBuckets.txt" ignoreCase="true" expand="false"/>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory" />

It is actually pretty straight forward. It uses a PatternTokenizerFactory to ignore case but grab only letters and then the SynonymFilterFactory looks at a text file to see how to group all the letters. That is where the book got fuzzy. It told me to look at the default synonyms.txt file for reference and it sent me on my merry way. I scoured the internet for the answer to my dilemma and finally found my answer though trial and error. The default synonyms.txt file looks like this:

#some test synonym mappings unlikely to appear in real input text
aaafoo => aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz

# Some synonym groups specific to this example
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.

# Synonym mappings can be used for spelling correction too
pixima => pixma

From this file I concluded that my buckets should be grouped as a-c=>a b c and so on. But that is not the case (it will actually appear to bucket the letters but also leave all the letters as their own bucket at the same time (A-C 150 results A 50 results and so on). It actually works opposite of the example. What is on the left is the solr fieldType and the right is the bucket/synonym. To work it must be a,b,c=>a-c.


So yeah, I have been meaning to post this for a while as I could find no meaningful help on the matter elsewhere on the internet. Also, remember to reindex!