When creating a Lucene Index, the index compiler is presented with a number of choices about how the indexable content is to be analysed for subsequent searching and classification. So it is necessary to make a choice of analysis that is best going to support the type of search that is likely to occur, but more importantly that is amenable to the sorts of classification rules that will most effectively classify the content.
There are a number of ways that the analysis can be done, the choice being made having a profound affect on the way the search results will occur. The sorts of choices include, for example, whether exact words are indexed, which would support exact-match searches. For this the Standard Analyser would be selected for indexing. On the other hand, the Snowball Analyser might be selected. This is a Potter-stemming analyser which removes the distinction between words which share the same stem. For example, the words ‘innovation’, ‘innovative’ and ‘innovator’ would become stemmed such that the term ‘innovat’ only would be indexed. This approach is useful for generalizing searches. Nevertheless, in some circumstances preservation of the distinction between the un-stemmed words might better serve the purpose of the resource repository.
Thus, the choice of the analyser is best informed by the classification rules that will be written and the sorts of searches that will be appropriate to the end-user base. Once again, as alluded to in an earlier blog, expertise is the only really useful arbiter of both the development of the system and the validation of its performance.