<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:ymaps="http://api.maps.yahoo.com/Maps/V2/AnnotatedMaps.xsd">

<channel>
	<title>austenconstable.com &#187; Search</title>
	<atom:link href="http://www.austenconstable.com/tag/search/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.austenconstable.com</link>
	<description>a year in the merde</description>
	<lastBuildDate>Thu, 15 Oct 2009 16:19:43 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Analysing email addresses with Hibernate Search &amp; Solr</title>
		<link>http://www.austenconstable.com/2009/10/15/analyzing-email-addresses-with-hibernate-search-solr/</link>
		<comments>http://www.austenconstable.com/2009/10/15/analyzing-email-addresses-with-hibernate-search-solr/#comments</comments>
		<pubDate>Thu, 15 Oct 2009 16:02:25 +0000</pubDate>
		<dc:creator>Austen</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Hibernate]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.austenconstable.com/?p=229</guid>
		<description><![CDATA[On first appearance WordDelimiterFilterFactory seems like the most appropriate solution to the problem. It splits words into sub words on intra-word delimiters. So: &#8220;email@someserver.com&#8221; -&#62; &#8220;email&#8221;, &#8220;someserver&#8221;, &#8220;com&#8221; This works well except for the fact that it splits on all &#8230; <a href="http://www.austenconstable.com/2009/10/15/analyzing-email-addresses-with-hibernate-search-solr/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>On first appearance WordDelimiterFilterFactory seems like the most appropriate solution to the problem. It splits words into sub words on intra-word delimiters.</p>
<p>So:</p>
<ul>
<li>&#8220;email@someserver.com&#8221; -&gt; &#8220;email&#8221;, &#8220;someserver&#8221;, &#8220;com&#8221;</li>
</ul>
<p>This works well except for the fact that it splits on <em>all</em> intra-word delimiters, and when combined with the StandardAnalyzer splits on letter-number transitions.</p>
<p>So:</p>
<ul>
<li>&#8220;email@some-server.com&#8221; -&gt; &#8220;email&#8221;, &#8220;some&#8221;, &#8220;server&#8221;, &#8220;com&#8221;</li>
<li>&#8220;email@server5.com&#8221;  -&gt; &#8220;email&#8221;, &#8220;server&#8221;, &#8220;5&#8243;, &#8220;com&#8221;</li>
</ul>
<p>Which is fine unless your users want to search for &#8220;server5&#8243; say or &#8220;some-server&#8221; (without analysing the search query itself).</p>
<p>And so the strategy I&#8217;ve taken is as follows,</p>
<ol>
<li>Use the PatternTokenizerFactory and split on &#8220;.&#8221; and &#8220;@&#8221;</li>
<li>Filter to lower case using LowerCaseFilterFactory</li>
<li>Store the full email address in a separate field</li>
</ol>
<p>Which now means that:</p>
<ul>
<li>&#8220;email@some-server.com&#8221; -&gt; &#8220;email&#8221;, &#8220;some-server&#8221;, &#8220;com&#8221;</li>
<li>&#8220;email@server5.com&#8221;  -&gt; &#8220;email&#8221;, &#8220;server5&#8243;, &#8220;com&#8221;</li>
</ul>
<p>Searches for &#8220;server5&#8243; and &#8220;some-server&#8221; are now found.</p>
<p>There is naturally some room for improvement for example what if the user searches for &#8220;server&#8221; and &#8220;5&#8243;,  they would reasonably expect anything that matched &#8220;server5&#8243; to be returned. At the moment I&#8217;m handling this by allowing wildcard searches so &#8220;server*&#8221; does the trick. It may need revisiting, but only time will tell&#8230;</p>
<p>In case you&#8217;re wondering how that gets put together here&#8217;s a source snippet:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">@<span style="color: #003399;">Entity</span>
@Indexed
@Table<span style="color: #009900;">&#40;</span>name <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;user&quot;</span>, catalog <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;somedb&quot;</span><span style="color: #009900;">&#41;</span>
@AnalyzerDef<span style="color: #009900;">&#40;</span>
  name <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;email&quot;</span>,
  tokenizer <span style="color: #339933;">=</span> @TokenizerDef<span style="color: #009900;">&#40;</span>
    factory <span style="color: #339933;">=</span> PatternTokenizerFactory.<span style="color: #000000; font-weight: bold;">class</span>, params <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
      @Parameter<span style="color: #009900;">&#40;</span>name <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;pattern&quot;</span>, value <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\\</span>.|<span style="color: #000099; font-weight: bold;">\\</span>@&quot;</span><span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>,
    filters <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
      @TokenFilterDef<span style="color: #009900;">&#40;</span>factory <span style="color: #339933;">=</span> LowerCaseFilterFactory.<span style="color: #000000; font-weight: bold;">class</span><span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> User <span style="color: #000000; font-weight: bold;">implements</span> java.<span style="color: #006633;">io</span>.<span style="color: #003399;">Serializable</span> <span style="color: #009900;">&#123;</span>
...
    @Column<span style="color: #009900;">&#40;</span>name <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;email&quot;</span>, nullable <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">false</span>, unique <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span>
    @Fields<span style="color: #009900;">&#40;</span> <span style="color: #009900;">&#123;</span>
      @<span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span>name <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;fullEmail&quot;</span>, index <span style="color: #339933;">=</span> Index.<span style="color: #006633;">UN_TOKENIZED</span>, store <span style="color: #339933;">=</span> Store.<span style="color: #006633;">YES</span><span style="color: #009900;">&#41;</span>,
      @<span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span>index <span style="color: #339933;">=</span> Index.<span style="color: #006633;">TOKENIZED</span>, analyzer <span style="color: #339933;">=</span> @Analyzer<span style="color: #009900;">&#40;</span>definition <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;email&quot;</span><span style="color: #009900;">&#41;</span>, store <span style="color: #339933;">=</span> Store.<span style="color: #006633;">YES</span><span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">String</span> getEmail<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">email</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
...
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Additional resources :</p>
<p><a href="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters" onclick="pageTracker._trackPageview('/outgoing/wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?referer=');">http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.austenconstable.com/2009/10/15/analyzing-email-addresses-with-hibernate-search-solr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

