Thursday, July 11, 2013

Configuring Solr

Solr Version used for this blog: 4.3.1

How to start solr?

<Solr Installation Directory>\example\java -jar start.jar
The default port is 8983.

How to change solr default port 8983?

port is mentioned in <solr installation directory>/etc/jetty.xml


<Call name="addConnector">
     <Arg>
         <New class="org.eclipse.jetty.server.bio.SocketConnector">
           <Call class="java.lang.System" name="setProperty"> <Arg>log4j.configuration</Arg> <Arg>etc/log4j.properties</Arg> </Call>
           <Set name="host"><SystemProperty name="jetty.host" /></Set>
           <Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>
           <Set name="maxIdleTime">50000</Set>
           <Set name="lowResourceMaxIdleTime">1500</Set>
           <Set name="statsOn">false</Set>
         </New>
     </Arg>
    </Call>


Change 8983 to any available port and restart solr.


How to access solr web interface?

http://<server name>:<port>/solr   e.g. http://localhost:8983/solr

Configuration for partial search


There is some keyword in my document (e.g. Player), what I want to achieve is to search with partial entry like Pla it should return me "Player". Now question is how to configure field type for this.

First try
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.KeywordRepeatFilter"/>
            <filter class="solr.PorterStemFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
    </fieldType>
Error : org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] field
Type "text_keyword": Plugin init failure for [schema.xml] analyzer/filter: Error
 loading class 'solr.KeywordRepeatFilter'. I am currently using solr 4.3.1. 


Tried to find out  KeywordRepeatFilter It is package org.apache.lucene.analysis.miscellaneous
Tried small change instead of solr.KeywordRepeatFilter put org.apache.lucene.analysis.miscel
laneous.KeywordRepeatFilter


Another barrier Caused by: java.lang.ClassCastException: class org.apache.lucene.analysis.miscel
laneous.KeywordRepeatFilter
 


Time is running fast, anyhow I have to make this partial search working and introduced another field type

<fieldType name="string_partial_search" class="solr.TextField" sortMissingLast="true" omitNorms="true">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StandardFilterFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" side="front" />
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" side="back" />
        </analyzer>
    </fieldType>


Now start debugging I have stored country names and their unemployment rate and inflation rate. I started searching with "Sin" with hope that it will return me all the country names having the word "Sin" any where in the name.

Following is the result
<lst name="responseHeader"> <int name="status">0</int> <int name="QTime">28</int> <lst name="params"> <str name="debugQuery">true</str> <str name="indent">true</str> <str name="q">Sin</str> <str name="_">1374122942599</str> <str name="wt">xml</str> </lst> </lst> <result name="response" numFound="40" start="0">

40 results are found. Debug result is like below.
<lst name="debug"> <str name="rawquerystring">Sin</str> <str name="querystring">Sin</str> <str name="parsedquery">(countryName:si countryName:in countryName:sin)/no_coord</str> <str name="parsedquery_toString">countryName:si countryName:in countryName:sin</str>

As my min gram size is 2 it started with combination si, in and sin.
Some results are Tunisia,Russian Federation, Micronesia, Fed. Sts.,Malaysia,French Polynesia,
Indonesia, Sint Maarten (Dutch part),Singapore are in the top ten results.

Not so happy with the result.
Then found the best one for the current situation. Search with "Sin*". It will give me any thing 
which start with Sin.

<lst name="debug"> <str name="rawquerystring">Sin*</str> <str name="querystring">Sin*</str> <str name="parsedquery">countryName:sin*</str> <str name="parsedquery_toString">countryName:sin*</str> <lst name="explain"> <str name="200"> 1.0 = (MATCH) ConstantScore(countryName:sin*), product of: 1.0 = boost 1.0 = queryNorm </str>

<lst name="responseHeader"> <int name="status">0</int> <int name="QTime">32</int> <lst name="params"> <str name="debugQuery">true</str> <str name="indent">true</str> <str name="q">Sin*</str> <str name="_">1374124575932</str> <str name="wt">xml</str> </lst> </lst> <result name="response" numFound="4" start="0">

Even if we want to find out the occurrence of   Sin in any part of the word. We can search by *Sin*

<lst name="explain"> <str name="200"> 1.0 = (MATCH) ConstantScore(countryName:*sin*), product of: 1.0 = boost 1.0 = queryNorm </str>


No comments:

Post a Comment