Re: need help from hard core solr experts - out of memory error

2014-04-18 Thread Shawn Heisey
On 4/18/2014 6:15 PM, Candygram For Mongo wrote:
> We are getting Out Of Memory errors when we try to execute a full import
> using the Data Import Handler.  This error originally occurred on a
> production environment with a database containing 27 million records.  Heap
> memory was configured for 6GB and the server had 32GB of physical memory.
>  We have been able to replicate the error on a local system with 6 million
> records.  We set the memory heap size to 64MB to accelerate the error
> replication.  The indexing process has been failing in different scenarios.
>  We have 9 test cases documented.  In some of the test cases we increased
> the heap size to 128MB.  In our first test case we set heap memory to 512MB
> which also failed.

One characteristic of a JDBC connection is that unless you tell it
otherwise, it will try to retrieve the entire resultset into RAM before
any results are delivered to the application.  It's not Solr doing this,
it's JDBC.

In this case, there are 27 million rows in the resultset.  It's highly
unlikely that this much data (along with the rest of Solr's memory
requirements) will fit in 6GB of heap.

JDBC has a built-in way to deal with this.  It's called fetchSize.  By
using the batchSize parameter on your JdbcDataSource config, you can set
the JDBC fetchSize.  Set it to something small, between 100 and 1000,
and you'll probably get rid of the OOM problem.

http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource

If you had been using MySQL, I would have recommended that you set
batchSize to -1.  This sets fetchSize to Integer.MIN_VALUE, which tells
the MySQL driver to stream results instead of trying to either batch
them or return everything.  I'm pretty sure that the Oracle driver
doesn't work this way -- you would have to modify the dataimport source
code to use their streaming method.

Thanks,
Shawn



Re: Indexing Big Data With or Without Solr

2014-04-18 Thread Aman Tandon
Vineet please share after you setup for solr cloud
Are you using jetty or tomcat.?

On Saturday, April 19, 2014, Vineet Mishra  wrote:
> Thanks Furkan, I will definitely give it a try then.
>
> Thanks again!
>
>
>
>
> On Tue, Apr 15, 2014 at 7:53 PM, Furkan KAMACI wrote:
>
>> Hi Vineet;
>>
>> I've been using SolrCloud for such kind of Big Data and I think that you
>> should consider to use it. If you have any problems you can ask it here.
>>
>> Thanks;
>> Furkan KAMACI
>>
>>
>> 2014-04-15 13:20 GMT+03:00 Vineet Mishra :
>>
>> > Hi All,
>> >
>> > I have worked with Solr 3.5 to implement real time search on some 100GB
>> > data, that worked fine but was little slow on complex queries(Multiple
>> > group/joined queries).
>> > But now I want to index some real Big Data(around 4 TB or even more),
can
>> > SolrCloud be solution for it if not what could be the best possible
>> > solution in this case.
>> >
>> > *Stats for the previous Implementation:*
>> > It was Master Slave Architecture with normal Standalone multiple
instance
>> > of Solr 3.5. There were around 12 Solr instance running on different
>> > machines.
>> >
>> > *Things to consider for the next implementation:*
>> > Since all the data is sensor data hence it is the factor of duplicity
and
>> > uniqueness.
>> >
>> > *Really urgent, please take the call on priority with set of feasible
>> > solution.*
>> >
>> > Regards
>> >
>>
>

-- 
Sent from Gmail Mobile


Re: Boost Search results

2014-04-18 Thread Aman Tandon
I guess you can apply some deboost for URL.
Lakshmi it will be more helpful to suggest if you also provide some kind of
example about what you want to achieve

On Saturday, April 19, 2014, A Laxmi  wrote:
> Markus, like I mentioned in my last email, I have got the qf with title,
> content and url. That doesn't help a whole lot. Could you please advise if
> there are any other parameters that I should consider for solr request
> handler config or the numbers I have got for title, content, url in qf
> parameter have to be modified?
>
> Thanks for your help..
>
>
> On Fri, Apr 18, 2014 at 4:08 PM, A Laxmi  wrote:
>
>> Hi Markus, Yes, you are right. I passed the qf from my front-end
framework
>> (PHP which uses SolrClient). This is how I got it set-up:
>>
>> $this->solr->set_param('defType','edismax');
>> $this->solr->set_param('qf','title^10 content^5 url^5');
>>
>> where you can see qf = title^10 content^5 url^5
>>
>>
>>
>>
>>
>>
>> On Fri, Apr 18, 2014 at 4:02 PM, Markus Jelsma <
markus.jel...@openindex.io
>> > wrote:
>>
>>> Hi, replicating full features search engine behaviour is not going to
>>> work with nutch and solr out of the box. You are missing a thousand
>>> features such as proper main content extraction, deduplication,
>>> classification of content and hub or link pages, and much more. These
>>> things are possible to implement but you may want to start with having
you
>>> solr request handler better configured, to begin with, your qf parameter
>>> does not have nutchs default title and content field selected.
>>>
>>>
>>> A Laxmi  schreef:Hi,
>>>
>>>
>>> When I started to compare the search results with the two options
below, I
>>> see a lot of difference in the search results esp. the* urls that show
up
>>> on the top *(*Relevancy *perspective).
>>>
>>> (1) Nutch 2.2.1 (with *Solr 4.0*)
>>> (2) Bing custom search set-up
>>>
>>> I wonder how should I tweak the boost parameters to get the best results
>>> on
>>> the top like how Bing, Google does.
>>>
>>> Please suggest why I see a difference and what parameters are best to
>>> configure in Solr to achieve what I see from Bing, or Google search
>>> relevancy.
>>>
>>> Here is what i got in solrconfig.xml:
>>>
>>> edismax
>>>
>>>  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
>>>
>>>*:*
>>>10
>>>*,score
>>>
>>>
>>> Thanks
>>>
>>
>>
>

-- 
Sent from Gmail Mobile


Re: need help from hard core solr experts - out of memory error

2014-04-18 Thread Candygram For Mongo
I have uploaded several files including the problem description with
graphics to this link on Google drive:

https://drive.google.com/folderview?id=0B7UpFqsS5lSjWEhxRE1NN2tMNTQ&usp=sharing

I shared it with this address "solr-user@lucene.apache.org" so I am hoping
it can be accessed by people in the group.


On Fri, Apr 18, 2014 at 5:15 PM, Candygram For Mongo <
candygram.for.mo...@gmail.com> wrote:

> I have lots of log files and other files to support this issue (sometimes
> referenced in the text below) but I am not sure the best way to submit.  I
> don't want to overwhelm and I am not sure if this email will accept graphs
> and charts.  Please provide direction and I will send them.
>
>
> *Issue Description*
>
>
>
> We are getting Out Of Memory errors when we try to execute a full import
> using the Data Import Handler.  This error originally occurred on a
> production environment with a database containing 27 million records.  Heap
> memory was configured for 6GB and the server had 32GB of physical memory.
>  We have been able to replicate the error on a local system with 6 million
> records.  We set the memory heap size to 64MB to accelerate the error
> replication.  The indexing process has been failing in different scenarios.
>  We have 9 test cases documented.  In some of the test cases we increased
> the heap size to 128MB.  In our first test case we set heap memory to 512MB
> which also failed.
>
>
>
>
>
> *Environment Values Used*
>
>
>
> *SOLR/Lucene version: *4.2.1*
>
> *JVM version:
>
> Java(TM) SE Runtime Environment (build 1.7.0_07-b11)
>
> Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
>
> *Indexer startup command:
>
> set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m
>
> java " %JVMARGS% ^
>
> -Dcom.sun.management.jmxremote.port=1092 ^
>
> -Dcom.sun.management.jmxremote.ssl=false ^
>
> -Dcom.sun.management.jmxremote.authenticate=false ^
>
> -jar start.jar
>
> *SOLR indexing HTTP parameters request:
>
> webapp=/solr path=/dataimport
> params={clean=false&command=full-import&wt=javabin&version=2}
>
>
>
> The information we use for the database retrieve using the Data Import
> Handler is as follows:
>
>
>
> 
> name="org_only"
>
> type="JdbcDataSource"
>
> driver="oracle.jdbc.OracleDriver"
>
> url="jdbc:oracle:thin:@{server
> name}:1521:{database name}"
>
> user="{username}"
>
> password="{password}"
>
> readOnly="false"
>
> />
>
>
>
>
>
> *The Query (simple, single table)*
>
>
>
> *select*
>
>
>
> *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')*
>
> *as SOLR_ID,*
>
>
>
> *'STU.ACCT_ADDRESS_ALL'*
>
> *as SOLR_CATEGORY,*
>
>
>
> *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as
> ADDRESSALLRID,*
>
> *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as
> ADDRESSALLADDRTYPECD,*
>
> *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as
> ADDRESSALLLONGITUDE,*
>
> *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as
> ADDRESSALLLATITUDE,*
>
> *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as
> ADDRESSALLADDRNAME,*
>
> *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as
> ADDRESSALLCITY,*
>
> *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as
> ADDRESSALLSTATE,*
>
> *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as
> ADDRESSALLEMAILADDR *
>
>
>
> *from STU.ACCT_ADDRESS_ALL*
>
>
>
> You can see this information in the database.xml file.
>
>
>
> Our main solrconfig.xml file contains the following differences compared
> to a new downloaded solrconfig.xml file (the original content).
>
>
>
> 
>
>  regex="solr-dataimporthandler-.*\.jar" />
>
> 
>
> 
>
> 
>
> 
>
>
>
>
> ${solr.abortOnConfigurationError:true}
>
>
>
>  class="org.apache.solr.core.StandardDirectoryFactory" />
>
>
>
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>
> 
>
> database.xml
>
> 
>
> 
>
> 
>
>
>
>
>
> *Custom Libraries*
>
>
>
> The common.jar contains a customized TokenFiltersFactory implementation
> that we use for indexing.  They do some special treatment to the fields
> read from the database.  How those classes are used is described in the
> schema.xml file.  The webapp.jar file contains other related classes.
> The commons-pool-1.4.jar is an API from apache used for instances reuse.
>
>
>
> The logic used in the TokenFiltersFactory is contained in the following
> files:
>
>
>
> ConcatFilterFactory.java
>
> ConcatFilter.java
>
> MDFilterSchemaFactory.java
>
> MDFilter.java
>
> MDFilterPoolObjectFactory.java
>
> NullValueFilterFactory.java
>
> NullValueFilter.java
>
>
>
> How we use them is described in the schema.xml file.
>
>
>
> We ha

Re: is there any way to post images and attachments to this mailing list?

2014-04-18 Thread A Laxmi
Just upload them in Google Drive and share the link with this group.


On Fri, Apr 18, 2014 at 9:15 PM, Candygram For Mongo <
candygram.for.mo...@gmail.com> wrote:

>
>


is there any way to post images and attachments to this mailing list?

2014-04-18 Thread Candygram For Mongo



Re: need help from hard core solr experts - out of memory error

2014-04-18 Thread Candygram For Mongo
We consistently reproduce this problem on multiple systems configured with
6GB and 12GB of heap space.  To quickly reproduce many cases for
troubleshooting we reduced the heap space to 64, 128 and 512MB.  With 6 or
12GB configured it takes hours to see the error.


On Fri, Apr 18, 2014 at 5:54 PM, Walter Underwood wrote:

> I see heap size commands for 128 Meg and 512 Meg. That will certainly run
> out of memory. Why do you think you have 6G of heap with these settings?
>
> –Xmx128m –Xms128m
> –Xmx512m –Xms512m
>
> wunder
>
> On Apr 18, 2014, at 5:15 PM, Candygram For Mongo <
> candygram.for.mo...@gmail.com> wrote:
>
> > I have lots of log files and other files to support this issue (sometimes
> > referenced in the text below) but I am not sure the best way to submit.
>  I
> > don't want to overwhelm and I am not sure if this email will accept
> graphs
> > and charts.  Please provide direction and I will send them.
> >
> >
> > *Issue Description*
> >
> >
> >
> > We are getting Out Of Memory errors when we try to execute a full import
> > using the Data Import Handler.  This error originally occurred on a
> > production environment with a database containing 27 million records.
>  Heap
> > memory was configured for 6GB and the server had 32GB of physical memory.
> > We have been able to replicate the error on a local system with 6 million
> > records.  We set the memory heap size to 64MB to accelerate the error
> > replication.  The indexing process has been failing in different
> scenarios.
> > We have 9 test cases documented.  In some of the test cases we increased
> > the heap size to 128MB.  In our first test case we set heap memory to
> 512MB
> > which also failed.
> >
> >
> >
> >
> >
> > *Environment Values Used*
> >
> >
> >
> > *SOLR/Lucene version: *4.2.1*
> >
> > *JVM version:
> >
> > Java(TM) SE Runtime Environment (build 1.7.0_07-b11)
> >
> > Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
> >
> > *Indexer startup command:
> >
> > set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m
> >
> > java " %JVMARGS% ^
> >
> > -Dcom.sun.management.jmxremote.port=1092 ^
> >
> > -Dcom.sun.management.jmxremote.ssl=false ^
> >
> > -Dcom.sun.management.jmxremote.authenticate=false ^
> >
> > -jar start.jar
> >
> > *SOLR indexing HTTP parameters request:
> >
> > webapp=/solr path=/dataimport
> > params={clean=false&command=full-import&wt=javabin&version=2}
> >
> >
> >
> > The information we use for the database retrieve using the Data Import
> > Handler is as follows:
> >
> >
> >
> >  >
> >name="org_only"
> >
> >type="JdbcDataSource"
> >
> >driver="oracle.jdbc.OracleDriver"
> >
> >url="jdbc:oracle:thin:@{server
> name}:1521:{database
> > name}"
> >
> >user="{username}"
> >
> >password="{password}"
> >
> >readOnly="false"
> >
> >/>
> >
> >
> >
> >
> >
> > *The Query (simple, single table)*
> >
> >
> >
> > *select*
> >
> >
> >
> > *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')*
> >
> > *as SOLR_ID,*
> >
> >
> >
> > *'STU.ACCT_ADDRESS_ALL'*
> >
> > *as SOLR_CATEGORY,*
> >
> >
> >
> > *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as
> > ADDRESSALLRID,*
> >
> > *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as
> > ADDRESSALLADDRTYPECD,*
> >
> > *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as
> > ADDRESSALLLONGITUDE,*
> >
> > *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as
> > ADDRESSALLLATITUDE,*
> >
> > *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as
> > ADDRESSALLADDRNAME,*
> >
> > *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as
> > ADDRESSALLCITY,*
> >
> > *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as
> > ADDRESSALLSTATE,*
> >
> > *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as
> > ADDRESSALLEMAILADDR *
> >
> >
> >
> > *from STU.ACCT_ADDRESS_ALL*
> >
> >
> >
> > You can see this information in the database.xml file.
> >
> >
> >
> > Our main solrconfig.xml file contains the following differences compared
> to
> > a new downloaded solrconfig.xml
> >
> file(the
> > original content).
> >
> >
> >
> > 
> >
> > > regex="solr-dataimporthandler-.*\.jar" />
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> ${solr.abortOnConfigurationError:true}
> >
> >
> >
> >  > class="org.apache.solr.core.StandardDirectoryFactory" />
> >
> >
> >
> >  > class="org.apache.solr.handler.dataimport.DataImportHandler">
> >
> > 
> >
> > database.xml
> >
> > 
> >
> > 
> >
> > 
> >
> >
> >
> >
> >
> > *Custom Libraries*
> >
> >
> >
> > The common.jar contains a customized TokenFiltersFactory implementation
> > that we use for indexing.  They do some special treatment to the fields
> > read from the database.  How those classes are 

Re: need help from hard core solr experts - out of memory error

2014-04-18 Thread Walter Underwood
I see heap size commands for 128 Meg and 512 Meg. That will certainly run out 
of memory. Why do you think you have 6G of heap with these settings?

–Xmx128m –Xms128m
–Xmx512m –Xms512m

wunder

On Apr 18, 2014, at 5:15 PM, Candygram For Mongo 
 wrote:

> I have lots of log files and other files to support this issue (sometimes
> referenced in the text below) but I am not sure the best way to submit.  I
> don't want to overwhelm and I am not sure if this email will accept graphs
> and charts.  Please provide direction and I will send them.
> 
> 
> *Issue Description*
> 
> 
> 
> We are getting Out Of Memory errors when we try to execute a full import
> using the Data Import Handler.  This error originally occurred on a
> production environment with a database containing 27 million records.  Heap
> memory was configured for 6GB and the server had 32GB of physical memory.
> We have been able to replicate the error on a local system with 6 million
> records.  We set the memory heap size to 64MB to accelerate the error
> replication.  The indexing process has been failing in different scenarios.
> We have 9 test cases documented.  In some of the test cases we increased
> the heap size to 128MB.  In our first test case we set heap memory to 512MB
> which also failed.
> 
> 
> 
> 
> 
> *Environment Values Used*
> 
> 
> 
> *SOLR/Lucene version: *4.2.1*
> 
> *JVM version:
> 
> Java(TM) SE Runtime Environment (build 1.7.0_07-b11)
> 
> Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
> 
> *Indexer startup command:
> 
> set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m
> 
> java " %JVMARGS% ^
> 
> -Dcom.sun.management.jmxremote.port=1092 ^
> 
> -Dcom.sun.management.jmxremote.ssl=false ^
> 
> -Dcom.sun.management.jmxremote.authenticate=false ^
> 
> -jar start.jar
> 
> *SOLR indexing HTTP parameters request:
> 
> webapp=/solr path=/dataimport
> params={clean=false&command=full-import&wt=javabin&version=2}
> 
> 
> 
> The information we use for the database retrieve using the Data Import
> Handler is as follows:
> 
> 
> 
>  
>name="org_only"
> 
>type="JdbcDataSource"
> 
>driver="oracle.jdbc.OracleDriver"
> 
>url="jdbc:oracle:thin:@{server name}:1521:{database
> name}"
> 
>user="{username}"
> 
>password="{password}"
> 
>readOnly="false"
> 
>/>
> 
> 
> 
> 
> 
> *The Query (simple, single table)*
> 
> 
> 
> *select*
> 
> 
> 
> *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')*
> 
> *as SOLR_ID,*
> 
> 
> 
> *'STU.ACCT_ADDRESS_ALL'*
> 
> *as SOLR_CATEGORY,*
> 
> 
> 
> *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as
> ADDRESSALLRID,*
> 
> *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as
> ADDRESSALLADDRTYPECD,*
> 
> *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as
> ADDRESSALLLONGITUDE,*
> 
> *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as
> ADDRESSALLLATITUDE,*
> 
> *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as
> ADDRESSALLADDRNAME,*
> 
> *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as
> ADDRESSALLCITY,*
> 
> *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as
> ADDRESSALLSTATE,*
> 
> *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as
> ADDRESSALLEMAILADDR *
> 
> 
> 
> *from STU.ACCT_ADDRESS_ALL*
> 
> 
> 
> You can see this information in the database.xml file.
> 
> 
> 
> Our main solrconfig.xml file contains the following differences compared to
> a new downloaded solrconfig.xml
> file(the
> original content).
> 
> 
> 
> 
> 
> regex="solr-dataimporthandler-.*\.jar" />
> 
>
> 
>
> 
>
> 
>
> 
> 
> 
> ${solr.abortOnConfigurationError:true}
> 
> 
> 
>  class="org.apache.solr.core.StandardDirectoryFactory" />
> 
> 
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
> 
> 
> 
> database.xml
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *Custom Libraries*
> 
> 
> 
> The common.jar contains a customized TokenFiltersFactory implementation
> that we use for indexing.  They do some special treatment to the fields
> read from the database.  How those classes are used is described in the
> schema.xml file.  The webapp.jar file contains other related classes.  The
> commons-pool-1.4.jar is an API from apache used for instances reuse.
> 
> 
> 
> The logic used in the TokenFiltersFactory is contained in the following
> files:
> 
> 
> 
>
> ConcatFilterFactory.java
> 
>
> ConcatFilter.java
> 
>
> MDFilterSchemaFactory.java
> 
>
> MDFilter.java
> 
>
> MDFilterPoolObjectFactory.java
> 
>
> NullValueFilterFactory.java
> 
>
> NullValueFilter.java
> 
> 
> 
> How we use them is described in the schema.xml file.
> 
> 
> 
> We have been experimenting with t

need help from hard core solr experts - out of memory error

2014-04-18 Thread Candygram For Mongo
I have lots of log files and other files to support this issue (sometimes
referenced in the text below) but I am not sure the best way to submit.  I
don't want to overwhelm and I am not sure if this email will accept graphs
and charts.  Please provide direction and I will send them.


*Issue Description*



We are getting Out Of Memory errors when we try to execute a full import
using the Data Import Handler.  This error originally occurred on a
production environment with a database containing 27 million records.  Heap
memory was configured for 6GB and the server had 32GB of physical memory.
 We have been able to replicate the error on a local system with 6 million
records.  We set the memory heap size to 64MB to accelerate the error
replication.  The indexing process has been failing in different scenarios.
 We have 9 test cases documented.  In some of the test cases we increased
the heap size to 128MB.  In our first test case we set heap memory to 512MB
which also failed.





*Environment Values Used*



*SOLR/Lucene version: *4.2.1*

*JVM version:

Java(TM) SE Runtime Environment (build 1.7.0_07-b11)

Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

*Indexer startup command:

set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m

java " %JVMARGS% ^

-Dcom.sun.management.jmxremote.port=1092 ^

-Dcom.sun.management.jmxremote.ssl=false ^

-Dcom.sun.management.jmxremote.authenticate=false ^

-jar start.jar

*SOLR indexing HTTP parameters request:

webapp=/solr path=/dataimport
params={clean=false&command=full-import&wt=javabin&version=2}



The information we use for the database retrieve using the Data Import
Handler is as follows:









*The Query (simple, single table)*



*select*



*NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')*

*as SOLR_ID,*



*'STU.ACCT_ADDRESS_ALL'*

*as SOLR_CATEGORY,*



*NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as
ADDRESSALLRID,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as
ADDRESSALLADDRTYPECD,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as
ADDRESSALLLONGITUDE,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as
ADDRESSALLLATITUDE,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as
ADDRESSALLADDRNAME,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as
ADDRESSALLCITY,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as
ADDRESSALLSTATE,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as
ADDRESSALLEMAILADDR *



*from STU.ACCT_ADDRESS_ALL*



You can see this information in the database.xml file.



Our main solrconfig.xml file contains the following differences compared to
a new downloaded solrconfig.xml
file(the
original content).

















${solr.abortOnConfigurationError:true}











database.xml











*Custom Libraries*



The common.jar contains a customized TokenFiltersFactory implementation
that we use for indexing.  They do some special treatment to the fields
read from the database.  How those classes are used is described in the
schema.xml file.  The webapp.jar file contains other related classes.  The
commons-pool-1.4.jar is an API from apache used for instances reuse.



The logic used in the TokenFiltersFactory is contained in the following
files:




ConcatFilterFactory.java


ConcatFilter.java


MDFilterSchemaFactory.java


MDFilter.java


MDFilterPoolObjectFactory.java


NullValueFilterFactory.java


NullValueFilter.java



How we use them is described in the schema.xml file.



We have been experimenting with the following configuration values:



maxIndexingThreads

ramBufferSizeMB

maxBufferedDocs

mergePolicy

maxMergeAtOnce

segmentsPerTier

maxMergedSegmentMB

autoCommit

maxDocs

maxTime

autoSoftCommit

maxTime



Using numerous combinations of these values, the indexing fails.





*IMPORTANT NOTE*



When we disable all of the copyfield tags contained in the schema.xml file,
or all but relatively few, the indexing completes successfully (see Test
Case 1).





*TEST CASES*



All of the test cases have been analyzed with the Visual VM tool.  All SOLR
configuration files and indexer log content are in the test case
directories included in a zip file.  We have included the most relevant
screenshots.  Test Case 2 is the only one that includes the thread dump.





*Test Case 1 *



JVM arguments = -XX:MaxPermSize=364m -Xss256K –Xmx512m –Xms512m



Results:

Indexing status: Completed

Time taken:   1:8:32.519

Error detail:   NO ERROR.

Index data directory size =  995 MB





*Test Case 2*



JVM arguments = -XX:

Re: space between search terms

2014-04-18 Thread Jack Krupansky
The LucidWorks Search query parser does indeed support multi-word synonyms 
at query time.


I vaguely recall some Jira traffic on supporting multi-word synonyms at 
query time for some special cases, but a review of CHANGES.txt does not find 
any such changes that made it into a release, yet.


The simplest approach for now is to do the query-time synonym expansion in 
your app layer as a preprocessor.


-- Jack Krupansky

-Original Message- 
From: Ahmet Arslan

Sent: Friday, April 18, 2014 7:38 PM
To: solr-user@lucene.apache.org
Subject: Re: space between search terms

Hi Jack,

I am planning to extract and publish such words for Turkish language. But I 
am not sure how to utilize them.


I wonder if there is a more flexible solution that will work query time 
only. That would not require reindexing every time a new item is added.


Ahmet


On Friday, April 18, 2014 1:47 PM, Jack Krupansky  
wrote:

Use an index-time synonym filter with a synonym entry:

indira nagar,indiranagar

But do not use that same filter at query time.

But, that may mess up some exact phrase queries, such as:

q="indiranagar xyz"

since the following term is actually positioned after the longest synonym.

To resolve that, use a sloppy phrase:

q="indiranagar xyz"~1

Or, set qs=1 for the edismax query parser.

-- Jack Krupansky


-Original Message- 
From: kumar

Sent: Friday, April 18, 2014 6:34 AM
To: solr-user@lucene.apache.org
Subject: space between search terms

Hi,

I Have a field called "title". It is having a values called "indira nagar"
as well as "indiranagar".

If i type any of the keywords it has to display both results.

Can anybody help how can we do this?


I am using the title field in the following way:




















   





--
View this message in context:
http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: space between search terms

2014-04-18 Thread Erick Erickson
Ahmet:

Yeah, the index .vs. query time bit is a pain. Often what people will
do is take their best shot at index time, then accumulate omissions
and use that list for query time. Then whenever they can/need to
re-index, merge the query-time list into the index time list and start
over.

Not an ideal solution by any means, but one that people have made to work.

Best,
Erick

On Fri, Apr 18, 2014 at 4:38 PM, Ahmet Arslan  wrote:
> Hi Jack,
>
> I am planning to extract and publish such words for Turkish language. But I 
> am not sure how to utilize them.
>
> I wonder if there is a more flexible solution that will work query time only. 
> That would not require reindexing every time a new item is added.
>
> Ahmet
>
>
> On Friday, April 18, 2014 1:47 PM, Jack Krupansky  
> wrote:
> Use an index-time synonym filter with a synonym entry:
>
> indira nagar,indiranagar
>
> But do not use that same filter at query time.
>
> But, that may mess up some exact phrase queries, such as:
>
> q="indiranagar xyz"
>
> since the following term is actually positioned after the longest synonym.
>
> To resolve that, use a sloppy phrase:
>
> q="indiranagar xyz"~1
>
> Or, set qs=1 for the edismax query parser.
>
> -- Jack Krupansky
>
>
> -Original Message-
> From: kumar
> Sent: Friday, April 18, 2014 6:34 AM
> To: solr-user@lucene.apache.org
> Subject: space between search terms
>
> Hi,
>
> I Have a field called "title". It is having a values called "indira nagar"
> as well as "indiranagar".
>
> If i type any of the keywords it has to display both results.
>
> Can anybody help how can we do this?
>
>
> I am using the title field in the following way:
>
> 
> 
>  mapping="mapping-ISOLatin1Accent.txt" />
> 
>  generateWordParts="1"
> generateNumberParts="1"
> catenateWords="1"
> catenateNumbers="1"
> catenateAll="1"
> splitOnCaseChange="1"
> splitOnNumerics="1"
> preserveOriginal="1" />
> 
>  pattern="([^\w\d\*æøåÆØÅ ])" replacement=" " replace="all" />
>  words="stopwords.txt" enablePositionIncrements="true" />
>
> 
> 
>  mapping="mapping-ISOLatin1Accent.txt" />
> 
>  generateWordParts="1"
> generateNumberParts="1"
> catenateWords="1"
> catenateNumbers="1"
> catenateAll="1"
> splitOnCaseChange="1"
> splitOnNumerics="1"
> preserveOriginal="1"/>
> 
>  pattern="([^\w\d\*æøåÆØÅ ])" replacement=" " replace="all" />
>  synonyms="synonyms_tf.txt" expand="true" />
>  words="stopwords.txt" enablePositionIncrements="true" />
>  protected="protwords.txt" />
> 
> 
> 
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: space between search terms

2014-04-18 Thread Ahmet Arslan
Hi Jack,

I am planning to extract and publish such words for Turkish language. But I am 
not sure how to utilize them.

I wonder if there is a more flexible solution that will work query time only. 
That would not require reindexing every time a new item is added. 

Ahmet


On Friday, April 18, 2014 1:47 PM, Jack Krupansky  
wrote:
Use an index-time synonym filter with a synonym entry:

indira nagar,indiranagar

But do not use that same filter at query time.

But, that may mess up some exact phrase queries, such as:

q="indiranagar xyz"

since the following term is actually positioned after the longest synonym.

To resolve that, use a sloppy phrase:

q="indiranagar xyz"~1

Or, set qs=1 for the edismax query parser.

-- Jack Krupansky


-Original Message- 
From: kumar
Sent: Friday, April 18, 2014 6:34 AM
To: solr-user@lucene.apache.org
Subject: space between search terms

Hi,

I Have a field called "title". It is having a values called "indira nagar"
as well as "indiranagar".

If i type any of the keywords it has to display both results.

Can anybody help how can we do this?


I am using the title field in the following way:




















                





--
View this message in context: 
http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can I reconstruct text from tokens?

2014-04-18 Thread Erick Erickson
Luke actually does this, or attempts to. The doc you assemble is lossy
though

It doesn't have stop words
All capitalization is lost
original terms for synonyms are lost
all punctuation is lost
I don't  think you can do this unless you store term information.
it's slow.
original words that are stemmed are lost
Anything you do with, say, ngrams will definitely be strange.
etc.

Basically, all the filters in the analysis chain may change what goes
into the index, that's their job. Each step may lose information.

FWIW,
Erick


On Fri, Apr 18, 2014 at 12:36 PM, Ramkumar R. Aiyengar
 wrote:
> Sorry, didn't think this through. You're right, still the same problem..
> On 16 Apr 2014 17:40, "Alexandre Rafalovitch"  wrote:
>
>> Why? I want stored=false, at which point multivalued field is just offset
>> values in the dictionary. Still have to reconstruct from offsets.
>>
>> Or am I missing something?
>>
>> Regards,
>>  Alex
>> On 16/04/2014 10:59 pm, "Ramkumar R. Aiyengar" 
>> wrote:
>>
>> > Logically if you tokenize and put the results in a multivalued field, you
>> > should be able to get all values in sequence?
>> > On 16 Apr 2014 16:51, "Alexandre Rafalovitch" 
>> wrote:
>> >
>> > > Hello,
>> > >
>> > > If I use very basic tokenizers, e.g. space based and no filters, can I
>> > > reconstruct the text from the tokenized form?
>> > >
>> > > So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
>> > >
>> > > I know we store enough information, but I don't know internal API
>> > > enough to know what I should be looking at for reconstruction
>> > > algorithm.
>> > >
>> > > Any hints?
>> > >
>> > > The XY problem is that I want to store large amount of very repeatable
>> > > text into Solr. I want the index to be as small as possible, so
>> > > thought if I just pre-tokenized, my dictionary will be quite small.
>> > > And I will be reconstructing some final form anyway.
>> > >
>> > > The other option is to just use compressed fields on stored field, but
>> > > I assume that does not take cross-document efficiencies into account.
>> > > And, it will be a read-only index after build, so I don't care about
>> > > updates messing things up.
>> > >
>> > > Regards,
>> > >Alex
>> > >
>> > > Personal website: http://www.outerthoughts.com/
>> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
>> > > proficiency
>> > >
>> >
>>


Re: Boost Search results

2014-04-18 Thread A Laxmi
Markus, like I mentioned in my last email, I have got the qf with title,
content and url. That doesn't help a whole lot. Could you please advise if
there are any other parameters that I should consider for solr request
handler config or the numbers I have got for title, content, url in qf
parameter have to be modified?

Thanks for your help..


On Fri, Apr 18, 2014 at 4:08 PM, A Laxmi  wrote:

> Hi Markus, Yes, you are right. I passed the qf from my front-end framework
> (PHP which uses SolrClient). This is how I got it set-up:
>
> $this->solr->set_param('defType','edismax');
> $this->solr->set_param('qf','title^10 content^5 url^5');
>
> where you can see qf = title^10 content^5 url^5
>
>
>
>
>
>
> On Fri, Apr 18, 2014 at 4:02 PM, Markus Jelsma  > wrote:
>
>> Hi, replicating full features search engine behaviour is not going to
>> work with nutch and solr out of the box. You are missing a thousand
>> features such as proper main content extraction, deduplication,
>> classification of content and hub or link pages, and much more. These
>> things are possible to implement but you may want to start with having you
>> solr request handler better configured, to begin with, your qf parameter
>> does not have nutchs default title and content field selected.
>>
>>
>> A Laxmi  schreef:Hi,
>>
>>
>> When I started to compare the search results with the two options below, I
>> see a lot of difference in the search results esp. the* urls that show up
>> on the top *(*Relevancy *perspective).
>>
>> (1) Nutch 2.2.1 (with *Solr 4.0*)
>> (2) Bing custom search set-up
>>
>> I wonder how should I tweak the boost parameters to get the best results
>> on
>> the top like how Bing, Google does.
>>
>> Please suggest why I see a difference and what parameters are best to
>> configure in Solr to achieve what I see from Bing, or Google search
>> relevancy.
>>
>> Here is what i got in solrconfig.xml:
>>
>> edismax
>>
>>  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
>>
>>*:*
>>10
>>*,score
>>
>>
>> Thanks
>>
>
>


Re: Boost Search results

2014-04-18 Thread A Laxmi
Hi Markus, Yes, you are right. I passed the qf from my front-end framework
(PHP which uses SolrClient). This is how I got it set-up:

$this->solr->set_param('defType','edismax');
$this->solr->set_param('qf','title^10 content^5 url^5');

where you can see qf = title^10 content^5 url^5






On Fri, Apr 18, 2014 at 4:02 PM, Markus Jelsma
wrote:

> Hi, replicating full features search engine behaviour is not going to work
> with nutch and solr out of the box. You are missing a thousand features
> such as proper main content extraction, deduplication, classification of
> content and hub or link pages, and much more. These things are possible to
> implement but you may want to start with having you solr request handler
> better configured, to begin with, your qf parameter does not have nutchs
> default title and content field selected.
>
>
> A Laxmi  schreef:Hi,
>
>
> When I started to compare the search results with the two options below, I
> see a lot of difference in the search results esp. the* urls that show up
> on the top *(*Relevancy *perspective).
>
> (1) Nutch 2.2.1 (with *Solr 4.0*)
> (2) Bing custom search set-up
>
> I wonder how should I tweak the boost parameters to get the best results on
> the top like how Bing, Google does.
>
> Please suggest why I see a difference and what parameters are best to
> configure in Solr to achieve what I see from Bing, or Google search
> relevancy.
>
> Here is what i got in solrconfig.xml:
>
> edismax
>
>  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
>
>*:*
>10
>*,score
>
>
> Thanks
>


Re: Boost Search results

2014-04-18 Thread Markus Jelsma
Hi, replicating full features search engine behaviour is not going to work with 
nutch and solr out of the box. You are missing a thousand features such as 
proper main content extraction, deduplication, classification of content and 
hub or link pages, and much more. These things are possible to implement but 
you may want to start with having you solr request handler better configured, 
to begin with, your qf parameter does not have nutchs default title and content 
field selected.


A Laxmi  schreef:Hi,


When I started to compare the search results with the two options below, I
see a lot of difference in the search results esp. the* urls that show up
on the top *(*Relevancy *perspective).

(1) Nutch 2.2.1 (with *Solr 4.0*)
(2) Bing custom search set-up

I wonder how should I tweak the boost parameters to get the best results on
the top like how Bing, Google does.

Please suggest why I see a difference and what parameters are best to
configure in Solr to achieve what I see from Bing, or Google search
relevancy.

Here is what i got in solrconfig.xml:

edismax
   
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   
   *:*
   10
   *,score


Thanks


Re: Can I reconstruct text from tokens?

2014-04-18 Thread Ramkumar R. Aiyengar
Sorry, didn't think this through. You're right, still the same problem..
On 16 Apr 2014 17:40, "Alexandre Rafalovitch"  wrote:

> Why? I want stored=false, at which point multivalued field is just offset
> values in the dictionary. Still have to reconstruct from offsets.
>
> Or am I missing something?
>
> Regards,
>  Alex
> On 16/04/2014 10:59 pm, "Ramkumar R. Aiyengar" 
> wrote:
>
> > Logically if you tokenize and put the results in a multivalued field, you
> > should be able to get all values in sequence?
> > On 16 Apr 2014 16:51, "Alexandre Rafalovitch" 
> wrote:
> >
> > > Hello,
> > >
> > > If I use very basic tokenizers, e.g. space based and no filters, can I
> > > reconstruct the text from the tokenized form?
> > >
> > > So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
> > >
> > > I know we store enough information, but I don't know internal API
> > > enough to know what I should be looking at for reconstruction
> > > algorithm.
> > >
> > > Any hints?
> > >
> > > The XY problem is that I want to store large amount of very repeatable
> > > text into Solr. I want the index to be as small as possible, so
> > > thought if I just pre-tokenized, my dictionary will be quite small.
> > > And I will be reconstructing some final form anyway.
> > >
> > > The other option is to just use compressed fields on stored field, but
> > > I assume that does not take cross-document efficiencies into account.
> > > And, it will be a read-only index after build, so I don't care about
> > > updates messing things up.
> > >
> > > Regards,
> > >Alex
> > >
> > > Personal website: http://www.outerthoughts.com/
> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > > proficiency
> > >
> >
>


Boost Search results

2014-04-18 Thread A Laxmi
Hi,


When I started to compare the search results with the two options below, I
see a lot of difference in the search results esp. the* urls that show up
on the top *(*Relevancy *perspective).

(1) Nutch 2.2.1 (with *Solr 4.0*)
(2) Bing custom search set-up

I wonder how should I tweak the boost parameters to get the best results on
the top like how Bing, Google does.

Please suggest why I see a difference and what parameters are best to
configure in Solr to achieve what I see from Bing, or Google search
relevancy.

Here is what i got in solrconfig.xml:

 edismax
   
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   
   *:*
   10
   *,score


Thanks


Re: Indexing Big Data With or Without Solr

2014-04-18 Thread Vineet Mishra
Thanks Furkan, I will definitely give it a try then.

Thanks again!




On Tue, Apr 15, 2014 at 7:53 PM, Furkan KAMACI wrote:

> Hi Vineet;
>
> I've been using SolrCloud for such kind of Big Data and I think that you
> should consider to use it. If you have any problems you can ask it here.
>
> Thanks;
> Furkan KAMACI
>
>
> 2014-04-15 13:20 GMT+03:00 Vineet Mishra :
>
> > Hi All,
> >
> > I have worked with Solr 3.5 to implement real time search on some 100GB
> > data, that worked fine but was little slow on complex queries(Multiple
> > group/joined queries).
> > But now I want to index some real Big Data(around 4 TB or even more), can
> > SolrCloud be solution for it if not what could be the best possible
> > solution in this case.
> >
> > *Stats for the previous Implementation:*
> > It was Master Slave Architecture with normal Standalone multiple instance
> > of Solr 3.5. There were around 12 Solr instance running on different
> > machines.
> >
> > *Things to consider for the next implementation:*
> > Since all the data is sensor data hence it is the factor of duplicity and
> > uniqueness.
> >
> > *Really urgent, please take the call on priority with set of feasible
> > solution.*
> >
> > Regards
> >
>


Re: Can I reconstruct text from tokens?

2014-04-18 Thread Michael Sokolov
I believe you could use term vectors to retrieve all the terms in a 
document, with their offsets.  Retrieving them from the inverted index 
would be expensive since the index is term-oriented, not 
document-oriented.  Without tv, I think you essentially have to scan the 
entire term dictionary looking for terms in your document. So that will 
cost you probably more than it's worth?


-Mike

On 04/16/2014 11:50 AM, Alexandre Rafalovitch wrote:

Hello,

If I use very basic tokenizers, e.g. space based and no filters, can I
reconstruct the text from the tokenized form?

So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?

I know we store enough information, but I don't know internal API
enough to know what I should be looking at for reconstruction
algorithm.

Any hints?

The XY problem is that I want to store large amount of very repeatable
text into Solr. I want the index to be as small as possible, so
thought if I just pre-tokenized, my dictionary will be quite small.
And I will be reconstructing some final form anyway.

The other option is to just use compressed fields on stored field, but
I assume that does not take cross-document efficiencies into account.
And, it will be a read-only index after build, so I don't care about
updates messing things up.

Regards,
Alex

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency




Re: solr parallel update and total indexing Issue

2014-04-18 Thread Erick Erickson
try not setting softCommit=true, that's going to take the current
state of your index and make it visible. If your DIH process has
deleted all your records, then that's the "current state".

Personally I wouldn't try to mix-n-match like this, the results will
take forever to get right. If you absolutely must do something like
this, I'd use collection aliasing to rebuild my index in a different
collection then switch from the old to new one in a controlled
fashion.

Best,
Erick

On Thu, Apr 17, 2014 at 11:37 PM, ~$alpha`  wrote:
> There is a bis issue in solr parallel update and total indexing
>
> Total Import syntax (working)
> dataimport?command=full-import&commit=true&optimize=true
>
> Update syntax(working)
> solr/update?softCommit=true' -H 'Content-type:application/json' -d
> '[{"id":"1870719","column":{"set":11}}]'
>
>
> Issue: If both are run in parallel, then commit in b/w take place.
>
> Example: i have 10k in total indexes i fire an solr query to update 1000
> records and in between i fire a total import(full indexer) what's
> happening is that in between commit is taken place... i.e untill total
> indexer runs i got limited records(1000).
>
> How to solve this ?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-parallel-update-and-total-indexing-Issue-tp4131935.html
> Sent from the Solr - User mailing list archive at Nabble.com.


multi-field suggestions

2014-04-18 Thread Michael Sokolov
I've been working on getting AnalyzingInfixSuggester to make suggestions 
using tokens drawn from multiple fields.  I've done this by copying 
tokens from each of those fields into a destination field, and building 
suggestions using that destination field.  This allows me to use 
different analysis strategies for each of the fields, which I need, but 
it doesn't address a couple of remaining issues:


1. Some source fields are more important than others, and it would be 
good to be able to give their tokens greater weight somehow


2. The threshold is applied equally across all tokens, but for some 
fields we want to suggest singletons (threshold=0), while for others we 
want to use the threshold to exclude low-frequency terms.


I looked a little bit at how to extend the whole framework from Solr on 
down to handle multiple source fields intrinsically, rather than using 
the copying technique, and it looks like I could possibly manage 
something like this by extending DocumentDictionary and plugging in a 
different DictionaryFactory.  Does that sound like a good approach?  Is 
there some better way to approach this problem?


Thanks

-Mike

PS Sorry for the cross-post; I realized after I hit send this was 
probably a better question for solr-user than lucene...


Re: 'qt' parameter is not working in search call of SolrPhpClient

2014-04-18 Thread Erick Erickson
You're confusing a couple of things here. the /select_test can be
accessed by pointing your URL at it rather than using qt, i.e. the
destination you're going to will be
http://server:port/solr/collection/select_test
rather than
http://server:port/solr/collection/select

Best,
Erick

On Thu, Apr 17, 2014 at 11:31 PM, harshrossi  wrote:
> I am using SolrPhpClient for interacting with Solr via PHP.
>
> I am using a custom request handler ( /select_test ) with 'edismax' feature
> in Solr config file
>
>   
>  
>explicit
>json
>
>edismax
>
>   text name topic description
>
>text
>100%
>*:*
>10
>*,score
>
>
>   text name topic description
>
>text,name,topic,description
>3
>  
>   
>
> I set the value for 'qt' parameter as '/select_test' in the $search_options
> array and pass it as parameter to the search function of the
> Apache_Solr_Service as below:
>
> $search_options = array(
> 'qt' => '/select_test',
>'fq' => 'topic:games',
>'sort' => 'name desc'
> );
>
>
>
> $result = $solr->search($query, 0, 10, $search_options);
>
> It does not call the request handler at all. The call goes to the default
> '/select' handler in solr config file.
>
> Just to confirm I put the custom request handler code in default handler and
> it worked.
>
> Why is this happening? Am I not setting it right?
>
> Please help!
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/qt-parameter-is-not-working-in-search-call-of-SolrPhpClient-tp4131934.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Siegfried Goeschl
Hi Alistair,

quick email before getting my plane - I worked with similar requirements in the 
past and tuning SOLR can be tricky

* are you hitting the same SOLR query handler (application versus manual 
checking)?
* turn on debugging for your application SOLR queries so you see what query is 
actually executed
* one thing I always do for prototyping is setting up the Solritas GUI using 
the same query handler as the application server

Cheers,

Siegfried Goeschl


On 18 Apr 2014, at 06:06, Alistair  wrote:

> Hey Jack,
> 
> thanks for the reply. I added autoGeneratePhraseQueries="true" to the
> fieldType and now it's giving me even more results! I'm not sure if the
> debug of my query will be helpful but I'll paste it just in case someone
> might have an idea. This produces 113524 results, whereas if I manually
> enter the query as keyword:schwarz AND keyword:kleid I only get 20283
> results (which is the correct one). 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Filtering Solr Queries

2014-04-18 Thread Erick Erickson
Is this a manageable list? That is, not a zillion names? If so, it
seems like you could do this with synonyms. Assuming your string_ci
bit is a "string" type, you'd need to change that to something like
KeywordTokenizerFactory followed by filters, and you might want to add
something like LowercaseFilterFactory to the chain.

Best,
Erick

On Thu, Apr 17, 2014 at 9:47 PM, kumar  wrote:
> Hi,
>
> I am indexing the data using title, city and location fields.
>
> but different cities are having same location names like "rajaji nagar",
> "rajajinagar".
>
> When user types
>
> computers in rajaji nagarIt has to display results like "computers
> in rajajinagr" as well as "computers in rajaji nagr".
>
> I am using the following schema.
>
>
> 
> 
>  multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />
>
>
>
>
>
> 
> 
>  class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>  class="solr.KeywordTokenizerFactory"/>
> 
>  class="solr.PatternReplaceFilterFactory" pattern="([\.,;:-_])"
> replacement=" " replace="all"/>
>  maxGramSize="50"
> minGramSize="2"/>
>  class="solr.PatternReplaceFilterFactory"
> pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
>  ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
> 
> 
>  class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>  class="solr.KeywordTokenizerFactory"/>
>
> 
>  class="solr.PatternReplaceFilterFactory" pattern="([\.,;:-_])"
> replacement=" " replace="all"/>
>  class="solr.PatternReplaceFilterFactory"
> pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
>  class="solr.PatternReplaceFilterFactory" pattern="^(.{30})(.*)?"
> replacement="$1" replace="all"/>
>  ignoreCase="true"
> synonyms="synonyms_fsw.txt" expand="true" />
>  ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>  class="solr.RemoveDuplicatesTokenFilterFactory"/>
> 
> 
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Filtering-Solr-Queries-tp4131924.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: cache warming questions

2014-04-18 Thread Kranti Parisa
cool, thanks.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Thu, Apr 17, 2014 at 11:37 PM, Erick Erickson wrote:

> No, the 5 most recently used in a query will be used to autowarm.
>
> If you have things you _know_ are going to be popular fqs, you could
> put them in newSearcher queries.
>
> Best,
> Erick
>
> On Thu, Apr 17, 2014 at 4:51 PM, Kranti Parisa 
> wrote:
> > Erik,
> >
> > I have a followup question on this topic.
> >
> > If we have used 10 unique FQs and when we configure filterCache=100 &
> > autoWarm=5, then which 5 out of the 10 will be repopulated in the case of
> > new searcher?
> >
> > I don't think there is a way to set the preference or there is?
> >
> >
> > Thanks,
> > Kranti K. Parisa
> > http://www.linkedin.com/in/krantiparisa
> >
> >
> >
> > On Thu, Apr 17, 2014 at 5:25 PM, Matt Kuiper 
> wrote:
> >
> >> Ok,  that makes sense.
> >>
> >> Thanks again,
> >> Matt
> >>
> >> Matt Kuiper - Software Engineer
> >> Intelligent Software Solutions
> >> p. 719.452.7721 | matt.kui...@issinc.com
> >> www.issinc.com | LinkedIn: intelligent-software-solutions
> >>
> >> -Original Message-
> >> From: Erick Erickson [mailto:erickerick...@gmail.com]
> >> Sent: Thursday, April 17, 2014 9:26 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: cache warming questions
> >>
> >> Don't go overboard warming here, you often hit diminishing returns very
> >> quickly. For instance, if the size is 512 you might set your autowarm
> count
> >> to 16 and get the most bang for your buck. Beyond some (usually small)
> >> number, the additional work you put in to warming is wasted. This is
> >> especially true if your autocommit (soft, or hard with
> >> openSearcher=true) is short.
> >>
> >> So while you're correct in your sizing bit, practically it's rarely that
> >> complicated since the autowarm count is usually so much smaller than the
> >> size that there's no danger of swapping them out. YMMV of course.
> >>
> >> Best,
> >> Erick
> >>
> >> On Wed, Apr 16, 2014 at 10:33 AM, Matt Kuiper 
> >> wrote:
> >> > Thanks Erick, this is helpful information!
> >> >
> >> > So it sounds like, at minimum the cache size (at least for filterCache
> >> and queryResultCache) should be the sum of the autowarmCount for that
> cache
> >> and the number of queries defined for the newSearcher listener.
>  Otherwise
> >> some items in the caches will be evicted right away.
> >> >
> >> > Matt
> >> >
> >> > -Original Message-
> >> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> >> > Sent: Tuesday, April 15, 2014 5:21 PM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Re: cache warming questions
> >> >
> >> > bq: What does it mean that items will be regenerated or prepopulated
> >> from the current searcher's cache...
> >> >
> >> > You're right, the values aren't cached. They can't be since the
> internal
> >> Lucene document id is used to identify docs, and due to merging the
> >> internal ID may bear no relation to the old internal ID for a particular
> >> document.
> >> >
> >> > I find it useful to think of Solr's caches as a  map where the key is
> >> the "query" and the value is some representation of the found documents.
> >> The details of the value don't matter, so I'll skip them.
> >> >
> >> > What matters is the key. Consider the filter cache. You put something
> >> like &fq=price:[0 TO 100] on a URL. Solr then uses the fq  clause as the
> >> key to the filterCache.
> >> >
> >> > Here's the sneaky bit. When you specify an autowarm count of N for the
> >> filterCache, when a new searcher is opened the first N keys from the map
> >> are re-executed in the new searcher's context and the results put into
> the
> >> new searcher's filterCache.
> >> >
> >> > bq:  ...how does auto warming and explicit warming work together?
> >> >
> >> > They're orthogonal. IOW, the autowarming for each cache is executed as
> >> well as the newSearcher static warming queries. Use the static queries
> to
> >> do things like fill the sort caches etc.
> >> >
> >> > Incidentally, this bears on why there's a "firstSearcher" and
> >> "newSearcher". The newSearcher queries are run in addition to the cache
> >> autowarms. firstSearcher static queries are only run when a Solr server
> is
> >> started the first time, and there are no cache entries to autowarm. So
> the
> >> firstSearcher queries might be quite a bit more complex than newSearcher
> >> queries.
> >> >
> >> > HTH,
> >> > Erick
> >> >
> >> > On Tue, Apr 15, 2014 at 1:55 PM, Matt Kuiper 
> >> wrote:
> >> >> Hello,
> >> >>
> >> >> I have a few questions regarding how Solr caches are warmed.
> >> >>
> >> >> My understanding is that there are two ways to warm internal Solr
> >> caches (only one way for document cache and lucene FieldCache):
> >> >>
> >> >> Auto warming - occurs when there is a current searcher handling
> >> requests and new searcher is being prepared.  "When a new searcher is
> >> opened, its caches may be prepopulated or "a

QueryElevationComponent always reads config from zookeeper

2014-04-18 Thread ronak kirit
Hello,

I was looking into "QueryElevationComponent" component.

As per the spec (http://wiki.apache.org/solr/QueryElevationComponent), if
config is not found in zookeepr, it should be loaded from data directory.
However, I see the bug. It doesn't seem to be working even in latest 4.7.2
release.

I have checked the latest code and found this:
Map getElevationMap(IndexReader reader, SolrCore
core) throws Exception {
synchronized (elevationCache) {
  Map map = elevationCache.get(null);
  if (map != null) return map;

  map = elevationCache.get(reader);
  if (map == null) {
String f = initArgs.get(CONFIG_FILE);
if (f == null) {
  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
  "QueryElevationComponent must specify argument: " +
CONFIG_FILE);
}
log.info("Loading QueryElevation from data dir: " + f);

Config cfg;

ZkController zkController =
core.getCoreDescriptor().getCoreContainer().getZkController();
if (zkController != null) {
  cfg = new Config(core.getResourceLoader(), f, null, null);
} else {
  InputStream is = VersionedFile.getLatestFile(core.getDataDir(),
f);
  cfg = new Config(core.getResourceLoader(), f, new
InputSource(is), null);
}

map = loadElevationMap(cfg);
elevationCache.put(reader, map);
  }
  return map;
}
  }

As per this code, we will never be able to load config from data directory
if zookeepr exists.

Can we fix this issue?

Thanks,
Ronak


Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Alistair
Hey Jack,

thanks for the reply. I added autoGeneratePhraseQueries="true" to the
fieldType and now it's giving me even more results! I'm not sure if the
debug of my query will be helpful but I'll paste it just in case someone
might have an idea. This produces 113524 results, whereas if I manually
enter the query as keyword:schwarz AND keyword:kleid I only get 20283
results (which is the correct one). 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: multi word search for elevator (QueryElevationComponent) not working

2014-04-18 Thread Niranjan
Hi Remi ,

Thanks for your reply.

I tried with with setting the query_text for "apple ipod" and added the
required doc_id to elevate.
I got the result but again I am  not able to get the desired result for NLP
queries such as "ipod nano generation 5" or "apple ipod best music ".
As in both the queries it contains "ipod" for which I want my desired doc
id's to be elevated.

I also tried changing in the QueryElevationComponent as:

First with this:
string"

Second time:
text_general

But no success.

Please correct me if I am doing the correct change as you mentioned.

Is there any other way round  in solr to achieve this.(Promoted Search).
Please guide me.

Regads,
Niranjan






--
View this message in context: 
http://lucene.472066.n3.nabble.com/multi-word-search-for-elevator-QueryElevationComponent-not-working-tp4131016p4131971.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: space between search terms

2014-04-18 Thread Jack Krupansky

Use an index-time synonym filter with a synonym entry:

indira nagar,indiranagar

But do not use that same filter at query time.

But, that may mess up some exact phrase queries, such as:

q="indiranagar xyz"

since the following term is actually positioned after the longest synonym.

To resolve that, use a sloppy phrase:

q="indiranagar xyz"~1

Or, set qs=1 for the edismax query parser.

-- Jack Krupansky

-Original Message- 
From: kumar

Sent: Friday, April 18, 2014 6:34 AM
To: solr-user@lucene.apache.org
Subject: space between search terms

Hi,

I Have a field called "title". It is having a values called "indira nagar"
as well as "indiranagar".

If i type any of the keywords it has to display both results.

Can anybody help how can we do this?


I am using the title field in the following way:




















   





--
View this message in context: 
http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
Sent from the Solr - User mailing list archive at Nabble.com. 



space between search terms

2014-04-18 Thread kumar
Hi,

I Have a field called "title". It is having a values called "indira nagar"
as well as "indiranagar".

If i type any of the keywords it has to display both results.

Can anybody help how can we do this?


I am using the title field in the following way:


























--
View this message in context: 
http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Jack Krupansky
Make sure your field type has the autoGeneratePhraseQueries="true" attribute 
(default is false). q.op only applies to explicit terms, not to terms which 
decompose into multiple terms. Confusing? Yes!


-- Jack Krupansky

-Original Message- 
From: Alistair

Sent: Friday, April 18, 2014 6:11 AM
To: solr-user@lucene.apache.org
Subject: Having trouble with German compound words in Solr 4.7

Hello all,

I'm a fairly new Solr user and I need my search function to handle compound
words in German. I've searched through the archives and found that Solr
already has a Filter Factory made for such words called
DictionaryCompoundWordTokenFilterFactory. I've already built a list of words
that I want split, and it seems like the filter is working correctly in most
cases, the majority of our searches are clothing items so let's say
"/schwarzkleid/" (black dress) becomes "/schwarz/" "/kleid/", which is what
I want to happen. However, it seems like the keyword search is done using an
*OR* operator. So I'm seeing items that are either black or are dresses but
I just want to see items that are both. I've also read that changing the
default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml
will rectify this issue, but nothing has changed in my query results. It
still uses the *OR* operator.
I've tried using Extended dismax in my queries but I am using the Solr PHP
library and I don't think it supports adding Dismax filters to the queries
themselves (if I'm wrong, please correct me). By the way, I am using Zend
Framework 2.0 in the backend and am communicating with Solr through the Solr
PHP library:  Solr PHP   .

Any suggestions on how to change the operator after my compound word queries
have been split?

Thanks!

Ali



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Alistair
Hello all,

I'm a fairly new Solr user and I need my search function to handle compound
words in German. I've searched through the archives and found that Solr
already has a Filter Factory made for such words called
DictionaryCompoundWordTokenFilterFactory. I've already built a list of words
that I want split, and it seems like the filter is working correctly in most
cases, the majority of our searches are clothing items so let's say
"/schwarzkleid/" (black dress) becomes "/schwarz/" "/kleid/", which is what
I want to happen. However, it seems like the keyword search is done using an
*OR* operator. So I'm seeing items that are either black or are dresses but
I just want to see items that are both. I've also read that changing the
default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml
will rectify this issue, but nothing has changed in my query results. It
still uses the *OR* operator.
I've tried using Extended dismax in my queries but I am using the Solr PHP
library and I don't think it supports adding Dismax filters to the queries
themselves (if I'm wrong, please correct me). By the way, I am using Zend
Framework 2.0 in the backend and am communicating with Solr through the Solr
PHP library:  Solr PHP   . 

Any suggestions on how to change the operator after my compound word queries
have been split?

Thanks!

Ali



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Where to specify numShards when startup up a cloud setup

2014-04-18 Thread Liu Bo
Hi zzT

Putting numShards in core.properties also works.

I struggled a little bit while figuring out this "configuration approach".
I knew I am not alone! ;-)


On 2 April 2014 18:06, zzT  wrote:

> It seems that I've figured out a "configuration approach" to this issue.
>
> I'm having the exact same issue and the only viable solutions found on the
> net till now are
> 1) Pass -DnumShards=x when starting up Solr server
> 2) Use the Collections API as indicated by Shawn.
>
> What I've noticed though - after making the call to /collections to create
> a
> node solr.xml - is that a new  entry is added inside solr.xml with
> the
> attribute "numShards".
>
> So, right now I'm configuring solr.xml with numShards attribute inside my
>  nodes. This way I don't have to worry with annoying stuff you've
> already mentioned e.g. waiting for Solr to start up etc.
>
> Of course same logic applies here, numShards param is meanigful only the
> first time. Even if you change it at a later point the # of shards stays
> the
> same.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Where-to-specify-numShards-when-startup-up-a-cloud-setup-tp4078473p4128566.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
All the best

Liu Bo


Re: Another japanese analysis problem

2014-04-18 Thread Shawn Heisey
On 4/18/2014 12:04 AM, Alexandre Rafalovitch wrote:
> Did you read through the CJK article series? Maybe there is something
> in there? 
> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
> 
> Sorry, no help on actual Japanese.

Almost everything I know about the Japanese language has been learned in
the last few weeks, working on this Solr config!

That blog series looks like really awesome information.  I will be
trying out some of what they've mentioned.  Thank you for pointing me
that direction.  The author's index is a lot more complex than ours ...
I'm really hoping to avoid having a lot of copies of each field.  The
index is already relatively large.

I think I'll take my discussion about a possible bug in CJKBigramFilter
to the dev list.

Thanks,
Shawn