Re: Faceting Question

2012-11-15 Thread Alexey Serba
Seems like pivot faceting is what you looking for (
http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting
)

Note: it currently does not work in distributed mode - see
https://issues.apache.org/jira/browse/SOLR-2894

On Thu, Nov 15, 2012 at 7:46 AM, Jamie Johnson jej2...@gmail.com wrote:
 Sorry some more info. I have a field to store source and another for date.
  I currently use faceting to get a temporal distribution across all
 sources.  What is the best way to get a temporal distribution per source?
  Is the only thing I can do to execute 1 query for the list of sources and
 then another query for each source?

 On Wednesday, November 14, 2012, Jamie Johnson jej2...@gmail.com wrote:
 I've recently been asked to be able to display a temporal facet broken
 down by source, so source1 has the following temporal distribution, source
 2 has the following temporal distribution etc.  I was wondering what the
 best way to accomplish this is?  My current thoughts were that I'd need to
 execute a completely separate query for each, is this right?  Could field
 aliasing some how be used to execute this in a single request to solr?  Any
 thoughts would really be appreciated.


Re: Faceting Facets

2012-09-03 Thread Alexey Serba
http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting

On Mon, Sep 3, 2012 at 6:38 PM, Dotan Cohen dotanco...@gmail.com wrote:
 Is there any way to nest facet searches in Solr? Specifically, I have
 a User field and a DateTime field. I need to know how many Documents
 match each User for each one-hour period in the past 24 hours. That
 is, 16 Users * 24 time periods = 384 values to return.

 I could run 16 queries and facet on DateTime, or 24 queries and facet
 on User. However, if there is a way to facet the facets, then I would
 love to know. Thanks!

 --
 Dotan Cohen

 http://gibberish.co.il
 http://what-is-what.com


Re: Query Time problem on Big Index Solr 3.5

2012-08-31 Thread Alexey Serba
1. Use filter queries

 Here a example of query, there are any incorrect o anything that can I
 change?
 http://xxx:8893/solr/candidate/select/?q=+(IdCandidateStatus:2)+(IdCobranded:3)+(IdLocation1:12))+(LastLoginDate:[2011-08-26T00:00:00Z
 TO 2012-08-28T00:00:00Z])

What is the logic here? Are you AND-ing these boolean clauses? If yes,
then I would change queries to

http://xxx:8893/solr/candidate/select/?q=*:*fq=IdCandidateStatus:2fq=IdCobranded:3fq=IdLocation1:12fq=LastLoginDate:[2011-08-26T00:00:00Z
TO 2012-08-28T00:00:00Z]

I.e. move queries into fq (filter query) parameter.
* it should be faster as it seems you don't need score here. Sort by
id/date instead.
* fq-s will be cached separately thus increasing cache hit rate.

2. Do not optimize your index

 I have a master, and 6 slaves, they are been syncronized every 10 minutes. 
 And the index always is optimized.
DO NOT optimize your index! (unless you re-create the whole index
completely every 10 mins). It basically kills the idea of replication
(after every optimize command slaves download the whole index).


Re: Java class [B has no public instance field or method named split.

2012-08-31 Thread Alexey Serba
http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5

On Sat, Sep 1, 2012 at 2:17 AM, Cirelli, Stephen J.
stephen.j.cire...@saic.com wrote:
 Anyone know why I'm getting this exception? I'm following the example
 here  http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer
 but I get the below error. The field type in my schema.xml is string,
 text doesn't work either. Why would I get an error that there's no split
 method on a string?

 Caused by: sun.org.mozilla.javascript.internal.EvaluatorException: Java
 class [B has no public instance field or method named split.
 (Unknown source#52)

 Here's the JS

 function parseAttachments(row){
 var mainDelim = '(|)', subDelim = '-|-',
 attRow = [//This must be in the order that it was
 concatinated in the query.
 { index:0, field:'attachmentFileName',
 arr: new java.util.ArrayList()},
 { index:1, field:'attachmentSize',
 arr: new java.util.ArrayList()},
 { index:2, field:'attachmentMIMEType',
 arr: new java.util.ArrayList()},
 { index:3,
 field:'attachmentExtractedText', arr: new java.util.ArrayList()},
 { index:4, field:'attachmentLink',
 arr: new java.util.ArrayList()}
 ]

 var allAttachments =
 row.get('attachments').split(mainDelim);
 for(var i=0,l=allAttachments.length; il; i++) {
 var attachment = allAttachments[i].split(subDelim);

 for(var j=0,jl=attRow.length; jjl; j++){
 var itm = attachment[j],
 arr = attRow[j].arr;
 arr.add(itm);
 }
 }
 for(var j=0,jl=attRow.length; jjl; j++){
 var itm = attRow[j];
 row.put(itm.field, itm.arr);
 }
 row.remove('attachments');
 return row;
 }


Re: Sharing and performance testing question.

2012-08-29 Thread Alexey Serba
 Any tips on load testing solr? Ideally we would like caching to not effect
 the result as much as possible.

1. Siege tool
This is probably the simplest option. You can generate urls.txt file
and pass it to the tool. You should also capture server performance
(CPU, memory, qps, etc) using tools like newrelic, zabbix, etc.

2. SolrMeter
http://code.google.com/p/solrmeter/

3. Solr benchmark module (not committed yet)
You to run complex benchmarks using different algorithms
* https://issues.apache.org/jira/browse/SOLR-2646
* 
http://searchhub.org/dev/2011/07/11/benchmarking-the-new-solr-near-realtime-improvements/


Re: Injest pauses

2012-08-29 Thread Alexey Serba
Hey Brad,

 This leads me to believe that a single merge thread is blocking indexing from 
 occuring.
 When this happens our producers, which distribute their updates amongst all 
 the shards, pile up on this shard and wait.
Which version of Solr you are using? Have you tried 4.0 beta?

* 
http://searchhub.org/dev/2011/04/09/solr-dev-diary-solr-and-near-real-time-search/
* https://issues.apache.org/jira/browse/SOLR-2565

Alexey


Re: LateBinding

2012-08-29 Thread Alexey Serba
http://searchhub.org/dev/2012/02/22/custom-security-filtering-in-solr/

See section about PostFilter.

On Wed, Aug 29, 2012 at 4:43 PM,  johannes.schwendin...@blum.com wrote:
 Hello,

 Has anyone ever implementet the security feature called late-binding?

 I am trying this but I am very new to solr and I would be very glad if I
 would get some hints to this.

 Regards,
 Johannes


Re: Injest pauses

2012-08-29 Thread Alexey Serba
Could you take jstack dump when it's happening and post it here?

 Interestingly it is not pausing during every commit so at least a portion of 
 the time the async commit code is working.  Trying to track down the case 
 where a wait would still be issued.
 
 -Original Message-
 From: Voth, Brad (GE Corporate) 
 Sent: Wednesday, August 29, 2012 12:32 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Injest pauses
 
 Thanks, I'll continue with my testing and tracking down the block.
 
 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
 Sent: Wednesday, August 29, 2012 12:28 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Injest pauses
 
 On Wed, Aug 29, 2012 at 11:58 AM, Voth, Brad (GE Corporate) 
 brad.v...@ge.com wrote:
 Anyone know the actual status of SOLR-2565, it looks to be marked as 
 resolved in 4.* but I am still seeing long pauses during commits using
 4.*
 
 SOLR-2565 is definitely committed - adds are no longer blocked by commits (at 
 least at the Solr level).
 
 -Yonik
 http://lucidworks.com


Re: Indexing and querying BLOBS stored in Mysql

2012-08-24 Thread Alexey Serba
I would recommend to create a simple data import handler to test tika
parsing for large BLOBs, i.e. remove not related entities, remove all
the configuration for delta imports and keep just entity that
retrieves blobs and entity that parses binary content
(fieldReader/TikaEntityProcessor).

Some comments:
1. Maybe you are running delta import and there are not new records in database?
2. deltaQuery should only return id-s and not other columns/data,
because you don't use them in deltaQueryImport (see
dataimporter.delta.id )
3. not all entities have HTMLStripTransformer in a transformers list,
but use them in fields. TemplateTransformer is not used at all.

   entity name=aitiologikes_ektheseis
 dataSource=db
 transformer=HTMLStripTransformer
 query=select id, title, title AS grid_title, model, type, url,
 last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
 body,' ',title)  AS content from aitiologikes_ektheseis where type = 'text'
 deltaImportQuery=select id, title, title AS grid_title, model, type, 
 url,
 last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
 body,' ',title)  AS content from aitiologikes_ektheseis where type = 'text'
 and id='${dataimporter.delta.id}'
 deltaQuery=select id, title, title AS grid_title, model, type, url,
 last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
 body,' ',title)  AS content from aitiologikes_ektheseis where type = 'text'
 and last_modified  '${dataimporter.last_index_time}'
 field column=id name=ida /
 field column=solr_id name=solr_id /
 field column=title name=title stripHTML=true /
 field column=grid_title name=grid_title stripHTML=true 
 /
 field column=model name=model stripHTML=true /
 field column=type name=type stripHTML=true /
 field column=url name=url stripHTML=true /
 field column=last_modified name=last_modified 
 stripHTML=true  /
 field column=search_tag name=search_tag stripHTML=true 
 /
 field column=content name=content stripHTML=true /
 /entity

 entity name=aitiologikes_ektheseis_bin
   query=select id, title, title AS grid_title, model, type, url,
 last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS
 text from aitiologikes_ektheseis where type = 'bin'
   deltaImportQuery=select id, title, title AS grid_title, model, 
 type,
 url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con
 AS text from aitiologikes_ektheseis where type = 'bin' and
 id='${dataimporter.delta.id}'
   deltaQuery=select id, title, title AS grid_title, model, type, url,
 last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS
 text from aitiologikes_ektheseis where type = 'bin' and last_modified 
 '${dataimporter.last_index_time}'
   transformer=TemplateTransformer
   dataSource=db

   field column=id name=ida /
 field column=solr_id name=solr_id /
   field column=title name=title stripHTML=true /
   field column=grid_title name=grid_title 
 stripHTML=true /
   field column=model name=model stripHTML=true /
   field column=type name=type stripHTML=true /
   field column=url name=url stripHTML=true /
   field column=last_modified name=last_modified 
 stripHTML=true  /
   field column=search_tag name=search_tag 
 stripHTML=true /

 entity dataSource=fieldReader 
 processor=TikaEntityProcessor
 dataField=aitiologikes_ektheseis_bin.text format=text
   field column=text name=contentbin stripHTML=true /
 /entity

 /entity

 ...
 ...
 /document

 /dataConfig

 *A portion from schema.xml (the fieldTypes and filed definition):*

 fieldType name=text_ktimatologio class=solr.TextField
 positionIncrementGap=100

   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_en.txt enablePositionIncrements=true/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_el.txt enablePositionIncrements=true/
 filter class=solr.GreekLowerCaseFilterFactory/
 filter class=solr.GreekStemFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
 filter class=solr.PorterStemFilterFactory/
   /analyzer

   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter 

Re: MySQL Exception: Communications link failure WITH DataImportHandler

2012-08-16 Thread Alexey Serba
My memory is vague, but I think I've seen something similar with older
versions of Solr.

Is it possible that you have significant database import and there's a
big segments merge happening in the middle causing blocking in dih
indexing process (and reading records from database as well), since
long inactivity in communication with db server and timeout as a
result. If this is the case then you can either increase timeout limit
on db server (don't remember the actual parameter) or upgrade Solr to
newer version that doesn't have such long pauses (4.0 beta?).

On Thu, Aug 16, 2012 at 12:37 PM, Jienan Duan jnd...@gmail.com wrote:
 Hi all:
 I have resolved this problem by configuring a jndi datasource in tomcat.
 But I still want to find out why it throw an exception in DIH when I
 configure datasource in data-configure.xml but a jndi resource.

 Regards.

 2012/8/16 Jienan Duan jnd...@gmail.com

 Hi all:
 I'm using DataImportHandler load data from MySQL.
 It works fine on my develop machine and online environment.
 But I got an exception on test environment:

 Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
 Communications link failure


 The last packet sent successfully to the server was 0 milliseconds ago.
 The driver has not received any packets from the server.

 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)

 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

 at com.mysql.jdbc.Util.handleNewInstance(Util.java:406)

 at
 com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074)

 at com.mysql.jdbc.MysqlIO.init(MysqlIO.java:343)

 at
 com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2132)

 ... 26 more

 Caused by: java.net.ConnectException: Connection timed out

 at java.net.PlainSocketImpl.socketConnect(Native Method)

 at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)

 at
 java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)

 at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)

 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)

 at java.net.Socket.connect(Socket.java:529)

 at java.net.Socket.connect(Socket.java:478)

 at java.net.Socket.init(Socket.java:375)

 at java.net.Socket.init(Socket.java:218)

 at
 com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:253)

 at com.mysql.jdbc.MysqlIO.init(MysqlIO.java:292)

 ... 27 more

 This make me confused,because the test env and online env almost
 same:Tomcat runs on a Linux Server with JDK6,MySql5 runs on another.
 Even I wrote a simple JDBC test class it works,a jsp file with JDBC code
 also works.Only DataImportHandler failed.
 I'm trying to read Solr source code and found that it seems Solr has it's
 own ClassLoader.I'm not sure if it goes wrong with Tomcat on some specific
 configuration.
 Dose anyone know how to fix this problem? Thank you very much.

 Best Regards.

 Jienan Duan

 --
 --
 不走弯路,就是捷径。
 http://www.jnan.org/




 --
 --
 不走弯路,就是捷径。
 http://www.jnan.org/


Re: Custom Geocoder with Solr and Autosuggest

2012-08-16 Thread Alexey Serba
 My first decision was to divide SOLR into two cores, since I am already
 using SOLR as my search server. One core would be for the main search of the
 site and one for the geocoding.
Correct. And you can even use that location index/collection for
locations extraction for a non structural documents - i.e. if you
don't have separate field with geographical names in your corpus (or
location data is just not good enough compared to what can be mined
from documents)

 My second decision is to store the name data in a normalised state, some
 examples are shown below:
 London, England
 England
 Swindon, Wiltshire, England
Yes, you can add postcode/outcodes there also. And I would add
additional field type region/county/town/postcode/outcode.

 The third decision was to return “autosuggest” results, for example when the
 user types “Lond” I would like to suggest “London, England”. For this to
 work I think it makes sense to return up to 5 results via JSON based on
 relevancy and have these displayed under the search box.
Yeah, you might want to boost cities more than towns (I'm sure there
are plenty ambiguous terms), use some kind of geoip service,
additional scoring factors.

 My fourth decision is that when the user actually hits the “search” button
 on the location field, SOLR is again queries and returns the most relevant
 result, including the co-ordinates which are stored.
You can also have special logic to decide if you want to use spatial
search or just simple textual match would be better. I.e. you have
England in your example. It doesn't sound practical to return
coordinates and use spatial search for this use case, right?

HTH,
Alexey


Re: Solr Index linear growth - Performance degradation.

2012-08-14 Thread Alexey Serba
10K queries
How do you generate these queries? I.e. is this a single or multi
threaded application?

Can you provide full queries you send to Solr servers and solrconfig
request handler configuration? Do you use function queries, grouping,
faceting, etc?


On Tue, Aug 14, 2012 at 10:31 AM, feroz_kh feroz.kh2...@gmail.com wrote:
 Its 7,200,000 hits == number of documents found by all 10K queries.
 We have RHEL tikanga version.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Index-linear-growth-Performance-degradation-tp4000934p4001069.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Running out of memory

2012-08-12 Thread Alexey Serba
 It would be vastly preferable if Solr could just exit when it gets a memory
 error, because we have it running under daemontools, and that would cause
 an automatic restart.
-XX:OnOutOfMemoryError=cmd args; cmd args
Run user-defined commands when an OutOfMemoryError is first thrown.

 Does Solr require the entire index to fit in memory at all times?
No.

But it's hard to say about your particular problem without additional
information. How often do you commit? Do you use faceting? Do you sort
by Solr fields and if yes what are those fields? And you should also
check caches.


Re: Is this too much time for full Data Import?

2012-08-08 Thread Alexey Serba
9m*15 - that's a lot of queries (400 QPS).

I would try reduce the number of queries:

1. Rewrite your main (root) query to select all possible data
* use SQL joins instead of DIH nested entities
* select data from 1-N related tables (tags, authors, etc) in the main
query using GROUP_CONCAT (that's MySQL specific function, but there
are similar functions for other RDBMS-es) aggregate function and then
split concatenated data in a DIH transformer.

2. Identify small tables in nested entities and cache them completely
in CachedSqlEntityProcessor.



On Wed, Aug 8, 2012 at 10:35 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 Hello,

 Does your indexer utilize CPU/IO? - check it by iostat/vmstat.
 If it doesn't, take several thread dumps by jvisualvm sampler or jstack,
 try to understand what blocks your threads from progress.
 It might happen you need to speedup your SQL data consumption, to do this,
 you can enable threads in DIH (only in 3.6.1), move from N+1 SQL queries to
 select all/cache approach
 http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor and
 https://issues.apache.org/jira/browse/SOLR-2382

 Good luck

 On Wed, Aug 8, 2012 at 9:16 AM, Pranav Prakash pra...@gmail.com wrote:

 Folks,

 My full data import takes ~80hrs. It has around ~9m documents and ~15 SQL
 queries for each document. The database servers are different from Solr
 Servers. Each document has an update processor chain which (a) calculates
 signature of the document using SignatureUpdateProcessorFactory and (b)
 Finds out terms which have term frequency  2; using a custom processor.
 The index size is ~ 480GiB

 I want to know if the amount of time taken is too large compared to the
 document count? How do I benchmark the stats and what are some of the ways
 I can improve this? I believe there are some optimizations that I could do
 at Update Processor Factory level as well. What would be a good way to get
 dirty on this?

 *Pranav Prakash*

 temet nosce




 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com


Re: Large RDBMS dataset

2011-12-29 Thread Alexey Serba
 The problem is that for each record in fd, Solr makes three distinct SELECT 
 on the other three tables. Of course, this is absolutely inefficient.

You can also try to use GROUP_CONCAT (it's MySQL function, but maybe
there's something similar in MS SQL) to select all the nested 1-N
entities in a single result set as strings joined using some separator
and then split them into multivalued fields in post processing phase
(using regex template transformer or similar)


Re: a question on jmx solr exposure

2011-12-29 Thread Alexey Serba
Which Solr version do you use? Maybe it has something to do with
default collection?

I do see separate jmx domain for every collection, i.e.

solr/collection1
solr/collection2
solr/collection3
...

On Wed, Dec 21, 2011 at 1:56 PM, Dmitry Kan dmitry@gmail.com wrote:
 Hello list,

 This might be not the right place to ask the jmx specific questions, but I
 decided to try, as we are polling SOLR statistics through jmx.

 We currently have two solr cores with different schemas A and B being run
 under the same tomcat instance. Question is: which stat is jconsole going
 to see under solr/ ?

 From the numbers (e.g. numDocs of searcher), jconsole see the stats of A.
 Where do stats of B go? Or is firstly activated core will capture the jmx
 pipe and won't let B's stats to go through?

 --
 Regards,

 Dmitry Kan


Re: Decimal Mapping problem

2011-12-29 Thread Alexey Serba
Try to cast MySQL decimal data type to string, i.e.

CAST( IF(drt.discount IS NULL,'0',(drt.discount/100)) AS CHAR) as discount
(or CAST AS TEXT)

On Mon, Dec 19, 2011 at 1:24 PM, Niels Stevens ni...@kabisa.nl wrote:
 Hey everybody,

 I'm having an issue importing Decimal numbers from my Mysql DB to Solr.
 Is there anybody with some advise, I will start and try to explain my
 problem.

 According to my findings, I think the lack of a explicit mapping of a
 Decimal value in the schema.xml
 is causing some issues I'm experiencing.

 The decimal numbers I'm trying to import look like this :

 0.075000
 7.50
 2.25


 but after the import statement the results for the equivalent Solr field
 are returned as this:

 [B@1413d20
 [B@11c86ff
 [B@1e2fd0d


 The import statement for this particular field looks like:

  IF(drt.discount IS NULL,'0',(drt.discount/100)) ...


 Now I thought that using the Round functions from mysql to 3 numbers after
 the dot.
 In conjunction with a explicite mapping field in the schema.xml could solve
 this issue.
 Is there someone with some similar problems with decimal fields or anybody
 with an expert view on this?

 Thanks a lot in advance.

 Regards,

 Niels Stevens


Re: Solr 3.3: DIH configuration for Oracle

2011-08-17 Thread Alexey Serba
Why do you need to collect both primary keys T1_ID_RECORD and
T2_ID_RECORD in your delta query. Isn't T2_ID_RECORD primary key value
enough to get all data from both tables? (you have table1-table2
relation as 1-N, right?)

On Thu, Aug 11, 2011 at 12:52 AM, Eugeny Balakhonov c0f...@gmail.com wrote:
 Hello, all!



 I want to create a good DIH configuration for my Oracle database with deltas
 support. Unfortunately I am not able to do it well as DIH has the strange
 restrictions.

 I want to explain a problem on a simple example. In a reality my database
 has very difficult structure.



 Initial conditions: Two tables with following easy structure:



 Table1

 -          ID_RECORD    (Primary key)

 -          DATA_FIELD1

 -          ..

 -          DATA_FIELD2

 -          LAST_CHANGE_TIME

 Table2

 -          ID_RECORD    (Primary key)

 -          PARENT_ID_RECORD (Foreign key to Table1.ID_RECORD)

 -          DATA_FIELD1

 -          ..

 -          DATA_FIELD2

 -          LAST_CHANGE_TIME



 In performance reasons it is necessary to do selection of the given tables
 by means of one request (via inner join).



 My db-data-config.xml file:



 ?xml version=1.0 encoding=UTF-8?

 dataConfig

    dataSource jndiName=jdbc/DB1 type=JdbcDataSource user=
 password=/

    document

        entity name=ent pk=T1_ID_RECORD, T2_ID_RECORD

            query=select * from TABLE1 t1 inner join TABLE2 t2 on
 t1.ID_RECORD = t2.PARENT_ID_RECORD

            deltaQuery=select t1.ID_RECORD T1_ID_RECORD, t1.ID_RECORD
 T2_ID_RECORD

                               from TABLE1 t1 inner join TABLE2 t2 on
 t1.ID_RECORD = t2.PARENT_ID_RECORD

                               where TABLE1.LAST_CHANGE_TIME 
 to_date('${dataimporter.last_index_time}', '-MM-DD HH24:MI:SS')

                               or TABLE2.LAST_CHANGE_TIME 
 to_date('${dataimporter.last_index_time}', '-MM-DD HH24:MI:SS')

            deltaImportQuery=select * from TABLE1 t1 inner join TABLE2 t2
 on t1.ID_RECORD = t2.PARENT_ID_RECORD

            where t1.ID_RECORD = ${dataimporter.delta.T1_ID_RECORD} and
 t2.ID_RECORD = ${dataimporter.delta.T2_ID_RECORD}

        /

    /document

 /dataConfig



 In result I have following error:



 java.lang.IllegalArgumentException: deltaQuery has no column to resolve to
 declared primary key pk='T1_ID_RECORD, T2_ID_RECORD'



 I have analyzed the source code of DIH. I found that in the DocBuilder class
 collectDelta() method works with value of entity attribute pk as with
 simple string. But in my case this is array with two values: T1_ID_RECORD,
 T2_ID_RECORD



 What do I do wrong?



 Thanks,

 Eugeny






Re: Weird issue with solr and jconsole/jmx

2011-06-24 Thread Alexey Serba
I just encountered the same bug - JMX registered beans don't survive
Solr core reloads.

I believe the reason is that when you do core reload
* when the new core is created - it overwrites/over-register beans in
registry (in mbeanserver)
* when the new core is ready in the core register phase CoreContainer
closes old core that results to unregistering jmx beans

As a result there's only one bean in registry
id=org.apache.solr.search.SolrIndexSearcher,type=Searcher@33099cc
main left after Core reload. It is because this in the only new
(dynamically named bean) that is created by new core and not
un-registered in oldCore.close. I'll try to reproduce that in test and
file bug in Jira.


On Tue, Mar 16, 2010 at 4:25 AM, Andrew Greenburg agreenb...@gmail.com wrote:
 On Tue, Mar 9, 2010 at 7:44 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : I connected to one of my solr instances with Jconsole today and
 : noticed that most of the mbeans under the solr hierarchy are missing.
 : The only thing there was a Searcher, which I had no trouble seeing
 : attributes for, but the rest of the statistics beans were missing.
 : They all show up just fine on the stats.jsp page.
 :
 : In the past this always worked fine. I did have the core reload due to
 : config file changes this morning. Could that have caused this?

 possibly... reloading the core actually causes a whole new SolrCore
 object (with it's own registry of SOlrInfoMBeans) to be created and then
 swapped in place of hte previous core ... so perhaps you are still looking
 at the stats of the old core which is no longer in use (and hasn't been
 garbage collected because the JMX Manager still had a refrence to it for
 you? ... i'm guessing at this point)

 did disconnecting from jconsole and reconnecting show you the correct
 stats?

 Disconnecting and reconnecting didn't help. The queryCache and
 documentCache and some others started showing up after I did a commit
 and opened a new searcher, but the whole tree never did fill in.

 I'm guessing that the request handler stats stayed associated with the
 old, no longer visible core in JMX since new instances weren't created
 when the core reloaded. Does that make sense? The stats on the web
 stats page continued to be fresh.



Re: Solr and Tag Cloud

2011-06-19 Thread Alexey Serba
Consider you have multivalued field _tag_ related to every document in
your corpus. Then you can build tag cloud relevant for all data set or
specific query by retrieving facets for field _tag_ for *:* or any
other query. You'll get a list of popular _tag_ values relevant to
this query with occurrence counts.

If you want to build tag cloud for general analyzed text fields you
still can do that the same way, but you should note that you can hit
some performance/memory problems if you have significant data set and
huge text fields. You should probably use stop words to filter popular
general terms.

On Sat, Jun 18, 2011 at 8:12 AM, Jamie Johnson jej2...@gmail.com wrote:
 Does anyone have details of how to generate a tag cloud of popular terms
 across an entire data set and then also across a query?



Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-17 Thread Alexey Serba
 Do you mean that we  have current Index as it is and have a separate core
 which  has only the user-id ,product-id relation and at while querying ,do a
 join between the two cores based on the user-id.
Exactly. You can index user-id, product-id relation either to the same
core or to different core on the same Solr instance.

 This would involve us to Index/delete the product  as and when the user
 subscription for a product changes ,This would involve some amount of
 latency if the Indexing (we have a queue system for Indexing across the
 various instances) or deletion is delayed
Right, but I'm not sure if it's possible to achieve good performance
requiring zero latency.

 IF we want to go ahead with this solution ,We currently are using solr 1.3
 , so  is this functionality available as a patch for solr 1.3?
No. AFAIK it's in trunk only.

 Would it be
 possible to  do with a separate Index  instead of a core ,then I can create
 only one  Index common for all our instances and then use this instance to
 do the join.
No, I don't think that's possible with join feature. I guess that
would require network request per search req and number of mapped ids
could be huge, so it could affect performance significantly.

 You'll need to be a bit careful using joins, as the performance hit
 can be significant if you have lots of cross-referencing to do, which
 I believe you would given your scenario.
As far as I understand join query would build bitset filter which can
be cached in filterCache, etc. The only performance impact I can think
of is that user-product relations table could be too big to fit into
single instance.


Re: Complex situation

2011-06-16 Thread Alexey Serba
Am I right that you are only interested in results / facets for
current season? If it's so then you can index start/end dates as a
separate number fields and build your search filters like this
fq=+start_date_month:[* TO 6] +start_date_day:[* TO 17]
+end_date_month:[* TO 6] +end_date_day:[16 TO *] where 6/16 is
current month/day.

On Thu, Jun 16, 2011 at 5:20 PM, roySolr royrutten1...@gmail.com wrote:
 Hello,

 First i will try to explain the situation:

 I have some companies with openinghours. Some companies has multiple seasons
 with different openinghours. I wil show some example data :

 Companyid          Startdate(d-m)  Enddate(d-m)     Openinghours_end
 1                        01-01                01-04                 17:00
 1                        01-04                01-08                 18:00
 1                        01-08                31-12                 17:30

 2                        01-01                31-12                 20:00

 3                        01-01                01-06                 17:00
 3                        01-06                31-12                 18:00

 What i want is some facets on the left site of my page. They have to look
 like this:

 Closing today on:
 17:00(23)
 18:00(2)
 20:00(1)

 So i need to get the NOW to know which openinghours(seasons) i need in my
 facet results. How should my index look like?
 Can anybody helps me how i can save this data in the solr index?





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Complex-situation-tp3071936p3071936.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-16 Thread Alexey Serba
 So a search for a product once the user logs in and searches for only the
 products that he has access to Will translate to something like this . ,the
 product ids are obtained form the db  for a particular user and can run
 into  n  number.

 search term fq=product_id(100 10001  ..n number)

 but we are currently running into too many Boolean expansion error .We are
 not able to tie the user also into roles as each user is mainly any one who
 comes to site and purchases a product .

I'm wondering if new trunk Solr join functionality can help here.

* http://wiki.apache.org/solr/Join

In theory you can index your products (product_id, ...) and
user_id-product many-to-many relation (user_product_id, user_id) into
signle/different cores and then do join, like
f=search termsfq={!join from=product_id to=user_product_id}user_id:10101

But I haven't tried that, so I'm just speculating.


Re: Strange behavior

2011-06-16 Thread Alexey Serba
Have you stopped Solr before manually copying the data? This way you
can be sure that index is the same and you didn't have any new docs on
the fly.

2011/6/14 Denis Kuzmenok forward...@ukr.net:
 What  should  i provide, OS is the same, environment is the same, solr
 is  completely  copied,  searches  work,  except that one, and that is
 strange..

 I think you will need to provide more information than this, no-one on this 
 list is omniscient AFAIK.

 François

 On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote:

 Hi.

 I've  debugged search on test machine, after copying to production server
 the  entire  directory  (entire solr directory), i've noticed that one
 query  (SDR  S70EE  K)  does  match  on  test  server, and does not on
 production.
 How can that be?








Re: Updating only one indexed field for all documents quickly.

2011-06-16 Thread Alexey Serba
 with the integer field. If you just want to influence the
 score, then just plain external field fields should work for
 you.

 Is this an appropriate solution, give our use case?

Yes, check out ExternalFileField

* http://search.lucidimagination.com/search/document/CDRG_ch04_4.4.4
* 
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
* http://www.slideshare.net/greggdonovan/solr-lucene-etsy-by-gregg-donovan/28


Re: URGENT HELP: Improving Solr indexing time

2011-06-13 Thread Alexey Serba
str name=Total Requests made to DataSource16276/str
...
 so I am doing a delta import of around 500,000 rows at a
 time.

http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport


Re: Need query help

2011-06-06 Thread Alexey Serba
See Tagging and excluding Filters section

* 
http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

2011/6/6 Denis Kuzmenok forward...@ukr.net:
 For now i have a collection with:
 id (int)
 price (double) multivalue
 brand_id (int)
 filters (string) multivalue

 I  need  to  get available brand_id, filters, price values and list of
 id's   for   current   query.  For  example now i'm doing queries with
 facet.field=brand_id/filters/price:
 1) to get current id's list: (brand_id:100 OR brand_id:150) AND 
 (filters:p1s100 OR filters:p4s20)
 2) to get available filters on selected properties (same properties but
 another  values):  (brand_id:100 OR brand_id:150) AND (filters:p1s* OR
 filters:p4s*)
 3) to get available brand_id (if any are selected, if none - take from
 1st query results): (filters:p1s100 OR filters:p4s20)
 4) another request to get available prices if any are selected

 Is there any way to simplify this task?
 Data needed:
 1) Id's for selected filters, price, brand_id
 2) Available filters, price, brand_id from selected values
 3) Another values for selected properties (is any chosen)
 4) Another brand_id for selected brand_id
 5) Another price for selected price

 Will appreciate any help or thoughts!

 Cheers,
 Denis Kuzmenok




Re: Solr memory consumption

2011-06-02 Thread Alexey Serba
 Commits  are  divided  into  2  groups:
 - often but small (last changed
 info)
1) Make sure that it's not too often and you don't have commit
overlapping problem.
http://wiki.apache.org/solr/FAQ#What_does_.22PERFORMANCE_WARNING:_Overlapping_onDeckSearchers.3DX.22_mean_in_my_logs.3F

2) You may also try to limit cache sizes and check if it helps.

3) If it doesn't help then try to monitor your app using jconsole
* try to hit garbage collector and see if it frees some memory
* browse solr jmx attributes and see if there'r any hints re solr
caches usage, etc

4) Try to run jmap -heap -histo and see if there's any hints there

5) If none of above helps then you probably need to examine your
memory usage using some kind of java profiler tool (like yourkit
profiler)


 Size: 4 databases about 1G (sum), 1 database (with n-gram) for 21G..
 I  don't  know any other way to search for product names except n-gram
 =\
Standard text field with solr.WordDelimiterFilterFactory and
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 during
indexing isn't good enough? You might want to limit min and max ngram
size, just to reduce your index size.


Re: Solr memory consumption

2011-06-01 Thread Alexey Serba
Hey Denis,

* How big is your index in terms of number of documents and index size?
* Is it production system where you have many search requests?
* Is there any pattern for OOM errors? I.e. right after you start your
Solr app, after some search activity or specific Solr queries, etc?
* What are 1) cache settings 2) facets and sort-by fields 3) commit
frequency and warmup queries?
etc

Generally you might want to connect to your jvm using jconsole tool
and monitor your heap usage (and other JVM/Solr numbers)

* http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
* http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX

HTH,
Alexey

2011/6/1 Denis Kuzmenok forward...@ukr.net:
 There  were  no  parameters  at  all,  and java hitted out of memory
 almost  every day, then i tried to add parameters but nothing changed.
 Xms/Xmx  -  did  not solve the problem too. Now i try the MaxPermSize,
 because it's the last thing i didn't try yet :(


 Wednesday, June 1, 2011, 9:00:56 PM, you wrote:

 Could be related to your crazy high MaxPermSize like Marcus said.

 I'm no JVM tuning expert either. Few people are, it's confusing. So if
 you don't understand it either, why are you trying to throw in very
 non-standard parameters you don't understand?  Just start with whatever
 the Solr example jetty has, and only change things if you have a reason
 to (that you understand).

 On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:
 Overall  memory on server is 24G, and 24G of swap, mostly all the time
 swap  is  free and is not used at all, that's why no free swap sound
 strange to me..







Re: DIH render html entities

2011-06-01 Thread Alexey Serba
Maybe HTMLStripTransformer is what you are looking for.

* http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer

On Tue, May 31, 2011 at 5:35 PM, Erick Erickson erickerick...@gmail.com wrote:
 Convert them to what? Individual fields in your docs? Text?

 If the former, you might get some joy from the XpathEntityProcessor.
 If you want to just strip the markup and index all the content you
 might get some joy from the various *html* analyzers listed here:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 Best
 Erick

 On Fri, May 27, 2011 at 5:19 AM, anass talby anass.ta...@gmail.com wrote:
 Sorry my question was not clear.
 when I get data from database, some field contains some html special chars,
 and what i want to do is just convert them automatically.

 On Fri, May 27, 2011 at 1:00 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Fri, May 27, 2011 at 3:50 PM, anass talby anass.ta...@gmail.com
 wrote:
  Is there any way to render html entities in DIH for a specific field?
 [...]

 This does not make too much sense: What do you mean by
 rendering HTML entities. DIH just indexes, so where would
 it render HTML to, even if it could?

 Please take a look at http://wiki.apache.org/solr/UsingMailingLists

 Regards,
 Gora




 --
       Anass




Re: Better Spellcheck

2011-06-01 Thread Alexey Serba
 I've tried to use a spellcheck dictionary built from my own content, but my
 content ends up having a lot of misspelled words so the spellcheck ends up
 being less than effective.
You can try to use sp.dictionary.threshold parameter to solve this problem
* http://wiki.apache.org/solr/SpellCheckerRequestHandler#sp.dictionary.threshold

 It also misses phrases. When someone
 searches for Untied States I would hope the spellcheck would suggest
 United States but it just recognizes that untied is a valid word and
 doesn't suggest any thing.
So you are saying about auto suggest component and not spellcheck
right? These are two different use cases.

If you want auto suggest and you have some search logs for your system
then you can probably use the following solution:
* 
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

If you don't have significant search logs history and want to populate
your auto suggest dictionary from index or some text file you should
check
* http://wiki.apache.org/solr/Suggester


Re: Documents update

2011-06-01 Thread Alexey Serba
 Will it be slow if there are 3-5 million key/value rows?
AFAIK it shouldn't affect search time significantly as Solr caches it
in memory after you reloading Solr core / issuing commit.

But obviously you need more memory and commit/reload will take more time.


Re: Indexing 20M documents from MySQL with DIH

2011-05-05 Thread Alexey Serba
{quote}
...
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
   at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
   at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
   ... 22 more
Apr 21, 2011 3:53:28 AM
org.apache.solr.handler.dataimport.EntityProcessorBase getNext
SEVERE: getNext() failed for query 'REDACTED'
org.apache.solr.handler.dataimport.DataImportHandlerException:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure

The last packet successfully received from the server was 128
milliseconds ago.  The last packet sent successfully to the server was
25,273,484 milliseconds ago.
...
{quote}

It could probably be because of autocommit / segment merging. You
could try to disable autocommit / increase mergeFactor

{quote}
I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.
{quote}

I was thinking about some hackish solution to paginate results
entity name =pages query=SELECT id FROM generate_series( (SELECT
count(*) from source_table) / 1000 ) ... 
  entity name=records query=SELECT * from source_table LIMIT 1000
OFFSET ${pages.id}*1000
  /entity
/entity
Or something along those lines ( you'd need to to calculate offset in
pages query )

But unfortunately MySQL does not provide generate_series function
(it's postgres function and there'r similar solutions for oracle and
mssql).


On Mon, Apr 25, 2011 at 3:59 AM, Scott Bigelow eph...@gmail.com wrote:
 Thank you everyone for your help. I ended up getting the index to work
 using the exact same config file on a (substantially) larger instance.

 On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson erickerick...@gmail.com 
 wrote:
 {{{A custom indexer, so that's a fairly common practice? So when you are
 dealing with these large indexes, do you try not to fully rebuild them
 when you can? It's not a nightly thing, but something to do in case of
 a disaster? Is there a difference in the performance of an index that
 was built all at once vs. one that has had delta inserts and updates
 applied over a period of months?}}}

 Is it a common practice? Like all of this, it depends. It's certainly
 easier to let DIH do the work. Sometimes DIH doesn't have all the
 capabilities necessary. Or as Chris said, in the case where you already
 have a system built up and it's easier to just grab the output from
 that and send it to Solr, perhaps with SolrJ and not use DIH. Some people
 are just more comfortable with their own code...

 Do you try not to fully rebuild. It depends on how painful a full rebuild
 is. Some people just like the simplicity of starting over every 
 day/week/month.
 But you *have* to be able to rebuild your index in case of disaster, and
 a periodic full rebuild certainly keeps that process up to date.

 Is there a difference...delta inserts...updates...applied over months. Not
 if you do an optimize. When a document is deleted (or updated), it's only
 marked as deleted. The associated data is still in the index. Optimize will
 reclaim that space and compact the segments, perhaps down to one.
 But there's no real operational difference between a newly-rebuilt index
 and one that's been optimized. If you don't delete/update, there's not
 much reason to optimize either

 I'll leave the DIH to others..

 Best
 Erick

 On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow eph...@gmail.com wrote:
 Thanks for the e-mail. I probably should have provided more details,
 but I was more interested in making sure I was approaching the problem
 correctly (using DIH, with one big SELECT statement for millions of
 rows) instead of solving this specific problem. Here's a partial
 stacktrace from this specific problem:

 ...
 Caused by: java.io.EOFException: Can not read response from server.
 Expected to read 4 bytes, read 0 bytes before connection was
 unexpectedly lost.
        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
        ... 22 more
 Apr 21, 2011 3:53:28 AM
 org.apache.solr.handler.dataimport.EntityProcessorBase getNext
 SEVERE: getNext() failed for query 'REDACTED'
 org.apache.solr.handler.dataimport.DataImportHandlerException:
 com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
 Communications link failure

 The last packet successfully received from the server was 128
 milliseconds ago.  The last packet sent successfully to the server was
 25,273,484 milliseconds ago.
 ...


 A custom indexer, so that's a fairly common practice? So when you are
 dealing with these large indexes, do you try not to fully rebuild them
 when you can? It's not a 

Re: Solr performance issue

2011-03-22 Thread Alexey Serba
 Btw, I am monitoring output via jconsole with 8gb of ram and it still goes
 to 8gb every 20 seconds or so,
 gc runs, falls down to 1gb.

Hmm, jvm is eating 8Gb for 20 seconds - sounds a lot.

Do you return all results (ids) for your queries? Any tricky
faceting/sorting/function queries?


Re: Custom scoring for searhing geographic objects

2010-12-19 Thread Alexey Serba
Hi Pavel,

I had the similar problem several years ago - I had to find
geographical locations in textual descriptions, geocode these objects
to lat/long during indexing process and allow users to filter/sort
search results to specific geographical areas. The important issue was
that there were several types of geographical objects - street  town
 region  country. The idea was to geocode to most narrow
geographical area as possible. Relevance logic in this case could be
specified as find the most narrow result that is unique identified by
your text or search query.  So I came up with custom algorithm that
was quite good in terms of performance and precision/recall. Here's
the simple description:
* You can intersect all text/searchquery terms with locations
dictionary to find only geo terms
* Search in your locations Lucene index and filter only street objects
(the most narrow areas). Due to tf*idf formula you'll get the most
relevant results. Then you need to post process N (3/5/10) results and
verify that they are matches indeed. I did intersect search terms with
result's terms and make another lucene search to verify if these terms
are unique identifying the match. If it's then return matching street.
If there's no any match proceed using the same algorithm with towns,
regions, countries.

HTH,
Alexey

On Wed, Dec 15, 2010 at 6:28 PM, Pavel Minchenkov char...@gmail.com wrote:
 Hi,
 Please give me advise how to create custom scoring. I need to result that
 documents were in order, depending on how popular each term in the document
 (popular = how many times it appears in the index) and length of the
 document (less terms - higher in search results).

 For example, index contains following data:

 ID    | SEARCH_FIELD
 --
 1     | Russia
 2     | Russia, Moscow
 3     | Russia, Volgograd
 4     | Russia, Ivanovo
 5     | Russia, Ivanovo, Altayskaya street 45
 6     | Russia, Moscow, Kremlin
 7     | Russia, Moscow, Altayskaya street
 8     | Russia, Moscow, Altayskaya street 15
 9     | Russia, Moscow, Altayskaya street 15/26


 And I should get next results:


 Query                     | Document result set
 --
 Russia                    | 1,2,4,3,6,7,8,9,5
 Moscow                  | 2,6,7,8,9
 Ivanovo                    | 4,5
 Altayskaya              | 7,8,9,5

 In fact --- it is a search for geographic objects (cities, streets, houses).
 At the same time can be given only part of the address, and the results
 should appear the most relevant results.

 Thanks.
 --
 Pavel Minchenkov



Re: Dataimport performance

2010-12-19 Thread Alexey Serba
 With subquery and with left join:   320k in 6 Min 30
It's 820 records per second. It's _really_ impressive considering the
fact that DIH performs separate sql query for every record in your
case.

 So there's one track entity with an artist sub-entity. My (admittedly
 rather limited) experience has been that sub-entities, where you have
 to run a separate query for every row in the parent entity, really
 slow down data import.
Sub entities slows down data import indeed. You can try to avoid
separate query for every row by using CachedSqlEntityProcessor. There
are couple of options - 1) you can load all sub-entity data in memory
or 2) you can reduce the number of sql queries by caching sub entity
data per id. There's no silver bullet and each option has its own pros
and cons.

Also Ephraim proposed a really neat solution with GROUP_CONCAT, but
I'm not sure that all RDBMS-es support that.


2010/12/15 Robert Gründler rob...@dubture.com:
 i've benchmarked the import already with 500k records, one time without the 
 artists subquery, and one time without the join in the main query:


 Without subquery: 500k in 3 min 30 sec

 Without join and without subquery: 500k in 2 min 30.

 With subquery and with left join:   320k in 6 Min 30


 so the joins / subqueries are definitely a bottleneck.

 How exactly did you implement the custom data import?

 In our case, we need to de-normalize the relations of the sql data for the 
 index,
 so i fear i can't really get rid of the join / subquery.


 -robert





 On Dec 15, 2010, at 15:43 , Tim Heckman wrote:

 2010/12/15 Robert Gründler rob...@dubture.com:
 The data-config.xml looks like this (only 1 entity):

      entity name=track query=select t.id as id, t.title as title, 
 l.title as label from track t left join label l on (l.id = t.label_id) 
 where t.deleted = 0 transformer=TemplateTransformer
        field column=title name=title_t /
        field column=label name=label_t /
        field column=id name=sf_meta_id /
        field column=metaclass template=Track name=sf_meta_class/
        field column=metaid template=${track.id} name=sf_meta_id/
        field column=uniqueid template=Track_${track.id} 
 name=sf_unique_id/

        entity name=artists query=select a.name as artist from artist a 
 left join track_artist ta on (ta.artist_id = a.id) where 
 ta.track_id=${track.id}
          field column=artist name=artists_t /
        /entity

      /entity

 So there's one track entity with an artist sub-entity. My (admittedly
 rather limited) experience has been that sub-entities, where you have
 to run a separate query for every row in the parent entity, really
 slow down data import. For my own purposes, I wrote a custom data
 import using SolrJ to improve the performance (from 3 hours to 10
 minutes).

 Just as a test, how long does it take if you comment out the artists entity?




Re: my index has 500 million docs ,how to improve so lr search performance?

2010-12-14 Thread Alexey Serba
How much memory do you allocate for JVMs? Considering you have 10 JVMs
per server (10*N) you might have not enough memory for OS file system
cache ( you need to keep some memory free for that )

 all indexs size is about 100G
is this per server or whole size?


On Mon, Nov 15, 2010 at 8:35 AM, lu.rongbin lu.rong...@goodhope.net wrote:

 In addition,my index has only two store fields, id and price, and other
 fields are index. I increase the document and query cache. the ec2
 m2.4xLarge instance is 8 cores, 68G memery. all indexs size is about 100G.
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/my-index-has-500-million-docs-how-to-improve-solr-search-performance-tp1902595p1902869.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Newbie: Indexing unrelated MySQL tables

2010-12-14 Thread Alexey Serba
 I figured I would create three entities and relevant
 schema.xml entries in this way:

 dataimport.xml:
 entity name=Users query=select id,firstname,lastname from
 user/entity
 entity name=Artwork query=select id,user,name,description from
 artwork/entity
 entity name=Jobs query=select id,company,position,location,description
 from jobs/entity
That's correct. You can list several entities under document element.
You can index them separately using entity parameter (i.e. add
entity=Users to you full import HTTP request). Do not forget to add
clean=false so you won't delete previously indexed documents. Or you
can index all entities in one request (by default).

 schema.xml:
 field name=id type=int indexed=true stored=true required=true/
 field name=firstname type=string indexed=true stored=true/
 field name=lastname type=string indexed=true stored=true/
 field name=user type=int indexed=true stored=true/
 field name=name type=string indexed=true stored=true/
 field name=description type=text indexed=true stored=false/
 field name=company type=string indexed=true stored=true/
 field name=position type=string indexed=true stored=true/
 field name=location type=string indexed=true stored=false/
Why do you use string type for textual fields (description, company,
name, firstname, lastname, etc)? Is it intentional to use these fields
in filtering/faceting?

You can also add default searchable multivalued field (type=text)
and copy field instructions to copy all textual content into this
field ( http://wiki.apache.org/solr/SchemaXml#Copy_Fields ). Thus you
will be able to search in default field for terms from all fields
(firstname, lastname, name, description, company, position, location,
etc).

You would probably want to add field type=user/artwork/job. You will
be able to facet/filter on that fields and provide better user search
experience.

 This obviously does not work as I want. I only get results from the users
 table, and I cannot get results from neither artwork nor jobs.
Are you sure that this is because the indexing isn't working? How do
you search for your data? What query parser (standard/dismax)/etc?

 I have
 found out that the possible solution is in putting field tags in the
 entity tag and somehow aliasing column names for Solr, but the logic
 behind this is completely alien to me and the blind tests I tried did not
 yield anything.
You don't need to list your fields explicitly in fields declaration.

BTW, what database do you use? Oracle has some issue with upper casing
column names that could be a problem.

 My logic says that the id field is getting replaced by the
 id field of other entities and indexes are being overwritten.
Are your ids unique across different objects? I.e. is there any job
with the same id as user? If so then you would probably want to prefix
your ids like:

entity name=Users query=select ('user_' || id) as
id,firstname,lastname from user/entity
entity name=Artwork query=select ('artwork_' || id) as
id,user,name,description from artwork/entity


 But if I
 aliased all id fields in all entities into something else, such as
 user_id and job_id, I couldn't figure what to put in the primaryKey
 configuration in schema.xml because I have three different id fields from
 three different tables that are all primary keyed in the database!
You can still create separate id fields if you need to search for
different objects by id and don't mess with prefixed ids. But it's not
required.

HTH,
Alexey


Re: Query performance very slow even after autowarming

2010-12-06 Thread Alexey Serba
* Do you use EdgeNGramFilter in index analyzer only? Or you also use
it on query side as well?

* What if you create additional field first_letter (string) and put
first character/characters (multivalued?) there in your external
processing code. And then during search you can filter all documents
that start with letter a using fq=a filter query. Would that solve
your performance problems?

* It makes sense to specify what are you trying to achieve and
probably more people can help you with that.

On Fri, Dec 3, 2010 at 10:47 AM, johnnyisrael johnnyi.john...@gmail.com wrote:

 Hi,

 I am using edgeNgramFilterfactory on SOLR 1.4.1 [filter
 class=solr.EdgeNGramFilterFactory maxGramSize=100 minGramSize=1 /]
 for my indexing.

 Each document will have about 5 fields in it and only one field is indexed
 with EdgeNGramFilterFactory.

 I have about 1.4 million documents in my index now and my index size is
 approx 296MB.

 I made the field that is indexed with EdgeNGramFilterFactory as default
 search field. All my query responses are very slow, some of them taking more
 than 10seconds to respond.

 All my query responses are very slow, Queries with single letters are still
 very slow.

 /select/?q=m

 So I tried query warming as follows.

 listener event=newSearcher class=solr.QuerySenderListener
      arr name=queries
        lststr name=qa/str/lst
        lststr name=qb/str/lst
        lststr name=qc/str/lst
        lststr name=qd/str/lst
        lststr name=qe/str/lst
        lststr name=qf/str/lst
        lststr name=qg/str/lst
        lststr name=qh/str/lst
        lststr name=qi/str/lst
        lststr name=qj/str/lst
        lststr name=qk/str/lst
        lststr name=ql/str/lst
        lststr name=qm/str/lst
        lststr name=qn/str/lst
        lststr name=qo/str/lst
        lststr name=qp/str/lst
        lststr name=qq/str/lst
        lststr name=qr/str/lst
        lststr name=qs/str/lst
        lststr name=qt/str/lst
        lststr name=qu/str/lst
        lststr name=qv/str/lst
        lststr name=qw/str/lst
        lststr name=qx/str/lst
        lststr name=qy/str/lst
        lststr name=qz/str/lst
      /arr
 /listener

 The same above is done for firstSearcher as well.

 My cache settings are as follows.

 filterCache
      class=solr.LRUCache
      size=16384
      initialSize=4096
 autowarmCount=4096/

 queryResultCache
      class=solr.LRUCache
      size=16384
      initialSize=4096
 autowarmCount=1024/

 documentCache
      class=solr.LRUCache
      size=16384
      initialSize=16384
 /

 Still after query warming, few single character search is taking up to 3
 seconds to respond.

 Am i doing anything wrong in my cache setting or autowarm setting or am i
 missing anything here?

 Thanks,

 Johnny
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Query-performance-very-slow-even-after-autowarming-tp2010384p2010384.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: dataimports response returns before done?

2010-12-06 Thread Alexey Serba
 After issueing a dataimport, I've noticed solr returns a response prior to 
 finishing the import. Is this correct?   Is there anyway i can make solr not 
 return until it finishes?
Yes, you can add synchronous=true to your request. But be aware that
it could take a long time and you can see http timeout exception.

 If not, how do I ping for the status whether it finished or not?
See command=status


On Fri, Dec 3, 2010 at 8:55 PM, Tri Nguyen tringuye...@yahoo.com wrote:
 Hi,

 After issueing a dataimport, I've noticed solr returns a response prior to 
 finishing the import. Is this correct?   Is there anyway i can make solr not 
 return until it finishes?

 If not, how do I ping for the status whether it finished or not?

 thanks,

 tri


Re: Syncing 'delta-import' with 'select' query

2010-12-06 Thread Alexey Serba
Hey Juan,

It seems that DataImportHandler is not a right tool for your scenario
and you'd better use Solr XML update protocol.
* http://wiki.apache.org/solr/UpdateXmlMessages

You still can work around your outdated GUI view problem with calling
DIH synchronously, by adding synchronous=true to your request. But it
won't solve the problem with two parallel requests from two users to
single DIH request handler, because DIH doesn't support that, and if
previous request is still running it bounces the second request.

HTH,
Alex



On Fri, Dec 3, 2010 at 10:33 PM, Juan Manuel Alvarez naici...@gmail.com wrote:
 Hello everyone! I would like to ask you a question about DIH.

 I am using a database and DIH to sync against Solr, and a GUI to
 display and operate on the items retrieved from Solr.
 When I change the state of an item through the GUI, the following happens:
 a. The item is updated in the DB.
 b. A delta-import command is fired to sync the DB with Solr.
 c. The GUI is refreshed by making a query to Solr.

 My problem comes between (b) and (c). The delta-import operation is
 executed in a new thread, so my call returns immediately, refreshing
 the GUI before the Solr index is updated causing the item state in the
 GUI to be outdated.

 I had two ideas so far:
 1. Querying the status of the DIH after the delta-import operation and
 do not return until it is idle. The problem I see with this is that
 if other users execute delta-imports, the status will be busy until
 all operations are finished.
 2. Use Zoie. The first problem is that configuring it is not as
 straightforward as it seems, so I don't want to spend more time trying
 it until I am sure that this will solve my issue. On the other hand, I
 think that I may suffer the same problem since the delta-import is
 still firing in another thread, so I can't be sure it will be called
 fast enough.

 Am I pointing on the right direction or is there another way to
 achieve my goal?

 Thanks in advance!
 Juan M.



Re: DIH - rdbms to index confusion

2010-12-06 Thread Alexey Serba
 I have a table that contains the data values I'm wanting to return when
 someone makes a search.  This table has, in addition to the data values, 3
 id's (FKs) pointing to the data/info that I'm wanting the users to be able
 to search on (while also returning the data values).

 The general rdbms query would be something like:
 select f.value, g.gar_name, c.cat_name from foo f, gar g, cat c, dub d
 where g.id=f.gar_id
 and c.id=f.cat_id
 and d.id=f.dub_id

You can put this general rdbms query as is into single DIH entity - no
need to split it.

You would probably want to split it if your main table has one to many
relation with other tables, so you can't retrieve all the data and
have single result set row per Solr document.


Re: Syncing 'delta-import' with 'select' query

2010-12-06 Thread Alexey Serba
 When you say two parallel requests from two users to single DIH
 request handler, what do you mean by request handler?
I mean DIH.

 Are you
 refering to the HTTP request? Would that mean that if I make the
 request from different HTTP sessions it would work?
No.

It means that when you have two users that simultaneously changed two
objects in the UI then you have two HTTP requests to DIH to pull
changes from the db into Solr index. If the second request comes when
the first is not fully processed then the second request will be
rejected. As a result your index would be outdated (w/o the latest
update) until the next update.


Re: using DIH with mets/alto file sets

2010-11-26 Thread Alexey Serba
 The idea is to create a full text index of the alto content, accompanied by 
 the author/title info from the mets file for purposes of results display.

- Then you need to list only alto files in your landscapes entity
(fileName=^ID.{3}-ALTO\d{3}.xml$ or something like that), because
you don't want to index every mets file as a separate solr document,
right?

- Also it seems you might want to try to add regex transformer that
extract ID from avto file name
   field column=metsId regex=ID(.{3})-ALTO\d{3}.xml
sourceColName=${landscapes.fileAbsolutePath} or fileAbsolutePath/

- And finally add nested entity to process mets file for every alto record

entity name=landscapes ...
  entity name=sample
entity name=metsProcessor
url=${landscapes.fileAbsolutePath}../ID${sample.metsId}-mets.xml
processor=XPathEntityProcessor forEach=/mets
transformer=TemplateTransformer,RegexTransformer,LogTransformer
and extract mets elements/attributes and index them as a separate fields.

P.S. I haven't tried similar scenario, so just speculating

On Fri, Nov 19, 2010 at 12:09 AM, Fred Gilmore fgilm...@mail.utexas.edu wrote:
 mets/alto is an xml standard for describing physical objects.  In this case,
 we're describing books.  The mets file holds the metadata (author, title,
 etc.), the alto file is the physical description (words on the page,
 formatting of the page).  So it's a one (mets) to many (alto) relationship.

 the directory structure:

 /our/collection/IDxxx/:

 IDxxx-mets.xml
 ALTO/

 /our/collection/IDxxx/ALTO/:

 IDxxx-ALTO001.xml
 IDxxx-ALTO002.xml

 ie. an xml file per scanned book page.

 Beyond the ID number as part of the file names, the mets file contains no
 reference to the alto children.  The alto children do contain a reference to
 the jpg page scan, which is labelled with the ID number as part of the name.

 The idea is to create a full text index of the alto content, accompanied by
 the author/title info from the mets file for purposes of results display.
  The first try with this is attempting a recursive FileDataSource approach.

 It was relatively easy to create a content field which holds the text of
 the page (each word is actually an attribute of a separate tag), but I'm
 having difficulty determining how I'm going to conditionally add the author
 and title data from the METS file to the rows created with the ALTO content
 field.  It'll involve regex'ing out the ID number associated with both the
 mets and alto filenames for starters, but even at that, I don't see how to
 keep it straight since it's not one mets=one alto and it's also not a static
 string for the entire index.

 thanks for any hints you can provide.

 Fred
 University of Texas at Austin
 ==
 data-config.xml thus far:

 dataConfig
 dataSource type=FileDataSource /
 document
 entity name=landscapes rootEntity=false
 processor=FileListEntityProcessor fileName=.xml$ recursive=true
 baseDir=/home/utlol/htdocs/lib-landscapes-new/publications/
 entity name=sample rootEntity=true
 stream=true
 pk=filename
 url=${landscapes.fileAbsolutePath}
 processor=XPathEntityProcessor
 forEach=/mets | /alto
 transformer=TemplateTransformer,RegexTransformer,LogTransformer
 logTemplate= processing ${landscapes.fileAbsolutePath}
 logLevel=info


 !-- use system filename for getting OCLC number --
 !-- we need it both for linking to results and for referencing the METS
 file --
 field column=fileAbsPath     template=${landscapes.fileAbsolutePath} /


 field column=title
 xpath=/mets/dmdSec/mdWrap/xmlData/mods/titleInfo/title /
 !--
 field column=author
  xpath=/mets/dmdSec/mdWrap/xmlData/mods/na...@id='MODSMD_PRINT_N1']/namepa...@type='given']
 /
 --
 field column=filename
 xpath=/alto/Description/sourceImageInformation/fileName /
 field column=content
 xpath=/alto/Layout/Page/PrintSpace/TextBlock/TextLine/String/@CONTENT /
 /entity
 /entity
 /document
 /dataConfig
 ==
 METS example:

 ?xml version=1.0 encoding=UTF-8?
 mets xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
 xmlns=http://www.loc.gov/METS/;
 xsi:schemaLocation=http://www.loc.gov/METS/
 http://schema.ccs-gmbh.com/docworks/version20/mets-docworks.xsd;
 xmlns:MODS=http://www.loc.gov/mods/v3; xmlns:mix=http://www.loc.gov/mix/;
 xmlns:xlink=http://www.w3.org/1999/xlink; TYPE=METAe_Monograph
 LABEL=ENVIRONMENTAL GEOLOGIC ATLAS OF THE TEXAS COASTAL ZONE- Kingsville
 Area
 metsHdr CREATEDATE=2010-05-06T11:21:18 LASTMODDATE=2010-05-06T11:21:18
 agent ROLE=CREATOR TYPE=OTHER OTHERTYPE=SOFTWARE
 nameCCS docWORKS/METAe Version 6.3-0/name
 notedocWORKS-ID: 1677/note
 /agent
 /metsHdr
 dmdSec ID=MODSMD_PRINT
 mdWrap MIMETYPE=text/xml MDTYPE=MODS LABEL=Bibliographic meta-data of
 the printed version
 xmlData
 MODS:mods
 MODS:titleInfo ID=MODSMD_PRINT_TI1 xml:lang=en
 MODS:titleENVIRONMENTAL GEOLOGIC ATLAS OF THE TEXAS COASTAL ZONE-
 Kingsville Area/MODS:title
 /MODS:titleInfo
 MODS:name ID=MODSMD_PRINT_N1 

Re: Basic Solr Configurations and best practice

2010-11-26 Thread Alexey Serba
 1-      How to combine data from DIH and content extracted from file system
 document into one document in the index?
http://wiki.apache.org/solr/TikaEntityProcessor
You can have one sql entity that retrieves metadata from database and
another nested entity that parses binary file into additional fields
in the document.

 2-      Should I move the per-user permissions into a separate index? What
 technique to implement?
I would start with keeping permissions in the same index as the actual content.


On Tue, Nov 23, 2010 at 11:35 AM, Darx Oman darxo...@gmail.com wrote:
 Hi guys

 I'm kind of new to solr and I'm wondering how to configure solr to best
 fulfills my requirements.

 Requirements are as follow:

 I have 2 data sources: database and file system documents. Every document in
 the file system has related information stored in the database.  Both the
 file content and the related database fields must be indexed.  Along with
 the DB data is per-user permissions for every document.  I'm using DIH for
 the DB and Tika for the file System.  The documents contents nearly never
 change, while the DB data especially the permissions changes very
 frequently. Total number of documents roughly around 2M and each document is
 about 500KB.

 1-      How to combine data from DIH and content extracted from file system
 document into one document in the index?

 2-      Should I move the per-user permissions into a separate index? What
 technique to implement?



Re: DIH delta, deltaQuery

2010-11-26 Thread Alexey Serba
Are you sure that it's deltaQuery that's taking a minute? It only
retrieves ids of updated records and then deltaImportQuery is executed
N times for each id record. You might want to try the following
technique - http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

On Wed, Nov 24, 2010 at 3:06 PM, stockii st...@shopgate.com wrote:

 Hello.

 i wonder why this deltaQuery takes over a minute:

 deltaQuery=SELECT id FROM sessions
                WHERE created BETWEEN DATE_ADD( NOW(), INTERVAL - 1 HOUR ) AND 
 NOW()
                OR modified BETWEEN '${dataimporter.sessions 
 .last_index_time}' AND
 DATE_ADD( NOW(), INTERVAL - 1 HOUR  ) 

 the database have only 700 Entries and the compare with modified takes so
 long !!? when i remove the modified compare its fast.

 when i put this query in my mysql database the query need 0.0014 seconds
 ... wha is it so slow?
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DIH-delta-deltaQuery-tp1960246p1960246.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Searching with wrong keyboard layout or using translit

2010-10-31 Thread Alexey Serba
Another approach for this problem is to use another Solr core for
storing users queries for auto complete functionality ( see
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
) and index not only user_query field, but also transliterated and
diff_layout versions and use dismax query parser to search suggestions
in all fields.

This solution is only viable if you have huge log of user queries (
which I believe google does ).

HTH,
Alex



2010/10/29 Alexander Kanarsky kanarsky2...@gmail.com:
 Pavel,

 it depends on size of your documents corpus, complexity and types of
 the queries you plan to use etc. I would recommend you to search for
 the discussions on synonyms expansion in Lucene (index time vs. query
 time tradeoffs etc.) since your problem is quite similar to that
 (think Moskva vs. Moskwa). Unless you have a small corpus, I would go
 with the second approach and expand the terms during the query time.
 However, the first approach might be useful, too: say, you may want to
 boost the score for the documents that naturally contain the word
 'Moskva', so such a documents will be at the top of the result list.
 Having both forms indexed will allow you to achieve this easily by
 utilizing Solr's dismax query (to boost the results from the field
 with the original terms):
 http://localhost:8983/solr/select/?q=MoskvadefType=dismaxqf=text^10.0+text_translit^0.1
 ('text' field has the original Cyrillic tokens, 'text_translit' is for
 transliterated ones)

 -Alexander


 2010/10/28 Pavel Minchenkov char...@gmail.com:
 Alexander,

 Thanks,
 What variat has better performance?


 2010/10/28 Alexander Kanarsky kanarsky2...@gmail.com

 Pavel,

 I think there is no single way to implement this. Some ideas that
 might be helpful:

 1. Consider adding additional terms while indexing. This assumes
 conversion of Russian text to both translit and wrong keyboard
 forms and index converted terms along with original terms (i.e. your
 Analyzer/Filter should produce Moskva and Vjcrdf for term Москва). You
 may re-use the same field (if you plan for a simple term queries) or
 create a separate fields for the generated terms (better for phrase,
 proximity queries etc. since it keeps the original text positional
 info). Then the query could use any of these forms to fetch the
 document. If you use separate fields, you'll need to expand/create
 your query to search for them, of course.
 2. If you have to index just an original Russian text, you might
 generate all term forms while analyzing the query, then you could
 treat the converted terms as a synonyms and use the combination of
 TermQuery for all term forms or the MultiPhraseQuery for the phrases.
 For Solr in this case you probably will need to add a custom filter
 similar to SynonymFilter.

 Hope this helps,
 -Alexander

 On Wed, Oct 27, 2010 at 1:31 PM, Pavel Minchenkov char...@gmail.com
 wrote:
  Hi,
 
  When I'm trying to search Google with wrong keyboard layout -- it
 corrects
  my query, example: http://www.google.ru/search?q=vjcrdf (I typed word
  Moscow in Russian but in English keyboard layout).
  http://www.google.ru/search?q=vjcrdfAlso, when I'm searching using
  translit, It does the same: http://www.google.ru/search?q=moskva
 
  What is the right way to implement this feature in Solr?
 
  --
  Pavel Minchenkov
 




 --
 Pavel Minchenkov




Re: problem on running fullimport

2010-10-24 Thread Alexey Serba
 Caused by: java.sql.SQLException: Illegal value for setFetchSize(). 

Try to add batchSize=-1 to your data source declaration

http://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F

On Fri, Oct 15, 2010 at 3:42 PM, swapnil dubey swapnil.du...@gmail.com wrote:
 Hi,

 I am using the full import option with the data-config file as mentioned
 below

 dataConfig
   dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
 url=jdbc:mysql:///xxx user=xxx password=xx  /
    document 
            entity name=yyy query=select studentName from test1
            field column=studentName name=studentName /
            /entity
    /document
 /dataConfig


 on running the full-import option I am getting the error mentioned below.I
 had already included the dataimport.properties file in my conf file.help me
 to get the issue resolved

 response
 -
 lst name=responseHeader
 int name=status0/int
 int name=QTime334/int
 /lst
 -
 lst name=initArgs
 -
 lst name=defaults
 str name=configdata-config.xml/str
 /lst
 /lst
 str name=commandfull-import/str
 str name=modedebug/str
 null name=documents/
 -
 lst name=verbose-output
 -
 lst name=entity:test1
 -
 lst name=document#1
 str name=queryselect studentName from test1/str
 -
 str name=EXCEPTION
 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
 execute query: select studentName from test1 Processing Document # 1
    at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
    at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
    at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
    at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
    at
 org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:184)
    at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
    at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
    at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
    at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357)
    at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
    at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
    at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
    at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
    at
 org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:203)
    at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
    at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
    at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
    at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
    at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
    at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
    at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
    at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
    at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
    at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
    at org.mortbay.jetty.Server.handle(Server.java:285)
    at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
    at
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
    at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
    at
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 Caused by: java.sql.SQLException: Illegal value for setFetchSize().
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1075)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:989)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:984)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:929)
    at 

Re: DataImportHandler dynamic fields clarification

2010-10-13 Thread Alexey Serba
Harry, could you please file a jira for this and I'll address this in
a patch. I fixed related issue (SOLR-2102) and I think it's pretty
similar.

 Interesting, I was under the impression that case does not matter.

 From http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config :
 It is possible to totally avoid the field entries in entities if the names
 of the fields are same (case does not matter) as those in Solr schema

Yeah, case does not matter only for explicit mapping of sql columns to
Solr fields. The reason is that DIH populates hash map for case
insensitive match only for explicit mappings.

You can also workaround this upper case column names in Oracle using
the following SQL clause:
=
data-config.xml
entity name=item query=select column_1 as quote;column_1quote;,
column_100 as quote;column_100quote; from wide_table
/entity

schema.xml
dynamicField name=column_*  type=string  indexed=true  stored=true
multiValued=true /
=

HTH,
Alexey


On Thu, Sep 30, 2010 at 9:10 PM, harrysmith harrysmith...@gmail.com wrote:


Two things, one are your DB column uppercase as this would effect the out.



 Interesting, I was under the impression that case does not matter.

 From http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config :
 It is possible to totally avoid the field entries in entities if the names
 of the fields are same (case does not matter) as those in Solr schema

 I confirmed that matching the schema.xml field case to the database table is
 needed for dynamic fields, and the wiki statement above is incorrect, or at
 the very least confusing, possibly a bug.

 My database is Oracle 10g and the column names have been created in all
 uppercase in the database.

 In Oracle:
 Table name: wide_table
 Column names: COLUMN_1 ... COLUMN_100 (yes, uppercase)

 Please see following scenarios and results I found:

 data-config.xml
 entity name=item query=select column_1,column_100 from wide_table
 field column=column_100 name=id/
 /entity

 schema.xml
 dynamicField name=column_*  type=string  indexed=true  stored=true
 multiValued=true /

 Result:
 Nothing Imported

 =

 data-config.xml
 entity name=item query=select COLUMN_1,COLUMN_100 from wide_table
 field column=column_100 name=id/
 /entity

 schema.xml
 dynamicField name=column_*  type=string  indexed=true  stored=true
 multiValued=true /

 Result:
 Note query column names changed to uppercase.
 Nothing Imported

 =


 data-config.xml
 entity name=item query=select column_1,column_100 from wide_table
 field column=COLUMN_100 name=id/
 /entity

 schema.xml
 dynamicField name=column_*  type=string  indexed=true  stored=true
 multiValued=true /

 Result:
 Note ONLY the field entry was changed to caps

 All records imported, with only COLUMN_100 id field.

 

 data-config.xml
 entity name=item query=select column_1,column_100 from wide_table
 field column=COLUMN_100 name=id/
 /entity

 schema.xml
 dynamicField name=COLUMN_*  type=string  indexed=true  stored=true
 multiValued=true /

 Result:
 Note BOTH the field entry was changed to caps in data-config.xml, and the
 dynamicField wildcard in schema.xml

 All records imported, with all fields specified. This is the behavior
 desired.

 =




















Second what does your db-data-config.xml look like



 The relevant data-config.xml is as follows:

 document name=
 entity name=item query=select COLUMN_1,COLUMN_100 from wide_table
  field column=COLUMN_100 name=id/
 /entity
 /document

 Ideally, I would rather have the query be 'select * from wide_table with
 the fields being dynamically matched by the column name from the
 dynamicField wildcard from the schema.xml.

 dynamicField name=COLUMN_*  type=string  indexed=true stored=true/


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DataImportHandler-dynamic-fields-clarification-tp1606159p1609578.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Delta Import with something other than Date

2010-09-10 Thread Alexey Serba
 Can you provide a sample of passing the parameter via URL? And how using it 
 would look in the data-config.xml
http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters


Re: Solr is indexing jdbc properties

2010-09-06 Thread Alexey Serba
http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5

Try to add convertType attribute to dataSource declaration, i.e.
 dataSource type=JdbcDataSource
  name=mssqlDatasource
  driver=net.sourceforge.jtds.jdbc.Driver
  url=jdbc:jtds:sqlserver://{db.host}:1433/{db};instance=SQLEXPRESS
  user={username}
  password={password}

  convertType=true
/

HTH,
Alex

On Mon, Sep 6, 2010 at 5:49 PM, savvas.andreas
savvas.andreas.moysi...@googlemail.com wrote:

 Hello,

 I am trying to index some data stored in an SQL Server database through DIH.
 My setup in data-config.xml is the following:

 dataConfig
  dataSource type=JdbcDataSource
                          name=mssqlDatasource
              driver=net.sourceforge.jtds.jdbc.Driver

 url=jdbc:jtds:sqlserver://{db.host}:1433/{db};instance=SQLEXPRESS
              user={username}
              password={password}/
  document
    entity name=id
                        dataSource=mssqlDatasource
            query=select id,
                        title
                        from WORK
                        field column=id name=id /
                        field column=title name=title /
    /entity
  /document
 /dataConfig

 However, when I run the indexer (invoking
 http://127.0.0.1:8983/solr/admin/dataimport.jsp?handler=/dataimport) I get
 all the rows in my index but with incorrect data indexed.

 More specifically, by examining the top 10 terms for the title field I get:

 term    frequency
 impl    1241371
 jdbc    1241371
 net     1241371
 sourceforg      1241371
 jtds    1241371
 clob    1241371
 netsourceforgejtdsjdbcclobimpl  1186981
 c       185070
 a       179901
 e       160759

 which is clearly wrong..Does anybody know why Solr is indexing the jdbc
 properties instead of the actual data?

 Any pointers would be much appreciated.

 Thank you very much.
 -- Savvas
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-is-indexing-jdbc-properties-tp1426473p1426473.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH: Rows fetch OK, Total Documents Failed??

2010-08-10 Thread Alexey Serba
Do you have any required fields or uniqueKey in your schema.xml? Do
you provide values for all these fields?

AFAIU you don't need commonField attribute for id and title fields. I
don't think that's your problem but anyway...


On Sat, Jul 31, 2010 at 11:29 AM,  scr...@asia.com wrote:

  Hi,

 I'm a bit lost with this, i'm trying to import a new XML via DIH, all row are 
 fetched but no ducument are indexed? I don't find any log or error?

 Any ideas?

 Here is the STATUS:


 str name=commandstatus/str
 str name=statusidle/str
 str name=importResponse/
 lst name=statusMessages
 str name=Total Requests made to DataSource1/str
 str name=Total Rows Fetched7554/str
 str name=Total Documents Skipped0/str
 str name=Full Dump Started2010-07-31 10:14:33/str
 str name=Total Documents Processed0/str
 str name=Total Documents Failed7554/str
 str name=Time taken 0:0:4.720/str
 /lst


 My xml file looks like this:

 ?xml version=1.0 encoding=UTF-8?
 products
    product
        titleMoniteur VG1930wm 19 LCD Viewsonic/title
        
 urlhttp://x.com/abc?a(12073231)p(2822679)prod(89042332277)ttid(5)url(http%3A%2F%2Fwww.ffdsssd.com%2Fproductinformation%2F%7E66297%7E%2Fproduct.htm%26sender%3D2003)/url
        contentMoniteur VG1930wm 19  LCD Viewsonic VG1930WM/content
        price247.57/price
        categoryEcrans/category
    /product
 etc...

 and my dataconfig:

 dataConfig
        dataSource type=URLDataSource /
        document
                entity name=products
                        url=file:///home/john/Desktop/src.xml
                        processor=XPathEntityProcessor
                        forEach=/products/product
                        transformer=DateFormatTransformer

                        field column=id      xpath=/products/product/url  
  commonField=true /
                        field column=title   
 xpath=/products/product/title commonField=true /
                        field column=category  
 xpath=/products/product/category /
                        field column=content  
 xpath=/products/product/content /
                        field column=price      
 xpath=/products/product/price /

                /entity
        /document
 /dataConfig







Re: Implementing lookups while importing data

2010-08-10 Thread Alexey Serba
 We are currently doing this via a JOIN on the numeric
 field, between the main data table and the lookup table, but this
 dramatically slows down indexing.
I believe SQL JOIN is the fastest and easiest way in your case (in
comparison with nested entity even using CachedSqlEntity). You
probably don't have proper indexes in your database - check SQL query
plan.


Re: DIH and multivariable fields problems

2010-08-10 Thread Alexey Serba
 Have others successfully imported dynamic multivalued fields in a
 child entity using the DataImportHandler via the child entity returning
 multiple records through a RDBMS?
Yes, it's working ok with static fields.

I didn't even know that it's possible to use variables in field names
( dynamic names ) in DIH configuration. This use case is quite
unusual.

 This is increasingly more looking like a bug. To recap, I am trying to use
 the DIH to import multivalued dynamic fields and using a variable to name
 that field.
I'm not an expert in DIH source code but it seems there's special
processing of dynamic fields that prevents handling field type (and
multivalued attribute). Specifically there's conditional jump
(continue) over field type detection code in case of dynamic field
name ( see DataImporter:initEntity ). I guess the reason of such
behavior is that you can't determine field type based on dynamic field
name (${variable}_s) at that time (configuration parsing). I'm
wondering if it's possible to determine field types at runtime (when
actual field title_s name is resolved).

I encountered similar problem with implicit sql_column - solr_field
mapping using SqlEntityProcessor, i.e. when you select some columns
and do not explicitly list all these columns as fields entries in your
configuration. In this case field type detection doesn't work either.
I think that moving type detection process into runtime would solve
that problem also. Am i missing something obvious that prevents us
from doing field type detection at runtime?

Alex

On Tue, Aug 10, 2010 at 4:20 AM, harrysmith harrysmith...@gmail.com wrote:

 This is increasingly more looking like a bug. To recap, I am trying to use
 the DIH to import multivalued dynamic fields and using a variable to name
 that field.

 Upon further testing, the multivalued import works fine with a
 static/constant name, but only keeps the first record when naming the field
 dynamically. See below for relevant snips.

 From schema.xml :
 dynamicField name=*_s  type=string  indexed=true  stored=true
 multiValued=true /

 From data-config.xml :

 entity name=terms query=select distinct CORE_DESC_TERM from metadata
 where item_id=${item.DIVID_PK}
 entity name=metadata query=select * from metadata where
 item_id=${item.DIVID_PK} AND core_desc_term='${terms.CORE_DESC_TERM}' 
 field name=metadata_record_s column=TEXT_VALUE /
 /entity
 /entity

 
 Produces the following, note that there are 3 records that should be
 returned and are correctly done, with the field name being a constant.

 - result name=response numFound=1 start=0
 - doc
  str name=id9892962/str
 - arr name=metadata_record_s
  strrecord 1/str
  strrecord 2/str
  strrecord 3/str
  strPolygraph Newsletter Title/str
  /arr
 - arr name=title
  strPolygraph Newsletter Title/str
  /arr
  /doc
  /result

 ===

 Now, changing the field name to a variable..., note only the first record is
 retained for the 'Relation_s' field -- there should be 3 records.

 field name=metadata_record_s column=TEXT_VALUE /
 becomes
 field name=${terms.CORE_DESC_TERM}_s column=TEXT_VALUE /

 produces the following:
 - result name=response numFound=1 start=0
 - doc
 - arr name=Relation_s
  strrecord 1/str
  /arr
 - arr name=Title_s
  strPolygraph Newsletter Title/str
  /arr
  str name=id9892962/str
 - arr name=title
  strPolygraph Newsletter Title/str
  /arr
  /doc
  /result

 Only the first record is retained. There was also another post (which
 recieved no replies) in the archive that reported the same issue. The DIH
 debug logs do show 3 records correctly being returned, so somehow these are
 not getting added.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1065244.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: commit is taking very very long time

2010-07-23 Thread Alexey Serba
 I am not sure why some commits take very long time.
Hmm... Because it merges index segments... How large is your index?

 Also is there a way to reduce the time it takes?
You can disable commit in DIH call and use autoCommit instead. It's
kind of hack because you postpone commit operation and make it async.

Another option is to set optimize=false in DIH call ( it's true by
default ). Also you can try to increase mergeFactor parameter but it
would affect search performance.


Re: 2 solr dataImport requests on a single core at the same time

2010-07-23 Thread Alexey Serba
 having multiple Request Handlers will not degrade the performance
IMO you shouldn't worry unless you have hundreds of them


Re: Performance issues when querying on large documents

2010-07-23 Thread Alexey Serba
Do you use highlighting? ( http://wiki.apache.org/solr/HighlightingParameters )

Try to disable it and compare performance.

On Fri, Jul 23, 2010 at 10:52 PM, ahammad ahmed.ham...@gmail.com wrote:

 Hello,

 I have an index with lots of different types of documents. One of those
 types basically contains extracts of PDF docs. Some of those PDFs can have
 1000+ pages, so there would be a lot of stuff to search through.

 I am experiencing really terrible performance when querying. My whole index
 has about 270k documents, but less than 1000 of those are the PDF extracts.
 The slow querying occurs when I search only on those PDF extracts (by
 specifying filters), and return 100 results. The 100 results definitely adds
 to the issue, but even cutting that down can be slow.

 Is there a way to improve querying with such large results? To give an idea,
 querying for a single word can take a little over a minute, which isn't
 really viable for an application that revolves around searching. For now, I
 have limited the results to 20, which makes the query execute in roughly
 10-15 seconds. However, I would like to have the option of returning 100
 results.

 Thanks a lot.


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Performance-issues-when-querying-on-large-documents-tp990590p990590.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: 2 solr dataImport requests on a single core at the same time

2010-07-22 Thread Alexey Serba
DataImportHandler does not support parallel execution of several
requests. You should either send your requests sequentially or
register several DIH handlers in solrconfig and use them in parallel.


On Thu, Jul 22, 2010 at 11:20 AM, kishan mklpra...@gmail.com wrote:

 please help me
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/2-solr-dataImport-requests-on-a-single-core-at-the-same-time-tp978649p986351.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Adding new elements to index

2010-07-07 Thread Alexey Serba
1) Shouldn't you put your entity elements under document tag, i.e.
dataConfig
  dataSource ... /
  dataSource ... /

  document name=docs
entity ../entity
entity ../entity
  /document
/dataConfig

2) What happens if you try to run full-import with explicitly
specified entity GET parameter?
command=full-importentity=carrers
command=full-importentity=hidrants


On Wed, Jul 7, 2010 at 11:15 AM, Xavier Rodriguez xee...@gmail.com wrote:
 Thanks for the quick reply!

 In fact it was a typo, the 200 rows I got were from postgres. I tried to say
 that the full-import was omitting the 100 oracle rows.

 When I run the full import, I run it as a single job, using the url
 command=full-import. I've tried to clear the index both using the clean
 command and manually deleting it, but when I run the full-import, the number
 of indexed documents are the documents coming from postgres.

 To be sure that the id field is unique, i get the id by assigning a letter
 before the id value. When indexed, the id looks like s_123, and that's the
 id 123 for an entity identified as s. Other entities use different
 prefixes, but never s.

 I used DIH to index the data. My configuration is the folllowing:

 File db-data-config.xml

  dataSource
        type=JdbcDataSource
        name=ds_ora
        driver=oracle.jdbc.OracleDriver
        url=jdbc:oracle:thin:@xxx.xxx.xxx.xxx:1521:SID
        user=user
        password=password
    /

  dataSource
        type=JdbcDataSource
        name=ds_pg
        driver=org.postgresql.Driver
        url=jdbc:postgresql://xxx.xxx.xxx.yyy:5432/sid
        user=user
        password=password
    /

 entity name=carrers dataSource=ds_ora query=select 's_'||id as
 id_carrer,'a' as tooltip from imi_carrers
            field column=id_carrer name=identificador /
            field column=tooltip name=Nom /
 /entity


 entity name=hidrants dataSource=ds_pg query=select 'h_'||id as
 id_hidrant, parc as tooltip from hidrants
            field column=id_hidrant name=identificador /
            field column=tooltip name=Nom /
  /entity

 --

 In that configuration, all the fields coming from ds_pg are indexed, and the
 fields coming from ds_ora are not indexed. As I've said, the strange
 behaviour for me is that no error is logged in tomcat, the number of
 documents created is the number of rows returned by hidrants, while the
 number of rows returned is the sum of the rows from hidrants and
 carrers.

 Thanks in advance.

 Xavi.







 On 7 July 2010 02:46, Erick Erickson erickerick...@gmail.com wrote:

 first do you have a unique key defined in your schema.xml? If you
 do, some of those 300 rows could be replacing earlier rows.

 You say:  if I have 200
 rows indexed from postgres and 100 rows from Oracle, the full-import
 process
 only indexes 200 documents from oracle, although it shows clearly that the
 query retruned 300 rows.

 Which really looks like a typo, if you have 100 rows from Oracle how
 did you get 200 rows from Oracle?

 Are you perhaps doing this in two different jobs and deleting the
 first import before running the second?

 And if this is irrelevant, could you provide more details like how you're
 indexing things (I'm assuming DIH, but you don't state that anywhere).
 If it *is* DIH, providing that configuration would help.

 Best
 Erick

 On Tue, Jul 6, 2010 at 11:19 AM, Xavier Rodriguez xee...@gmail.com
 wrote:

  Hi,
 
  I have a SOLR installed on a Tomcat application server. This solr
 instance
  has some data indexed from a postgres database. Now I need to add some
  entities from an Oracle database. When I run the full-import command, the
  documents indexed are only documents from postgres. In fact, if I have
 200
  rows indexed from postgres and 100 rows from Oracle, the full-import
  process
  only indexes 200 documents from oracle, although it shows clearly that
 the
  query retruned 300 rows.
 
  I'm not doing a delta-import, simply a full import. I've tried to clean
 the
  index, reload the configuration, and manually remove
 dataimport.properties
  because it's the only metadata i found.  Is there any other file to check
  or
  modify just to get all 300 rows indexed?
 
  Of course, I tried to find one of that oracle fields, with no results.
 
  Thanks a lot,
 
  Xavier Rodriguez.
 




Re: Data Import Handler Rich Format Documents

2010-06-28 Thread Alexey Serba
 Ok, I'm trying to integrate the TikaEntityProcessor as suggested.  I'm using
 Solr Version: 1.4.0 and getting the following error:

 java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
 org.apache.solr.handler.dataimport.BinURLDataSource
It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1
release. You should use trunk / nightly builds.
https://issues.apache.org/jira/browse/SOLR-1583

 My data-config.xml looks like this:

 dataConfig
  dataSource type=JdbcDataSource
    driver=oracle.jdbc.driver.OracleDriver
    url=jdbc:oracle:thin:@whatever:12345:whatever
    user=me
    name=ds-db
    password=secret/

  dataSource type=BinURLDataSource
    name=ds-url/

  document
    entity name=my_database
     dataSource=ds-db
     query=select * from my_database where rownum lt;=2
      field column=CONTENT_ID                name=content_id/
      field column=CMS_TITLE                 name=cms_title/
      field column=FORM_TITLE                name=form_title/
      field column=FILE_SIZE                 name=file_size/
      field column=KEYWORDS                  name=keywords/
      field column=DESCRIPTION               name=description/
      field column=CONTENT_URL               name=content_url/
    /entity

    entity name=my_database_url
     dataSource=ds-url
     query=select CONTENT_URL from my_database where
 content_id='${my_database.CONTENT_ID}'
     entity processor=TikaEntityProcessor
      dataSource=ds-url
      format=text
      url=http://www.mysite.com/${my_database.content_url};
      field column=text/
     /entity
    /entity

  /document
 /dataConfig

 I added the entity name=my_database_url section to an existing (working)
 database entity to be able to have Tika index the content pointed to by the
 content_url.

 Is there anything obviously wrong with what I've tried so far?

I think you should move Tika entity into my_database entity and
simplify the whole configuration

entity name=my_database dataSource=ds-db query=select * from
my_database where rownum lt;=2
...
field column=CONTENT_URL   name=content_url/

entity processor=TikaEntityProcessor dataSource=ds-url
format=text url=http://www.mysite.com/${my_database.content_url};
field column=text/
/entity
/entity


Re: solr data config questions

2010-06-28 Thread Alexey Serba
Hi,

You can add additional commentreplyjoin entity to story entity, i.e.

entity name=story ...
...
entity name=commenttable ...
...
entity name=replytable ...
...
/entity
/entity

entity name=commentreplyjoin query=select concat(comment_id,
',', replier_id) as commentreply from commenttable left join
replytable on replytable.comment_id=commenttable.comment_id where
commenttable.story_id=${story.story_id}'
field name=commentreply column=commentreply /
/entity
/entity

Thus, you will have multivalued field commentreply that contains list
of related comment_id, reply_id (comment_id, if you don't have any
related replies for this entry) pairs. You can retrieve all values of
that field and process on a client and build complex data structure.

HTH,
Alex

On Mon, Jun 28, 2010 at 8:19 PM, Peng, Wei wei.p...@xerox.com wrote:
 Hi All,



 I am a new user of Solr.

 We are now trying to enable searching on Digg dataset.

 It has story_id as the primary key and comment_id are the comment id
 which commented story_id, so story_id and comment_id is one-to-many
 relationship.

 These comment_ids can be replied by some repliers, so comment_id and
 repliers are one-to-many relationship.



 The problem is that within a single returned document the search results
 shows an array of comment_ids and an array of repliers without knowing
 which repliers replied which comment.

 For example: now we got comment_id:[c1,c,2...,cn],
 repliers:[r1,r2,r3rm]. Can we get something like
 comment_id:[c1,c,2...,cn], repliers:[{r1,r2},{},r3{rm-1,rm}] so that
 {r1,r2} is corresponding to c1?



 Our current data-config is attached:

 dataConfig

    dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
 autoreconnect=true netTimeoutForStreamingResults=1200
 url=jdbc:mysql://localhost/diggdataset batchSize=-1 user=root
 password= /

    document

            entity name=story pk=story_id query=select * from
 story

                  deltaImportQuery=select * from story where
 ID=='${dataimporter.delta.story_id}'

                  deltaQuery=select story_id from story where
 last_modified  '${dataimporter.last_index_time}'



            field column=link name=link /

            field column=title name=title /

            field column=description name=story_content /

            field column=digg name=positiveness /

            field column=comment name=spreading_number /

            field column=user_id name=author /

            field column=profile_view name=user_popularity /

            field column=topic name=topic /

            field column=timestamp name=timestamp /



            entity name=dugg_list  pk=story_id

                    query=select * from dugg_list where
 story_id='${story.story_id}'

                    deltaQuery=select SID from dugg_list where
 last_modified  '${dataimporter.last_index_time}'

                    parentDeltaQuery=select story_id from story where
 story_id=${dugg_list.story_id}

                  field name=viewer column=dugger /

            /entity



            entity name=commenttable  pk=comment_id

                    query=select * from commenttable where
 story_id='${story.story_id}'

                    deltaQuery=select SID from commenttable where
 last_modified  '${dataimporter.last_index_time}'

                    parentDeltaQuery=select story_id from story where
 story_id=${commenttable.story_id}

                  field name=comment_id column=comment_id /

                  field name=spreading_user column=replier /

                  field name=comment_positiveness column=up /

                  field name=comment_negativeness column=down /

                  field name=user_comment column=content /

                  field name=user_comment_timestamp
 column=timestamp /





            entity name=replytable

                    query=select * from replytable where
 comment_id='${commenttable.comment_id}'

                    deltaQuery=select SID from replytable where
 last_modified  '${dataimporter.last_index_time}'

                    parentDeltaQuery=select comment_id from
 commenttable where comment_id=${replytable.comment_id}

                  field name=replier_id column=replier_id /

                  field name=reply_content column=content /

                  field name=reply_positiveness column=up /

                  field name=reply_negativeness column=down /

                  field name=reply_timestamp column=timestamp /

            /entity



            /entity

            /entity

    /document

 /dataConfig



 Please help me on this.

 Many thanks



 Vivian










Re: DIH and denormalizing

2010-06-28 Thread Alexey Serba
 It seems that ${ncdat.feature} is not being set.
Try ${dataTable.feature} instead.


On Tue, Jun 29, 2010 at 1:22 AM, Shawn Heisey s...@elyograg.org wrote:
 I am trying to do some denormalizing with DIH from a MySQL source.  Here's
 part of my data-config.xml:

 entity name=dataTable pk=did
      query=SELECT *,FROM_UNIXTIME(post_date) as pd FROM ncdat WHERE did
 gt; ${dataimporter.request.minDid} AND did lt;=
 ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards})
 IN (${dataimporter.request.modVal})
 entity name=ncdat_wt
        query=SELECT webtable as wt FROM ncdat_wt WHERE
 featurecode='${ncdat.feature}'
 /entity
 /entity

 The relationship between features in ncdat and webtable in ncdat_wt (via
 featurecode) will be many-many.  The wt field in schema.xml is set up as
 multivalued.

 It seems that ${ncdat.feature} is not being set.  I saw a query happening on
 the server and it was SELECT webtable as wt FROM ncdat_wt WHERE
 featurecode='' - that last part is an empty string with single quotes
 around it.  From what I can tell, there are no entries in ncdat where
 feature is blank.  I've tried this with both a 1.5-dev checked out months
 ago (which we are using in production) and a 3.1-dev checked out today.

 Am I doing something wrong?

 Thanks,
 Shawn




Re: dataimport.properties is not updated on delta-import

2010-06-25 Thread Alexey Serba
Please note that Oracle ( or Oracle jdbc driver ) converts column
names to upper case eventhough you state them in lower case. If this
is the case then try to rewrite your query in the following form
select id as id, name as name from table

On Thursday, June 24, 2010, warb w...@mail.com wrote:

 Hello again!

 Upon further investigation it seems that something is amiss with
 delta-import after all, the delta-import does not actually import anything
 (I thought it did when I ran it previously but I am not sure that was the
 case any longer.) It does complete successfully as seen from the front-end
 (dataimport?command=delta-import). Also in the logs it is stated the the
 import was successful (INFO: Delta Import completed successfully), but there
 are exception pertaining to some documents.

 The exception message is that the id field is missing
 (org.apache.solr.common.SolrException: Document [null] missing required
 field: id). Now, I have checked the column names in the table, the
 data-config.xml file and the schema.xml file and they all have the
 column/field names written in lowercase and are even named exactly the same.

 Do Solr rollback delta-imports if one or more of the documents failed?
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/dataimport-properties-is-not-updated-on-delta-import-tp916753p919609.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Data Import Handler Rich Format Documents

2010-06-21 Thread Alexey Serba
You are right. It seems TikaEntityProcessor is exactly the tool you
need in this case.

Alex

On Sat, Jun 19, 2010 at 2:59 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 : I think you can use existing ExtractingRequestHandler to do the job,
 : i.e. add child entity to your DIH metadata

 why would you do this instead of using the TikaEntityProcessor as i
 already suggested in my earlier mail?



 -Hoss




Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Alexey Serba
I think you can use existing ExtractingRequestHandler to do the job,
i.e. add child entity to your DIH metadata

dataSource type=JdbcDataSource name=db ... /
dataSource type=URLDataSource name=solr /
entity name=metadata query=select id, title, url from metadata
dataSource=db
entity processor=PlainTextEntityProcessor name=content
url=http://localhost:8983/solr/update/extract?extractOnly=truewt=xmlindent=onstream.url=${metadata.url};
dataSource=solr
field column=plainText name=content/
/entity
/entity

That's not working example, just basic idea, you still need to
uri_escape ${metadata.url} reference probably using some transformer
(regexp, javascript?) and extract file content from ERH xml response
using xpath and probably do some html stripping.

HTH,
Alex

On Fri, Jun 18, 2010 at 4:51 PM, Tod listac...@gmail.com wrote:
 I have a database containing Metadata from a content management system.
  Part of that data includes a URL pointing to the actual published document
 which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.

 I'm already indexing the Metadata and that provides a lot of value.  The
 customer however would like that the content pointed to by the URL also be
 indexed for more discrete searching.

 This article at Lucid:

 http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

 describes the process of coding a custom transformer.  A separate article
 I've read implies Nutch could be used to provide this functionality too.

 What would be the best and most efficient way to accomplish what I'm trying
 to do?  I have a feeling the Lucid article might be dated and there might
 ways to do this now without any coding and maybe without even needing to use
 Nutch.  I'm using the current release version of Solr.

 Thanks in advance.


 - Tod



Re: Solr DataConfig / DIH Question

2010-06-16 Thread Alexey Serba
 There is a 1-[0,1] relationship between Person and Address with address_id 
 being the nullable foreign key.

I think you should be good with single query/entity then (no need for
nested entities)

entity name=person query=select person.id, person.name,
person.address_id, address.zipcode from person left join address on
address.id=person.address_id

On Sunday, June 13, 2010, Holmes, Charles V. chol...@mitre.org wrote:
 I'm putting together an entity.  A simplified version of the database schema 
 is below.  There is a 1-[0,1] relationship between Person and Address with 
 address_id being the nullable foreign key.  If it makes any difference, I'm 
 using SQL Server 2005 on the backend.

 Person [id (pk), name, address_id (fk)]
 Address [id (pk), zipcode]

 My data config looks like the one below.  This naturally fails when the 
 address_id is null since the query ends up being select * from user.address 
 where id = .

 entity name=person
         Query=select * from user.person
   entity name=address
           Query=select * from user.address where id = ${person.address_id}
   /entity
 /entity

 I've worked around it by using a config like this one.  However, this makes 
 the queries quite complex for some of my larger joins.

 entity name=person
         Query=select * from user.person
   entity name=address
           Query=select * from user.address where id = (select address_id 
 from user.person where id = ${person.id})
   /entity
 /entity

 Is there a cleaner / better way of handling these type of relationships?  
 I've also tried to specify a default in the Solr schema, but that seems to 
 only work after all the data is indexed which makes sense but surprised me 
 initially.  BTW, thanks for the great DIH tutorial on the wiki!

 Thanks!
 Charles



Re: multiValued using

2010-06-08 Thread Alexey Serba
Hi Alberto,

You can add child entity which returns multiple records, i.e.

entity name=root query=select id, title from titles
entity name=child query=select value from multivalued where
title_id='${root.id}'
/entity
/entity

HTH,
Alex

2010/6/7 Alberto García Sola alberto...@gmail.com:
 Hello, this is my first message to this list.

 I was wondering if it is possible to use multiValued when using MySQL (or
 any SQL-database engine) through DataImportHandler.

 I've tried using a query which return something like this:
 1 - title1 - multivalue1-1
 1 - title1 - multivalue1-2
 1 - title1 - multivalue1-3
 2 - title2 - multivalue2-1
 2 - title2 - multivalue2-2

 And using the first row as ID. But that only returns me the first occurrence
 rather than transforming them into multiValued fields.

 Is there a way to deal with multiValued in databases?

 NOTE: The way of working with multivalues I use is using foreign keys and
 relate them into the query so that the query gives me the results the way I
 have shown.

 Regards,
 Alberto.



Re: Importing large datasets

2010-06-07 Thread Alexey Serba
What's the relation between items and item_descriptions table? I.e. is
there only one item_descriptions record for every id?

If 1-1 then you can merge all your data into single database and use
the following query

 entity name=item
   dataSource=single_datasource
   query=select * from items inner join item_descriptions on
item_descriptions.id=items.id
 /entity

HTH,
Alex

On Thu, Jun 3, 2010 at 6:34 AM, Blargy zman...@hotmail.com wrote:


 Erik Hatcher-4 wrote:

 One thing that might help indexing speed - create a *single* SQL query
 to grab all the data you need without using DIH's sub-entities, at
 least the non-cached ones.

       Erik

 On Jun 2, 2010, at 12:21 PM, Blargy wrote:



 As a data point, I routinely see clients index 5M items on normal
 hardware
 in approx. 1 hour (give or take 30 minutes).

 Also wanted to add that our main entity (item) consists of 5 sub-
 entities
 (ie, joins). 2 of those 5 are fairly small so I am using
 CachedSqlEntityProcessor for them but the other 3 (which includes
 item_description) are normal.

 All the entites minus the item_description connect to datasource1.
 They
 currently point to one physical machine although we do have a pool
 of 3 DB's
 that could be used if it helps. The other entity, item_description
 uses a
 datasource2 which has a pool of 2 DB's that could potentially be
 used. Not
 sure if that would help or not.

 I might as well that the item description will have indexed, stored
 and term
 vectors set to true.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 I can't find any example of creating a massive sql query. Any out there?
 Will batching still work with this massive query?
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: indexer threading?

2010-04-27 Thread Alexey Serba
Hi Brian,

I was testing indexing performance on a high cpu box recently and came
to the same issue. I tried different indexing methods ( xml,
CSVRequestHandler and Solrj + BinaryRequestWriter with multiple
threads ). The last method is the fastest indeed. I believe that
multiple threads approach gives you better performance if you have
complex text analysis. I had very simple analysis -
WhitespaceTokenizer only and performance boost with increasing threads
was not very impressive ( but still ). I guess that in case of simple
text analysis overall performance comes to synchronization issues.

I tried to profile application during indexing phase for CPU times and
monitors and it seems that most of blocking is on the following
methods:
- DocumentsWriter.doBalanceRAM
- DocumentsWriter.getThreadState
- SolrIndexWriter.ensureOpen

I don't know the guts of Solr/Lucene in such details so can't make any
conclusions. Are there any configuration techniques to improve
indexing performance in multiple threads scenario?

Alex

On Mon, Apr 26, 2010 at 6:52 PM, Wawok, Brian brian.wa...@cmegroup.com wrote:
 Hi,

 I was wondering about how the multi-threading of the indexer works?  I am 
 using SolrJ to stream documents to a server. As I add more threads on the 
 client side, I slowly see both speed and CPU usage go up on the indexer side. 
 Once I hit about 4 threads, my indexer is at 100% cpu usage (of 1 CPU on a 
 4-way box), and will not do any more work. It is pretty fast, doing something 
 like 75k lines of text per second.. but I would really like to use all 4 CPUs 
 on the indexer. Is the just a limitation of Solr, or is this a limitation of 
 using SolrJ and document streaming?


 Thanks,


 Brian



Re: Short Question: Fills this entity multiValued Fields (DIH)?

2010-04-08 Thread Alexey Serba
 Have a look at these two lines:
 
 entity name=feature query=select description from feature where
 item_id='${item.ID}'
                field name=features column=description /
 

 If there is more than one description per item_ID, does the features-field
 gets multiple values if it is defined as multiValued=true?
Correct.


Re: SOLR-1316 How To Implement this autosuggest component ???

2010-03-24 Thread Alexey Serba
You should add this component (suggest or spellcheck, depends how do
you name it) to request handler, i.e. add

  requestHandler name=/suggest
class=org.apache.solr.handler.component.SearchHandler
lst name=defaults
/lst
arr name=components
  strsuggest/str
/arr
  /requestHandler

And then you can hit the following url and get your suggestions

http://localhost:8983/solr/suggest/?spellcheck=truespellcheck.dictionary=suggestspellcheck.build=truespellcheck.extendedResults=truespellcheck.count=10q=prefix

On Wed, Mar 24, 2010 at 8:09 PM, stocki st...@shopgate.com wrote:

 hey.

 i got it =)

 i checked out with lucene and the build from solr. with ant -verbose
 example.

 now, when i put this line into solrconfig: str
 name=classnameorg.apache.solr.spelling.suggest.Suggester/str
 no exception occurs =) juhu

 but how wokrs this component ?? sorry for a new stupid question ^^


 stocki wrote:

 okay, thx

 so i checked out but i cannot build an build.

 i got 100 errors ...

 D:\cygwin\home\stock\trunk_\solr\common-build.xml:424: The following error
 occur
 red while executing this line:
 D:\cygwin\home\stock\trunk_\solr\common-build.xml:281: The following error
 occur
 red while executing this line:
 D:\cygwin\home\stock\trunk_\solr\contrib\clustering\build.xml:69: The
 following
 error occurred while executing this line:
 D:\cygwin\home\stock\trunk_\solr\build.xml:155: The following error
 occurred whi
 le executing this line:
 D:\cygwin\home\stock\trunk_\solr\common-build.xml:221: Compile failed; see
 the c
 ompiler error output for details.



 Lance Norskog-2 wrote:

 You need 'ant' to do builds.  At the top level, do:
 ant clean
 ant example

 These will build everything and set up the example/ directory. After
 that, run:
 ant test-core

 to run all of the unit tests and make sure that the build works. If
 the autosuggest patch has a test, this will check that the patch went
 in correctly.

 Lance

 On Tue, Mar 23, 2010 at 7:42 AM, stocki st...@shopgate.com wrote:

 okay,
 i do this..

 but one file are not right updatet 
 Index: trunk/src/java/org/apache/solr/util/HighFrequencyDictionary.java
 (from the suggest.patch)

 i checkout it from eclipse, apply patch, make an new solr.war ... its
 the
 right way ??
 i thought that is making a war i didnt need to make an build.

 how do i make an build ?




 Alexey-34 wrote:

 Error loading class 'org.apache.solr.spelling.suggest.Suggester'
 Are you sure you applied the patch correctly?
 See http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

 Checkout Solr trunk source code (
 http://svn.apache.org/repos/asf/lucene/solr/trunk ), apply patch,
 verify that everything went smoothly, build solr and use built version
 for your tests.

 On Mon, Mar 22, 2010 at 9:42 PM, stocki st...@shopgate.com wrote:

 i patch an nightly build from solr.
 patch runs, classes are in the correct folder, but when i replace
 spellcheck
 with this spellchecl like in the comments, solr cannot find the
 classes
 =(

 searchComponent name=spellcheck class=solr.SpellCheckComponent
    lst name=spellchecker
      str name=namesuggest/str
      str
 name=classnameorg.apache.solr.spelling.suggest.Suggester/str
      str
 name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str
      str name=fieldtext/str
      str name=sourceLocationamerican-english/str
    /lst
  /searchComponent


 -- SCHWERWIEGEND: org.apache.solr.common.SolrException: Error loading
 class
 'org.ap
 ache.solr.spelling.suggest.Suggester'


 why is it so ??  i think no one has so many trouble to run a patch
 like
 me =( :D


 Andrzej Bialecki wrote:

 On 2010-03-19 13:03, stocki wrote:

 hello..

 i try to implement autosuggest component from these link:
 http://issues.apache.org/jira/browse/SOLR-1316

 but i have no idea how to do this !?? can anyone get me some tipps ?

 Please follow the instructions outlined in the JIRA issue, in the
 comment that shows fragments of XML config files.


 --
 Best regards,
 Andrzej Bialecki     
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




 --
 View this message in context:
 http://old.nabble.com/SOLR-1316-How-To-Implement-this-autosuggest-component-tp27950949p27990809.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context:
 http://old.nabble.com/SOLR-1316-How-To-Implement-this-patch-autoComplete-tp27950949p28001938.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 Lance Norskog
 goks...@gmail.com





 --
 View this message in context: 
 http://old.nabble.com/SOLR-1316-How-To-Implement-this-patch-autoComplete-tp27950949p28018196.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: SOLR-1316 How To Implement this autosuggest component ???

2010-03-23 Thread Alexey Serba
 Error loading class 'org.apache.solr.spelling.suggest.Suggester'
Are you sure you applied the patch correctly?
See http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

Checkout Solr trunk source code (
http://svn.apache.org/repos/asf/lucene/solr/trunk ), apply patch,
verify that everything went smoothly, build solr and use built version
for your tests.

On Mon, Mar 22, 2010 at 9:42 PM, stocki st...@shopgate.com wrote:

 i patch an nightly build from solr.
 patch runs, classes are in the correct folder, but when i replace spellcheck
 with this spellchecl like in the comments, solr cannot find the classes =(

 searchComponent name=spellcheck class=solr.SpellCheckComponent
    lst name=spellchecker
      str name=namesuggest/str
      str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
      str
 name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str
      str name=fieldtext/str
      str name=sourceLocationamerican-english/str
    /lst
  /searchComponent


 -- SCHWERWIEGEND: org.apache.solr.common.SolrException: Error loading class
 'org.ap
 ache.solr.spelling.suggest.Suggester'


 why is it so ??  i think no one has so many trouble to run a patch like
 me =( :D


 Andrzej Bialecki wrote:

 On 2010-03-19 13:03, stocki wrote:

 hello..

 i try to implement autosuggest component from these link:
 http://issues.apache.org/jira/browse/SOLR-1316

 but i have no idea how to do this !?? can anyone get me some tipps ?

 Please follow the instructions outlined in the JIRA issue, in the
 comment that shows fragments of XML config files.


 --
 Best regards,
 Andrzej Bialecki     
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




 --
 View this message in context: 
 http://old.nabble.com/SOLR-1316-How-To-Implement-this-autosuggest-component-tp27950949p27990809.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Term Highlighting without store text in index

2010-03-18 Thread Alexey Serba
Hey Dominique,

See 
http://www.lucidimagination.com/search/document/5ea8054ed8348e6f/highlight_arbitrary_text#3799814845ebf002

Although it might be not good solution for huge texts, wildcard/phrase queries.
http://issues.apache.org/jira/browse/SOLR-1397

On Mon, Mar 15, 2010 at 4:09 PM, dbejean dominique.bej...@eolya.fr wrote:

 Hello,

 Just in order to be able to show term highlighting in my results list, I
 store all the indexed data in the Lucene index and so, it is very huge
 (108Gb). Is there any possibilities to do it in an other way ? Now or in the
 future, is it possible that Solr use a 3nd-party tool such as ehcache in
 order to store the content of the indexed documents outside of the Lucene
 index ?

 Thank you

 Dominique


 --
 View this message in context: 
 http://old.nabble.com/Term-Highlighting-without-store-text-in-index-tp27904022p27904022.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: implementing profanity detector

2010-02-11 Thread Alexey Serba
 - A TokenFilter would allow me to tap into the existing analysis pipeline so
 I get the tokens for free but I can't access the document.
https://issues.apache.org/jira/browse/SOLR-1536

On Fri, Jan 29, 2010 at 12:46 AM, Mike Perham mper...@onespot.com wrote:
 We'd like to implement a profanity detector for documents during indexing.
  That is, given a file of profane words, we'd like to be able to mark a
 document as safe or not safe if it contains any of those words so that we
 can have something similar to google's safe search.

 I'm trying to figure out how best to implement this with Solr 1.4:

 - An UpdateRequestProcessor would allow me to dynamically populate a safe
 boolean field but requires me to pull out the content, tokenize it and run
 each token through my set of profanities, essentially running the analysis
 pipeline again.  That's a lot of overheard AFAIK.

 - A TokenFilter would allow me to tap into the existing analysis pipeline so
 I get the tokens for free but I can't access the document.

 Any suggestions on how to best implement this?

 Thanks in advance,
 mike



DataImportHandler - case sensitivity of column names

2010-02-08 Thread Alexey Serba
I encountered the problem with Oracle converting column names to upper
case. As a result SolrInputDocument is created with field names in
upper case and Document [null] missing required field: id exception
is thrown ( although ID field is defined ).

I do not specify field elements explicitly.

I know that I can rewrite all my queries to select id as id, body
as body from document format, but is there any other workaround for
this? case insensitive option or something?

Here's my data-config:
dataConfig
  dataSource convertType=true
driver=oracle.jdbc.driver.OracleDriver password=oracle
url=jdbc:oracle:thin:@localhost:1521:xe user=SYSTEM/
  document name=items
entity name=root pk=id preImportDeleteQuery=db:db1
query=select id, body from document
transformer=TemplateTransformer
  entity name=nested1 query=select category from
document_category where doc_id='${root.id}'/
  entity name=nested2 query=select tag from document_tag where
doc_id='${root.id}'/
  field column=db template=db1/
/entity
  /document
/dataConfig

Alexey


Re: Indexing an oracle warehouse table

2010-02-03 Thread Alexey Serba
 What would be the right way to point out which field contains the term 
 searched for.
I would use highlighting for all of these fields and then post process
Solr response in order to check highlighting tags. But I don't have so
many fields usually and don't know if it's possible to configure Solr
to highlight fields using '*' as dynamic fields.

On Wed, Feb 3, 2010 at 2:43 AM, caman aboxfortheotherst...@gmail.com wrote:

 Thanks all. I am on track.
 Another question:
 What would be the right way to point out which field contains the term
 searched for.
 e.g. If I search for SOLR and if the term exist in field788 for a document,
 how do I pinpoint that which field has the term.
 I copied all the fields in field called 'body' which makes searching easier
 but would be nice to show the field which has that exact term.

 thanks

 caman wrote:

 Hello all,

 hope someone can point me to right direction. I am trying to index an
 oracle warehouse table(TableA) with 850 columns. Out of the structure
 about 800 fields are CLOBs and are good candidate to enable full-text
 searching. Also have few columns which has relational link to other
 tables. I am clean on how to create a root entity and then pull data from
 other relational link as child entities.  Most columns in TableA are named
 as field1,field2...field800.
 Now my question is how to organize the schema efficiently:
 First option:
 if my query is 'select * from TableA', Do I  define field name=attr1
 column=FIELD1 / for each of those 800 columns?   Seems cumbersome. May
 be can write a script to generate XML instead of handwriting both in
 data-config.xml and schema.xml.
 OR
 Dont define any field name=attr1 column=FIELD1 / so that column in
 SOLR will be same as in the database table. But questions are 1)How do I
 define unique field in this scenario? 2) How to copy all the text fields
 to a common field for easy searching?

 Any helpful is appreciated. Please feel free to suggest any alternative
 way.

 Thanks







 --
 View this message in context: 
 http://old.nabble.com/Indexing-an-oracle-warehouse-table-tp27414263p27429352.html
 Sent from the Solr - User mailing list archive at Nabble.com.




DataImportHandler - convertType attribute

2010-02-02 Thread Alexey Serba
Hello,

I encountered blob indexing problem and found convertType solution in
FAQhttp://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5

I was wondering why it is not enabled by default and found the
following comment
http://www.lucidimagination.com/search/document/169e6cc87dad5e67/dataimporthandler_and_blobs#169e6cc87dad5e67in
mailing list:

We used to attempt type conversion from the SQL type to the field's given
type. We
found that it was error prone and switched to using the ResultSet#getObject
for all columns (making the old behavior a configurable option –
convertType in JdbcDataSource).

Why it is error prone? Is it safe enough to enable convertType for all jdbc
data sources by default? What are the side effects?

Thanks in advance,
Alex


Re: Indexing a oracle warehouse table

2010-02-02 Thread Alexey Serba
 Dont define any field name=attr1 column=FIELD1 / so that column in
 SOLR will be same as in the database table.
Correct
You can define dynamic field dynamicField name=field*  type=text
indexed=true  stored=true/ ( see
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields )

 1)How do I define unique field in this scenario?
You can create primary key into database or generate it directly in
Solr ( see UUID techniques http://wiki.apache.org/solr/UniqueKey )

 2) How to copy all the text fields to a common field for easy searching?
copyField source=field* dest=field/ ( see
http://wiki.apache.org/solr/SchemaXml#Copy_Fields )


On Tue, Feb 2, 2010 at 4:22 AM, caman aboxfortheotherst...@gmail.com wrote:

 Hello all,

 hope someone can point me to right direction. I am trying to index an oracle
 warehouse table(TableA) with 850 columns. Out of the structure about 800
 fields are CLOBs and are good candidate to enable full-text searching. Also
 have few columns which has relational link to other tables. I am clean on
 how to create a root entity and then pull data from other relational link as
 child entities.  Most columns in TableA are named as
 field1,field2...field800.
 Now my question is how to organize the schema efficiently:
 First option:
 if my query is 'select * from TableA', Do I  define field name=attr1
 column=FIELD1 / for each of those 800 columns?   Seems cumbersome. May be
 can write a script to generate XML instead of handwriting both in
 data-config.xml and schema.xml.
 OR
 Dont define any field name=attr1 column=FIELD1 / so that column in
 SOLR will be same as in the database table. But questions are 1)How do I
 define unique field in this scenario? 2) How to copy all the text fields to
 a common field for easy searching?

 Any helpful is appreciated. Please feel free to suggest any alternative way.

 Thanks





 --
 View this message in context: 
 http://old.nabble.com/Indexing-a-oracle-warehouse-table-tp27414263p27414263.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: DataImportHandler - synchronous execution

2010-01-13 Thread Alexey Serba
Hi,

I created Jira issue SOLR-1721 and attached simple patch ( no
documentation ) for this.

HIH,
Alex

2010/1/13 Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com:
 it can be added

 On Tue, Jan 12, 2010 at 10:18 PM, Alexey Serba ase...@gmail.com wrote:
 Hi,

 I found that there's no explicit option to run DataImportHandler in a
 synchronous mode. I need that option to run DIH from SolrJ (
 EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream
 to DIH as a workaround for this, but I think it makes sense to add
 specific option for that. Any objections?

 Alex




 --
 -
 Noble Paul | Systems Architect| AOL | http://aol.com



DataImportHandler - synchronous execution

2010-01-12 Thread Alexey Serba
Hi,

I found that there's no explicit option to run DataImportHandler in a
synchronous mode. I need that option to run DIH from SolrJ (
EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream
to DIH as a workaround for this, but I think it makes sense to add
specific option for that. Any objections?

Alex


Re: Adaptive search?

2009-12-18 Thread Alexey Serba
You can add click counts to your index as additional field and boost
results based on that value.

http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29

You can keep some kind of buffer for clicks and update click count
field for documents in the index periodically.

If you don't want to update whole documents in the index then you
probably should look at ExternalFileField or Lucene ParallelReader as
a custom Solr IndexReader, but this is complex low level Lucene stuff
and requires some hacking.

Alex

On Thu, Dec 17, 2009 at 6:46 PM, Siddhant Goel siddhantg...@gmail.com wrote:
 Let say we have a search engine (a simple front end - web app kind of a
 thing - responsible for querying Solr and then displaying the results in a
 human readable form) based on Solr. If a user searches for something, gets
 quite a few search results, and then clicks on one such result - is there
 any mechanism by which we can notify Solr to boost the score/relevance of
 that particular result in future searches? If not, then any pointers on how
 to go about doing that would be very helpful.

 Thanks,

 On Thu, Dec 17, 2009 at 7:50 PM, Paul Libbrecht p...@activemath.org wrote:

 What can it mean to adapt to user clicks ? Quite many things in my head.
 Do you have maybe a citation that inspires you here?

 paul


 Le 17-déc.-09 à 13:52, Siddhant Goel a écrit :


  Does Solr provide adaptive searching? Can it adapt to user clicks within
 the
 search results it provides? Or that has to be done externally?





 --
 - Siddhant



Re: preserve relational strucutre in solr?

2009-12-14 Thread Alexey Serba
http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example

See full import example, it has 1-n and n-n relationships

On Mon, Dec 14, 2009 at 4:34 PM, Faire Mii faire@gmail.com wrote:

  was able to import data through solr DIH.

 in my db i have 3 tables:

 threads: id tags: id thread_tag_map: thread_id, tag_id

 i want to import the many2many relationship (which thread has which tags) to 
 my solr index.

 how should the query look like.

 i have tried with following code without result:

 entity name=thread_tags
        query=select * from threads, tags, thread_tag_map where 
 thread_tag_map.thread_id = threads.id AND thread_tag_map.tag_id = tags.id
 /entity

 s this the right way to go?

 i thought that with this query each document will consist of tread and all 
 the tags related to it. and i could do a query to get the specific thread by 
 tagname.


 thanks!


Re: Similar documents from multiple cores with different schemas

2009-11-09 Thread Alexey Serba
 Or maybe it's
 possible to tweak MoreLikeThis just to return the fields and terms that
 could be used for a search on the other core?
Exactly

See parameter mlt.interestingTerms in MoreLikeThisHandler
http://wiki.apache.org/solr/MoreLikeThisHandler

You can get interesting terms and build query (with N optional clauses
+ boosts) to second core yourself

HIH,
Alex


On Mon, Nov 9, 2009 at 6:25 PM, Chantal Ackermann
chantal.ackerm...@btelligent.de wrote:
 Hi all,

 my search for any postings answering the following question haven't produced
 any helpful hints so far. Maybe someone can point me into the right
 direction?

 Situation:
 I have two cores with slightly different schemas. Slightly means that some
 fields appear on both cores but there are some that are required in one core
 but optional in the other. Then there are fields that appear only in one
 core.
 (I don't want to put them in one index, right now, because of the fields
 that might be required for only one type but not the other. But it's
 certainly an option.)

 Question:
 Is there a way to get similar contents from core B when the input (seed) to
 the comparison is a document from core A?

 MoreLikeThis:
 I was searching for MoreLikeThis, multiple schemas etc. As these are cores
 with different schemas, the posts on distributed search/sharding in
 combination with MoreLikeThis are not helpful. But maybe there is some other
 functionality that I am not aware of? Some similarity search? Or maybe it's
 possible to tweak MoreLikeThis just to return the fields and terms that
 could be used for a search on the other core?

 Thanks for any input!
 Chantal



Re: sanizing/filtering query string for security

2009-11-09 Thread Alexey Serba
I added some kind of pre and post processing of Solr results for this, i.e.

If I find fieldname specified in query string in form of
fieldname:term then I pass this query string to standard request
handler, otherwise use DisMaxRequestHandler ( DisMaxRequestHandler
doesn't break the query, at least I haven't seen yet ). If standard
request handler throws error ( invalid field, too many clauses, etc )
then I pass original query to DisMax request handler.

Alex

On Mon, Nov 9, 2009 at 10:05 PM, michael8 mich...@saracatech.com wrote:

 Hi Julian,

 Saw you post on exactly the question I have.  I'm curious if you got any
 response directly, or figured out a way to do this by now that you could
 share?  I'm in the same situation trying to 'sanitize' the query string
 coming in before handing it to solr.  I do see that characters like :
 could break the query, but am curious if anyone has come up with a general
 solution as I think this must be a fairly common problem for any solr
 deployment to tackle.

 Thanks,
 Michael


 Julian Davchev wrote:

 Hi,
 Is there anything special that can be done for sanitizing user input
 before passed as query to solr.
 Not allowing * and ? as first char is only thing I can thing of right
 now. Anything else it should somehow handle.

 I am not able to find any relevant document.



 --
 View this message in context: 
 http://old.nabble.com/sanizing-filtering-query-string-for-security-tp21516844p26271891.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: sanizing/filtering query string for security

2009-11-09 Thread Alexey Serba
 BTW, I have not used DisMax handler yet, but does it handle *:* properly?
See q.alt DisMax parameter
http://wiki.apache.org/solr/DisMaxRequestHandler#q.alt

You can specify q.alt=*:* and q as empty string to get all results.

 do you care if users issue this query
I allow users to issue an empty search and get all results with all
facets / etc. It's a nice navigation UI btw.

 Basically given my UI, I'm trying to *hide* the total count from users 
 searching for *everything*
If you don't specify q.alt parameter then Solr returns zero results
for empty search. *:* won't work either.

 though this syntax has helped me debug/monitor the state of my search doc 
 pool size.
see q.alt

Alex

On Tue, Nov 10, 2009 at 12:59 AM, michael8 mich...@saracatech.com wrote:

 Sounds like a nice approach you have  done.  BTW, I have not used DisMax
 handler yet, but does it handle *:* properly?  IOW, do you care if users
 issue this query, or does DisMax treat this query string differently than
 standard request handler?  Basically given my UI, I'm trying to *hide* the
 total count from users searching for *everything*, though this syntax has
 helped me debug/monitor the state of my search doc pool size.

 Thanks,
 Michael


 Alexey-34 wrote:

 I added some kind of pre and post processing of Solr results for this,
 i.e.

 If I find fieldname specified in query string in form of
 fieldname:term then I pass this query string to standard request
 handler, otherwise use DisMaxRequestHandler ( DisMaxRequestHandler
 doesn't break the query, at least I haven't seen yet ). If standard
 request handler throws error ( invalid field, too many clauses, etc )
 then I pass original query to DisMax request handler.

 Alex

 On Mon, Nov 9, 2009 at 10:05 PM, michael8 mich...@saracatech.com wrote:

 Hi Julian,

 Saw you post on exactly the question I have.  I'm curious if you got any
 response directly, or figured out a way to do this by now that you could
 share?  I'm in the same situation trying to 'sanitize' the query string
 coming in before handing it to solr.  I do see that characters like :
 could break the query, but am curious if anyone has come up with a
 general
 solution as I think this must be a fairly common problem for any solr
 deployment to tackle.

 Thanks,
 Michael


 Julian Davchev wrote:

 Hi,
 Is there anything special that can be done for sanitizing user input
 before passed as query to solr.
 Not allowing * and ? as first char is only thing I can thing of right
 now. Anything else it should somehow handle.

 I am not able to find any relevant document.



 --
 View this message in context:
 http://old.nabble.com/sanizing-filtering-query-string-for-security-tp21516844p26271891.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context: 
 http://old.nabble.com/sanizing-filtering-query-string-for-security-tp21516844p26274459.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: MoreLikeThis and filtering/restricting on target fields

2009-11-06 Thread Alexey Serba
Hi Cody,

 I have tried using MLT as a search component so that it has access to
 filter queries (via fq) but I cannot seem to get it to give me any
 data other than more of the same, that is, I can get a ton of Articles
 back but not other content types.
Filter query ( fq ) should work, for example add fq=type_s:BlogPost OR
type_s:Community

http://localhost:9007/solr/mlt?q=id:WikiArticle:948mlt.fl=body_tmlt.qf=body_t^1.0fq=type_s:BlogPost
OR type_s:Community

Alex

On Fri, Nov 6, 2009 at 1:44 AM, Cody Caughlan tool...@gmail.com wrote:
 I am trying to use MoreLikeThis (both the component and handler,
 trying combinations) and I would like to give it an input document
 reference which has a source field to analyze and then get back
 other documents which have a given field that is used by MLT.

 My dataset is composed of documents like:

 # Doc 1
 id:Article:99
 type_s:Article
 body_t: the body of the article...

 # Doc 2
 id:Article:646
 types_s:Article
 body_t: another article...

 # Doc 3
 id:Community:44
 type_s:Community
 description_t: description of this community...

 # Doc 4
 id:Community:34874
 type_s:Community
 description_t: another description

 # Doc 5
 id:BlogPost:2384
 type_s:BlogPost
 body_t: contents of some blog post

 So I would like to say, given an article (e.g. id:Article:99 which
 has a field body_t that should be analyze), give more related
 Communities, and you will want to search on description_t for your
 analysis.'

 When I run a basic query like:

 (using raw URL values for clarity, but they are encoded in reality)

 http://localhost:9007/solr/mlt?q=id:WikiArticle:948mlt.fl=body_t

 then I get back a ton of other articles. Which is fine if my target
 type was Article.

 So how I can I say search on field A for your analysis of the input
 document, but for related terms use field B, filtered by type_s

 It seems that I can really only specify one field via mlt.fl

 I have tried using MLT as a search component so that it has access to
 filter queries (via fq) but I cannot seem to get it to give me any
 data other than more of the same, that is, I can get a ton of Articles
 back but not other content types.

 Am I just trying to do too much?

 Thanks
 /Cody



Re: Dismax and Standard Queries together

2009-11-03 Thread Alexey Serba
Hi Ram,

You can add another field total ( catchall field ) and copy all other
fields into this field ( using copyField directive )
http://wiki.apache.org/solr/SchemaXml#Copy_Fields

and use this field in DisMax qf parameter, for example
qf=business_name^2.0 category_name^1.0 sub_category_name^1.0 total^0.0
and
mm=100%

Thus, it requires occurrence of all search keywords in any field of
your document, but you can control relevance of returned results via
boosting in qf parameter.

HIH,
Alex

On Tue, Nov 3, 2009 at 12:02 AM, ram_sj rpachaiyap...@gmail.com wrote:

 Hi,

 I have three fields, business_name, category_name, sub_category_name in my
 solrconfig file.

 my query = pet clinic

 example sub_category_names: Veterinarians, Kennels, Veterinary Clinics
 Hospitals, Pet Grooming, Pet Stores, Clinics

 my ideal requirement is dismax searching on

 a. dismax over three or two fields
 b. followed by a Boolean match over any one of the field is acceptable.

 I played around with minimum match attributes, but doesn't seems to be
 helpful, I guess the dismax requires at-least two fields.

 The nest queries takes only one qf filed, so it doesn't help much either.

 Any suggestions will be helpful.

 Thanks
 Ram
 --
 View this message in context: 
 http://old.nabble.com/Dismax-and-Standard-Queries-together-tp26157830p26157830.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: adding and updating a lot of document to Solr, metadata extraction etc

2009-11-02 Thread Alexey Serba
Hi Eugene,

 - ability to iterate over all documents, returned in search, as Lucene does
  provide within a HitCollector instance. We would need to extract and
  aggregate various fields, stored in index, to group results and aggregate 
 them
  in some way.
 
 Also I did not find any way in the tutorial to access the search results with
 all fields to be processed by our application.

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
Check out Faceted Search, probably you can achieve your goal by using
Facet Component

There's also Field Collapsing patch
http://wiki.apache.org/solr/FieldCollapsing


Alex


Re: Solr Cell on web-based files?

2009-11-02 Thread Alexey Serba
 e.g (doesn't work)
 curl http://localhost:8983/solr/update/extract?extractOnly=true
 --data-binary @http://myweb.com/mylocalfile.htm -H Content-type:text/html

 You might try remote streaming with Solr (see
 http://wiki.apache.org/solr/SolrConfigXml).

Yes, curl example

curl 
'http://localhost:8080/solr/main_index/extract/?extractOnly=trueindent=onresource.name=lecture12stream.url=http%3A//myweb.com/lecture12.ppt'

It works great for me.

Alex


Re: yellow pages navigation kind menu. howto take every 100th row from resultset

2009-10-05 Thread Alexey Serba
It seems that you need Faceted
Searchhttp://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr

On Fri, Oct 2, 2009 at 3:35 PM, Julian Davchev j...@drun.net wrote:
 Hi,

 Long story short:   how can I take every 100th row from solr resultset.
 What would syntax for this be.

 Long story:

 Currently I have lots of say documents(articles) indexed. They all have
 field title with corresponding value.

 atitle
 btitle
 .
 *title

 How do I build menu   so I can search of those?
 I cannot just hardcode  ABC  Dmeaning all starting
 with A all starting with B etc...cause there are unicode characters
 and english alphabet will just not cut it...

 So my idea is to make ranges like

 [atitle - mtitle][mtitle - ltitle] ...etc etc   (based on
 actual title names I got)


 Questions is how do I figure out what those  atitle-mtitle is (like get
 from solr query every 100th record)
 Two solutions I found:
 1. get all stuff and do it server side (huge load as it's thousands
 record we talk about)
 2. use solr sort and start and make N calls until   resulted rows 
 100.But this will mean quite a load as well as there lots of records.

 Any pointers?
 Thanks





Re: Keepwords Schema

2009-10-05 Thread Alexey Serba
Probably you want to use
- multivalued field 'authors'
add
  doc
field name=filenamelogin.php/field
field name=authorsalex/field
field name=authorsbrian/field
...
  /doc
/add
- return facets for this field
- you can filter unwanted authors whether during indexing process or post
process returned search results

On Fri, Oct 2, 2009 at 4:35 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Thu, Oct 1, 2009 at 7:37 PM, matrix_psj matrix_...@hotmail.com wrote:

 
 
  An example:
  My schema is about web files. Part of the syntax is a text field of
 authors
  that have worked on each file, e.g.
  file
 filenamelogin.php/filename
lastModDate2009-01-01/lastModDate
authorsalex, brian, carl carlington, dave alpha, eddie, dave
  beta/authors
  /file
 
  When I perform a search and get 20 web files back, I would like a facet
 of
  the individual authors, but only if there name appears in a
  public_authors.txt file.
 
  So if the public_authors.txt file contained:
  Anna,
  Bob,
  Carl Carlington,
  Dave Alpha,
  Elvis,
  Eddie,
 
  The facet returned would be:
  Carl Carlington
  Dave Alpha
  Eddie
 
 
 
  Not sure if that makes sense? If it does, could someone explain to me the
  schema fieldtype declarations that would bring back this sort of results.
 
 
 If I'm understanding you correctly - You want to facet on a field (with
 facet=truefacet.field=authors) but you want to show only certain
 whitelisted facet values in the response.

 If that is correct then, you can remove the authors which are not in the
 whitelist during indexing time. You can do this by adding
 KeepWordFilterFactory to your field type:

 filter class=solr.KeepWordFilterFactory words=author_whitelist.txt
 ignoreCase=true /

 --
 Regards,
 Shalin Shekhar Mangar.



Re: Disabling tf (term frequency) during indexing and/or scoring

2009-09-16 Thread Alexey Serba
Hi Aaron,

You can overwrite default Lucene Similarity and disable tf and
lengthNorm factors in scoring formula ( see
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
and http://lucene.apache.org/java/2_4_1/api/index.html )

You need to

1) compile the following class and put it into Solr WEB-INF/classes
---
package my.package;

import org.apache.lucene.search.DefaultSimilarity;

public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {

public float lengthNorm(String fieldName, int numTerms) {
return numTerms  0 ? 1.0f : 0.0f;
}

public float tf(float freq) {
return freq  0 ? 1.0f : 0.0f;
}
}
---

2. Add similarity class=my.package.NoLengthNormAndTfSimilarity/
into your schema.xml
http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca

HIH,
Alex

On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee ucbmc...@gmail.com wrote:
 Hello,

 Let me preface this by admitting that I'm still fairly new to Lucene and
 Solr, so I apologize if any of this sounds naive and I'm open to thinking
 about my problem differently.

 I'm currently responsible for a rather large dataset of business records
 that I'm trying to build a Lucene/Solr infrastructure around, to replace an
 in-house solution that we've been using for a few years. These records are
 sourced from multiple providers and there's often a fair bit of overlap in
 the business coverage. I have a set of fuzzy correlation libraries that I
 use to identify these documents and I ultimately create a super-record that
 includes metadata from each of the providers. Given the nature of things,
 these providers often have slight variations in wording or spelling in the
 overlapping fields (it's amazing how many ways people find to refer to the
 same business or address). I'd like to capture these variations, as they
 facilitate searching, but TF considerations are currently borking field
 scoring here.

 For example, taking business names into consideration, I have a Solr schema
 similar to:

 field name=name_provider1 type=string indexed=false stored=false
 multiValued=true
 ...
 field name=name_providerN type=string indexed=false stored=false
 multiValued=true
 field name=nameNorm type=text indexed=true stored=false
 multiValued=true omitNorms=true

 copyField source=name_provider1 dest=nameNorm
 ...
 copyField source=name_providerN dest=nameNorm

 For any given business record, there may be 1..N business names present in
 the nameNorm field (some with naming variations, some identical). With TF
 enabled, however, I'm getting different match scores on this field simply
 based on how many providers contributed to the record, which is not
 meaningful to me. For example, a record containing nameNormfoo
 barpositionIncrementGapfoo bar/nameNorm is necessarily scoring higher
 than a record just containing nameNormfoo bar/nameNorm.  Although I
 wouldn't mind TF data being considered within each discrete field value, I
 need to find a way to prevent score inflation based simply on the number of
 contributing providers.

 Looking at the mailing list archive and searching around, it sounds like the
 omitTf boolean in Lucene used to function somewhat in this manner, but has
 since taken on a broader interpretation (and name) that now also disables
 positional and payload data. Unfortunately, phrase support for fields like
 this is absolutely essential. So what's the best way to address a need like
 this? I guess I don't mind whether this is handled at index time or search
 time, but I'm not sure what I may need to override or if there's some
 existing provision I should take advantage of.

 Thank you for any help you may have.

 Best regards,
 Aaron



Re: do NOT want to stem plurals for a particular field, or words

2009-09-16 Thread Alexey Serba
  You can enable/disable stemming per field type in the schema.xml, by
 removing the stemming filters from the type definition.

 Basically, copy your prefered type, rename it to something like
 'text_nostem', remove the stemming filter from the type and use your
 'text_nostem' type for your field 'type' .
+ you can search in both fields text_stemmed and text_exact using
DisMax handler and boost text_exact matching. Thus if you search for
'articles' you'll get all results with 'articles' and 'article', but
exact match will be on top.


Re: query too long / has-many relation

2009-09-09 Thread Alexey Serba
 Is there a way to configure Solr to accept POST queries (instead of GET
 only?).
 Or: is there some other way to make Solr accept queries longer than 2,000
 characters? (Up to 10,000 would be nice)
Solr accepts POST queries by default. I switched to POST for exactly
the same reason. I use Solr 1.4 ( trunk version ) though.


 I have a Solr 1.3 index (served by Tomcat) of People, containing id, name,
 address, description etc. This works fine.
 Now I want to store and retrieve Events (time location, person), so each
 person has 0 or more events.
 As I understood it, there is no way to model a has-many relation in Solr (at
 least not between two structures with more than 1 properties), so I decided
 to store the Events in a separate mysql table.
 An example of a query I would like to do is: give me all people that will
 have an Event on location x coming month, that have  in their
 description.
 I do this in two steps now: first I query the mysql table, then I build a
 solr query, with a big OR of all the ids.
 The problem is that this can generate long (too long) querystrings.
Another option would be to put all your event objects (time, location,
person_id, description) into Solr index ( normalization )
Then you can generate Solr query give me all events on location x
coming month that have smth in their description and asks Solr to
return facets values for field person_id. Solr will return all
distinct values of field person_id that matches the query with count
values. Then you can take list of related person_ids and load all
persons from MySQL database using SQL in IN () clause.


Re: query too long / has-many relation

2009-09-09 Thread Alexey Serba
 Is there a way to configure Solr to accept POST queries (instead of GET
 only?).
 Or: is there some other way to make Solr accept queries longer than 2,000
 characters? (Up to 10,000 would be nice)
 Solr accepts POST queries by default. I switched to POST for exactly
 the same reason. I use Solr 1.4 ( trunk version ) though.
Don't forget to increase maxBooleanClauses in solrconfig.xml
http://wiki.apache.org/solr/SolrConfigXml#head-69ecb985108d73a2f659f2387d916064a2cf63d1


Re: query too long / has-many relation

2009-09-09 Thread Alexey Serba
 But apart from that everything works fine now (10,000 OR clauses takes 10
 seconds).
Not fast.
I would recommend to denormalize your data, put everything into Solr
index and use Solr faceting
http://wiki.apache.org/solr/SolrFacetingOverview to get relevant
persons ( see my previous message )


Re: DisMax - fetching dynamic fields

2009-08-05 Thread Alexey Serba
My bad! Please disregard this post.

Alex

On Tue, Aug 4, 2009 at 9:21 PM, Alexey Serbaase...@gmail.com wrote:
 Solr 1.4 built from trunk revision 790594 ( 02 Jul 2009 )

 On Tue, Aug 4, 2009 at 9:19 PM, Alexey Serbaase...@gmail.com wrote:
 Hi everybody,

 I have a couple of dynamic fields in my schema, e.g. rating_* popularity_*

 The problem I have is that if I try to specify existing fields
 rating_1 popularity_1 in fl parameter - DisMax handler just
 ignores them whereas StandardRequestHandler works fine.

 Any clues what's wrong?

 Thanks in advance,
 Alex




DisMax - fetching dynamic fields

2009-08-04 Thread Alexey Serba
Hi everybody,

I have a couple of dynamic fields in my schema, e.g. rating_* popularity_*

The problem I have is that if I try to specify existing fields
rating_1 popularity_1 in fl parameter - DisMax handler just
ignores them whereas StandardRequestHandler works fine.

Any clues what's wrong?

Thanks in advance,
Alex


Re: DisMax - fetching dynamic fields

2009-08-04 Thread Alexey Serba
Solr 1.4 built from trunk revision 790594 ( 02 Jul 2009 )

On Tue, Aug 4, 2009 at 9:19 PM, Alexey Serbaase...@gmail.com wrote:
 Hi everybody,

 I have a couple of dynamic fields in my schema, e.g. rating_* popularity_*

 The problem I have is that if I try to specify existing fields
 rating_1 popularity_1 in fl parameter - DisMax handler just
 ignores them whereas StandardRequestHandler works fine.

 Any clues what's wrong?

 Thanks in advance,
 Alex