from:"Chantal Ackermann"

> 
> Interesting I don't recall a bug like that being fixed.
> Anyway, glad it works for you now!
> -Yonik


Then it’s probably because it’s Christmas time! :-)

Re: Nested JSON Facets (Subfacets)

Hi Yonik,

after upgrading to Solr 6.3.0, the nested function works as expected! (Both 
with and without docValues.)

"facets":{
"count":3179500,
"all_pop":1.5901646171168616E8,
"shop_cat":{
  "buckets":[{
  "val":"Kontaktlinsen > Torische Linsen",
  "count":75168,
  "cat_sum":3752665.0497611803},


Thanks,
Chantal


> Am 15.12.2016 um 16:00 schrieb Chantal Ackermann <c.ackerm...@it-agenten.com>:
> 
> Hi Yonik,
> 
> are you certain that nesting a function works as documented on 
> http://yonik.com/solr-subfacets/?
> 
> top_authors:{ 
>type: terms,
>field: author,
>limit: 7,
>sort: "revenue desc",
>facet:{
>  revenue: "sum(sales)"
>}
>  }
> 
> 
> I’m getting the feeling that the function is never really executed because, 
> for my index, calling sum() with a non-number field (e.g. a multi-valued 
> string field) throws an error when *not nested* but does *not throw an error* 
> when nested:
> 
>json.facet={all_pop: "sum(gtin)“}
> 
>"error":{
>"trace":“java.lang.UnsupportedOperationException
>   at 
> org.apache.lucene.queries.function.FunctionValues.doubleVal(FunctionValues.java:47)
> 
> And the following does not throw an error but definitely should if the 
> function would be executed:
> 
>json.facet={all_pop:"sum(popularity)",shop_cat: {type:terms, 
> field:shop_cat, facet: {cat_pop:"sum(gtin)"}}}
> 
> returns:
> 
> "facets":{
>"count":2815500,
>"all_pop":1.4065865823321116E8,
>"shop_cat":{
>  "buckets":[{
>  "val":"Kontaktlinsen > Torische Linsen",
>  "count":75168,
>  "cat_pop":0.0},
>{
>  "val":"Damen-Mode/Inspirationen",
>  "count":47053,
>  "cat_pop":0.0},
> 
> For completeness: here is the field directive for „gtin“ with 
> „text_noleadzero“ based on „solr.TextField“:
> 
> required="false" multiValued="true“/>
> 
> 
> Is this a bug or is the documentation a glimpse of the future? I will try 
> version 6.3.0, now. I was still on 6.1.0 for the above tests.
> (I have also tried with the „avg“ function, just to make sure, but same 
> there.)
> 
> Cheers,
> Chantal

Re: Nested JSON Facets (Subfacets)

Hi Yonik,

are you certain that nesting a function works as documented on 
http://yonik.com/solr-subfacets/?

top_authors:{ 
type: terms,
field: author,
limit: 7,
sort: "revenue desc",
facet:{
  revenue: "sum(sales)"
}
  }


I’m getting the feeling that the function is never really executed because, for 
my index, calling sum() with a non-number field (e.g. a multi-valued string 
field) throws an error when *not nested* but does *not throw an error* when 
nested:

json.facet={all_pop: "sum(gtin)“}

"error":{
"trace":“java.lang.UnsupportedOperationException
at 
org.apache.lucene.queries.function.FunctionValues.doubleVal(FunctionValues.java:47)

And the following does not throw an error but definitely should if the function 
would be executed:

json.facet={all_pop:"sum(popularity)",shop_cat: {type:terms, 
field:shop_cat, facet: {cat_pop:"sum(gtin)"}}}

returns:

"facets":{
"count":2815500,
"all_pop":1.4065865823321116E8,
"shop_cat":{
  "buckets":[{
  "val":"Kontaktlinsen > Torische Linsen",
  "count":75168,
  "cat_pop":0.0},
{
  "val":"Damen-Mode/Inspirationen",
  "count":47053,
  "cat_pop":0.0},

For completeness: here is the field directive for „gtin“ with „text_noleadzero“ 
based on „solr.TextField“:

 required="false" multiValued="false" docValues="true“/>
> 
> I have also re-indexed (removed data/ and indexed from scratch). The 
> popularity field is populated with random values (as I don’t have the real 
> values from production) meaning that all documents have values > 0.
> 
> Here the statistics output:
> 
> "stats":{
>"stats_fields":{
>  "popularity":{
>"min":7.952374289743602E-5,
>"max":99.3896484375,
>"count":1687500,
>"missing":0,
>"sum":8.436878611434968E7,
>"sumOfSquares":5.626142812197906E9,
>"mean":49.9963176973924,
>"stddev":28.885623872869992},
> 
> And this is the relevant facet output from calling
> 
> /solr//query?
> json.facet={
> num_pop:{query: "popularity[* TO  *]“},
> all_pop: "sum(popularity)“,
> shop_cat: {type:terms, field:shop_cat, facet: {cat_pop: 
> "sum(popularity)"}}}=*:*=1=popularity=json
> 
> "facets":{
>"count":1687500,
>"all_pop":1.5893775613258794E8,
>"num_pop":{
>  "count":1687500},
>"shop_cat":{
>  "buckets":[{
>  "val":"Kontaktlinsen > Torische Linsen",
>  "count":75168,
>  "cat_pop":0.0},
>{
>  "val":"Neu",
>  "count":31772,
>  "cat_pop":0.0},
>{
>  "val":"Gesundheit & Schönheit > Gesundheitspflege",
>  "count":20281,
>  "cat_pop":0.0},
> [… more facets omitted]
> 
> 
> The /query handler is an edismax configuration, though I don’t think this 
> matters as long as the results include documents with popularity > 0 which is 
> the case as seen in the facet output (and sum() works in general for all of 
> the documents just not for the buckets as seen in „all_pop").
> 
> I will try to explicitly turn off the docValues and add stored=„true“ just to 
> try out more. If someone has any other suggestions that I should try - I 
> would be glad to here them. If it is not possible to retrieve the sum in this 
> way I would have to fetch each sum separately which would be a huge 
> performance impact.
> 
> Thanks!
> Chantal
> 
> 
> 
> 
> 
>> Am 15.12.2016 um 10:16 schrieb CA :
>> 
>>> num_pop:{query:"popularity:[* TO *]"}
>

Re: Nested JSON Facets (Subfacets)

Hi Yonik,


here is an update on what I’ve tried so far, unfortunately without any more 
luck.

The field directive is (should have included this when asking the question):

   /query?
json.facet={
num_pop:{query: "popularity[* TO  *]“},
all_pop: "sum(popularity)“,
shop_cat: {type:terms, field:shop_cat, facet: {cat_pop: 
"sum(popularity)"}}}=*:*=1=popularity=json

"facets":{
"count":1687500,
"all_pop":1.5893775613258794E8,
"num_pop":{
  "count":1687500},
"shop_cat":{
  "buckets":[{
  "val":"Kontaktlinsen > Torische Linsen",
  "count":75168,
  "cat_pop":0.0},
{
  "val":"Neu",
  "count":31772,
  "cat_pop":0.0},
{
  "val":"Gesundheit & Schönheit > Gesundheitspflege",
  "count":20281,
  "cat_pop":0.0},
[… more facets omitted]


The /query handler is an edismax configuration, though I don’t think this 
matters as long as the results include documents with popularity > 0 which is 
the case as seen in the facet output (and sum() works in general for all of the 
documents just not for the buckets as seen in „all_pop").

I will try to explicitly turn off the docValues and add stored=„true“ just to 
try out more. If someone has any other suggestions that I should try - I would 
be glad to here them. If it is not possible to retrieve the sum in this way I 
would have to fetch each sum separately which would be a huge performance 
impact.

Thanks!
Chantal





> Am 15.12.2016 um 10:16 schrieb CA :
> 
>> num_pop:{query:"popularity:[* TO *]"}

Re: Blog Post: Integration Testing SOLR Index with Maven

2013-03-15 Thread Chantal Ackermann

Hi,

@Lance - thanks, it's a pleasure to give something back to the community. Even 
if it is comparatively small. :-)

@Paul - it's definitly not 15 min but rather 2 min. Actually, the testing part 
of this setup is very regular compared to other Maven projects. The copying of 
the WAR file and repackaging is not that time consuming. (This is still Maven - 
widely used and proven - it wouldn't be if it was not practical?)


Cheers,
Chantal

Blog Post: Integration Testing SOLR Index with Maven

2013-03-14 Thread Chantal Ackermann

Hi all,


this is not a question. I just wanted to announce that I've written a blog post 
on how to set up Maven for packaging and automatic testing of a SOLR index 
configuration.

http://blog.it-agenten.com/2013/03/integration-testing-your-solr-index-with-maven/

Feedback or comments appreciated!
And again, thanks for that great piece of software.

Chantal

Re: Blog Post: Integration Testing SOLR Index with Maven

2013-03-14 Thread Chantal Ackermann

Hi Paul,

I'm sorry I cannot provide you with any numbers. I also doubt it would be wise 
to post any as I think the speed depends highly on what you are doing in your 
integration tests.

Say you have several request handlers that you want to test (on different 
cores), and some more complex use cases like using output from one request 
handler as input to others. You would also import test data that would be 
representative enough to test these request handlers and use cases.

The requests themselves, of course, only take as long as SolrJ takes to run and 
SOLR takes to answer them.
In addition, there is the overhead of Maven starting up, running all the 
plugins, importing the data, executing the tests. Well, Maven is certainly not 
the fastest tool to start up and get going…

If you are asking because you want to run rather a lot requests and test their 
output - JMeter might be preferrable?

Hope that was not too vague an answer,
Chantal


Am 14.03.2013 um 09:51 schrieb Paul Libbrecht:

 Nice,
 
 Chantal can you indicate there or here what kind of speed for integration 
 tests you've reached with this, from a bare source to a successfully tested 
 application?
 (e.g. with 100 documents)
 
 thanks in advance
 
 Paul
 
 
 On 14 mars 2013, at 09:29, Chantal Ackermann wrote:
 
 Hi all,
 
 
 this is not a question. I just wanted to announce that I've written a blog 
 post on how to set up Maven for packaging and automatic testing of a SOLR 
 index configuration.
 
 http://blog.it-agenten.com/2013/03/integration-testing-your-solr-index-with-maven/
 
 Feedback or comments appreciated!
 And again, thanks for that great piece of software.
 
 Chantal

Re: Antwort: Re: Antwort: Re: Query during a query

2012-09-03 Thread Chantal Ackermann

Hi Johannes,

on production, SOLR is better a backend service to your actual web application:

Client (Browser) --- Web App --- Solr Server

Very much like a database. The processes are implemented in your Web App, and 
when they require results from Solr for whatever reason they simply query it.

Chantal




Am 03.09.2012 um 06:48 schrieb johannes.schwendin...@blum.com:

 The problem is, that I don't know how to do this. :P
 
 My sequence: the user enters his search words. This is sent to solr. There 
 I need to make another query first to get metadata from the index. with 
 this metadata I have to connect to an external source to get some 
 information about the user. With this information and the first search 
 words I query then the solr index to get the search result.
 
 I hope its clear now wheres my problem and what I want to do
 
 Regards,
 Johannes
 
 
 
 Von:
 Jack Krupansky j...@basetechnology.com
 An:
 solr-user@lucene.apache.org
 Datum:
 31.08.2012 15:03
 Betreff:
 Re: Antwort: Re: Query during a query
 
 
 
 So, just do another query before doing the main query. What's the problem? 
 
 Be more specific. Walk us through the sequence of processing that you 
 need.
 
 -- Jack Krupansky
 
 -Original Message- 
 From: johannes.schwendin...@blum.com
 Sent: Friday, August 31, 2012 1:52 AM
 To: solr-user@lucene.apache.org
 Subject: Antwort: Re: Query during a query
 
 Thanks for the answer, but I want to know how I can do a seperate query
 before the main query.
 And I only want this data in my programm. The user won't see it.
 I need the values from one field to get some information from an external
 source while the main query is executed.
 
 pravesh suyalprav...@yahoo.com schrieb am 31.08.2012 07:42:48:
 
 Von:
 
 pravesh suyalprav...@yahoo.com
 
 An:
 
 solr-user@lucene.apache.org
 
 Datum:
 
 31.08.2012 07:43
 
 Betreff:
 
 Re: Query during a query
 
 Did you checked SOLR Field Collapsing/Grouping.
 http://wiki.apache.org/solr/FieldCollapsing
 http://wiki.apache.org/solr/FieldCollapsing
 If this is what you are looking for.
 
 
 Thanx
 Pravesh
 
 
 
 --
 View this message in context: http://lucene.472066.n3.nabble.com/
 Query-during-a-query-tp4004624p4004631.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Configure logging with Solr 4 on Tomcat 7

2012-08-27 Thread Chantal Ackermann


Drop the logging.properties file into the solr.war at WEB-INF/classes .

See here: 
http://lucidworks.lucidimagination.com/display/solr/Configuring+Logging
Section Tomcat Logging Settings

Cheers,
Chantal

Am 27.08.2012 um 16:43 schrieb Nicholas Ding:

 Hello,
 
 I've deployed Solr 4 on Tomcat 7, it is a multicore configuration,
 everything seems work fine, but I can't see any logs. How do I enable
 logging?
 
 Thanks
 Nicholas

Re: Solr Index problem

2012-08-24 Thread Chantal Ackermann


 Are you committing? You have to commit for them to be actually added….

If DIH says it did not add any documents (added 0 documents) committing won't 
help.

Likely, there is a problem with the mapping between DIH and the schema so that 
none of the fields make it into the index. We would need the DIH and the schema 
file, as Andy pointed out already.

Cheers,
Chantal



 

 -Original Message-
 From: ranmatrix S [mailto:ranmat...@gmail.com] 
 Sent: Thursday, August 23, 2012 5:46 PM
 To: solr-user@lucene.apache.org
 Subject: Solr Index problem
 
 Hi,
 
 I have setup Solr to index data from Oracle DB through DIH handler. However 
 through Solr admin I could see the DB connection is successfull, data 
 retrieved from DB to Solr but not added into index. The message is that 0 
 documents added even when I am able to see that 9 records are returned back.
 
 The schema and fields in db-data-config.xml are one and the same.
 
 Please suggest if anything I should look for.
 
 --
 Regards,
 Ran...

Re: Debugging DIH

2012-08-24 Thread Chantal Ackermann

 
 I don't see that you have anything in the DIH that tells what columns from 
 the query go into which fields in the index.  You need something like
 
 field name=location column=location /
 field name=amount column=amount /
 field name=when column=when /
 

That is not completely true. If the columns have the same names as the fields, 
the mapping is redundant. Nevertheless, it might be the problem. What I've 
experienced with Oracle, at least, is that the columns would be returned in 
uppercase even if my alias would be in lowercase. You might force it by adding 
quotes, though. Or try adding

field name=location column=LOCATION /
field name=amount column=AMOUNT /
field name=when column=WHEN /

You might check in your preferred SQL client how the column names are returned. 
It might be an indicator. (At least, in my case they would be uppercase in SQL 
Developer.)

Cheers,
Chantal

Re: Solr contribs build and jar-of-jars

2012-08-20 Thread Chantal Ackermann

Hi Lance,

does this do what you want?

http://maven.apache.org/plugins/maven-assembly-plugin/descriptor-refs.html#jar-with-dependencies

It's maven but that would be an advantage I'd say… ;-)

Chantal

Am 05.08.2012 um 01:25 schrieb Lance Norskog:

 Has anybody tried packaging the contrib distribution jars in the
 jar-of-jars format? Or merging all included jars into one super-jar?
 
 The OpenNLP contrib has a Lucene analyze, 3 external jars, and Solr
 classes. Packaging this sucker is proving painful in the extreme. UIMA
 has the same problem. 'ant' has a task for generating the manifest
 class path for a jar-of-jars, and the technique actually works:
 
 http://ant.apache.org/manual/Tasks/manifestclasspath.html
 http://stackoverflow.com/questions/858766/generate-manifest-class-path-from-classpath-in-ant
 http://grokbase.com/t/ant/user/0213wdmn51/building-a-fileset-dynamically#20020103j47ufvwooklrovrjfdvirgohe4
 
 If this works completely, it seems like the right way to build the
 dist/ jars for the contribs.
 
 -- 
 Lance Norskog
 goks...@gmail.com

Re: How to update a core using SolrJ

2012-08-03 Thread Chantal Ackermann

Hi Roy,

the example URL is correct if your core is available under that name 
(configured in solr.xml) and has started without errors. I think I observed 
that it makes a different whether there is a trailing slash or not (but that 
was a while ago, so maybe that has changed).

If you can reach that URL via browser but SolrJ with exactly the same URL 
cannot, then
- maybe the SolrJ application is running in a different environment?
- there is authentication setup and you are authenticated via browser but SolrJ 
does not know of it
- ...?

Some log output would be definitely helpful.

Cheers,
Chantal


Am 02.08.2012 um 22:42 schrieb Benjamin, Roy:

 I'm using SolrJ and CommonsHttpSolrServer.
 
 Before moving to multi-core configuration I constructed CommonsHttpSolrServer 
 from http://localhost:8080/solr;, this worked fine.
 
 Now I have two cores.  I have tried contructing CommonsHttpSolrServer from 
 http://localhost:8080/solr/core0; but this does
 not work.  The resource is not found when I try to add docs.
 
 How do I update Solr using SolrJ in a multi-core configuration?  What is the 
 correct form for the CommonsHttpSolrServer URL?
 
 Thanks!
 
 Roy

Re: matching with whole field

2012-08-02 Thread Chantal Ackermann

Hi Elisabeth,

try adding the same tokenizer chain for query, as well, or simply remove the 
type=index from the analyzer element.

Your chain is analyzing the input of the indexer and removing diacritics and 
lowercasing. With your current setup, the input to the search is not analyzed 
likewise so inputs that are not lowercased or contain diacritics will not match.

You might want to use the analysis frontend in the Admin UI to see how input to 
the indexer and the searcher is transformed and matched.

Cheers,
Chantal

Am 02.08.2012 um 09:56 schrieb elisabeth benoit:

 Hello,
 
 I am using Solr 3.4.
 
 I'm trying to define a type that it is possible to match with only if
 request contains exactly the same words.
 
 Let's say I have two different values for ONLY_EXACT_MATCH_FIELD
 
 ONLY_EXACT_MATCH_FIELD: salon de coiffure
 ONLY_EXACT_MATCH_FIELD: salon de coiffure pour femmes
 
 I would like to match only with the first ont when requesting Solr with
 fq=ONLY_EXACT_MATCH_FIELD:(salon de coiffure)
 
 As far has I understood, the solution is to do not tokenize on white
 spaces, and use instead solr.KeywordTokenizerFactory
 
 
 My actual type is defined as followed in schema.xml
 
fieldType name=ONLY_EXACT_MATCH_FIELD class=solr.TextField
 omitNorms=true positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.LengthFilterFactory min=1 max=100 /
  /analyzer
/fieldType
 
 But matching with fields with more then one word doesn't work. Does someone
 have a clue what I am doing wrong?
 
 Thanks,
 Elisabeth

Re: Solr upgrade from 1.4 to 3.6

2012-08-01 Thread Chantal Ackermann

Hi Kalyan,

that is becouse SolrJ uses javabin as format which has class version numbers 
in the serialized objects that do not match. Set the format to XML (wt 
parameter) and it will work (maybe JSON would, as well).

Chantal
 

Am 31.07.2012 um 20:50 schrieb Manepalli, Kalyan:

 Hi all,
We are trying to upgrade our solr instance from 1.4 to 3.6. We 
 use SolrJ API to fetch the data from index. We see that SolrJ 3.6 version is 
 not compatible with index generated with 1.4.
 Is this known issue and is there a workaround for this.
 
 Thanks,
 Kalyan Manepalli

Re: Rebuild index after database change

2012-07-31 Thread Chantal Ackermann

Hi Rodrigo,

the data will only show in SOLR if the index is built after the data has been 
committed to the database it reads the data from.
If the data does not show up in the index there could be several reasons why 
that is:

a) different database
b) permissions prevent that the data is visible (I think this unlikly as it 
does not seem from your description that it is restricted to certain 
tables/columns that are not seen at all)
c) the data inserts and updates have not been committed when the data is being 
requested by SolrJ
d) SolrJ requests the data before the new data has been inserted/updated and 
committed (well c) and d) are quite similar but in essence could be different)

It might be difficult to start SolrJ with a cron job if the database is updated 
at irregular times. Better might be to trigger the indexer (the SolrJ job that 
updates the SOLR index) either:

- from the database: you would have to check how the indexer is started and 
this script you would have to call from the database via trigger or callback or 
similar. Depends obviously on the possibilities your db offers and might also 
be not the best if there are several db instances and no defined master.
- via polling: an often running cron that checks whether new data has been 
imported, and if so starts the indexer.

Hope this was of some help. If not you might have to provide more details on 
how the indexer is started, at the moment.

Cheers,
Chantal


Am 31.07.2012 um 04:18 schrieb Rodrigo P. Bregalanti:

 Hello,
 
 I am working on a Data warehouse project, which is making huge modifications
 at the database level directly. It is working fine, and everything is going
 on, but there is a third party application reading data from one of this
 databases. This application is using Solrj (with embedded server) and it is
 resulting in a big issue: the new data inserted directly into the database
 is not being showed by this application.
 
 I have researched a lot around that, but didn't find any way to make this
 new data available to this particular third party application.
 
 Is that something possible to do? Have someone faced this kind of issue
 before?
 
 Please let me know if I should put some additional detail.
 
 Thanks in advance.
 
 Best regards.  
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Rebuild-index-after-database-change-tp3998257.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Rebuild index after database change

2012-07-31 Thread Chantal Ackermann

Hi Rodrigo,

as I understand it, you know where the index is located.

If you can also find out where the configuration files are - if they are simply
accessible on the FS - you could start a regular SOLR server that simply uses
that config directory as SOLR_HOME (it will automatically use the correct data
dir, or you can make sure it does by providing that data directory when you
start the server).

To find the config files you would be looking for these files, ideally in the
following directory structure:

single index:
-conf/schema.xml
-conf/solrconfig.xml

multiple indexes (cores):
-solr.xml
-core1/
-conf/schema.xml
-conf/solrconfig.xml
-more core dirs …

There is no dedicated Wiki page describing the structure but this one might
help:
http://wiki.apache.org/solr/CoreAdmin

Also the examples in the sources reflect mostly the conventional structure
except that the directory collection1 would be omitted:
https://builds.apache.org/job/Solr-trunk/ws/checkout/solr/example/solr/collection1/conf/
SOLR_HOME in this case is solr/ which contains conf/

If they would be structured in that way you could simply start a regular
solr.war as described in the SOLR documentation (e.g. via Jetty:
http://wiki.apache.org/solr/SolrJetty), pointing it to that directory as
SOLR_HOME. You would then have a running SOLR Server that you could update the
way you suggested in your previous response.

When in doubt about the configuration, include the content of the dataDir
element in solrconfig.xml in the next posting. Also, describe the structure of
the config directory and which files it includes. It might help. Or describe
how the application initializes the embedded server.

If the files are not located in the FS but somewhere inside the application
where you cannot easily reference them when starting Jetty or if it is
configured programmatically, you would have to create your own configuration
directory by copying/creating them in the expected structure. The dataDir
entry in solrconfig.xml has to point to the same data directory as the embedded
solr server uses.

Hu, I hope I was clear enough. Please ask if not.

Cheers,
Chantal

Am 31.07.2012 um 11:23 schrieb Rodrigo P. Bregalanti:

Hi Chantal, thanks for replying.
It is very helpfull, and I think I am in the right path.

As the database is not changed during the night, my idea is to add a cron
job to re-index that at this time. The main problem is there is no separate
service indexing the data. The applicaton is using Java+Grails and a Grails
plugin for Solr, which integrates to the grails domain classes. When a
domain class is saved through the application, this plugins add/remove that
from the index automatically.

After some research I noticed that if there was a separate Solr server, I
could post a call using HTTP to the Solr server, but the application is
using an embedded server.

I have found in the file system where the solr data and indexes are being
saved, but I don not know if Solrj has some utility class that could be
called from command line using those files, in a cron job schedule. Do you
know if it is possible?

Best regards.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Rebuild-index-after-database-change-tp3998257p3998311.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Skip first word

2012-07-27 Thread Chantal Ackermann

Hi Simone,

no I meant that you populate the two fields with the same input - best done via 
copyField directive.

The first field will contain ngrams of size 1 and 2. The other field will 
contain ngrams of size 3 and longer (you might want to set a decent maxsize 
there).

The query for the autocomplete list uses the first field when the input (typed 
in by the user) is one or two characters long. Your example was: D, G, or 
than Do or Ga. The result would search only on the single token field that 
contains for the input Dolce  Gabbana only the ngrams D and Do. So, only 
the input D or Do would result in a hit on Dolce  Gabbana.
Once the user has typed in the third letter: Dol or Gab, you query the 
second, more tokenized field which would contain for Dolce  Gabbana the 
ngrams Dol Dolc Dolce Gab Gabb Gabba etc.
Both inputs Gab and Dol would then return Dolce  Gabbana.

1. First  field type:

tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=2 
side=front/

2. Secong field type:

tokenizer class=solr.WhitespaceTokenizerFactory/
!-- maybe add WordDelimiter etc. --
filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=10 
side=front/

3. field declarations:

field name=short_prefix type=short_ngram … /
field name=long_prefix type=long_ngram … /

copyField source=short_prefix dest=long_prefix /


Chantal

Am 27.07.2012 um 11:05 schrieb Finotti Simone:

 Hi Chantal,
 
 if I understand correctly, this implies that I have to populate different 
 fields according to their lenght. Since I'm not aware of any logical 
 condition you can apply to copyField directive, it means that this logic has 
 to be implementend by the process that populates the Solr core. Is this 
 assumption correct?
 
 That's kind of bad, because I'd like to have this kind of rules in the Solr 
 configuration. Of course, if that's the only way... :)
 
 Thank you 
 
 
 Inizio: Chantal Ackermann [c.ackerm...@it-agenten.com]
 Inviato: giovedì 26 luglio 2012 18.32
 Fine: solr-user@lucene.apache.org
 Oggetto: Re: Skip first word
 
 Hi,
 
 use two fields:
 1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 for 
 inputs of length  3,
 2. the other one tokenized as appropriate with minsize=3 and longer for all 
 longer inputs
 
 
 Cheers,
 Chantal
 
 
 Am 26.07.2012 um 09:05 schrieb Finotti Simone:
 
 Hi Ahmet,
 business asked me to apply EdgeNGram with minGramSize=1 on the first term 
 and with minGramSize=3 on the latter terms.
 
 We are developing a search suggestion mechanism, the idea is that if the 
 user types D, the engine should suggest Dolce  Gabbana, but if we type 
 G, it should suggest other brands. Only if users type Gab it should 
 suggest Dolce  Gabbana.
 
 Thanks
 S
 
 Inizio: Ahmet Arslan [iori...@yahoo.com]
 Inviato: mercoledì 25 luglio 2012 18.10
 Fine: solr-user@lucene.apache.org
 Oggetto: Re: Skip first word
 
 is there a tokenizer and/or a combination of filter to
 remove the first term from a field?
 
 For example:
 The quick brown fox
 
 should be tokenized as:
 quick
 brown
 fox
 
 There is no such filter that i know of. Though, you can implement one with 
 modifying source code of LengthFilterFactory or StopFilterFactory. They both 
 remove tokens. Out of curiosity, what is the use case for this?

Re: Skip first word

2012-07-27 Thread Chantal Ackermann

Your're welcome :-)
C

Re: querying using filter query and lots of possible values

2012-07-26 Thread Chantal Ackermann

Hi Daniel,

index the id into a field of type tint or tlong and use a range query 
(http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29):

fq=id:[200 TO 2000]

If you want to exclude certain ids it might be wiser to simply add an exclusion 
query in addition to the range query instead of listing all the single values. 
You will run into problems with too long request urls. If you cannot avoid long 
urls you might want to increase maxBooleanClauses (see 
http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section).

Cheers,
Chantal

Am 26.07.2012 um 18:01 schrieb Daniel Brügge:

 Hi,
 
 i am facing the following issue:
 
 I have couple of million documents, which have a field called source_id.
 My problem is, that I want to retrieve all the documents which have a
 source_id
 in a specific range of values. This range can be pretty big, so for example
 a
 list of 200 to 2000 source ids.
 
 I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
 6 .)
 but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
 huge
 number of values.
 
 Another solution that came into my mind was to assigned all the documents I
 want to
 retrieve a new kind of filter id. So all the documents which i want to
 analyse
 get a new id. But i need to update all the millions of documents for this
 and assign
 them a new id. This could take some time.
 
 Do you can think of a nicer way to solve this issue?
 
 Regards  greetings
 
 Daniel

Re: Skip first word

2012-07-26 Thread Chantal Ackermann

Hi,

use two fields:
1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 for 
inputs of length  3,
2. the other one tokenized as appropriate with minsize=3 and longer for all 
longer inputs


Cheers,
Chantal


Am 26.07.2012 um 09:05 schrieb Finotti Simone:

 Hi Ahmet,
 business asked me to apply EdgeNGram with minGramSize=1 on the first term and 
 with minGramSize=3 on the latter terms.
 
 We are developing a search suggestion mechanism, the idea is that if the user 
 types D, the engine should suggest Dolce  Gabbana, but if we type G, 
 it should suggest other brands. Only if users type Gab it should suggest 
 Dolce  Gabbana.
 
 Thanks
 S
 
 Inizio: Ahmet Arslan [iori...@yahoo.com]
 Inviato: mercoledì 25 luglio 2012 18.10
 Fine: solr-user@lucene.apache.org
 Oggetto: Re: Skip first word
 
 is there a tokenizer and/or a combination of filter to
 remove the first term from a field?
 
 For example:
 The quick brown fox
 
 should be tokenized as:
 quick
 brown
 fox
 
 There is no such filter that i know of. Though, you can implement one with 
 modifying source code of LengthFilterFactory or StopFilterFactory. They both 
 remove tokens. Out of curiosity, what is the use case for this?

Re: querying using filter query and lots of possible values

2012-07-26 Thread Chantal Ackermann

Hi Daniel,

depending on how you decide on the list of ids, in the first place, you could 
also create a new index (core) and populate it with DIH which would select only 
documents from your main index (core) in this range of ids. When updating you 
could try a delta import.

Of course, this is only worth the effort if that core would exist for some time 
- but you've written that the subset of ids is constant for a longer time.

Just another idea on top ;-)
Chantal

Re: SOLR 4.0-ALPHA : DIH : Indexed and Committed Successfully but Index is empty

Hi Hoss,

 Did you perhaps forget to include RunUpdateProcessorFactory at the end?

What is that? ;-)
I had copied the config from http://wiki.apache.org/solr/UpdateRequestProcessor 
but removed the lines I thought I did not need. :-(

I've changed my configuration, and this is now WORKING (4.0-ALPHA):

updateRequestProcessorChain name=emptyFieldChain
processor class=solr.RemoveBlankFieldUpdateProcessorFactory 
/
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=update.chainemptyFieldChain/str
str name=configdata-config.xml/str
str name=cleantrue/str
str name=committrue/str
str name=optimizetrue/str
/lst
/requestHandler

LogUpdateProcessorFactory and RunUpdateProcessorFactory were missing. Once 
those are in place, DIH does commit and optimize and the documents are visible 
immediately to the WebGUI and the Searchers.

It would be nice to have those two factories mentioned as required (or when you 
would want them in the config) in the example solrconfig.xml and the wiki:
https://builds.apache.org/job/Solr-trunk/ws/checkout/solr/example/solr/collection1/conf/solrconfig.xml
and http://wiki.apache.org/solr/UpdateRequestProcessor

Javadoc of RunUpdateProcessorFactory doesn't tell me so much. I can have a look 
at the source and try to write something wise on the wiki about it. :-D

Thanks Hoss!
Chantal


I'm going to update the other thread with how to handle empty number fields 
with either 3.6.1 or 4.0
http://lucene.472066.n3.nabble.com/NumberFormatException-while-indexing-TextField-with-LengthFilter-and-then-copying-to-tfloat-td3996250.html

Re: Invalid or unreadable WAR file : .../solr.war when starting solr 3.6.1 app on Tomcat 7 (upgrade to 7.0.29/upstream)?

HI,

I haven't been following from the beginning but am still curious: is the war 
file on a shared fs?

See also:
http://www.mail-archive.com/users@tomcat.apache.org/msg79555.html
http://stackoverflow.com/questions/5493931/java-lang-illegalargumentexception-invalid-or-unreadable-war-file-error-in-op

If you have installed Tomcat via package manager you might want to install 
directly by simple unpacking the apache-tomcat-{verison}.tar.gz and copying the 
solr.war file into the /webapps/ subdirectory.
What the answers in the stackoverflow thread suggest is packaging something 
into the solr.war. You could add the logging.properties file (JULI config) 
under WEB-INF/classes/ - I would recommend that anyway. (I never had problems 
with a clean solr.war in Tomcat (5,6,7), though.)

Chantal


Am 24.07.2012 um 19:50 schrieb Chris Hostetter:

 
 : I removed distro pacakged Tomcat from the eqaation,
   ...
 : replacing it with an upstream instance
   ...
 : Repeating the process, at attempt to 'start' the /solr webapp, there's
 : no change.  I still get
   ...
 : java.lang.IllegalArgumentException: Invalid or unreadable WAR
 : file : /srv/solr_home/solr.war
 
 Are you sure you didn't accidently corrupt the war file in some way?
 
 what is the md5 or sha1sum of the war file you have?
 does jar tf solr.war give you any errors?
 
   ..
 
 I just used the following steps (basically the same as yours just 
 different paths) and got solr running in tomcat 7.0.29 with no 
 problems 
 
 hossman@frisbee:/var/tmp$ ls -al
 total 110188
 drwxrwxrwt  2 rootroot 4096 Jul 24 10:37 .
 drwxr-xr-x 13 rootroot 4096 Jul 18 09:34 ..
 -rw-rw-r--  1 hossman hossman 105132366 Jul 24 10:29 
 apache-solr-4.0.0-ALPHA.tgz
 -rw-rw-r--  1 hossman hossman   7679160 Jul  3 04:25 
 apache-tomcat-7.0.29.tar.gz
 -rw-rw-r--  1 hossman hossman   183 Jul 24 10:29 solr-context-file.xml
 hossman@frisbee:/var/tmp$ tar -xzf apache-solr-4.0.0-ALPHA.tgz 
 hossman@frisbee:/var/tmp$ tar -xzf apache-tomcat-7.0.29.tar.gz 
 hossman@frisbee:/var/tmp$ cp -r apache-solr-4.0.0-ALPHA/example/solr solr-home
 hossman@frisbee:/var/tmp$ cp 
 apache-solr-4.0.0-ALPHA/dist/apache-solr-4.0.0-ALPHA.war solr.war
 hossman@frisbee:/var/tmp$ sha1sum solr.war 
 51c9e4bf6f022ea3873ee315eb08a96687e71964  solr.war
 hossman@frisbee:/var/tmp$ md5sum solr.war 
 a154197657bb5cb9cee13fb5cfca931b  solr.war
 hossman@frisbee:/var/tmp$ cat solr-context-file.xml 
 Context docBase=/var/tmp/solr.war debug=0 crossContext=true 
   Environment name=solr/home type=java.lang.String 
 value=/var/tmp/solr-home override=true /
 /Context
 hossman@frisbee:/var/tmp$ mkdir -p 
 apache-tomcat-7.0.29/conf/Catalina/localhost/
 hossman@frisbee:/var/tmp$ cp solr-context-file.xml 
 apache-tomcat-7.0.29/conf/Catalina/localhost/solr.xml
 hossman@frisbee:/var/tmp$ ./apache-tomcat-7.0.29/bin/catalina.sh start
 Using CATALINA_BASE:   /var/tmp/apache-tomcat-7.0.29
 Using CATALINA_HOME:   /var/tmp/apache-tomcat-7.0.29
 Using CATALINA_TMPDIR: /var/tmp/apache-tomcat-7.0.29/temp
 Using JRE_HOME:/usr/lib/jvm/java-6-openjdk-amd64/
 Using CLASSPATH:   
 /var/tmp/apache-tomcat-7.0.29/bin/bootstrap.jar:/var/tmp/apache-tomcat-7.0.29/bin/tomcat-juli.jar
 hossman@frisbee:/var/tmp$ 
 
 
 ...and now solr is up and running on http://localhost:8080/solr/ and there 
 are no errors in the logs.
 
 
 
 
 -Hoss

Re: NumberFormatException while indexing TextField with LengthFilter and then copying to tfloat

Here are the working solutions for:


3.6.1 (or lower probably)


via ScriptTransformer in data-config.xml:

function prepareData(row) {
var cols = new java.util.ArrayList();
cols.add(spent_hours);
cols.add(estimated_hours);
cols.add(story_points);
cols.add(pos);
for (var i=0; icols.size(); i++) {
var no = row.get(cols.get(i));
if (no != null  no.trim().length() == 0) {
row.remove(cols.get(i));
}
}
return row;
}

In the XPathEntityProcessor, add the ScriptTransformer:
 transformer=script:prepareData,…

XPATHs:

field column=spent_hours 
xpath=/issues/issue/spent_hours /
field column=estimated_hours 
xpath=/issues/issue/estimated_hours /
field column=story_points
xpath=/issues/issue/story_points /
field column=pos 
xpath=/issues/issue/position /

All of these fields are of type tfloat, required=false. They will only get a 
value if it is not empty or null.



4.0-ALPHA
**

No ScriptTransformer required, XPATH as above, same field type, 
required=false.

In the dataimporthandler configuration section in solrconfig.xml specify:

updateRequestProcessorChain name=emptyFieldChain
processor class=solr.RemoveBlankFieldUpdateProcessorFactory 
/
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=update.chainemptyFieldChain/str
str name=configdata-config.xml/str
str name=cleantrue/str
str name=committrue/str
str name=optimizetrue/str
/lst
/requestHandler

Re: Autocomplete terms from the middle of name/description of a Doc

Hi Ugo,

You can use facet.prefix on a tokenized field instead of a String field.

Example:
field name=product type=string … /
field name=product_tokens type=text_split … /!-- use e.g. 
WhitespaceTokenizer or WordDelimiter and others, see example schema.xml that 
comes with SOLR --

facet.prefix on product will only return hits that match the start of the 
single token stored in that field.
As product_tokens contains the value of product tokenized in a fashion that 
suites you, it can contain multiple tokens. facet.prefix on product_tokens 
will return hits that match *any* of these tokens - which is what you want.

Chantal

Am 25.07.2012 um 15:29 schrieb Ugo Matrangolo:

 Hi,
 
 I'm working on making our autocomplete engine a bit more smart.
 
 The actual impl is a basic facet based autocompletion as described in the
 'SOLR 3 Enterprise Search' book: we use all the typed tokens except the
 last one to build a facet.prefix query on an autocomplete facet field we
 built at index time.
 
 This allows us to have something like:
 
 'espress' -- '*espress*o machine', '*espress*o maker', etc
 
 We want something like:
 
 'espress' - '*espress*o machine', '*espress*o maker', 'kMix *espress*o
 maker'
 
 Note that the last suggested term could be not obtained by quering on the
 facet prefix as we do now. What we need is a way to find the 'espress'
 string in the middle of the name/description of our products.
 
 Any suggestions ?
 
 Cheers,
 Ugo

Re: Autocomplete terms from the middle of name/description of a Doc


 Suppose I have a product with a title='kMix Espresso maker'. If I tokenize
 this and put the result in product_tokens I should get
 '[kMix][Espresso][maker]'.
 
 If now I try to search with facet.field='product_tokens' and
 facet.prefix='espresso' I should get only 'espresso' while I want 'kMix
 Espresso maker'.

Yes, you are probably right. I did use this approach at somepoint. Your remark 
has made me check my code again.
I was using n_gram in the end.

(facet.prefix on tokenized fields might work in certain circumstances where you 
can get the actual value from the string field (or its facet) in parallel.)

This is the jquery autocomplete plugin instantiation:

$(function() {
$(#qterm).autocomplete({
minLength: 1,
source: function(request,response) {
jQuery.ajax({
url: /solr/select,
dataType: json,
data: {
q : title_ngrams:\ + request.term + 
\,
rows: 0,
facet: true,
facet.field: title,
facet.mincount: 1,
facet.sort: index,
facet.limit: 10,
fq: end_date:[NOW TO *]
wt: json
},
success: function( data ) {
/*var result = jQuery.map( 
data.facet_counts.facet_fields.title, function( item, index ) {
if (index%2) return 
null;
else return {
//label: item,
value: item
}
});*/
var result = [];
var facets = 
data.facet_counts.facet_fields.title;
var j = 0;
for (i=0; ifacets.length; i=i+2) {
result[j] = facets[i];
j = j+1;
}
response(result);
}
});
}
});

And here the fieldtype ngram for title_ngram. title is a string type field.

!-- NGram configuration for searching for wordparts without 
the use of wildcards.
This is for suggesting search terms e.g. sourcing an 
autocomplete widget. --
fieldType name=ngram class=solr.TextField
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory 
/
filter class=solr.LengthFilterFactory 
min=1 max=500 /
filter class=solr.TrimFilterFactory /
filter 
class=solr.ISOLatin1AccentFilterFactory /
filter class=solr.WordDelimiterFilterFactory 
splitOnCaseChange=1
 splitOnNumerics=1 stemEnglishPossessive=1 
generateWordParts=1
 generateNumberParts=1 catenateAll=1 
preserveOriginal=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.EdgeNGramFilterFactory 
minGramSize=2 maxGramSize=15 side=front/
filter 
class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory 
/
filter class=solr.TrimFilterFactory /
filter 
class=solr.ISOLatin1AccentFilterFactory /
filter class=solr.WordDelimiterFilterFactory 
splitOnCaseChange=1
 splitOnNumerics=1 stemEnglishPossessive=1 
generateWordParts=1
 generateNumberParts=1 catenateAll=0 
preserveOriginal=1 /
filter class=solr.LowerCaseFilterFactory /
/analyzer
/fieldType

Hope this one gets you going…
Chantal

Re: NumberFormatException while indexing TextField with LengthFilter and then copying to tfloat

2012-07-24 Thread Chantal Ackermann

Hi Hoss,

thank you for the quick response and the explanations!

 My suggestion would be to modify the XPath expression you are using to 
 pull data out of your original XML files and ignore  estimated_hours/
 

I don't think this is possible. That would include text() in the XPath which is 
not handled by the XPathRecordReader. I've checked in the code, as well, and 
the JavaDoc does not list this possibility. I've tried those patterns:

/issues/issue/estimated_hours[text()]
/issues/issue/estimated_hours/text()

No value at all will be added for that field for any of the documents 
(including those that do have a value in the XML).

 Alternatively: there are some new UpdateProcessors available in 4.0 that 
 let you easily prune field values based on various criteria (update 
 porcessors happen well before copyField)...
 
 http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/update/processor/RemoveBlankFieldUpdateProcessorFactory.html

Thanks for pointing me to it. I've switched to 4.0.0-ALPHA (hoping, the ALPHA 
doesn't show itself too often ;-) ).

For anyone interested, my DataImportHandler Setup in solrconfig.xml now reads:

updateRequestProcessorChain name=emptyFieldChain
processor class=solr.RemoveBlankFieldUpdateProcessorFactory 
/
/updateRequestProcessorChain

requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=update.chainemptyFieldChain/str
str name=configdata-config.xml/str
str name=cleantrue/str
str name=committrue/str
str name=optimizetrue/str
/lst
/requestHandler

Works as expected!

And kudos to those working on the admin frontend, as well! The new admin is 
indeed slick!



 But i can certainly understand the confusion, i've opened SOLR-3657 to try 
 and improve on this.  Ideally the error message should make it clear that 
 the value from source field was copied to dest field which then 
 encountered error
 

Thank you! Good Exception messages are certainly helpful!

Chantal

SOLR 4.0-ALPHA : DIH : Indexed and Committed Successfully but Index is empty

2012-07-24 Thread Chantal Ackermann

Hi there,

sorry for the length - it is mostly (really) log output. The basic issue is 
reflected in the subject: DIH runs fine, but even with an extra optimize on top 
(which should not be necessary given my DIH config) the index remains empty.

(I have changed from 3.6.1 to 4.0-ALPHA because of Hoss' answer to my question 
NumberFormatException while indexing TextField with LengthFilter (on this 
same list). I had an index setup with 4.0-ALPHA today, I could verify that 
Hoss' suggestion works. But now, I seem not to be able to get that index filled 
yet another time.
SOLR runs inside Jetty which is started via mvn jetty:run-war. SOLR_HOME is 
set to a subdirectory of maven's target dir. I have been using this setup 
successfully with SOLR 3.* for some time, now. While configuring the index, I 
often do a mvn clean; mvn jetty:run-war so SOLR_HOME including the index is 
completely removed and recreated from scratch.)


After running a full import of DIH on core issues using:
http://localhost:9090/solr/issues/dataimport?command=full-importimportfile=/absolute/path/to/issues.xml

I get the response:

response
lst name=responseHeader
int name=status0/int
int name=QTime1/int
/lst
lst name=initArgs
lst name=defaults
str name=update.chainemptyFieldChain/str
str name=configdata-config.xml/str
str name=cleantrue/str
str name=committrue/str
str name=optimizetrue/str
/lst
/lst
str name=statusidle/str
str name=importResponse/
lst name=statusMessages
str name=Total Requests made to DataSource0/str
str name=Total Rows Fetched294/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2012-07-24 15:46:27/str
str name=
Indexing completed. Added/Updated: 294 documents. Deleted 0 documents.
/str
str name=Committed2012-07-24 15:46:28/str
str name=Optimized2012-07-24 15:46:28/str
str name=Total Documents Processed294/str
str name=Time taken0:0:0.605/str
/lst
str name=WARNING
This response format is experimental. It is likely to change in the future.
/str
/response

Meaning that everything went fine including commit and optimize and the index 
should now contain 294 documents. Well, it doesn't.
Trying to get it working again, I have now replaced large parts of my 
solrconfig.xml with the new parts taken from the current 4.0-ALPHA 
(https://builds.apache.org/job/Solr-trunk/ws/checkout/) but this doesn't change 
a thing. The schema version is set to 1.5.



When starting the server it outputs:

24.07.2012 16:00:16 org.apache.solr.core.SolrCore init
INFO: [issues] Opening new SolrCore at target/classes/core_issues/, 
dataDir=target/classes/core_issues/data/
…
24.07.2012 16:00:16 org.apache.solr.core.SolrCore getNewIndexDir
WARNUNG: New index directory detected: old=null 
new=target/classes/core_issues/data/index/
24.07.2012 16:00:16 org.apache.solr.core.SolrCore initIndex
WARNUNG: [issues] Solr index directory 'target/classes/core_issues/data/index' 
doesn't exist. Creating new index...
24.07.2012 16:00:16 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=1

commit{dir=/path/to/maven-project/target/classes/core_issues/data/index,segFN=segments_1,generation=1,filenames=[segments_1]
24.07.2012 16:00:16 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1
…
24.07.2012 16:00:16 org.apache.solr.search.SolrIndexSearcher init
INFO: Opening Searcher@920ab60 main
24.07.2012 16:00:16 org.apache.solr.core.SolrCore registerSearcher
INFO: [issues] Registered new searcher Searcher@920ab60 
main{StandardDirectoryReader(segments_1:1)}
24.07.2012 16:00:16 org.apache.solr.update.CommitTracker init
INFO: Hard AutoCommit: if uncommited for 15000ms; 
24.07.2012 16:00:16 org.apache.solr.update.CommitTracker init
INFO: Soft AutoCommit: disabled
24.07.2012 16:00:16 org.apache.solr.handler.dataimport.DataImportHandler 
processConfiguration
INFO: Processing configuration from solrconfig.xml: 
{update.chain=emptyFieldChain,config=data-config.xml,clean=true,commit=true,optimize=true}
24.07.2012 16:00:16 org.apache.solr.handler.dataimport.DataImporter 
loadDataConfig
INFO: Data Configuration loaded successfully
24.07.2012 16:00:16 org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to Searcher@920ab60 
main{StandardDirectoryReader(segments_1:1)}
24.07.2012 16:00:16 org.apache.solr.core.CoreContainer register
INFO: registering core: issues



When running the DIH full import, the log output is:

24.07.2012 16:00:31 org.apache.solr.handler.dataimport.DataImporter doFullImport
INFO: Starting Full Import
24.07.2012 16:00:31 org.apache.solr.core.SolrCore execute
INFO: [issues] webapp=/solr path=/dataimport 
params={command=full-importimportfile=/path/to/maven-project/src/test/resources/issues.xml}
 status=0 QTime=4 
24.07.2012 16:00:31 org.apache.solr.handler.dataimport.SimplePropertiesWriter 
readIndexerProperties
WARNUNG: Unable to read: dataimport.properties
24.07.2012 16:00:32 org.apache.solr.handler.dataimport.DocBuilder

NumberFormatException while indexing TextField with LengthFilter and then copying to tfloat

2012-07-20 Thread Chantal Ackermann

Hi all,

I'm trying to index float values that are not required, input is an XML file. I 
have problems avoiding the NFE.
I'm using SOLR 3.6.



Index input:
- XML using DataImportHandler with XPathProcessor

Data:
Optional, Float, CDATA like: estimated_hours2.0/estimated_hours or 
estimated_hours/

Original Problem:
Empty values would cause a NumberFormatException when being loaded directly 
into a tfloat type field.

Processing chain (to avoid NFE):
via XPath loaded into a field of type text with a trim and length filter, then 
via copyField directive into the tfloat type field

data-config.xml:
field column=s_estimated_hours xpath=/issues/issue/estimated_hours /

schema.xml:
types...
fieldtype name=text_not_empty class=solr.TextField
analyzer
tokenizer class=solr.KeywordTokenizerFactory 
/
filter class=solr.TrimFilterFactory /
filter class=solr.LengthFilterFactory 
min=1 max=20 /
/analyzer
/fieldtype
/types

fields...
field name=estimated_hours type=tfloat indexed=true 
stored=true required=false /
field name=s_estimated_hours type=text_not_empty 
indexed=false stored=false /
/fields

copyField source=s_estimated_hours dest=estimated_hours /

Problem:
Well, yet another NFE. But this time reported on the text field 
s_estimated_hours:

WARNUNG: Error creating document : SolrInputDocument[{id=id(1.0)={2930}, 
s_estimated_hours=s_estimated_hours(1.0)={}}]
org.apache.solr.common.SolrException: ERROR: [doc=2930] Error adding field 
's_estimated_hours'=''
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:333)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:66)
at 
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:723)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.NumberFormatException: empty String
at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:992)
at java.lang.Float.parseFloat(Float.java:422)
at org.apache.solr.schema.TrieField.createField(TrieField.java:410)
at org.apache.solr.schema.FieldType.createFields(FieldType.java:289)
at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:107)
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:312)
... 11 more


It is like it would copy the empty value - which must not make it through the 
LengthFilter of s_estimated_hours - to the tfloat field estimated_hours 
anyway. How can I avoid this? Or is there any other way to make the indexer 
ignore the empty values when creating the tfloat fields? If it could at least 
create the document and enter the other values… (onError=continue is not 
helping as this is only a Warning (I've tried))


BTW: I did try with the XPath that should only select those nodes with text: 
/issues/issue/estimated_hours[text()]
The result was that no values would make it into the tfloat fields while all 
documents would be indexed without warnings or errors. (I discarded this option 
thinking that the xpath was not correctly evaluated.)


Thank you for any suggestions!
Chantal

Re: Velocity substring issue

2012-03-28 Thread Chantal Ackermann


Hi Henri,

you have not provided very much information, so, here comes a guess:

try ${bdte1} instead of $bdte1 - maybe Velocity resolves $bdte and
concatenates 1 instead of trying the longer value as variable, first.

Chantal


On Wed, 2012-03-28 at 12:04 +0200, henri.gour...@laposte.net wrote:
 The following code fails on the $bdte1 substring. Both $bdte and $bdte1
 appear to be identical!
 
 triggers the following error message:
 
 The problem persiste with various values of the indices.
 
 Am I missing something?
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Velocity-substring-issue-tp3864088p3864088.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem witch adding classpath

2012-03-16 Thread Chantal Ackermann


Hi,

I put all those jars into SOLR_HOME/lib. I do not specify them in
solrconfig.xml explicitely, and they are all found all right.

Would that be an option for you?

Chantal


On Thu, 2012-03-15 at 17:43 +0100, ViruS wrote:
 Hello,
 
 I just now try to switch from 3.4.0 to 3.5.0 ... i make new instance and
 when I try use same config for adding libaries i have error.
 SEVERE: java.lang.NoClassDefFoundError:
 org/apache/lucene/analysis/TokenStream
 This error only show when i use polish stempel.
 In config i have set (solr/vrs/conf/solrconfig.xml):
   lib path=../../../dist/lucene-stempel-3.5.0.jar /
   lib path=../../../dist/apache-solr-analysis-extras-3.5.0.jar /
 
 
 When I start Solr is adding path:
 INFO: Adding specified lib dirs to ClassLoader
 2012-03-15 17:35:51 org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 INFO: Adding
 'file:/home/virus/appl/apache-solr-3.5.0/dist/lucene-stempel-3.5.0.jar' to
 classloader
 2012-03-15 17:35:51 org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 INFO: Adding
 'file:/home/virus/appl/apache-solr-3.5.0/dist/apache-solr-analysis-extras-3.5.0.jar'
 to classloader
 
 Same problem I have witch Velocity
 In config (solr/ac/conf/solrconfig.xml:
 lib dir=../../../contrib/velocity/lib /
 ...
 queryResponseWriter name=velocity class=solr.VelocityResponseWriter
 enable=true/
 
 When I satrt have this error:
 SEVERE: org.apache.solr.common.SolrException: Error Instantiating
 QueryResponseWriter, solr.VelocityResponseWriter is not a
 org.apache.solr.response.QueryResponseWriter
 INFO: Adding specified lib dirs to ClassLoader
 2012-03-15 17:40:17 org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 INFO: Adding
 'file:/home/virus/appl/apache-solr-3.5.0/contrib/velocity/lib/velocity-tools-2.0.jar'
 to classloader
 
 
 
 Full start log here: http://piotrsikora.pl/solr.log
 
 
 Thanks in advanced!

Re: 400 Error adding field 'tags'='[a,b,c]'

2012-03-15 Thread Chantal Ackermann

Hi Alp,

if you have not changed how SOLR logs in general, you should find the
log output in the regular server logfile. For Tomcat you can find this
in TOMCAT_HOME/catalina.out (or search for that name).

If there is a problem with your schema, SOLR should be complaining about
it during application/server start up. It would definitely print
something if there is a field declared in your schema but cannot be
initialized for some reason.

I don't think that the names of the fields themselves are the problem. I
never had an issue with the field name 'name'.

Cheers,
Chantal

On Wed, 2012-03-14 at 02:53 +0100, jlark wrote:
Interestingly I'm getting this on other fields now.

I have the fieldfield name=name type=text_general indexed=true
stored=true /

which is copied to text copyField source=name dest=text/

and my text field is simplyfield name=text type=text_general
indexed=true stored=true /

I'm feedin my test document

{url : TestDoc2, title : another test, ptag:[a,b],name:foo
bar},

and when I try to feed I get.

HTTP request sent, awaiting response... 400 ERROR: [doc=TestDoc2] Error
adding field 'name'='foo bar'

If I remove the field from the document though it works fine.
I'm wondering if there is a set of reserved names that I'm using at this
point.

Jus twhish there was a way to get more helpfull error messages.

Thanks for the help.
Alp

--
View this message in context:
http://lucene.472066.n3.nabble.com/400-Error-adding-field-tags-a-b-c-tp3823853p3824126.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: sun-java6 alternatives for Solr 3.5

2012-02-28 Thread Chantal Ackermann

You can download Oracle's Java (which was Sun's) from Oracle directly.
You will have to create an account with them. You can use the same
account for reading the java forum and downloading other software like
their famous DB.

Simply download. JDK6 is still a binary as were all Sun packages before.
Do a chmod +x and run it. You have to accept the license, and then it
unpacks itself in that same directory - no root privileges required.

As of JDK 7 you can download tar.gz packages.

http://www.oracle.com/technetwork/java/javase/downloads/index.html

Actually, you're better of downloading and installing by yourself
because you can have several different versions in parallel and the
automatic updates do not override your installed version. That comes in
handy if you are a Java developer, at least...

Cheers,
Chantal


On Mon, 2012-02-27 at 21:38 +0100, Demian Katz wrote:
 For what it's worth, I run Solr 3.5 on Ubuntu using the OpenJDK packages and 
 I haven't run into any problems.  I do realize that sometimes the Sun JDK has 
 features that are missing from other Java implementations, but so far it 
 hasn't affected my use of Solr.
 
 - Demian
 
  -Original Message-
  From: ku3ia [mailto:dem...@gmail.com]
  Sent: Monday, February 27, 2012 2:25 PM
  To: solr-user@lucene.apache.org
  Subject: sun-java6 alternatives for Solr 3.5
  
  Hi all!
  I had installed an Ubuntu 10.04 LTS. I had added a 'partner' repository to
  my sources list and updated it, but I can't find a package sun-java6-*:
  root@ubuntu:~# apt-cache search java6
  default-jdk - Standard Java or Java compatible Development Kit
  default-jre - Standard Java or Java compatible Runtime
  default-jre-headless - Standard Java or Java compatible Runtime (headless)
  openjdk-6-jdk - OpenJDK Development Kit (JDK)
  openjdk-6-jre - OpenJDK Java runtime, using Hotspot JIT
  openjdk-6-jre-headless - OpenJDK Java runtime, using Hotspot JIT (headless)
  
  Than I had goggled and found an article:
  https://lists.ubuntu.com/archives/ubuntu-security-announce/2011-
  December/001528.html
  
  I'm using Solr 3.5 and Apache Tomcat 6.0.32.
  Please advice me what I must do in this situation, because I always used
  sun-java6-* packages for Tomcat and Solr and it worked fine
  Thanks!
  
  --
  View this message in context: http://lucene.472066.n3.nabble.com/sun-java6-
  alternatives-for-Solr-3-5-tp3781792p3781792.html
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can this type of sorting/boosting be done by solr

2012-02-23 Thread Chantal Ackermann

Hi Ritesh,

you could add another field that contains the size of the list in the
AREFS field. This way you'd simply sort by that field in descending
order.

Should you update AREFS dynamically, you'd have to update the field with
the size, as well, of course.

Chantal

On Thu, 2012-02-23 at 11:27 +0100, rks_lucene wrote:
 Hi,
 
 I have a journal article citation schema like this:
 {  AT - article_title
AID - article_id (Unique id)
AREFS - article_references_list (List of article id's referred/cited in
 this article. Multi-valued)
AA - Article Abstract
---
other_article_stuff
...
 }
 
 So for example, in order to search for all those articles that refer(cite)
 article id 51643, I simply need to search for AREFS:51643 and it will give
 me the list of articles that have 51643 listed in AREFS.
 
 Now, I want to be able to search in the text of articles and sort the
 results by most referred articles. How can I do this ?
 
 Say if my search query is q=AT:metal and it gives me 1700 results. How can I
 sort 1700 results by those that have received maximum number of citations by
 others.
 
 I have been researching function queries to solve this but have been unable
 to do so.
 
 Thanks in advance.
 Ritesh
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769315.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can this type of sorting/boosting be done by solr

2012-02-23 Thread Chantal Ackermann

Sorry to have misunderstood.
It seems the new Relevance Functions in Solr 4.0 might help - unless you
need to use an official release.

http://wiki.apache.org/solr/FunctionQuery#Relevance_Functions

On Thu, 2012-02-23 at 13:04 +0100, rks_lucene wrote:
Dear Chantal,

Thanks for your reply, but thats not what I was asking.

Let me explain. The size of the list in AREFS would give me how many records
are *referred by* an article and NOT how many records *refer to* an article.

Say if an article id - 51463 has been published in 2002 and refers to 10
articles dating from 1990-2002. Then the count of AREFS would be 10 which is
static once the journal has been published.

However if the same article is being *referred to* by 20 articles published
from 2003-2012 then I am talking about this 20 count. This count is dynamic
and as we keep adding records to the index, there are more articles that
will refer to article 51463 it in their AREFS field in the future.
/(Obviously when we are adding article 51463 to the index we have no clue
who will be referring to it in the future, so we can have another field in
it for this, nor can be update 51463 everytime someone refers to it)/

So today, if I want to know who all are referring to 51463, by actually
searching for this id in the AREFS field. The query is as simple as
q=AREFS:51463 and it will given the list of articles from 2003 to 2012 and
the result count would be 20.

So back to the question, say if my search query is q=AT:metal and it gives
me 1700 results. How can I
sort 1700 results by those that have received maximum number of citations
(till date) by others. (i.e., that have maximum number of results if I
individually search their ids in the AREFS field).

Hope this makes it clear. I feel this is a sort/boost by function query
candidate. But I am not able to figure it out.

Thanks
Ritesh

--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769475.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to loop through the DataImportHandler query results?

2012-02-16 Thread Chantal Ackermann

If your script turns out too complex to maintain, and you are developing
in Java, anyway, you could extend EntityProcessor and handle the data in
a custom way. I've done that to transform a datamart like data structure
back into a row based one.

Basically you override the method that gets the data in a Map and
transform it into a different Map which contains the fields as
understood by your schema.

Chantal


On Thu, 2012-02-16 at 14:59 +0100, Mikhail Khludnev wrote:
 Hi Baranee,
 
 Some time ago I played with
 http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it was a
 pretty good stuff.
 
 Regards
 
 
 On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan 
 baraneethara...@hp.comwrote:
 
  To avoid that we don't want to mention the column names in the field tag ,
  but want to write a query to map all the fields in the table with solr
  fileds even if we don't know, how many columns are there in the table.  I
  need a kind of loop which runs through all the query results and map that
  with solr fileds.

Re: Frequent garbage collections after a day of operation

2012-02-16 Thread Chantal Ackermann

Make sure your Tomcat instances are started each with a max heap size
that adds up to something a lot lower than the complete RAM of your
system.

Frequent Garbage collection means that your applications request more
RAM but your Java VM has no more resources, so it requires the Garbage
Collector to free memory so that the requested new objects can be
created. It's not indicating a memory leak unless you are running a
custom EntityProcessor in DIH that runs into an infinite loop and
creates huge amounts of schema fields. ;-)

Also - if you are doing hot deploys on Tomcat, you will have to restart
the Tomcat instance on a regular bases as hot deploys DO leak memory
after a while. (You might be seeing class undeploy messages in
catalina.out and later on OutOfMemory error messages.)

If this is not of any help you will probably have to provide a bit more
information on your Tomcat and SOLR configuration setup.

Chantal


On Thu, 2012-02-16 at 16:22 +0100, Matthias Käppler wrote:
 Hey everyone,
 
 we're running into some operational problems with our SOLR production
 setup here and were wondering if anyone else is affected or has even
 solved these problems before. We're running a vanilla SOLR 3.4.0 in
 several Tomcat 6 instances, so nothing out of the ordinary, but after
 a day or so of operation we see increased response times from SOLR, up
 to 3 times increases on average. During this time we see increased CPU
 load due to heavy garbage collection in the JVM, which bogs down the
 the whole system, so throughput decreases, naturally. When restarting
 the slaves, everything goes back to normal, but that's more like a
 brute force solution.
 
 The thing is, we don't know what's causing this and we don't have that
 much experience with Java stacks since we're for most parts a Rails
 company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
 seeing this, or can you think of a reason for this? Most of our
 queries to SOLR involve the DismaxHandler and the spatial search query
 components. We don't use any custom request handlers so far.
 
 Thanks in advance,
 -Matthias

Re: MoreLikeThis Question

Hi,

you would not want to include the unique ID and similar stuff, though?
No idea whether it would impact the number of hits but it would most
probably influence the scoring if nothing else.

E.g. if you compare by certain fields, I would expect that a score of
1.0 indicates a match on all of those fields (haven't tested that
explicitly, though). If the unique ID is included you could never reach
that score.

Just my 2 cents...

Chantal


On Wed, 2012-02-15 at 07:27 +0100, Jamie Johnson wrote:
 Is there anyway with MLT to say get similar based on all fields or is
 it always a requirement to specify the fields?

Re: Solr as an part of api to unburden databases

  
  does anyone of the maillinglist users use solr as an API to avoid database
  queries? [...]
 
 Like in a... cache?
 
 Why not use a cache then? (memcached, for example, but there are more).
 

Good point. A cache only uses lookup by one kind of cache key while SOLR
provides lookup by ... well... any search configuration that your index
setup (mainly the schema) supports.

If the database queries always do a find by unique id, then use a
cache. Otherwise using SOLR is a valid option.


Chantal

Re: Error Indexing in solr 3.5

Hi,

I've got these errors when my client used a different SolrJ version from
the SOLR server it connected to:

SERVER 3.5  responding --- CLIENT some other version

You haven't provided any information on your client, though.

Chantal

On Wed, 2012-02-15 at 13:09 +0100, mechravi25 wrote:
 Hi,
 
 When I tried to index in solr 3.5 i got the following exception
 
 org.apache.solr.client.solrj.SolrServerException: Error executing query
   at
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
   at com.quartz.test.FullImport.callIndex(FullImport.java:80)
   at
 com.quartz.test.GetObjectTypes.checkObjectTypeProp(GetObjectTypes.java:245)
   at com.quartz.test.GetObjectTypes.execute(GetObjectTypes.java:640)
   at com.quartz.test.QuartzSchedMain.main(QuartzSchedMain.java:55)
 Caused by: java.lang.RuntimeException: Invalid version or the data in not in
 'javabin' format
   at 
 org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
   at
 org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
   at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
   at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
   at
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
 
 
 
 I placed the latest solrj 3.5 jar in the example/solr/lib directory and then
 re-started the same but still I am getting the above mentioned exception. 
 
 Please let me know if I am missing anything.
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Error-Indexing-in-solr-3-5-tp3746735p3746735.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facet on TrieDateField field without including date

I've done something like that by calculating the hours during indexing
time (in the script part of the DIH config using java.util.Calendar
which gives you all those field values without effort). I've also
extracted information on which weekday it is (using the integer
constants of Calendar).
If you need this only for one timezone it is straight forward but if the
queries come from different time zones you'll have to shift
appropriately.

I found that pre-calculating has the advantage that you end up with very
simple data: simple integers. And it makes it quite easy to build more
complex queries on that. For example I have created a grid (build from
facets) where the columns are the weekdays and the rows are the hours of
day. The facets are created using a field containing the combination of
weekday and hour of day.


Chantal



On Wed, 2012-02-15 at 15:49 +0100, Yonik Seeley wrote:
 On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson jej2...@gmail.com wrote:
  I think it would if I indexed the time information separately.  Which
  was my original thought, but I was hoping to store this in one field
  instead of 2.  So my idea was I'd store the time portion as as a
  number (an int might suffice from 0 to 24 since I only need this to
  have that level of granularity) then do range queries over that.  I
  couldn't think of a way to do this using the date field though because
  it would give me bins broken up by hours in a particular day,
  something like
 
  2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
  2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
  2012-01-01-02:00:00 - 2012-01-01-03:00:00 5
 
  But what I really want is just the time portion across all days
 
  00:00:00 - 01:00:00 10
  01:00:00 - 02:00:00 20
  02:00:00 - 03:00:00 5
 
  I would then use the date field to limit the time range in which the
  facet was operating.  Does that make sense?  Is there a more efficient
  way of doing this?
 
 Hmm, no there's no way to do this.
 Even if you were to write a custom faceting component, it seems like
 it would still be very expensive to derive the hour of the day from ms
 for every doc.
 
 -Yonik
 lucidimagination.com
 
 
 
 
  On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
  yo...@lucidimagination.com wrote:
  On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson jej2...@gmail.com wrote:
  I would like to be able to facet based on the time of
  day items are purchased across a date span.  I was hoping that I could
  do a query of something like date:[NOW-1WEEK TO NOW] and then specify
  I wanted facet broken into hourly bins.  Is this possible?  Do I
 
  Will range faceting do everything you need?
  http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range
 
  -Yonik
  lucidimagination.com

Re: Stemming and accents (HunspellStemFilterFactory)

2012-02-14 Thread Chantal Ackermann

Hi Bráulio,

I don't know about HunspellStemFilterFactory especially but concerning
accents:

There are several accent filter that will remove accents from your
tokens. If the Hunspell filter factory requires the accents, then simply
add the accent filters after Hunspell in your index and query filter
chains.

You would then have Hunspell produce the tokens as result of the
stemming and only afterwards the accents would be removed (your example:
'forum' instead of 'fórum'). Do the same on the query side in case
someone inputs accents.

Accent filters are:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory
(lowercases, as well!)
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

and others on that page.

Chantal


On Tue, 2012-02-14 at 14:48 +0100, Bráulio Bhavamitra wrote:
 Hello all,
 
 I'm evaluating the HunspellStemFilterFactory I found it works with a
 pt_PT dictionary.
 
 For example, if I search for 'fóruns' it stems it to 'fórum' and then find
 'fórum' references.
 
 But if I search for 'foruns' (without accent),
 then HunspellStemFilterFactory cannot stem
 word, as it does' not exist in its dictionary.
 
 It there any way to make HunspellStemFilterFactory work without accents
 differences?
 
 best,
 bráulio

Re: indexing with DIH (and with problems)

2012-02-10 Thread Chantal Ackermann



On Thu, 2012-02-09 at 23:45 +0100, alessio crisantemi wrote:
 hi all,
 I would index on solr my pdf files wich includeds on my directory c:\myfile\
 
 so, I add on my solr/conf directory the file data-config.xml like the
 following:
 
 
 dataConfig
 dataSource type=BinFileDataSource /
 document
 entity name=f dataSource=null rootEntity=false

Why do you set rootEntity=false on the root entity?
This looks odd to me - but I can be wrong, of course.

If DIH shows this:

str name=*Total Requests made to DataSource**0*/str


DIH hasn't even retrieved any data from you data source. Check that the
call you have configured really returns any documents.


Chantal




 processor=FileListEntityProcessor
 baseDir=c:\myfile\ fileName=*.pdf
 recursive=true
 entity name=tika-test processor=TikaEntityProcessor
 url=${f.fileAbsolutePath} format=text
 field column=author name=author meta=true/
 field column=title name=title meta=true/
  field column=content_type name=content_type meta=true/
 /entity
 /entity
 /document
 /dataConfig
 
 before, I add this part into solr-config.xml:
 
 
 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
   str name=configc:\solr\conf\data-config.xml/str
 /lst
   /requestHandler
 
 
 but this is the result:
 
 
 * * str name=*command**delta-import*/str
  * * str name=*status**idle*/str
  * * str name=*importResponse* /
  
 *-*http://pc-alessio:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=delta-import#
 lst name=*statusMessages*
  * * str name=*Time Elapsed**0:0:2.512*/str
  * * str name=*Total Requests made to DataSource**0*/str
  * * str name=*Total Rows Fetched**0*/str
  * * str name=*Total Documents Processed**0*/str
  * * str name=*Total Documents Skipped**0*/str
  * * str name=*Full Dump Started**2012-02-09 23:37:07*/str
  * * str name=***Indexing failed. Rolled back all changes.*/str
  * * str name=*Rolledback**2012-02-09 23:37:07*/str
 * * /lst
  * * str name=*WARNING**This response format is experimental. It is
 likely to change in the future.*/str
 * * /response
 
 suggestions?
 thanks
 alessio

Re: can solr automatically search for different punctuation of a word

2012-02-01 Thread Chantal Ackermann

Hi Alex,

the dependency tag is used in the Maven project file (pom.xml). If you
are not using Maven to build your project then simply skip that part.

The important thing is that the ICU jar (lucene-icu) and the analysis
extra jar (solr-analysis-extra) are in your classpath.

See also Erick's answer in respond to your question. The folder for
additional jar files in solr is:

${SOLR_HOME}/lib/

Cheers,
Chantal

On Tue, 2012-01-31 at 04:38 +0100, alx...@aim.com wrote:
Hi Chantal,

In the readme file at solr/contrib/analysis-extras/README.txt it says to add
the ICU library (in lib/)

Do I need also add dependecy... and where?

Thanks.
Alex.

-Original Message-
From: Chantal Ackermann chantal.ackerm...@btelligent.de
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Jan 13, 2012 1:52 am
Subject: Re: can solr automatically search for different punctuation of a word

Hi Alex,

for me, ICUFoldingFilterFactory works very good. It does lowercasing and

removes diacritica (this is how umlauts and accenting of letters is

called - punctuation means comma, points etc.). It will work for any any

language, not only German. And it will also handle apostrophs as in

C'est bien.

ICU requires additional libraries in the classpath. For an in-built solr

solution have a look at ASCIIFoldingFilterFactory.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory

Example configuration:

fieldType name=text_sort class=solr.TextField

positionIncrementGap=100

analyzer

tokenizer class=solr.KeywordTokenizerFactory /

filter class=solr.ICUFoldingFilterFactory /

/analyzer

/fieldType

And dependencies (example for Maven) in addition to solr-core:

dependency

groupIdorg.apache.lucene/groupId

artifactIdlucene-icu/artifactId

version${solr.version}/version

scoperuntime/scope

/dependency

dependency

groupIdorg.apache.solr/groupId

artifactIdsolr-analysis-extras/artifactId

version${solr.version}/version

scoperuntime/scope

/dependency

Cheers,

Chantal

On Fri, 2012-01-13 at 00:09 +0100, alx...@aim.com wrote:

Hello,

I would like to know if solr has a functionality to automatically search
for a

different punctuation of a word.

For example if I if a user searches for a word Uber, and stemmer is german

lang, then solr looks for both Uber and Über, like in synonyms.

Is it possible to give a file with a list of possible substitutions of
letters

to solr and have it search for all possible punctuations?

Thanks.

Alex.

Re: Parameter for database host in DIH?

2012-01-23 Thread Chantal Ackermann

Hi wunder,

for us, it works with internal dots when specifying the properties in
$SOLR_HOME/[core]/conf/solrcore.properties:

like this:
db.url=xxx
db.user=yyy
db.passwd=zzz

$SOLR_HOME/[core]/conf/data-config.xml:

dataSource type=JdbcDataSource
driver=oracle.jdbc.driver.OracleDriver url=${db.url}
user=${db.user} password=${db.passwd} batchSize=1000 /



Cheers,
Chantal

On Sat, 2012-01-21 at 01:01 +0100, Walter Underwood wrote:
 Weird. I can make it work with a request parameter and 
 $dataimporter.request.dbhost:
 
 http://localhost:8983/solr/textbooks/dataimport?command=full-importdbhost=mydbhost
 
 Or I can make it work with a Java system property with no dots.
 
 But when I use a Java system property with internal dots, it doesn't work.
 
 wunder
 
 On Jan 20, 2012, at 3:53 PM, Walter Underwood wrote:
 
  On Jan 20, 2012, at 3:34 PM, Shawn Heisey wrote:
  
  On 1/20/2012 3:48 PM, Walter Underwood wrote:
  Is there a way to parameterize the JDBC URL in the data import handler?  
  I tried this, but it did not insert the value of the property. I'm 
  running Solr 3.3.0.
  
   dataSource driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://${com.chegg.dbhost}/product
  
  Here's what I've got in mine.  I pass in dbHost and dbSchema parameters 
  (along with a bunch of others that get used in the entity SQL statements) 
  when starting DIH.
  
  url=jdbc:mysql://${dataimporter.request.dbHost}:3306/${dataimporter.request.dbSchema}?zeroDateTimeBehavior=convertToNull
  
  
  Are those Java system properties? I didn't get a substitution when I ran: 
  java -Dcom.chegg.dbhost=mydbhost
  
  The resulting JDBC URL was jdbc:mysql:///product, so it replaced the 
  variable with empty string. Odd.
  
  wunder
  --
  Walter Underwood
  wun...@wunderwood.org

Re: Validating solr user query

2012-01-23 Thread Chantal Ackermann

Hi Dipti,

just to make sure: are you aware of

http://wiki.apache.org/solr/DisMaxQParserPlugin

This will handle the user input in a very conventional and user friendly
way. You just have to specify on which fields you want it to search.
With the 'mm' parameter you have a powerfull option to specify how much
of a search query has to match (more flexible than defining a default
operator).

Cheers,
Chantal

On Fri, 2012-01-20 at 23:52 +0100, Dipti Srivastava wrote:
 Hi All,
 I ma using HTTP/JSON to search my documents in Solr. Now the client provides 
 the query on which the search is based.
 What is a good way to validate the query string provided by the user.
 
 On the other hand, if I want the user to build this query using some Solr api 
 instead of preparing a lucene query string which API can I use for this?
 I looked into
 SolrQuery in SolrJ but it does not appear to have a way to specify the more 
 complex queries with the boolean operators and operators such as ~,+,- etc.
 
 Basically, I am trying to avoid running into bad query strings built by the 
 caller.
 
 Thanks!
 Dipti
 
 
 This message is private and confidential. If you have received it in error, 
 please notify the sender and remove it from your system.

Re: can solr automatically search for different punctuation of a word

2012-01-13 Thread Chantal Ackermann

Hi Alex,

for me, ICUFoldingFilterFactory works very good. It does lowercasing and
removes diacritica (this is how umlauts and accenting of letters is
called - punctuation means comma, points etc.). It will work for any any
language, not only German. And it will also handle apostrophs as in
C'est bien.

ICU requires additional libraries in the classpath. For an in-built solr
solution have a look at ASCIIFoldingFilterFactory.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory



Example configuration:
fieldType name=text_sort class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ICUFoldingFilterFactory /
/analyzer
/fieldType

And dependencies (example for Maven) in addition to solr-core:
dependency
groupIdorg.apache.lucene/groupId
artifactIdlucene-icu/artifactId
version${solr.version}/version
scoperuntime/scope
/dependency
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-analysis-extras/artifactId
version${solr.version}/version
scoperuntime/scope
/dependency

Cheers,
Chantal

On Fri, 2012-01-13 at 00:09 +0100, alx...@aim.com wrote:
 Hello,
 
 I would like to know if solr has a functionality to automatically search for 
 a different punctuation of a word. 
 For example if I if a user searches for a word Uber, and stemmer is german 
 lang, then solr looks for both Uber and  Über,  like in synonyms.
 
 Is it possible to give a file with a list of possible substitutions of 
 letters to solr and have it search for all possible punctuations?
 
 
 Thanks.
 Alex.

Re: Solr, SQL Server's LIKE

2012-01-02 Thread Chantal Ackermann


Thanks, Erick! That sounds great. I really do have to upgrade.

Chantal


On Sun, 2012-01-01 at 16:42 +0100, Erick Erickson wrote:
 Chantal:
 
 bq: The problem with the wildcard searches is that the input is not
 analyzed.
 
 As of 3.6/4.0, this is no longer entirely true. Some analysis is
 performed for wildcard searches by default and you can
 specify most anything you want if you really need to see:
 https://issues.apache.org/jira/browse/SOLR-2438
 and
 http://wiki.apache.org/solr/MultitermQueryAnalysis
 
 Best
 Erick

RE: Solr, SQL Server's LIKE

2011-12-30 Thread Chantal Ackermann


The problem with the wildcard searches is that the input is not
analyzed. For english, this might not be such a problem (except if you
expect case insenstive search). But than again, you don't get that with
like, either. Ngrams bring that and more.

What I think is often forgotten when comparing 'like' and Solr search
is:
Solr's analyzer allow not only for case insenstive search but also for
other analysis such as removing diacritics and this is also applied when
sorting (you have to create a separate index in the DB, as well, if you
want that).

Say you have the following names:
'Van Hinden'
'van Hinden'
'Música'
'Musil'

like 'mu%' - no hits
like 'Mu%' - 1 hit
like 'van%' - 1 hit
like 'hin%' - no hits

with Solr whitespace or standard tokenizer, ngrams and a diacritcs and
lowercase filter (no wildcard search):
'mu'/'Mu' - 2 hits sorted ignoring case and diacritics
'van' - 2 hits
'hin' - 2 hits


(This is written down from experience. I haven't checked those examples
explicitly.)

Cheers,
Chantal



On Fri, 2011-12-30 at 02:00 +0100, Chris Hostetter wrote:
 : Thanks. I know I'll be able to utilize some of Solr's free text 
 : searching capabilities in other search types in this project. The 
 : product manager wants this particular search to exactly mimic LIKE%.
   ...
 : Ex: If I search Albatross I want Albert to be excluded completely, 
 : rather than having a low score.
 
 please be specific about the types of queries you want. ie: we need more 
 then one example of the type of input you want to provide, the type of 
 matches you want to see for that input, and the type of matches you want 
 to get back.
 
 in your first message you said you need to match company titles pretty 
 exactly but then seem to contradict yourself by saying the SQL's LIKE 
 command fit's the bill -- even though the SQL LIKE command exists 
 specificly for in-exact matches on field values.
 
 Based on your one example above of Albatross, you don't need anything 
 special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
 just search for Albatross and it will match Albatross but not 
 Albert.  if you want Albatross to match Albatross Road use some 
 basic tokenization.
 
 If all you really care about is prefix searching (which seems suggested by 
 your LIKE% comment above, which i'm guessing is shorthand for something 
 similar to LIKE 'ABC%'), so that queries like abc and abcd both 
 match abcdef and abcd but neither of them match abcd 
 then just use prefix queries (ie: abcd*) -- they should be plenty 
 efficient for your purposes.  you only need to worry about ngrams when you 
 want to efficiently match in the middle of a string. (ie: TITLE LIKE 
 %ABC%)
 
 
 -Hoss

Re: Update schema.xml using solrj APIs

2011-12-22 Thread Chantal Ackermann


Hi Ahmed,

if you have a multi core setup, you could change the file
programmatically (e.g. via XML parser), copy the new file to the
existing one (programmatically, of course), then reload the core.

I haven't reloaded the core programmatically, yet, but that should be
doable via SolrJ. Or - if you are not using Java, then call the specific
core admin URL in your programme.

You will have to re-index after changing the schema.xml.

Chantal


On Thu, 2011-12-22 at 04:34 +0100, Otis Gospodnetic wrote:
 Ahmed,
 
 At this point in time - no.  You need to edit it manually and restart Solr to 
 see the changed.
 This will change in the future.
 
 Otis
 
 Performance Monitoring SaaS for Solr - 
 http://sematext.com/spm/solr-performance-monitoring/index.html
 
 
 
 
  From: Ahmed Abdeen Hamed ahmed.elma...@gmail.com
 To: solr-user@lucene.apache.org 
 Sent: Wednesday, December 21, 2011 4:12 PM
 Subject: Update schema.xml using solrj APIs
  
 Hello friend,
 
 I am new to Solrj and I am wondering if there is a away you can update the
 schema.xml file via the APIs.
 
 I would appreciate any help.
 
 Thanks very much,
 -Ahmed

Re: Exception using SolrJ

2011-12-21 Thread Chantal Ackermann

Hi Shawn,

maybe the requests that fail have a certain pattern - for example that
they are longer than all the others.

Chantal

Re: full-data import suddenly stopped working. Total Rows Fetched remains 0

DIH does not simply fail. Without more information, it's hard just to
guess.
As your using MS SQLServer, maybe you ran into this?

http://blogs.msdn.com/b/jdbcteam/archive/2011/11/07/supported-java-versions-november-2011.aspx

Would be a problem caused by certain java versions.

Have you turned the DEBUG level on for DIH and Solr in general?

Chantal



On Mon, 2011-12-19 at 18:55 +0100, PeterKerk wrote:
 Hi Chantal,
 
 I reduced my data-config.xml to a bare minimum:
 dataConfig
 dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=jdbc:sqlserver://localhost:1433;databaseName=tt user=sa
 password=dfgjLJSFSD /
 document name=weddinglocations
 entity name=location query=select * from locations WHERE
 isapproved='true'
 field name=id column=ID /
 field name=title column=TITLE /
 /entity
 /document
 /dataConfig
 
 I ran reload-config succesfully, but still the same behavior occurs.
 Oh and the query select * from locations WHERE isapproved='true' returns a
 lot of results when ran directly against my DB
 
 What else can it be?
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/full-data-import-suddenly-stopped-working-Total-Rows-Fetched-remains-0-tp3599004p3599087.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Exception using SolrJ

Hi Shawn,

the exception indicates that the connection was lost. I'm sure you
figured that out for yourself.

Questions:
- is that specific server instance really running? That is, can you
reach it via browser?
- If yes: how is your connection pool configured and how do you
initialize it? More specifically: from what I know, CommonsHttp is
already multi threaded so in your initializing code should not be using
multiple threads to access it. Not completely sure about that in
combination with SolrJ, though. I just had that issue when using
CommonsHttp directly in the wrong way.

I am using SolrJ with CommonsHttp pool for a some time now, and it all
works very reliably. I've encountered those Connection reset exceptions
also but they were always caused by the server not being reachable.


Chantal



From your pastebin:

Caused by: org.apache.solr.client.solrj.SolrServerException:
java.net.SocketException: Connection reset
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:480)



On Tue, 2011-12-20 at 01:11 +0100, Shawn Heisey wrote:
 On 12/16/2011 12:44 AM, Shawn Heisey wrote:
  I am seeing exceptions from some code I have written using SolrJ.I 
  have placed it into a pastebin:
 
 
  http://pastebin.com/XnB83Jay
 
 No reply in three days, does nobody have any ideas for me?
 
 Thanks,
 Shawn

Re: multiple temporary indexes


You could also create a single index and use a field user to filter
results for only a single user. This would also allow for statistics
over the complete base.

Chantal



On Tue, 2011-12-20 at 12:43 +0100, graham wrote:
 Hi,
 
 I'm a complete newbie and currently at the stage of wondering whether
 Solr might be suitable for what I want.
 
 I need to take search results collected by another system in response to
 user requests and allow each user to view their set of results in
 different ways: sorting into different order, filtering by facets, etc.
 
 I am wondering whether it would be practical to do this by creating a
 Solr index for each result set on the fly. Two particular questions are:
 
 1. Is it even practical to do this in real time? Assuming that each set
 of results contains low hundreds of elements (each a bibliographic
 record), and that the users' patience is not unlimited.
 
 2. What would be the best way to manage a separate index for each query,
 given that the main constraint is time, and that the number of indexes
 needed simultaneously is not known in advance? Create a separate core
 for each query, or use a single index with a query id as one of the
 keys, or..?
 
 Thanks for any advice (or pointers to existing systems which work like
 this)
 
 Graham

Re: full-data import suddenly stopped working. Total Rows Fetched remains 0

Never would have thought that MS could help me earn such honours...
;D

On Tue, 2011-12-20 at 12:57 +0100, PeterKerk wrote:
 Chantal...you are the queen! :p
 That was it, I downgraded to 6.27 and now it works again...thank god!
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/full-data-import-suddenly-stopped-working-Total-Rows-Fetched-remains-0-tp3599004p3601013.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: full-data import suddenly stopped working. Total Rows Fetched remains 0

2011-12-19 Thread Chantal Ackermann

Hi Peter,

the most probable cause is that your database query returns no results.
Have you run the query that DIH is using directly on your database?

In the output you can see that DIH has fetched 0 rows from the DB. Maybe
your query contains a restriction that suddenly had this effect - like a
restriction on a modification time or similar.


Cheers,
Chantal


On Mon, 2011-12-19 at 18:21 +0100, PeterKerk wrote:
 str name=Total Rows Fetched0/str
 str name=Total Documents Processed0/str
 str name=Total Documents Skipped0/str

Re: Solr Best Practice Configuration

2011-12-09 Thread Chantal Ackermann

Hi Ben,

what I understand from your post is:

Advertiser (1) - (*) Advert
(one-to-many where there can be 50,000 per single Advertiser)

Your index entity is based on Advert which means that there can be
50,000 documents in the index that need to be changed if a field of an
Advertiser is updated in the database.

I am using multi-core setups with differently structured indexes for
these needs. This means that some more complex lookups require queries
on several cores. This has not been a problem, so far. Our indexes,
however, have rather few data (ranging from a few hundred thousand
entries to some millions, rather a lot of fields with short texts) and
are highly dynamic (rebuilt several times a day, full rebuilt, no
increments).

Moving the Advertiser data out of the Advertiser index means:
(1) on updates of the Advertiser fields you don't need to change the
Advert index
(2) the Advert index might be a bit smaller (if that matters)
(3) the statistics on the Advertiser data will be in relation to the
Advertiser data and not in relation to the Adverts, while the statistics
on the Adverts won't contain any Advertiser data, anymore.

(This list might not be complete.)

What does (3) imply?
You will not be able to facet or sort or group on Adverts using any of
the Advertiser fields (as they reside in a different index core).

If you need facetting or similar then consider first testing the
performance of a massive update or rebuilding your index before starting
to change to multiple cores. Maybe the performance is better than you
fear it to be and no change is required.

Cheers,
Chantal

On Fri, 2011-12-09 at 10:46 +0100, BenMccarthy wrote:
Good Morning.

I have now been through the various Solr tutorials and read the SOLR 3
Enterprise server book. Im not at the point of figuring out if Solr can
help us with a scaling problem. Im looking for advice on the following
scenario any pointers or references will be great:

I have two sets of distinct data:

Advert
Advertiser

An Advertiser has many Adverts in the db looking like

Advert {
id
field a
field b
advertiser_id
}

Advertiser {
id
field c
field d
lat
long
}

So ive followed some docs and ive created a DIH which pulls all this into
one SOLR index. Which is great. The problem im looking at is that we have
a massive churn on Advertiser updates and with the one index i dont think it
will scale (Correct me if im wrong).

Would it be possible to have two seperate cores each with its own index and
then when issuing queries the results are returned as they are in a single
core setup.

Im basically looking for some pointers telling me if im going in the right
direction. I dont want to have to update 5 adverts when a advertiser
simply updated field c. This is a problem we have with our current search.

Thanks
Ben

--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Best-Practice-Configuration-tp3572492p3572492.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr - http error 404 when requesting solrconfig.xml or schema.xml

2011-11-29 Thread Chantal Ackermann

Hi Torsten,

some more information would help us to help you:
- does calling /apps/solrslave/admin/ return the Admin Homepage?
- what is the path to your SOLR_HOME
- where in the filesystem are solrconfig.xml and schema.xml (even if
this sounds redundant, maybe they are just misplaced)
- their read permissions (whether the server can access them)
- where the server is looking for them (the value of the JNDI SOLR_HOME,
the output of the logfile telling you which locations is actually being
used as SOLR_HOME, and whether this is where you want it to be)

Cheers,
Chantal

On Tue, 2011-11-29 at 10:50 +0100, Torsten Krah wrote:
 Hi,
 
 got some interesting problem and don't know how to debug further.
 I am using an external solr home configured via jndi.
 Deployed my war file (context is /apps/solrslave/) and if want to look
 at the schema:
 
 /apps/solrslave/admin/file/?contentType=text/xml;charset=utf-8file=schema.xml
 
 the response is 404.
 
 It doesn't matter if i am using Jetty 7.x, 8.x or Tomcat 6.0.33, 404 is
 the answer.
 
 Anyone an idea where to look for?
 
 regards
 
 Torsten

Re: DIH Strange Problem

2011-11-23 Thread Chantal Ackermann

Hi Yavar,

my experience with similar problems was that there was something wrong
with the database connection or the database.

Chantal


On Wed, 2011-11-23 at 11:57 +0100, Husain, Yavar wrote:
 I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing 
 data. Indexing and all was working perfectly fine. However today when I 
 started full indexing again, Solr halts/stucks at the line Creating a 
 connection for entity. There are no further messages after that. I 
 can see that DIH is busy and on the DIH console I can see A command is still 
 running, I can also see total rows fetched = 0 and total request made to 
 datasource = 1 and time is increasing however it is not doing anything. This 
 is the exact configuration that worked for me. I am not really able to 
 understand the problem here. Also in the index directory where I am storing 
 the index there are just 3 files: 2 segment files + 1  lucene*-write.lock 
 file.
 ...
 data-config.xml:
 
 dataSource type=JdbcDataSource 
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders 
 user=testUser password=password/
 document
 .
 .
 
 Logs:
 
 INFO: Server startup in 2016 ms
 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter 
 doFullImport
 INFO: Starting Full Import
 Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 
 QTime=11
 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter 
 readIndexerProperties
 INFO: Read dataimport.properties
 Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
 INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
 Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=1

 commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6]
 Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
 INFO: newest commit = 1322041133719
 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 
 call
 INFO: Creating a connection for entity SampleText with URL: 
 jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders
 /PRE
 BR
 **BRThis
  message may contain confidential or proprietary information intended only 
 for the use of theBRaddressee(s) named above or may contain information 
 that is legally privileged. If you areBRnot the intended addressee, or the 
 person responsible for delivering it to the intended addressee,BRyou are 
 hereby notified that reading, disseminating, distributing or copying this 
 message is strictlyBRprohibited. If you have received this message by 
 mistake, please immediately notify us byBRreplying to the message and 
 delete the original message and any copies immediately thereafter.BR
 BR
 Thank you.~BR
 **BR
 FAFLDBR
 PRE

Re: XSLT caching mechanism

2011-11-14 Thread Chantal Ackermann

In solrconfig.xml, change the xsltCacheLifetimeSeconds property of the
XSLTResponseWriter to the desired value (this example 6000secs):

queryResponseWriter name=xslt class=solr.XSLTResponseWriter
int name=xsltCacheLifetimeSeconds6000/int
/queryResponseWriter



On Mon, 2011-11-14 at 15:31 +0100, vrpar...@gmail.com wrote:
 Hello All,
 
 i am using xslt to transform solr xml response, when made search;getting
 below warning
 
 WARNING [org.apache.solr.util.xslt.TransformerProvider] The
 TransformerProvider's simplistic XSLT caching mechanism is not appropriate
 for high load scenarios, unless a single XSLT transform is used and
 xsltCacheLifetimeSeconds is set to a sufficiently high value.
 
 how can i apply effective xslt caching for solr ?
 
 
 
 Thanks,
 Vishal Parekh
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/XSLT-caching-mechanism-tp3506979p3506979.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Xsl for query output

2011-10-14 Thread Chantal Ackermann


Hi Jeremy,


The xsl files go into the subdirectory /xslt/ (you have to create that)
in the /conf/ directory of the core that should return the transformed
results.

So, if you have a core /myCore/ that you want to return transformed
results you need to put the example.xsl into:

$SOLR_HOME/myCore/conf/xslt/example.xsl

and in $SOLR_HOME/myCore/conf/solrconfig.xml you add (change the cache
value to whatever appropriate):

queryResponseWriter name=xslt class=solr.XSLTResponseWriter
   int name=xsltCacheLifetimeSeconds6000/int
/queryResponseWriter

Call this in a query:

http://mysolrserver/solr/myCore/select?q=id:idwt=xslttr=example.xsl


Chantal


On Fri, 2011-10-14 at 07:22 +0200, Jeremy Cunningham wrote:
 Thanks for the response but I have seen this page and I had a few
 questions.  
 
 1.  Since I am using tomcat, I had to move the example directory into the
 tomcat directory structure.  In the multicore, there is no example.xsl.
 Where do I 
 need to put it? Also, how do I send docs for indexing when running solr
 under tomcat?  
 
 Thanks,
 Jeremy
 
 On 10/13/11 3:46 PM, Lance Norskog goks...@gmail.com wrote:
 
 http://wiki.apache.org/solr/XsltResponseWriter
 
 This is for the single-core example. It is easiest to just go to
 solr/example, run java -jar start.jar, and hit the URL in the above wiki
 page. Then poke around in solr/example/solr/conf/xslt. There is no
 solrconfig.xml change needed.
 
 It is generally easiest to use the solr/example 'java -jar start.jar'
 example to test out features. It is easy to break configuration linkages.
 
 Lance
 
 On Thu, Oct 13, 2011 at 12:42 PM, Jeremy Cunningham 
 jeremy.cunningham.h...@statefarm.com wrote:
 
  I am new to solr and not a web developer.  I am a data warehouse guy
 trying
  to use solr for the first time.  I am familiar with xsl but I can't
 figure
  out how to get the example.xsl to be applied to my xml results.  I am
  running tomcat and have solr working.  I copied over the solr mulitiple
 core
  example to the conf directory on my tomcat server. I also added the war
 file
  and the search is fine.  I can't seem to figure out what I need to add
 to
  the solrcofig.xml or where ever so that the example.xsl is used.
 Basically
  can someone tell me where to put the xsl and where to configure its
 usage?
 
  Thanks
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com

Re: Interesting DIH challenge

2011-10-10 Thread Chantal Ackermann

Hi there,

I have been using cores to built up new cores (because of various
reasons). (I am not using SOLR as data storage, the cores are re-indexed
frequently.)

This solution works for releases 1.4 and 3 as it does not use the
SolrEntityProcessor.

To load data from another SOLR core and populate part of the new
document I use:

(1) in the target data-config.xml:
entity name=content dataSource=sourceCore
url=solr/gmaContent/select?q=contentid:
${targetDoc.ID}amp;wt=xsltamp;tr=response-to-update.xsl
processor=my.custom.handler.dataimport.CachingXPathEntityProcessor
cacheKey=${targetDoc.ID} useSolrAddSchema=true
/entity

(2) sourceCore's solrconfig.xml needs an entry (uncomment) for the xslt
response writer:

  !-- XSLT response writer transforms the XML output by any xslt file
found
   in Solr's conf/xslt directory.  Changes to xslt files are checked
for
   every xsltCacheLifetimeSeconds.  
--
  queryResponseWriter name=xslt class=solr.XSLTResponseWriter
int name=xsltCacheLifetimeSeconds6000/int
  /queryResponseWriter


(2) response-to-update.xsl (this goes into
$SOLR_HOME/sourceCore/conf/xslt/):


?xml version='1.0' encoding='UTF-8'?
xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'

xsl:output method=xml media-type=text/xml;charset=utf-8
indent=yes encoding=UTF-8 omit-xml-declaration=no /

xsl:template match='/'
add
xsl:apply-templates select=/response/result/doc /
/add
/xsl:template

xsl:template match=doc
doc
xsl:choose
xsl:when test=doc/*[name()='arr']
xsl:apply-templates select=//arr /
/xsl:when
xsl:otherwise
xsl:apply-templates select=child::node() /
/xsl:otherwise
/xsl:choose
/doc
/xsl:template

xsl:template match=//arr
xsl:for-each select=child::node()
xsl:element name=field
xsl:attribute name=namexsl:value-of
select=../@name/xsl:value-of
/xsl:attribute
xsl:value-of select=. /
/xsl:element
/xsl:for-each
/xsl:template

xsl:template match=child::node()
xsl:element name=field
xsl:attribute name=namexsl:value-of
select=@name/xsl:value-of
/xsl:attribute
xsl:value-of select=. /
/xsl:element
/xsl:template

/xsl:stylesheet

Cheers,
Chantal




On Mon, 2011-10-10 at 06:26 +0200, Gora Mohanty wrote:
 On Mon, Oct 10, 2011 at 6:30 AM, Pulkit Singhal pulkitsing...@gmail.com 
 wrote:
  @Gora Thank You!
 
  I know that Solr accepts xml with Solr specific elements that are commands
  that only it understands ... such as add/, commit/ etc.
 
  Question: Is there some way to ask Solr to dump out whatever it has in its
  index already ... as a Solr xml document?
 
 As far as I know, there is no way to do that out of the box. One would get
 the contents of each record with a normal Solr query, massage that into
 a Solr XML document, and use that to rebuild the index. Have not tried
 this, but it should be possible to get the desired output format with the
 XsltResponseWriter: http://wiki.apache.org/solr/XsltResponseWriter .
 
 All in all, it seems easier to me to just reindex from the base source, unless
 that is not possible for some reason.
 
  Plan: I intend to message that xml dump (add the field + value that I need
  in every doc's xml element) and then I should be able to push this dump back
  to Solr to get data indexed again, I hope.
 
 Yes, that should be the general idea.
 
 Regards,
 Gora

Property undefined in Schema Browser (Solr Admin)

2011-08-24 Thread Chantal Ackermann

Hi all,

the Schema Browser in the SOLR Admin shows me the following information:



Field: title

Field Type: string

Properties: Indexed, Stored, Multivalued, Omit Norms, undefined, Sort
Missing Last

Schema: Indexed, Stored, Multivalued, Omit Norms, undefined, Sort
Missing Last

Index: Indexed, Stored, Omit Norms


I was wandering where this undefined property comes from. I had a look
at:
http://wiki.apache.org/solr/LukeRequestHandler
and the schema.jsp
but to no avail so far.

Could someone give me a hint? I'm just wondering whether I am missing
some problem with my field declaration which is:

field name=title type=string indexed=true stored=true
required=true multiValued=true/

Thanks a lot!
Chantal

Re: Property undefined in Schema Browser (Solr Admin)

2011-08-24 Thread Chantal Ackermann

Hi Stefan,

thanks for your time!

There is a capital F which is not listed as key? But this is also the
case in your example so probably I'm confusing something.

Anyway, the respective output of: /admin/luke?fl=title
is:

lst name=title
str name=typestring/str
str name=schemaI-SM---OF---l/str
str name=indexI-SO/str
int name=docs16697/int
int name=distinct8476/int
−
lst name=topTerms
...
/lst
−
lst name=histogram
...
/lst
/lst
/lst
−
lst name=info
−
lst name=key
str name=IIndexed/str
str name=TTokenized/str
str name=SStored/str
str name=MMultivalued/str
str name=VTermVector Stored/str
str name=oStore Offset With TermVector/str
str name=pStore Position With TermVector/str
str name=OOmit Norms/str
str name=LLazy/str
str name=BBinary/str
str name=fSort Missing First/str
str name=lSort Missing Last/str
/lst


Cheers,
Chantal


On Wed, 2011-08-24 at 11:44 +0200, Stefan Matheis wrote:
 Hi Chantal,
 
 how does your luke-output look like?
 
 What the Schema-Browser does is, it takes the schema-  index-element:
  str name=schemaI-SOF---l/str
  str name=indexI-SO/str
 
 and does a lookup for every mentioned character in the key-hash:
  lst name=key
  str name=IIndexed/str
  str name=TTokenized/str
  str name=SStored/str
  str name=MMultivalued/str
  str name=VTermVector Stored/str
  str name=oStore Offset With TermVector/str
  str name=pStore Position With TermVector/str
  str name=OOmit Norms/str
  str name=LLazy/str
  str name=BBinary/str
  str name=fSort Missing First/str
  str name=lSort Missing Last/str
  /lst
 
 so i guess there is something in your output, that could not be mapped
 :/ i just checked this with the example schema .. so there may be
 cases which are not correct.
 
 Regards
 Stefan
 
 On Wed, Aug 24, 2011 at 10:48 AM, Chantal Ackermann
 chantal.ackerm...@btelligent.de wrote:
  Hi all,
 
  the Schema Browser in the SOLR Admin shows me the following information:
 
 
  
  Field: title
 
  Field Type: string
 
  Properties: Indexed, Stored, Multivalued, Omit Norms, undefined, Sort
  Missing Last
 
  Schema: Indexed, Stored, Multivalued, Omit Norms, undefined, Sort
  Missing Last
 
  Index: Indexed, Stored, Omit Norms
  
 
  I was wandering where this undefined property comes from. I had a look
  at:
  http://wiki.apache.org/solr/LukeRequestHandler
  and the schema.jsp
  but to no avail so far.
 
  Could someone give me a hint? I'm just wondering whether I am missing
  some problem with my field declaration which is:
 
  field name=title type=string indexed=true stored=true
  required=true multiValued=true/
 
  Thanks a lot!
  Chantal

Re: Property undefined in Schema Browser (Solr Admin)

2011-08-24 Thread Chantal Ackermann

Hi Stefan,

I'm using Firefox 3.6.20 and Chromium 12.0.742.112 (90304) Ubuntu 10.10.

The undefined appears with both of them.


Chantal



On Wed, 2011-08-24 at 14:09 +0200, Stefan Matheis wrote:
 Hi Chantal,
 
 On Wed, Aug 24, 2011 at 1:43 PM, Chantal Ackermann
 chantal.ackerm...@btelligent.de wrote:
  There is a capital F which is not listed as key? But this is also the
  case in your example so probably I'm confusing something.
 
 There's a quick hack in place, which tries: the character, the
 lowercase character  the uppercase character - so there should be a
 least one correlation.
 
 But i'll add an additional check to the code, that 'undefined'-values
 will be skip for the list.
 
 Just to check that, which Browser are you using? The UI was developed
 using Firefox4  Chrome12+ and is not fully tested on others browsers
 :/
 
 Regards
 Stefan

Re: How to copy and extract information from a multi-line text before the tokenizer

2011-08-23 Thread Chantal Ackermann


Hi Michael,

have you considered the DataImportHandler?
You could use the the LineEntityProcessor to create fields per line and
then copyField to collect everything for the AllData field.

http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor

Chantal



On Tue, 2011-08-23 at 12:28 +0200, Michael Kliewe wrote:
 Hello all,
 
 I have a custom schema which has a few fields, and I would like to create a 
 new field in the schema that only has one special line of another field 
 indexed. Lets use this example:
 
 field AllData (TextField) has for example this data:
 Title: exampleTitle of the book
 Author: Example Author
 Date: 01.01.1980
 
 Each line is separated by a line break.
 I now need a new field named OnlyAuthor which only has the Author information 
 in it, so I can search and facet for specific Author information. I added 
 this to my schema:
 
 fieldType name=authorField class=solr.TextField
   analyzer type=index
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.TrimFilterFactory/
   /analyzer
   analyzer type=query
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.TrimFilterFactory/
   /analyzer
 /fieldType
 
 field name=OnlyAuthor type=authorField indexed=true stored=true /
 
 copyField source=AllData dest=OnlyAuthor/
 
 
 But this is not working, the new AuthorOnly field contains all data, because 
 the regex didn't match. But I need Example Author in that field (I think) 
 to be able to search and facet only author information.
 
 I don't know where the problem is, perhaps someone of you can give me a hint, 
 or a totally different method to achieve my goal to extract a single line 
 from this multi-line-text.
 
 Kind regards and thanks for any help
 Michael

Re: Store complete XML record (DIH XPathEntityProcessor)

2011-08-01 Thread Chantal Ackermann

Hi g,

ok, I understand your problem, now. (Sorry for answering that late.)

I don't think PlainTextEntityProcessor can help you. It does not take a
regex. LineEntityProcessor does but your record elements probably do not
come on their own line each and you wouldn't want to depend on that,
anyway.

I guess you would be best off writing your own entity processor - maybe
by extending XPath EP if that gives you some advantage. You can of
course also implement your own importer using SolrJ and your favourite
XML parser framework - or any other programming language.

If you are looking for a config-only solution - i'm not sure that there
is one. Someone else might be able to comment on that?

Cheers,
Chantal

On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote:
Thanks Chantal
I am ok with the second call and I already tried using that. Unfortunatly
It reads the whole file into a field. My file is as below example
xml
record
...
/record

record
...
/record

/xml

Now the XPATH does the 'for each /record' part. For each record I also need
to store the raw log in there. If I use the PlainTextEntityProcessor then
it gives me the whole file (from xml .. /xml ) and not each of the
record /record

Am I using the PlainTextEntityProcessor wrong?

THanks
g

--
View this message in context:
http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Store complete XML record (DIH XPathEntityProcessor)

2011-07-28 Thread Chantal Ackermann

Hi g,

have a look at the PlainTextEntityProcessor:
http://wiki.apache.org/solr/DataImportHandler#PlainTextEntityProcessor

you will have to call the URL twice that way, but I don't think you can
get the complete document (the root element with all structure) via
xpath - so the XPathEntityProcessor cannot help you.

If calling the URL twice slows your indexer down in unacceptable ways
you can always subclass XPathEntityProcessor (knowing Java is helpful,
thoug...). There surely is a way to make it return what you need. Or
maybe an entity processor that caches the content and uses XPath EP and
PlainText EP to accomplish your needs (not sure whether the API allows
for that).

Cheers,
Chantal

On Thu, 2011-07-28 at 05:53 +0200, solruser@9913 wrote:
I am trying to use DIH to import an XML based file with multiple XML records
in it. Each record corresponds to one document in Lucene. I am using the
DIH FileListEntityProcessor (to get file list) followed by the
XPathEntityProcessor to create the entities.

It works perfectly and I am able to map XML elements to fields . however
I also need to store the entire XML record as separate 'full text' field.
Is there any way the XPathEntityProcessor provides a variable like 'rawLine'
or 'plainText' that I can map to a field.

I tried to use the Plain Text processor after this - but that does not
recognize the XML boundaries and just gives the whole XML file.

entity name=x rootEntity=truedataSource=logfilereader
processor=XPathEntityProcessor
url=${logfile.fileAbsolutePath} stream=false
forEach=/xml/myrecord
transformer=
field column=mycol1
xpath=/xml/myrecord/@something
/

and so on ...
This works perfectly. However I also need something like ...

field column=fullxmlrecord name=plainText /

Any help is much appreciated. I am a newbie and may be missing something
obvious here

-g

--
View this message in context:
http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3205524.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: [POLL] How do you (like to) do logging with Solr

2011-05-16 Thread Chantal Ackermann


 Please tick one of the options below with an [X]:
 
 [ ]  I always use the JDK logging as bundled in solr.war, that's perfect
 [X]  I sometimes use log4j or another framework and am happy with 
 re-packaging solr.war

actually : not so happy because our operations team has to repackage it.
But there is no option for
 [X] add the logger configuration to the server's classpath, no
repackaging!

 [ ]  Give me solr.war WITHOUT an slf4j logger binding, so I can choose at 
 deploy time
 [ ]  Let me choose whether to bundle a binding or not at build time, using an 
 ANT option
 [ ]  What's wrong with the solr/example Jetty? I never run Solr elsewhere!
 [ ]  What? Solr can do logging? How cool!

Maven : Specifying SNAPSHOT Artifacts and the Hudson Repository

2011-03-16 Thread Chantal Ackermann

Hi all,

does anyone have a successfull setup (=pom.xml) that specifies the
Hudson snapshot repository :

https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastStableBuild/artifact/maven_artifacts
(or that for trunk)

and entries for any solr snapshot artifacts which are then found by
Maven in this repository?

I have specified the repository in my pom.xml as :
repositories
repository
idsolr-snapshot-3.x/id
urlhttps://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastSuccessfulBuild/artifact/maven_artifacts/url
releases
enabledfalse/enabled
/releases
snapshots
enabledtrue/enabled
/snapshots
/repository
/repositories

And the dependencies:
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-core/artifactId
version3.2-SNAPSHOT/version
/dependency
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-dataimporthandler/artifactId
version3.2-SNAPSHOT/version
/dependency


Maven's output is (for solr-core):

Downloading:
http://192.168.2.40:8081/nexus/content/groups/public/org/apache/solr/solr-core/3.2-SNAPSHOT/solr-core-3.2-SNAPSHOT.jar
[INFO] Unable to find resource
'org.apache.solr:solr-core:jar:3.2-SNAPSHOT' in repository
solr-snapshot-3.x
(https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastSuccessfulBuild/artifact/maven_artifacts)


I'm also trying around with specifying the exact name of the jar, but no
success so far, and it also seems wrong as it will be constantly
changing.
Also, searching hasn't returned anything helpful, so far.

I'd really appreciate if someone could point me into the right
direction!
Thanks!
Chantal

DIH : modify document in sibling entity of root entity

2011-03-10 Thread Chantal Ackermann

Dear all,

in DIH, is it possible to have two sibling entities where:

- the first one is the root entity that creates the documents by
iterating over a table that has one row per document.
- the second one is executed after the completion of the first entity
iteration, and it provides more data that is added to the newly created
documents.


I've set up such a dih configuration, and the second entity is executed,
but no data is written into the index apart from the data extracted by
the root entity  (=no document is modified?).

Documents are identified by the unique key 'id' which is defined by
pk=id on both entities.

Is this supposed to work at all? I haven't found anything so far on the
net but I could have used the wrong keywords for searching, of course.

As answer to the maybe obvious question why I'm not using a subentity:
I thought that this solution might be faster because it iterates over
the second data source instead of hitting it with a query per each
document.

Anyway, the main reason I tried this is because I want to know whether
it works. I'm still not sure whether it should work but I'm doing
something wrong...


Thanks!
Chantal

Re: DIH : modify document in sibling entity of root entity

2011-03-10 Thread Chantal Ackermann

Hi Stefan,

thanks for your time!

No, the second entity is not reusing values from the previous one. It
just provides more fields to it, and, of course the unique identifier -
which in case of the second entity is not unique:

document name=contributor
entity name=contributor pk=id rootEntity=true
query=select   CONTRIBUTOR_ID as id,
CONTRIBUTOR_NAME as name,
EXT_ID as extid
fromDIM_CONTRIBUTOR
/entity   
entity name=appearance pk=id rootEntity=false
transformer=RegexTransformer
query=select   CONTENTID as contentid,
SUBVALUE
fromCONTENT_VALUE
where   ID_ATTRIBUTE=170
field column=ignore sourceColName=SUBVALUE
groupNames=id,type,pos,character
regex=(\d+);(\d+);(\d+);([^;]*);\d*;[A-Z0-9]*;\d* /
/entity
/document


and here are the fields:

field name=id type=slong indexed=true stored=true
required=true /
field name=name type=string indexed=true stored=true
required=true termVectors=true /
field name=contentid type=slong indexed=true stored=true
multiValued=true /
field name=character type=string indexed=true stored=true
multiValued=true termVectors=true /
field name=type type=sint indexed=true stored=true
multiValued=true /

(For the sake of simplicity I've removed some fields that would be
created using copyfield instructions and transformers.)

I'm currently trying to run this using a subentity using the SQL
restriction SUBVALUE like '${contributor.id};%' but this takes ages...

The other one finished in under a minute (and it did actually process
the second entity, I think, it just didn't modify the index). The
current one runs for about 30min, and has only processed 22,000
documents out of more than 390,000. (Of course, there is probably no
index on that column)


Thanks for any suggestions!
Chantal




On Thu, 2011-03-10 at 17:13 +0100, Stefan Matheis wrote:
 Hi Chantal,
 
 i'm not sure if i understood you correctly (if at all)? Two entities,
 not arranged as sub-entitiy, but using values from the previous
 entity? Could you paste your dataimport  the relevant part of the
 logging-output?
 
 Regards
 Stefan
 
 On Thu, Mar 10, 2011 at 4:12 PM, Chantal Ackermann
 chantal.ackerm...@btelligent.de wrote:
  Dear all,
 
  in DIH, is it possible to have two sibling entities where:
 
  - the first one is the root entity that creates the documents by
  iterating over a table that has one row per document.
  - the second one is executed after the completion of the first entity
  iteration, and it provides more data that is added to the newly created
  documents.
 
 
  I've set up such a dih configuration, and the second entity is executed,
  but no data is written into the index apart from the data extracted by
  the root entity  (=no document is modified?).
 
  Documents are identified by the unique key 'id' which is defined by
  pk=id on both entities.
 
  Is this supposed to work at all? I haven't found anything so far on the
  net but I could have used the wrong keywords for searching, of course.
 
  As answer to the maybe obvious question why I'm not using a subentity:
  I thought that this solution might be faster because it iterates over
  the second data source instead of hitting it with a query per each
  document.
 
  Anyway, the main reason I tried this is because I want to know whether
  it works. I'm still not sure whether it should work but I'm doing
  something wrong...
 
 
  Thanks!
  Chantal

Re: DIH : modify document in sibling entity of root entity

2011-03-10 Thread Chantal Ackermann

Hi Gora,

thanks for making me read this part of the documentation again!
This processor probably cannot do what I need out of the box but I will
try to extend it to allow specifying a regular expression in its where
attribute.

Thanks!
Chantal

On Thu, 2011-03-10 at 17:39 +0100, Gora Mohanty wrote:
 On Thu, Mar 10, 2011 at 8:42 PM, Chantal Ackermann
 chantal.ackerm...@btelligent.de wrote:
 [...]
  Is this supposed to work at all? I haven't found anything so far on the
  net but I could have used the wrong keywords for searching, of course.
 
  As answer to the maybe obvious question why I'm not using a subentity:
  I thought that this solution might be faster because it iterates over
  the second data source instead of hitting it with a query per each
  document.
 [...]
 
 I think that what you are after can be handled by Solr's
 CachedSqlEntityProcessor:
 http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
 
 Two major caveats here:
 * I am not 100% sure that I have understood your requirements.
 * The documentation for CachedSqlEntityProcessor needs to be improved.
   Will see if I can test it, and come up with a better example. As I have
   not actually used this, it could be that I have misunderstood its purpose.
 
 Regards,
 Gora

Re: solrj http client 4

2010-12-08 Thread Chantal Ackermann

SOLR-2020 addresses upgrading to HttpComponents (form HttpClient). I
have had no time to work more on it, yet, though. I also don't have that
much experience with the new version, so any help is much appreciated.

Cheers,
Chantal

On Tue, 2010-12-07 at 18:35 +0100, Yonik Seeley wrote:
 On Tue, Dec 7, 2010 at 12:32 PM, Stevo Slavić ssla...@gmail.com wrote:
  Hello solr users and developers,
 
  Are there any plans to upgraded http client dependency in solrj from 3.x to
  4.x?
 
 I'd certainly be for moving to 4.x (and I think everyone else would too).
 The issue is that it's not a drop-in replacement, so someone needs to
 do the work.
 
 -Yonik
 http://www.lucidimagination.com
 
  Found this https://issues.apache.org/jira/browse/SOLR-861 ticket -
  judging by comments in it upgrade might help fix the issue. I have a project
  in jar hell, getting different versions of http client as transitive
  dependency...
 
  Regards,
  Stevo.

Re: XML to solr

2010-11-15 Thread Chantal Ackermann

Hi Jörg,

you could use the DataImportHandler's XPathEntityProcessor. There you
can specify for each sorl field the XPath at which its value is stored
in the original file (your first example snippet).

The value of field FIEL_ITEMS_DATEINAME for example would have the
XPath //fie...@name='DATEINAME'].
(http://zvon.org/xxl/XPathTutorial/General_ger/examples.html has a very
simple and good reference for xpath patterns.)

Have a look at the DataImportHandler wiki page on how to call the
XPathEntityProcessor.

Cheers,
Chantal

On Mon, 2010-11-15 at 09:22 +0100, Jörg Agatz wrote:
 hi Users.
 
 I have a Question,
 
 i have a lot of XML to indexing, at the Moment i have two XML files, one
 original, and one for solr a (Search_xml)
 
 for example:
 
 add
 doc
 SECTION type=FILE_ITEMS
 field name=MD5SUM6483030ed18d8b7a58a701c8bb638d20/field
 field name=DATEINAME0012_20101105111938206.pdf/field
 field name=FILE_TYPEPDM/field
 /SECTION
 SECTION type=ERP
 SECTION type=ERP_FILE_ITEMS
 field name=IDxx/field
 /SECTION
 SECTION type=ERP_FILE_CONTENT
 field name=VORGANGSARTEK-Anfrage/field
 /SECTION
 /SECTION
 /doc
 /add
 
 
 
 Search_xml :
 
 
 
 
 add
 doc
 field
 name=FILE_ITEMS_MD5SUM6483030ed18d8b7a58a701c8bb638d20/field
 field
 name=FILE_ITEMS_DATEINAME0012_20101105111938206.pdf/field
 field name=FILE_ITEMS_FILE_TYPEPDM/field
 field name=ERP_ERP_FILE_ITEMS_IDxx/field
 field name=ERP_ERP_FILE_CONTENT_VORGANSARTEK-Anfrage/field
 /doc
 /add
 
 My Question is now, (how) can i indexing the Original XML? without move the
 XML to a special search XML?

Output Search Result in ADD-XML-Format

2010-11-10 Thread Chantal Ackermann

Dear all,

my use case is:

Creating an index using DIH where the sub-entity is querying another
SOLR index for more fields.
As there is a very convenient attribute useSolrAddSchema that would
spare me to list all the fields I want to add from the other index, I'm
looking for a way to get the search results in the ADD format directly.

Before starting on the XSLT file that would transform the regular SOLR
result into an SOLR update xml, I just wanted to ask whether there
already exists a solution for this. Maybe I missed some request handler
that already returns the result in update format?

Thanks!
Chantal

RE: Output Search Result in ADD-XML-Format

2010-11-10 Thread Chantal Ackermann

Thank you, James. I was looking for something like that (and I remember
having stumbled over it, in the past, now that you mention it).

I've created an xslt file that transforms the regular result to an
update xml document. Seeing that the SolrEntityProcessor is still in
development, I will stick to the XSLT solution while we are still using
1.4 but I will add a note that with the new release we should try this
SolrEntityProcessor.

(Reading through the JIRA issue I'm not sure whether I can simply get
all fields from the other index and dump them into the index which is
being built. With the XSLT + useSolrAddSchema solution this works just
fine without the need to list all the fields. I should try that before
the next solr release to be able to give some feedback.)

Thanks!
Chantal


On Wed, 2010-11-10 at 15:13 +0100, Dyer, James wrote:
 I'm not sure, but SOLR-1499 might have what you want.
 
 https://issues.apache.org/jira/browse/SOLR-1499
 
 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311
 
 
 -Original Message-
 From: Chantal Ackermann [mailto:chantal.ackerm...@btelligent.de] 
 Sent: Wednesday, November 10, 2010 5:59 AM
 To: solr-user@lucene.apache.org
 Subject: Output Search Result in ADD-XML-Format
 
 Dear all,
 
 my use case is:
 
 Creating an index using DIH where the sub-entity is querying another
 SOLR index for more fields.
 As there is a very convenient attribute useSolrAddSchema that would
 spare me to list all the fields I want to add from the other index, I'm
 looking for a way to get the search results in the ADD format directly.
 
 Before starting on the XSLT file that would transform the regular SOLR
 result into an SOLR update xml, I just wanted to ask whether there
 already exists a solution for this. Maybe I missed some request handler
 that already returns the result in update format?
 
 Thanks!
 Chantal

Re: Missing facet values for zero counts

2010-09-29 Thread Chantal Ackermann

Hi Allistair,


On Wed, 2010-09-29 at 15:37 +0200, Allistair Crossley wrote:
 Hello list,
 
 I am implementing a directory using Solr. The user is able to search with a 
 free-text query or 2 filters (provided as pick-lists) for country. A 
 directory entry only has one country.
 
 I am using Solr facets for country and I use the facet counts generated 
 initially by a *:* search to generate my pick-list.
 
 This is working fairly well but there are a couple of issues I am facing.
 
 Specifically the countries pick-list does not contain ALL possible countries. 
 It only contains those that have been indexed against a document. 
 
 I have looked at facet.missing but I cannot see how this will work - if no 
 documents have a country of Sweden, then how would Solr know to generate a 
 missing total of zero for Sweden - it's never heard of it.
 
 I feel I am missing something - is there a way by which you tell Solr all 
 possible countries rather than relying on counts generated from the index? 
 

I don't think you are missing anything. Instead, you've described it
very well: how should SOLR know of something that never made it into the
index?

Why not just state in the interface that for all missing countries (and
deduce that from the facets and the list retrieved from the database),
there are no hits. You can list those countries separately (or even add
them to the facets after processing solr's result).

If you do want to have them in the index, you'd have to add them by
adding empty documents. But you might get into trouble with required
fields etc. And you will change the statistics of the fields.


Chantal

Re: Autocomplete: match words anywhere in the token

2010-09-23 Thread Chantal Ackermann

On Wed, 2010-09-22 at 20:14 +0200, Arunkumar Ayyavu wrote:
 Thanks for the responses. Now, I included the EdgeNGramFilter. But, I get
 the following results when I search for canon pixma.
 Canon PIXMA MP500 All-In-One Photo Printer
 Canon PowerShot SD500
 
 As you can guess, I'm not expecting the 2nd result entry. Though I
 understand why I'm getting the 2nd entry, I don't know how to ask Solr to
 exlcude it (I could fitler it in my application though). :-( Looks like I
 should study more of Solr's capabilites to get the solution.
 

This has not so much to do with autosuggest, anymore?
You put those quotes in to denote the search input, not to say that the
search input was a phrase, I suppose. Searching for the phrase (quoted),
only the first line should have been found.

If you want to have returned hits that include most of the searched
terms, and in case of only two input terms both: you can configure such
sophisticated rules with the 
http://wiki.apache.org/solr/DisMaxQParserPlugin
Have a look at the mm parameter (Minimum Should Match)

Chantal

Re: Restrict possible results based on relational information

2010-09-20 Thread Chantal Ackermann

hi Stefan

 users can send privates messages, the selection of recipients is done via
 auto-complete. therefore we need to restrict the possible results based on
 the users confirmed contacts - but i have absolutely no idea how to do that
 :/ Add all confirmed contacts to the index, and use it like a type of
 relation? pass the list of confirmed contacts together with the query?

This does not sound like a search query because:
1. you know the user
2. you know his/her list of confirmed contacts

If both statements are true, the list of confirmed contacts should be
accessible via JSON-URL call so that you can load it into a autocomplete
dropdown.
SOLR needs not be involved in this case (but you can of course store the
list of confirmed contacts in a multivalued field per user if you need
it for other searches or facetting).

Cheers,
Chantal

RE: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Chantal Ackermann

Hi Andre,

changing the entity in your index from donor to gift changes of course
the scope of your search results. I found it helpful to re-think such
change from that other side (the result side).
If the users of your search application look for individual gifts, in
the end, then changing the index to gift is for the better.

If they are searching for donors, then I would rethink the change but
not discard it completely: you can still get the list of distinct donors
by facetting over donors. You can show the users that list of donors
(the facets), and they can chose from it and get all information on that
donor (restricted to the original query, of course). The information
would include the actual search result of a list of gifts that passed
the query.

Cheers,
Chantal

On Wed, 2010-09-15 at 21:49 +0200, Andre Bickford wrote:
 Thanks for the response Erick.
 
 I did actually try exactly what you suggested. I flipped the index over so 
 that a gift is the document. This solution certainly solves the previous 
 problem, but introduces a new issue where the search results show duplicate 
 donors. If a donor gave 12 times in a year, and we offer full years as facet 
 ranges, my understanding is that you'd see that donor 12 times in the search 
 results, once for each gift document. Obviously I could do some client side 
 filtering to list only distinct donors, but I was hoping to avoid that.
 
 If I've simply stumbled into the basic tradeoffs of denormalization, I can 
 live with client side de-duplication, but if you have any further suggestions 
 I'm all eyes.
 
 As for sizing, we have some huge charities as clients. However, right now I'm 
 testing on a copy of prod data from a smaller client with ~350,000 donors and 
 ~8,000,000 gift records. So, when I flipped the index around as you 
 suggested, it went from 350,000 documents to 8,000,000 documents. No issues 
 with performance at all.
 
 Thanks again,
 Andre
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Wednesday, September 15, 2010 3:09 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Simple Filter Query (fq) Use Case Question
 
 One strategy is to denormalize all the way. That is, each
 Solr document is Gift Amount and Gift Date would not be multiValued.
 You'd create a different document for each gift, so you'd have multiple
 documents with the same Id, Name, and Address. Be careful, though,
 if you've defined Id as a UniqueKey, you'd only have one record/donor. You
 can handle this easily enough by making a composite key of Id+Gift Date
 (assuming no donor made more than one gift on exactly the same date).
 
 I know this goes completely against all the reflexes you've built up with
 working with DBs, but...
 
 Can you give us a clue how many donations we're talking about here?
 You'd have to be working with a really big nonprofit to get enough documents
 to have to start worrying about making your index smaller.
 
 HTH
 Erick
 
 On Wed, Sep 15, 2010 at 1:41 PM, Andre Bickford abickf...@softrek.comwrote:
 
  I'm working on creating a solr index search for a charitable organization.
  The solr index stores documents of donors. Each donor document has the
  following four fields:
 
  Id
  Name
  Address
  Gift Amount (multiValued)
  Gift Date (multiValued)
 
  In our relational database, there is a one-to-many relationship between the
  DONOR table and the GIFT table. One donor can of course give many gifts over
  time. Consequently, I created the Gift Amount and Gift Date fields to be
  mutiValued.
 
  Now, consider the following query filtered for gifts last month between $0
  and $100:
 
  q=name:Jones
  fq=giftDate:[NOW/MONTH-1 TO NOW/MONTH]
  fq=giftAmount:[0 TO 100]
 
  The results show me donors who donated ANY amount in the past month and
  donors who had EVER in the past given a gift between $0 and $100. I was
  hoping to only see donors who had given a gift between $0 and $100 in the
  past month exclusively. I believe the problem is that I neglected to
  consider that for two multiValued fields, while the values might align
  index wise, there is really no other association between the two fields,
  so the filter query intersection isn't really behaving as I expected.
 
  I think this is a fundamental question of one-to-many denormalization, but
  obviously I'm not yet experienced enough with Lucene/Solr to find a
  solution. As to why not just keep using a relational database, it's because
  I'm trying to provide a faceting solution to drill down to donors. The
  aforementioned fq parameters would come from faceting. Oh, that and Oracle
  Text indexes are a PITA. :-)
 
  Thanks for any help you can provide.
 
  André Bickford
  Software Engineering Team Leader
  SofTrek Corporation
  30 Bryant Woods North  Amherst, NY 14228
  716.691.2800 x154  800.442.9211  Fax: 716.691.2828
  abickf...@softrek.com  www.softrek.com

Re: Boosting specific field value

2010-09-16 Thread Chantal Ackermann

Hi Ravi,

with dismax, use the parameter q.alt which expects standard lucene
syntax (instead of q). If q.alt is present in the query, q is not
required. Add the parameter qt=dismax.

Chantal

On Thu, 2010-09-16 at 06:22 +0200, Ravi Kiran wrote:
 Hello Mr.Rochkind,
I am using StandardRequestHandler so I presume I
 cannot use bq param right ?? Is there a way we can mix dismax and
 standardhandler i.e use lucene syntax for query and use dismax style for bq
 using localparams/nested queries? I remember seeing your post related to
 localparams and nested queries and got thoroughly confused
 
 On Wed, Sep 15, 2010 at 10:28 PM, Jonathan Rochkind rochk...@jhu.eduwrote:
 
  Maybe you are looking for the 'bq' (boost query) parameter in dismax?
 
  http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29
  
  From: Ravi Kiran [ravi.bhas...@gmail.com]
  Sent: Wednesday, September 15, 2010 10:02 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Boosting specific field value
 
  Erick,
  I afraid you misinterpreted my issueif I query like you said
  i.e q=source(bbc OR associated press)^10  I will ONLY get documents with
  source BBC or Associated Press...what I am asking is - if my query query
  does not deal with source at all but uses some other field...since the
  field
  source will be in the result , is there a way to still boost such a
  document
 
  To re-iterate, If my query is as follows
 
  q=primarysection:(Politics* OR Nation*)fq=contenttype:(Blog OR Photo
  Gallery) pubdatetime:[NOW-3MONTHS TO NOW]
 
  and say the resulting docs have source field, is there any way I can
  boost
  the resulting doc/docs that have either BBC/Associated Press as the value
  in
  source field to be on top
 
  Can a filter query (fq) have a boost ? if yes, then probably I could
  rewrite
  the query as follows in a round about way
 
  q=primarysection:(Politics* OR Nation*)fq=contenttype:(Blog OR Photo
  Gallery) pubdatetime:[NOW-3MONTHS TO NOW] (source:(BBC OR Associated
  Press)^10 OR -source:(BBC OR Associated Press)^5)
 
  Theoretically, I have to write source in the fq 2 times as I need docs that
  have source values too just that they will have a lower boost
 
  Thanks,
 
  Ravi Kiran Bhaskar
 
  On Wed, Sep 15, 2010 at 1:34 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   This seems like a simple query-time boost, although I may not be
   understanding
   your problem well. That is, q=source(bbc OR associated press)^10
  
   As for boosting more recent documents, see:
  
  
  http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
  
   HTH
   Erick
  
  
   On Wed, Sep 15, 2010 at 12:44 PM, Ravi Kiran ravi.bhas...@gmail.com
   wrote:
  
Hello,
   I am currently querying solr for a *primarysection* which will
return documents like - *q=primarysection:(Politics* OR
Nation*)fq=contenttype:(Blog OR Photo Gallery)
   pubdatetime:[NOW-3MONTHS
TO NOW]*. Each document has several fields of which I am most
  interested
in
single valued field called *source* ...I want to boost documents
  which
contain *source* value say Associated Press OR BBC and also by
   newer
documents. The returned documents may have several other source values
other
than BBC or Associated Press. since I specifically don't query on
   these
source values I am not sure how I can boost them, Iam using *
StandardRequestHandler*

Re: SolrJ and Multi Core Set up

2010-09-03 Thread Chantal Ackermann

Hi Shaun,

you create the SolrServer using multicore by just adding the core to the
URL. You don't need to add anything with SolrQuery.

URL url = new URL(new URL(solrBaseUrl), coreName);
CommonsHttpSolrServer server = new CommonsHttpSolrServer(url);

Concerning the default core thing - I wouldn't know about that.


Cheers,
Chantal

On Fri, 2010-09-03 at 12:03 +0200, Shaun Campbell wrote:
 I'm writing a client using SolrJ and was wondering how to handle a multi
 core installation.  We want to use the facility to rebuild the index on one
 of the cores at a scheduled time and then use the SWAP facility to switch
 the live core to the newly rebuilt core.  I think I can do the SWAP with
 CoreAdminRequest.setAction() with a suitable parameter.
 
 First of all, does Solr have some concept of a default core? If I have core0
 as my live core and core1 which I rebuild, then after the swap I expect
 core0 to now contain my rebuilt index and core1 to contain the old live core
 data.  My application should then need to keep referring to core0 as normal
 with no change.  Does I have to refer to core0 programmatically? I've
 currently got working client code to index and to query my Solr data but I
 was wondering whether or how I set the core when I move to multi core?
 There's examples showing it set as part of the URL so my guess it's done by
 using something like setParam on SolrQuery.
 
 Has anyone got any advice or examples of using SolrJ in a multi core
 installation?
 
 Regards
 Shaun

Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread Chantal Ackermann

Hi Ahmed,

fields that are empty do not impact the index. It's different from a
database.
I have text fields for different languages and per document there is
always only one of the languages set (the text fields for the other
languages are empty/not set). It works all very well and fast.

I wonder more about what you describe as unrelated data - why would
you want to put unrelated data into a single index? If you want to
search on all the data and return mixed results there surely must be
some kind of relation between the documents?

Chantal

On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
 I understand (and its straightforward) when you want to create a index for
 something simple like Products.
 
 But how do you go about creating a Solr index when you have data coming from
 10-15 database tables, and the tables have unrelated data?
 
 The issue is then you would have many 'columns' in your index, and they will
 be NULL for much of the data since you are trying to shove 15 db tables into
 a single Solr/Lucense index.
 
 
 This must be a common problem, what are the potential solutions?

Re: Implementing lookups while importing data

2010-07-29 Thread Chantal Ackermann

Hi Gora,

your suggestion is good.

Two thoughts:
1. if both of the tables you are joining are in the same database under
the same user you might want to check why the join is so slow. Maybe you
just need to add an index on a column that is used in your WHERE
clauses. Joins should not be slow.

2. if the tables are in different databases and you are joining them via
DIH I tend to agree that this can get too slow (I think the connections
might not get pooled and the jdbc driver adds too much overhead -
ATTENTION ASSUMPTION).
If it's not a possibility for you to create a temporary table that
aggregates the required data before indexing, then your proposal is
indeed a good solution.
Another way I can think off right now, that would only reduce your
coding effort and change it to a configuration task:
In your indexing procedure do:
a) create a temporary solr core on your solr server (see the page on
core admin in the wiki)
b) index this tmp core with the text data
c) index your main core with the data by joining it to the already
existing solr index in the tmp core (this is fast, I can assure you, use
URLDataSource with XPathEntityProcessor if you are on 1.4)
d) delete the tmp core (well, or keep it for next time)

Chantal


On Thu, 2010-07-29 at 11:51 +0200, Gora Mohanty wrote:
 Hi,
 
 We have a database that has numeric values for some columns, which
 correspond to text values in drop-downs on a website. We need to
 index both the numeric and text equivalents into Solr, and can do
 that via a lookup on a different table from the one holding the
 main data. We are currently doing this via a JOIN on the numeric
 field, between the main data table and the lookup table, but this
 dramatically slows down indexing.
 
 We could try using the CachedSqlEntity processor, but there are
 some issues in doing that, as the data import handler is quite
 complicated.
 
 As the lookups need to be done only once, I was planning the
 following:
 (a) Do the lookups in a custom data source that extends
 JDBCDataSource, and store them in arrays.
 (b) Implement a custom transformer that uses the array data
 to convert numeric values read from the database to text.
 Comments on this approach, or suggestions for simpler ones would be
 much appreciated.
 
 Regards,
 Gora

Re: Excluding large tokens from indexing

2010-07-29 Thread Chantal Ackermann

This is probably what you want?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory



On Thu, 2010-07-29 at 15:44 +0200, Paul Dlug wrote:
 Is there a filter available that will remove large tokens from the
 token stream? Ideally something configurable to a character limit? I
 have a noisy data set that has some large tokens (in this case more
 than 50 characters) that I'd like to just strip. They're unlikely to
 ever match a user query and will just take up space since there are a
 large number of them that are not distinct.
 
 
 --Paul

Re: Indexing Problem: Where's my data?

make sure to set stored=true on every field you expect to be returned
in your results for later display.

Chantal

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Hi Lance!

On Wed, 2010-07-28 at 02:31 +0200, Lance Norskog wrote:
 Should this go into the trunk, or does it only solve problems unique
 to your use case?

The solution is generic but is an extension of XPathEntityProcessor
because I didn't want to touch the solr.war. This way I can deploy the
extension into SOLR_HOME/lib.
The problem that it solves is not one with XPathEntityProcessor but more
general. What it does:

It adds an attribute to the entity that I called skipIfEmpty which
takes the variable (it could even take more variables seperated by
whitespace).
On entityProcessor.init() which is called for sub-entities per row of
root entity (:= before every new request to the data source), the value
of the attribute is resolved and if it is null or empty (after
trimming), the entity is not further processed.
This attribute is only allowed on sub-entities.

It would probably be nicer to put that somewhere higher up in the class
hierarchy so that all entity processors could make use of it.
But I don't know how common the use case is - all examples I found where
more or less joins on primary keys.

Cheers,
Chantal

Here comes the code==

import static
org.apache.solr.handler.dataimport.DataImportHandlerException.SEVERE;

import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.DataImportHandlerException;
import org.apache.solr.handler.dataimport.XPathEntityProcessor;

public class OptionalXPathEntityProcessor extends XPathEntityProcessor {
private Logger log =
Logger.getLogger(OptionalXPathEntityProcessor.class.getName());
private static final String SKIP_IF_EMPTY = skipIfEmpty;
private boolean skip = false;

@Override
protected void firstInit(Context context) {
if (context.isRootEntity()) {
throw new DataImportHandlerException(SEVERE,
OptionalXPathEntityProcessor not allowed for root entities.);
}
super.firstInit(context);
}

@Override
public void init(Context context) {
String value = 
context.getResolvedEntityAttribute(SKIP_IF_EMPTY);
if (value == null || value.trim().isEmpty()) {
skip = true;
} else {
super.init(context);
skip = false;
}
}

@Override
public MapString, Object nextRow() {
if (skip) return null;
return super.nextRow();
}
}

Re: SolrJ Response + JSON

You could use org.apache.solr.handler.JsonLoader.
That one uses org.apache.noggit.JSONParser internally.
I've used the JacksonParser with Spring.

http://json.org/ lists parsers for different programming languages.

Cheers,
Chantal

On Wed, 2010-07-28 at 15:08 +0200, MitchK wrote:
 Hello , 
 
 Second try to send a mail to the mailing list... 
 
 I need to translate SolrJ's response into JSON-response.
 I can not query Solr directly, because I need to do some math with the
 responsed data, before I show the results to the client.
 
 Any experiences how to translate SolrJ's response into JSON without writing
 your own JSON Writer?
 
 Thank you. 
 - Mitch

Re: SolrJ Response + JSON

Hi Mitch

On Wed, 2010-07-28 at 16:38 +0200, MitchK wrote:
 Thank you, Chantal.
 
 I have looked at this one: http://www.json.org/java/index.html
 
 This seems to be an easy-to-understand-implementation.
 
 However, I am wondering how to determine whether a SolrDocument's field 
 is multiValued or not.
 The JSONResponseWriter of Solr looks at the schema-configuration. 
 However, the client shouldn't do that.
 How did you solved that problem?

I didn't. I'm not recreating JSON from the SolrJ results.

I would try to use the same classes that SolrJ uses, actually. (Writing
that without having a further look at the code.) I would avoid
recreating existing code as much as possible.
About multivalued fields: you need instanceof checks, I guess. The field
only contains a list if there really are multiple values. (That's what
works for my ScriptTransformer.)

Are you sure that you cannot change the SOLR results at query time
according to your needs? Maybe you should ask for that, first (ask for X
instead of Y...).

Cheers,
Chantal


 
 Thanks for sharing ideas.
 
 - Mitch
 
 
 Am 28.07.2010 15:35, schrieb Chantal Ackermann:
  You could use org.apache.solr.handler.JsonLoader.
  That one uses org.apache.noggit.JSONParser internally.
  I've used the JacksonParser with Spring.
 
  http://json.org/ lists parsers for different programming languages.
 
  Cheers,
  Chantal
 
  On Wed, 2010-07-28 at 15:08 +0200, MitchK wrote:
 
  Hello ,
 
  Second try to send a mail to the mailing list...
 
  I need to translate SolrJ's response into JSON-response.
  I can not query Solr directly, because I need to do some math with the
  responsed data, before I show the results to the client.
 
  Any experiences how to translate SolrJ's response into JSON without writing
  your own JSON Writer?
 
  Thank you.
  - Mitch

Re: Design questions/Schema Help

Hi,

IMHO you can do this with date range queries and (date) facets.
The DateMathParser will allow you to normalize dates on min/hours/days.
If you hit a limit there, then just add a field with an integer for
either min/hour/day. This way you'll loose the month information - which
is sometimes what you want.

You probably want the document entity to be a query with fields:
query
user (id? if you have that)
sessionid
date

the most popular query within a date range is the query that was logged
most times? Do a search on the date range:
q=date:[start TO end]
with facet on the query which gives you the count similar to group by 
count aggregation functionality in an RDBMS. You can do multiple facets
at the same time but be carefull what you are querying for - it will
impact the facet count. You can use functions to change the base of each
facet.

http://wiki.apache.org/solr/SimpleFacetParameters

Cheers,
Chantal

On Tue, 2010-07-27 at 01:43 +0200, Mark wrote:
 We are thinking about using Cassandra to store our search logs. Can 
 someone point me in the right direction/lend some guidance on design? I 
 am new to Cassandra and I am having trouble wrapping my head around some 
 of these new concepts. My brain keeps wanting to go back to a RDBMS design.
 
 We will be storing the user query, # of hits returned and their session 
 id. We would like to be able to answer the following questions.
 
 - What is the n most popular queries and their counts within the last x 
 (mins/hours/days/etc). Basically the most popular searches within a 
 given time range.
 - What is the most popular query within the last x where hits = 0. Same 
 as above but with an extra where clause
 - For session id x give me all their other queries
 - What are all the session ids that searched for 'foos'
 
 We accomplish the above functionality w/ MySQL using 2 tables. One for 
 the raw search log information and the other to keep the 
 aggregate/running counts of queries.
 
 Would this sort of ad-hoc querying be better implemented using Hadoop + 
 Hive? If so, should I be storing all this information in Cassandra then 
 using Hadoop to retrieve it?
 
 Thanks for your suggestions

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Hi Mitch,

thanks for that suggestion. I wasn't aware of that. I've already added a
temporary field in my ScriptTransformer that does basically the same.

However, with this approach indexing time went up from 20min to more
than 5 hours.

The new approach is to query the solr index for that other database that
I've already setup. This is only a bit slower than the original query
(20min). (I'm using URLDataSource to be 1.4.1 conform.)

As with the db entity before, for every document a request is sent to
the solr core even if it is useless because the input variable is empty.
It seems that once an entity processor kicks in you cannot avoid the
initial request to its data source?

Thanks,
Chantal

On Mon, 2010-07-26 at 16:22 +0200, MitchK wrote:
 Hi Chantal,
 
 did you tried to write a  http://wiki.apache.org/solr/DIHCustomFunctions
 custom DIH Function ?
 If not, I think this will be a solution.
 Just check, whether ${prog.vip} is an empty string or null.
 If so, you need to replace it with a value that never can response anything.
 
 So the vip-field will always be empty for such queries. 
 Maybe that helps?
 
 Hopefully, the variable resolver is able to resolve something like
 ${dih.functions.getReplacementIfNeeded(prog.vip).
 
 Kind regards,
 - Mitch
 
 
 
 Chantal Ackermann wrote:
  
  Hi,
  
  my use case is the following:
  
  In a sub-entity I request rows from a database for an input list of
  strings:
  entity name=prog ...
  field name=vip ... /* multivalued, not required */
  entity name=ssc_entry dataSource=ssc onError=continue
  query=select SSC_VALUE from SSC_VALUE
  where SSC_ATTRIBUTE_ID=1
and SSC_VALUE in (${prog.vip})
  field column=SSC_VALUE name=vip_ssc /
  /entity
  /entity
  
  The root entity is prog and it has an optional multivalued field
  called vip. When the list of vip values is empty, the SQL for the
  sub-entity above throws an SQLException. (Working with Oracle which does
  not allow an empty expression in the in-clause.)
  
  Two things:
  (A) best would be not to run the query whenever ${prog.vip} is null or
  empty.
  (B) From the documentation, it is not clear that onError is only checked
  in the transformer runs but not checked when the SQL for the entity
  throws an exception. (Trunk version JdbcDataSource lines 250pp).
  
  IMHO, (A) is the better fix, and if so, (B) is the right decision. (If
  (A) is not easily fixable, making (B) work would be helpful.)
  
  Looking through the code, I've realized that the replacement of the
  variables is done in a very generic way. I've not yet seen an
  appropriate way to check on those variables in order to stop the
  processing of the entity if the variable is empty.
  Is there a way to do this? Or maybe there is a completely different way
  to get my use case working. Any help most appreciated!
  
  Thanks,
  Chantal

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Hi Mitch,


 New idea:
 Create a method which returns the query-string:
 
 returnString(theVIP)
 {
if ( theVIP != null || theVIP != )
{
return a query-string to find the vip
}
else
{
return SELECT 1 // you need to modify this, so that it
 matches your field-definition
}
 }
 
 The main-idea is to perform a blazing fast query, instead of a complex
 IN-clause-query.
 Does this sounds like a solution???

I was using in because it's a multiValued input that results in
multiValued output (not necessarily but it's most probable - it's either
empty or multiple values).
I don't understand how I can make your solution work with multivalued
input/output?

  The new approach is to query the solr index for that other database that 
  I've already setup. This is only a bit slower than the original query 
  (20min). (I'm using URLDataSource to be 1.4.1 conform.) 
  
 Unfortunately I can not follow you. 
 You are querying a solr-index for a database?

Yes, because I've already put one up (second core) and used SolrJ to get
what I want later on, but it would be better to compute the relation
between the two indexes at index time instead of at query time. (If it
would have worked with the db entity the second index wouldn't have been
required, anymore.)
But now that it works well with the url entity I'm fine with maintaining
that second index. It's not that much effort.
I've subclassed URLDataSource to add a check whether the list of input
values is empty and only proceed when this is not the case. If realized
that I have to throw an exception and add the onError attribute to the
entity to make that work.

Thanks!
Chantal

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)