Parallel SQL / calcite adapter

2015-11-19 Thread Kai Gülzau

We are currently evaluating calcite as a SQL facade for different Data Sources

-  JDBC

-  REST

>SOLR

-  ...

I didn't found a "native" calcite adapter for solr 
(http://calcite.apache.org/docs/adapter.html).

Is it a good idea to use the parallel sql feature (over jdbc) to connect 
calcite (or apache drill) to solr?
Any suggestions?


Thanks,

Kai Gülzau


StandardTokenizer vs. hyphens

2013-05-17 Thread Kai Gülzau
Is there some StandardTokenizer Implementation which does not break words on 
hyphens?

I think it would be more flexible to retain hyphens and use a 
WordDelimiterFactory to split these tokens.


StandardTokenizer today:
doc1: email -> email
doc2: e-mail -> e|mail
doc3: e mail -> e|mail

query1: email -> doc1
query2: e-mail -> doc2,doc3
query2: e mail -> doc2,doc3


StandardTokenizer which keeps hyphens + WDF:
doc1: email -> email
doc2: e-mail -> e-mail|email|e|mail
doc3: e mail -> e|mail

query1: email -> doc1,doc2
query2: e-mail -> doc1,doc2,doc3
query2: e mail -> doc2,doc3


Any suggestions to configure or code the 2nd behavior?

Regards,

Kai Gülzau


Keyword aware Tokenizer?

2013-05-17 Thread Kai Gülzau
Does anybody know of a tokenizer which can be configured with (multiple) 
regular expressions to mark some of the input text as keyword
and behave like StandardTokenizer (or UAX29URLEmailTokenizer) otherwise?

Input:
Does my order 4711.0815!-somecode_and.other(stuff) arrive on friday?

Tokens:
does|my|order|4711.0815!-somecode_and.other(stuff)|arrive|on|Friday


Any pointer? How to code?

Regards,

Kai Gülzau






RE: Term Frequencies for Query Result

2013-02-15 Thread Kai Gülzau
> i *think* you are saying that you want the sum of term frequencies for all 
> terms in all matching documents -- but i'm not sure, because i don't see 
> how TermVectorComponent is helping you unless you are iterating over every 
> doc in the result set (ie: deep paging) to get the TermVectors for every 
> doc ... it would help if you could explain what you mean by "counting all 
> frequencies manually"

You are good in guessing :-)
Saying "counting all frequencies manually" I think of collecting term
frequencies for each term while iterating over all documents.


>> I am looking for a way to get the top terms for a query result.
> you have to elaborate on exactly what you mean ... how are you defining 
> "top terms for a query result" ?  Are you talking about the most common 
> terms in the entire result set of documents that match your query?

My goal is to show the most relevant keywords for some documents of the index.
So "top terms for a query result" should be "top nouns for a filtered query".

While using faceting "top" means "sorted by count of docs containing the term".

When I could get the sum of the term frequencies, my hope is to be able
to distinguish between too common terms and more relevant terms.
Something like a score for a term based on a filtered query.


regards,

Kai Gülzau


RE: which analyzer is used for facet.query?

2013-02-15 Thread Kai Gülzau
OK, "problem" solved...

I my tests I only reloaded the core "master" and queried the core "slave".
So config changes on "slave" where not in place :-\

Sorry guys!

Kai


RE: How to make this work with SOLR ( LUCENE-2899 : Add OpenNLP Analysis capabilities as a module)

2013-02-15 Thread Kai Gülzau
> I tried patching my SOLR 4.1 source , as well as a freshly downloaded
> SOLR trunk, to no avail. I guess I just need some tips on how and what
> to patch. I tried to patch the base directory as well as the lucene
> directory. If there's something I need to hack in the  patch, do let
> me know.

Try to apply the patch to trunk within eclipse.
There you can see each filediff and manually change it while patching.

I just ignored most of the javadoc and some other (nonfunctional) diffs and
was able to produce some jars which are running (for my tests) in solr 4.1.


regards,

Kai


copy Field / postprocess Fields after analyze / dynamic analyzer config

2013-02-08 Thread Kai Gülzau
I there a way to postprocess a field after analyze?

Saying postprocess I think of renaming, moving or appending fields.


Some more information:

My schema.xml contains several language suffixed fields (nouns_de, ...).
Each of these is analyzed in a language dependent way:


  


  
  

  


When I do a facted search I have to include every field_lang combination since 
I do not know the language at query time:

http://localhost:8983/solr/master/select?q=*:*&rows=0&facet=true&facet.field=nouns_de&facet.field=nouns_en&facet.field=nouns_fr&facet.field=nouns_nl
 ...

So I have to merge all terms in my own business logic :-(


Any idea / pointer to rename fields after analyze?

This post says it's not possible with the current API:
http://lucene.472066.n3.nabble.com/copyField-after-analyzer-td3900337.html


Another approach would be to allow analyzer configuration depending on another 
field value (language).


regards,

Kai Gülzau



RE: which analyzer is used for facet.query?

2013-02-08 Thread Kai Gülzau
> So it seems that facet.query is using the analyzer of type index.
> Is it a bug or is there another analyzer type for the facet query?

Nobody?
Should I file a bug?

Kai

-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Tuesday, February 05, 2013 2:31 PM
To: solr-user@lucene.apache.org
Subject: which analyzer is used for facet.query?

Hi all,

which analyzer is used for the facet.query?


This is my schema.xml:


  


  
  

  

...



When doing a faceting search like:

http://localhost:8983/solr/slave/select?q=*:*&fq=type:7&rows=0&wt=json&indent=true&facet=true&facet.query=albody_de:Klaus

The UIMA whitespace tokenizer logs some infos:
Feb 05, 2013 2:23:06 PM WhitespaceTokenizer process Information: "Whitespace 
tokenizer starts processing"
Feb 05, 2013 2:23:06 PM WhitespaceTokenizer process Information: "Whitespace 
tokenizer finished processing"


So it seems that facet.query is using the analyzer of type index.
Is it a bug or is there another analyzer type for the facet query?

Regards,

Kai Gülzau





RE: Indexing nouns only with UIMA works - performance issue?

2013-02-05 Thread Kai Gülzau
So with https://issues.apache.org/jira/browse/LUCENE-4749 it's possible to set 
the ModelFile?



???

Thanks,

Kai 


-Original Message-
From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] 
Sent: Monday, February 04, 2013 2:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing nouns only with UIMA works - performance issue?

see an example at
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diff&r1=1442116&r2=1442117&pathrev=1442117where
the 'ngramsize' parameter is set, that's defined in
AggregateSentenceAE.xml descriptor and is then set with the given actual
value.
HTH,

Tommaso


which analyzer is used for facet.query?

2013-02-05 Thread Kai Gülzau
Hi all,

which analyzer is used for the facet.query?


This is my schema.xml:


  


  
  

  

...



When doing a faceting search like:

http://localhost:8983/solr/slave/select?q=*:*&fq=type:7&rows=0&wt=json&indent=true&facet=true&facet.query=albody_de:Klaus

The UIMA whitespace tokenizer logs some infos:
Feb 05, 2013 2:23:06 PM WhitespaceTokenizer process Information: "Whitespace 
tokenizer starts processing"
Feb 05, 2013 2:23:06 PM WhitespaceTokenizer process Information: "Whitespace 
tokenizer finished processing"


So it seems that facet.query is using the analyzer of type index.
Is it a bug or is there another analyzer type for the facet query?

Regards,

Kai Gülzau





RE: Indexing nouns only - UIMA vs. OpenNLP

2013-02-01 Thread Kai Gülzau
Hi Lance,

> About removing non-nouns: the OpenNLP patch includes two simple 
> TokenFilters for manipulating terms with payloads. The 
> FilterPayloadFilter lets you keep or remove terms with given payloads.

yes, I used this already in the schema.xml
>  payloadList="NN,NNS,NNP,NNPS,FM" keepPayloads="true"/>
> 

Works fine :-)
But as Robert Muir stated in LUCENE-4345 I also think using types (and storing 
these optionally as payloads)
would be a better approach.

> http://code.google.com/p/universal-pos-tags/
Thanks for the pointer, used it to improve my english (brown) whitelist for 
UIMA :-)

Regards,

Kai Gülzau


Indexing nouns only with UIMA works - performance issue?

2013-02-01 Thread Kai Gülzau
I now use the "stupid" way to use the german corpus for UIMA: copy + paste :-)

I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
...

  file:german/TuebaModel.dat

...
and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml


Next step is to replace every occurrence of "HmmTagger" in
lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
with "HmmTaggerDE" an save it as
lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml

This can be used in your schema.xml:

  


  


There should be a way to accomplish this via config though.



Last open issue: Performance!

First run via Admin GUI analyze index value "Klaus geht in das Haus und sieht 
eine Maus." / query: "": ~ 5 seconds
Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer finished processing"

Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine Maus." 
/ query: "": ~ 4 seconds
Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer processInformation: 
"Whitespace tokenizer finished processing"

Initialized 3 times?
I think some of the components are not reused while analyzing.

Is this a known issue?


Regards,

Kai Gülzau



-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, January 31, 2013 6:48 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing nouns only - UIMA vs. OpenNLP

UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)


  


  


Open issue -> How to set the ModelFile for the Tagger to 
"german/TuebaModel.dat" ???


Kai Gülzau



RE: Indexing nouns only - UIMA vs. OpenNLP

2013-01-31 Thread Kai Gülzau
UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)


  


  


Open issue -> How to set the ModelFile for the Tagger to 
"german/TuebaModel.dat" ???



OpenNLP:

And a modified patch for https://issues.apache.org/jira/browse/LUCENE-2899 is 
now working
with solr 4.1. :-)


  

  
  
  
  




Any hints on which lib is more accurate on noun tagging?
Any performance or memory issues (some OOM here while testing with 1GB via 
Analyzer Admin GUI)?


Regards,

Kai Gülzau




-Original Message-----
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, January 31, 2013 2:19 PM
To: solr-user@lucene.apache.org
Subject: Indexing nouns only - UIMA vs. OpenNLP

Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:


  

/org/apache/uima/desc/AggregateSentenceAE.xml
false

  false
  albody


  
org.apache.uima.SentenceAnnotation

  coveredText
  albody2

  
   
  


- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau



Indexing nouns only - UIMA vs. OpenNLP

2013-01-31 Thread Kai Gülzau
Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:


  

/org/apache/uima/desc/AggregateSentenceAE.xml
false

  false
  albody


  
org.apache.uima.SentenceAnnotation

  coveredText
  albody2

  
   
  


- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau



Term Frequencies for Query Result

2013-01-30 Thread Kai Gülzau
Hi,

I am looking for a way to get the top terms for a query result.

Faceting does not work since counts are measured as documents containing a term 
and not as the overall count of a term in all found documents:

http://localhost:8983/solr/master/select?q=type%3A7&rows=1&wt=json&indent=true&facet=true&facet.query=type%3A7&facet.field=albody&facet.method=fc

  "facet_counts":{
"facet_queries":{
  "type:7":156},
"facet_fields":{
  "albody":[
"der",73,
"in",68,
"betreff",63,
...


Using http://wiki.apache.org/solr/TermVectorComponent an counting all 
frequencies manually seems to be the only solution by now:

http://localhost:8983/solr/tvrh/?q=type:7&tv.fl=albody&f.albody.tv.tf=true&wt=json&indent=true


"termVectors":[

"uniqueKeyFieldName","ukey",

"798_7_0",[

  "uniqueKey","798_7_0",

  "albody",[

"der",[

  "tf",5],

"die",[

  "tf",7],

...



Does anyone know a better and more efficient solution?


Regards,

Kai Gülzau



RE: How to update one field without losing the others?

2012-06-18 Thread Kai Gülzau
I'm currently playing around with a branch 4x Version 
(https://builds.apache.org/job/Solr-4.x/5/) but I don't get field updates to 
work.

A simple GET testrequest
http://localhost:8983/solr/master/update/json?stream.body={"add":{"doc":{"ukey":"08154711","type":"1","nbody":{"set":"mycontent"

results in
{
  "ukey":"08154711",
  "type":"1",
  "nbody":"{set=mycontent}"}]
}

All fields are stored.
ukey is the unique key :-)
type is a required field.
nbody is a solr.TextField.


Is there any (wiki/readme) pointer how to test and use these feature correctly?
What are the restrictions?

Regards,

Kai Gülzau

 
-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Saturday, June 16, 2012 4:47 PM
To: solr-user@lucene.apache.org
Subject: Re: How to update one field without losing the others?

Atomic update is a very new feature coming in 4.0 (i.e. grab a recent
nightly build to try it out).

It's not documented yet, but here's the JIRA issue:
https://issues.apache.org/jira/browse/SOLR-139?focusedCommentId=13269007&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13269007

-Yonik
http://lucidimagination.com


mailto: scheme aware tokenizer

2012-03-16 Thread Kai Gülzau
Is there any analyzer out there which handles the mailto: scheme?

UAX29URLEmailTokenizer seems to split at the wrong place:

mailto:t...@example.org ->
mailto:test
example.org

As a workaround I use

mailto:"; 
replacement="mailto: "/>

Regards,

Kai Gülzau

novomind AG
__

Bramfelder Straße 121 • 22305 Hamburg

phone +49 (0)40 808071138 • fax +49 (0)40 808071-100
email kguel...@novomind.com • http://www.novomind.com

Vorstand : Peter Samuelsen (Vors.) • Stefan Grieben • Thomas Köhler
Aufsichtsratsvorsitzender: Werner Preuschhof
Gesellschaftssitz: Hamburg • HR B93508 Amtsgericht Hamburg


RE: DIH Strange Problem

2011-11-28 Thread Kai Gülzau
Do you use Java 6 update 29? There is a known issue with the latest mssql 
driver:

http://blogs.msdn.com/b/jdbcteam/archive/2011/11/07/supported-java-versions-november-2011.aspx

"In addition, there are known connection failure issues with Java 6 update 29, 
and the developer preview (non production) versions of Java 6 update 30 and 
Java 6 update 30 build 12.  We are in contact with Java on these issues and we 
will update this blog once we have more information."

Should work with update 28.

Kai

-Original Message-
From: Husain, Yavar [mailto:yhus...@firstam.com] 
Sent: Monday, November 28, 2011 1:02 PM
To: solr-user@lucene.apache.org; Shawn Heisey
Subject: RE: DIH Strange Problem

I figured out the solution and Microsoft and not Solr is the problem here :):

I downloaded and build latest Solr (3.4) from sources and finally hit following 
line of code in Solr (where I put my debug statement) :

if(url != null){
   LOG.info("Yavar: getting handle to driver manager:");
   c = DriverManager.getConnection(url, initProps);
   LOG.info("Yavar: got handle to driver manager:"); }

The call to Driver Manager was not returning. Here was the error!! The Driver 
we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded 
another driver called jTDS jDBC driver and installed that. Problem got fixed!!!

So please follow the following steps:

1. Download jTDS jDBC driver from http://jtds.sourceforge.net/ 2. Put the 
driver jar file into your Solr/lib directory where you had put Microsoft JDBC 
driver.
3. In the data-config.xml use this statement: 
driver="net.sourceforge.jtds.jdbc.Driver"
4. Also in data-config.xml mention url like this: 
"url="jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX"
5. Now run your indexing.

It should solve the problem.

-Original Message-
From: Husain, Yavar
Sent: Thursday, November 24, 2011 12:38 PM
To: solr-user@lucene.apache.org; Shawn Heisey
Subject: RE: DIH Strange Problem

Hi

Thanks for your replies.

I carried out these 2 steps (it did not solve my problem):

1. I tried setting responseBuffering to adaptive. Did not work.
2. For checking Database connection I wrote a simple java program to connect to 
database and fetch some results with the same driver that I use for solr. It 
worked. So it does not seem to be a problem with the connection.

Now I am stuck where Tomcat log says: "Creating a connection for entity ." 
and does nothing, I mean after this log we usually get the "getConnection() 
took x millisecond" however I dont get that ,I can just see the time moving 
with no records getting fetched.

Original Problem listed again:


I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing 
data. Indexing and all was working perfectly fine. However today when I started 
full indexing again, Solr halts/stucks at the line "Creating a connection for 
entity." There are no further messages after that. I can see that DIH 
is busy and on the DIH console I can see "A command is still running", I can 
also see total rows fetched = 0 and total request made to datasource = 1 and 
time is increasing however it is not doing anything. This is the exact 
configuration that worked for me. I am not really able to understand the 
problem here. Also in the index directory where I am storing the index there 
are just 3 files: 2 segment files + 1  lucene*-write.lock file. 
...
data-config.xml:

  .
.

Logs:

INFO: Server startup in 2016 ms
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 
QTime=11 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Nov 23, 2011 4:11:27 PM 
org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6]
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1322041133719
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity SampleText with URL: 
jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Wednesday, November 23, 2011 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH Strange Problem

On 11/23/2011 5:21 AM, Chantal Ackermann wrote:
> Hi Yavar,
>
> my experience with similar problems was that there was something wrong
> with the database connection or the database.
>
> Chantal

It's also possible tha

DIH -> how to collect added/error unique keys?

2011-11-09 Thread Kai Gülzau
Hi *,

I am using DataImportHandler to do imports on a INDEX_QUEUE table (UKEY | 
ACTION)
using a custom Transformer which adds fields from various sources depending on 
the UKEY.

Indexing works fine this way.

But now I want to delete the rows from INDEX_QUEUE which were successfully 
updated.

-> Is there a good "API way" to do this?

Right now I'm using custom RequestProcessor which collects the UIDs and calls a 
method
on a singleton with access to the DB. It works but I hate these global 
singletons... :-(

public void processAdd(AddUpdateCommand cmd) throws IOException {
  SolrInputDocument doc = cmd.getSolrInputDocument();
  try {
super.processAdd(cmd);
addOK(doc);
  } catch (IOException e) {
addError(doc);
throw e;
  } catch (RuntimeException e) {
addError(doc);
throw e;
  }
}

Any other suggestions?

Regards,

Kai Gülzau



RE: Jetty logging

2011-11-03 Thread Kai Gülzau
Hi,

remove slf4j-jdk14-1.6.1.jar from the war and repack it with slf4j-log4j12.jar 
and log4j-1.2.14.jar instead.

->http://wiki.apache.org/solr/SolrLogging

Regards,

Kai Gülzau

-Original Message-
From: darul [mailto:daru...@gmail.com] 
Sent: Thursday, November 03, 2011 11:26 AM
To: solr-user@lucene.apache.org
Subject: Jetty logging

Hello everybody,

I do not find a solution on how to configure jetty with sl4j and a 
log4j.properties file.

In  I have put :

- log4j-1.2.14.jar
- slf4j-api-1.3.1.jar

in  directory:
- log4j.properties



At the end, nothing append when running jetty.

Do you have any ideas ?

Thanks,

Julien





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Jetty-logging-tp3476715p3476715.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: document update / nested documents / document join

2011-10-17 Thread Kai Gülzau
I just found another feature/ticket to be able to update fields:
https://issues.apache.org/jira/browse/SOLR-2753
https://issues.apache.org/jira/browse/LUCENE-1231

-> CSF Column Stride Fields

This should work well with simple fields like category/date/...!?

So I have 2 options:
1.)
Introduce a rather complex logic on client side to form the right join query 
(or do join manually),
which should, as you stated, work even with complex queries.

2.)
Or do it straightforward, combine all docs to one and WAIT for one of the 
various "update field/doc"
features to be realized.


I think I'll give 1.) a try and wait for 2.) if I get into trouble.


Regards,

Kai Gülzau
  

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com] 
Sent: Monday, October 17, 2011 1:22 PM
To: solr-user@lucene.apache.org
Subject: Re: document update / nested documents / document join

Hi,

First. I'm not sure you know. But the join isn't like a join in a database it's 
more like
   select * from (set of documents that match query)
   where exists (set of documents that match join query)

I have some complex (multiple join fq) in one call and that is fine, so I think 
this query may work also.
other wise you could try something like:
q=*:*&fq={!join+from=out_ticketid+to=ticketid}(category:bugfixes+OR+out_category:bugfixes)&fq={!join+from=out_ticketid+to=ticketid}(body:answer+OR+out_body:answer)

My wish would also be that this where backported to 3.x. But if not we'll 
probably go live on 4.x

Thijs


On 17-10-2011 11:46, Kai Gülzau wrote:
> Nobody?
>
> SOLR-139 seems to be the most popular issue but I don’t think this will be 
> resolved in near future (this year). Right?
>
> So I will try SOLR-2272 as a workaround, split up my documents in "static" 
> and " frequently updated"
> and join them at query time.
>
> What is the exact join query to do a query like "category:bugfixes AND 
> body:answer"
>matching "category:bugfixes" in doc1 and
>matching "body:answer" in doc3
>with just returning "doc 1"??
>
> I adopted the fieldnames of
> doc 3:
> type: out
> out_ticketid: 1001
> out_body: this is my answer
> out_category: other
>
> q={!join+from=out_ticketid+to=ticketid}(category:bugfixes+OR+out_categ
> ory:bugfixes)+AND+(body:answer+OR+out_body:answer)
>
>
> Writing this, I doubt this syntax is even possible!?
> Additionally I'm not sure if trunk with SOLR-2272 is "production ready".
>
> The only way to do what I want in a released 3.x version is to do several 
> searches and joining the results manually.
> e.g.
> q=category:bugfixes ->  doc1 ->  ticketid: 1001 q=body:answers ->  
> doc3 ->  ticket:1001
> ->  result ticketid:1001
>
> This I way I would lose benefits like faceted search etc. :-\
>
> Any suggestions?
>
>
> Regards,
>
> Kai Gülzau
>
> -Original Message-
> From: Kai Gülzau [mailto:kguel...@novomind.com]
> Sent: Thursday, October 13, 2011 4:52 PM
> To: solr-user@lucene.apache.org
> Subject: document update / nested documents / document join
>
> Hi *,
>
> i am a bit confused about what is the best way to achieve my requirements.
>
> We have a mail ticket system. A ticket is created when a mail is received by 
> the system:
>
> doc 1:
> uid: 1001_in
> ticketid: 1001
> type: in
> body: I have a problem
> category: bugfixes
> date: 201110131955
>
> This incoming document is static. While the ticket is in progress there is 
> another document representing the current/last state of the ticket. Some 
> fields of this document are updated frequently:
>
> doc 2:
> uid: 1001_out
> ticketid: 1001
> type: out
> body:
> category: bugfixes
> date: 201110132015
>
> a bit later (doc 2 is deleted/updated):
> doc 3:
> uid: 1001_out
> ticketid: 1001
> type: out
> body: this is my answer
> category: other
> date: 201110140915
>
> I would like to do a boolean search spanning multiple documents like 
> "category:bugfixes AND body:answer".
>
> I think it's the same what was proposed by:
> http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-sup
> port-in-lucene
>
> So I dig into the deeps of Lucene and Solr tickets and now i am stuck 
> choosing the "right" way:
>
> https://issues.apache.org/jira/browse/LUCENE-2454 Nested Document 
> query support
> https://issues.apache.org/jira/browse/LUCENE-3171 
> BlockJoinQuery/Collector
> https://issues.apache.org/jira/browse/LUCENE-1879 Parallel incremental 
> indexing
> https://issues.apache.org/jira/browse/SOLR-139 Support 
> updateable/modifiable documents
> https://issues.apache.org

RE: document update / nested documents / document join

2011-10-17 Thread Kai Gülzau
Nobody?

SOLR-139 seems to be the most popular issue but I don’t think this will be 
resolved in near future (this year). Right?

So I will try SOLR-2272 as a workaround, split up my documents in "static" and 
" frequently updated"
and join them at query time.

What is the exact join query to do a query like "category:bugfixes AND 
body:answer"
  matching "category:bugfixes" in doc1 and
  matching "body:answer" in doc3
  with just returning "doc 1"??

I adopted the fieldnames of
doc 3:
type: out
out_ticketid: 1001
out_body: this is my answer
out_category: other

q={!join+from=out_ticketid+to=ticketid}(category:bugfixes+OR+out_category:bugfixes)+AND+(body:answer+OR+out_body:answer)


Writing this, I doubt this syntax is even possible!?
Additionally I'm not sure if trunk with SOLR-2272 is "production ready".

The only way to do what I want in a released 3.x version is to do several 
searches and joining the results manually.
e.g. 
q=category:bugfixes -> doc1 -> ticketid: 1001
q=body:answers -> doc3 -> ticket:1001
-> result ticketid:1001

This I way I would lose benefits like faceted search etc. :-\

Any suggestions?


Regards,

Kai Gülzau

-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, October 13, 2011 4:52 PM
To: solr-user@lucene.apache.org
Subject: document update / nested documents / document join

Hi *,

i am a bit confused about what is the best way to achieve my requirements.

We have a mail ticket system. A ticket is created when a mail is received by 
the system:

doc 1:
uid: 1001_in
ticketid: 1001
type: in
body: I have a problem
category: bugfixes
date: 201110131955

This incoming document is static. While the ticket is in progress there is 
another document representing the current/last state of the ticket. Some fields 
of this document are updated frequently:

doc 2:
uid: 1001_out
ticketid: 1001
type: out
body:
category: bugfixes
date: 201110132015

a bit later (doc 2 is deleted/updated):
doc 3:
uid: 1001_out
ticketid: 1001
type: out
body: this is my answer
category: other
date: 201110140915

I would like to do a boolean search spanning multiple documents like 
"category:bugfixes AND body:answer".

I think it's the same what was proposed by:
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

So I dig into the deeps of Lucene and Solr tickets and now i am stuck choosing 
the "right" way:

https://issues.apache.org/jira/browse/LUCENE-2454 Nested Document query support
https://issues.apache.org/jira/browse/LUCENE-3171 BlockJoinQuery/Collector
https://issues.apache.org/jira/browse/LUCENE-1879 Parallel incremental indexing
https://issues.apache.org/jira/browse/SOLR-139 Support updateable/modifiable 
documents
https://issues.apache.org/jira/browse/SOLR-2272 Join


If it is easily possible to update one field in a document i would just merge 
the two logical documents into one representing the whole ticket. But i can't 
see this is already possible.

SOLR-2272 seems to be the best solution by now but feels like workaround.
" I can't update a document field so i split it up in static and dynamic 
content and join both at query time."

SOLR-2272 is committed to trunk/solr 4.
Are there any planned release dates for solr 4 or a possible backport for 
SOLR-2272 in 3.x?


I would appreciate any suggestions.

Regards,

Kai Gülzau







document update / nested documents / document join

2011-10-13 Thread Kai Gülzau
Hi *,

i am a bit confused about what is the best way to achieve my requirements.

We have a mail ticket system. A ticket is created when a mail is received by 
the system:

doc 1:
uid: 1001_in
ticketid: 1001
type: in
body: I have a problem
category: bugfixes
date: 201110131955

This incoming document is static. While the ticket is in progress there is 
another document representing
the current/last state of the ticket. Some fields of this document are updated 
frequently:

doc 2:
uid: 1001_out
ticketid: 1001
type: out
body:
category: bugfixes
date: 201110132015

a bit later (doc 2 is deleted/updated):
doc 3:
uid: 1001_out
ticketid: 1001
type: out
body: this is my answer
category: other
date: 201110140915

I would like to do a boolean search spanning multiple documents like 
"category:bugfixes AND body:answer".

I think it's the same what was proposed by:
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

So I dig into the deeps of Lucene and Solr tickets and now i am stuck choosing 
the "right" way:

https://issues.apache.org/jira/browse/LUCENE-2454 Nested Document query support
https://issues.apache.org/jira/browse/LUCENE-3171 BlockJoinQuery/Collector
https://issues.apache.org/jira/browse/LUCENE-1879 Parallel incremental indexing
https://issues.apache.org/jira/browse/SOLR-139 Support updateable/modifiable 
documents
https://issues.apache.org/jira/browse/SOLR-2272 Join


If it is easily possible to update one field in a document i would just merge 
the two logical
documents into one representing the whole ticket. But i can't see this is 
already possible.

SOLR-2272 seems to be the best solution by now but feels like workaround.
" I can't update a document field so i split it up in static and dynamic 
content and join both at query time."

SOLR-2272 is committed to trunk/solr 4.
Are there any planned release dates for solr 4 or a possible backport for 
SOLR-2272 in 3.x?


I would appreciate any suggestions.

Regards,

Kai Gülzau







RE: Multiple indexes

2011-06-17 Thread Kai Gülzau
> > (for example if you need separate TFs for each document type).
> 
> I wonder if in this precise case it wouldn't be pertinent to 
> have a single index with the various document types each 
> having each their own fields set. Isn't TF calculated field by field ?

Oh, you are right :)
So i will start testing with one "mixed type" index and
perhaps use IndexReaderFactory afterwards in comparison.

Thanks,

Kai Gülzau

RE: Multiple indexes

2011-06-16 Thread Kai Gülzau
Are there any plans to support a kind of federated search
in a future solr version?

I think there are reasons to use seperate indexes for each document type
but do combined searches on these indexes
(for example if you need separate TFs for each document type).

I am aware of http://wiki.apache.org/solr/DistributedSearch
and a workaround to do federated search with sharding
http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set
but this seems to be too much network- and maintenance overhead.

Perhaps it is worth a try to use an IndexReaderFactory which
returns a lucene MultiReader!?
Is the IndexReaderFactory still Experimental?
https://issues.apache.org/jira/browse/SOLR-1366


Regards,

Kai Gülzau

 

> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
> Sent: Wednesday, June 15, 2011 8:43 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Multiple indexes
> 
> Next, however, I predict you're going to ask how you do a 'join' or 
> otherwise query accross both these cores at once though. You can't do 
> that in Solr.
> 
> On 6/15/2011 1:00 PM, Frank Wesemann wrote:
> > You'll configure multiple cores:
> > http://wiki.apache.org/solr/CoreAdmin
> >> Hi.
> >>
> >> How to have multiple indexes in SOLR, with different fields and
> >> different types of data?
> >>
> >> Thank you very much!
> >> Bye.
> >
> >
> 

RE: Is there anything like MultiSearcher?

2011-06-15 Thread Kai Gülzau
Hi Roman,

do you have solved your problem and how?

Regards,

Kai Gülzau

 

> -Original Message-
> From: Roman Chyla [mailto:roman.ch...@gmail.com] 
> Sent: Saturday, February 05, 2011 4:50 PM
> To: solr-user@lucene.apache.org
> Subject: Is there anything like MultiSearcher?
> 
> Dear Solr experts,
> 
> Could you recommend some strategies or perhaps tell me if I approach
> my problem from a wrong side? I was hoping to use MultiSearcher to
> search across multiple indexes in Solr, but there is no such a thing
> and MultiSearcher was removed according to this post:
> http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html
> 
> I though I had two use cases:
> 
> 1. maintenance - I wanted to build two separate indexes, one for
> fulltext and one for metadata (the docs have the unique ids) -
> indexing them separately would make things much simpler
> 2. ability to switch indexes at search time (ie. for testing purposes
> - one fulltext index could be built by Solr standard mechanism, the
> other by a rather different process - independent instance of lucene)
> 
> I think the recommended approach is to use the Distributed search - I
> found a nice solution here:
> http://stackoverflow.com/questions/2139030/search-multiple-sol
r-cores-and-return-one-result-set
> - however it seems to me, that data are sent over HTTP (5M from one
> core, and 5M from the other core being merged by the 3rd solr core?)
> and I would like to do it only for local indexes and without the
> network overhead.
> 
> Could you please shed some light if there already exist an optimal
> solution to my use cases? And if not, whether I could just try to
> build a new SolrQuerySearcher that is extending lucene MultiSearcher
> instead of IndexSearch - or you think there are some deeply rooted
> problems there and the MultiSearch-er cannot work inside Solr?
> 
> Thank you,
> 
>   Roman
> 
>