Re: Checking Optimal Values for BM25

2016-12-15 Thread Sascha Szott

Hi Furkan,

in order to change the BM25 parameter values k1 and b, the following XML 
snippet needs to be added in your schema.xml configuration file:



  1.3
  0.7


It is even possible to specify the SimilarityFactory on individual index 
fields. See [1] for more details.


Best
Sascha

[1] https://wiki.apache.org/solr/SchemaXml#Similarity


Am 15.12.2016 um 14:58 schrieb Furkan KAMACI:

Hi,

Sole's default similarity is BM25 anymore. Its parameters are defined as

k1=1.2, b=0.75

as default. However is there any way that to check the effect of using
different coefficients to calculate BM25 to find the optimal values?

Kind Regards,
Furkan KAMACI



Re: field length within BM25 score calculation in Solr 6.3

2016-12-15 Thread Sascha Szott

Hi,

bumping my question after 10 days. Any clarification is appreciated.

Best
Sascha



Hi folks,

my Solr index consists of one document with a single valued field "title" of type 
"text_general". The title field was index with the content: 1 2 3 4 5 6 7 8 9. The field 
type text_general uses a StandardTokenizer which should result in 9 tokens. The corresponding 
length of field title in the given document is 9.

The field type is defined as follows:

   
 
   
   
   
 
 
   
   
   
   
 
   


I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word.

As expected, the query title:1 returns the given document. The BM25 score of 
the document for the given query is 0.272.

But why does Solr 6.3 states that the length of field title is 10.24?

0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of:
   0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
 0.2876821 = idf(docFreq=1, docCount=1)
 0.94664377 = tfNorm, computed from:
   1.0 = termFreq=1.0
   1.2 = parameter k1
   0.75 = parameter b
   9.0 = avgFieldLength
   10.24 = fieldLength

In contrast, the value of avgFieldLength is correct.

The same observation can be made if the index consists of two simple documents:

doc1: title = 1 2 3 4
doc2: title = 1 2 3 4 5 6 7 8

The BM25 score calculation of doc2 is explained as:

0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of:
   0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
 0.18232156 = idf(docFreq=2, docCount=2)
 0.7757405 = tfNorm, computed from:
   1.0 = termFreq=1.0
   1.2 = parameter k1
   0.75 = parameter b
   6.0 = avgFieldLength
   10.24 = fieldLength

The value of fieldLength does not match 8.

Is there same "magic“ applied to the value of field length that goes beyond the 
standard BM25 score formula?

If so, what is the idea behind this modification. If not, is this a Lucene / 
Solr bug?

Best regards,
Sascha







--
Sascha Szott :: KOBV/ZIB :: +49 30 84185-457


field length within BM25 score calculation in Solr 6.3

2016-12-04 Thread Sascha Szott
Hi folks,

my Solr index consists of one document with a single valued field "title" of 
type "text_general". The title field was index with the content: 1 2 3 4 5 6 7 
8 9. The field type text_general uses a StandardTokenizer which should result 
in 9 tokens. The corresponding length of field title in the given document is 9.

The field type is defined as follows:

  

  
  
  


  
  
  
  

  


I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word.

As expected, the query title:1 returns the given document. The BM25 score of 
the document for the given query is 0.272. 

But why does Solr 6.3 states that the length of field title is 10.24?

0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of:
  0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
0.2876821 = idf(docFreq=1, docCount=1)
0.94664377 = tfNorm, computed from:
  1.0 = termFreq=1.0
  1.2 = parameter k1
  0.75 = parameter b
  9.0 = avgFieldLength
  10.24 = fieldLength

In contrast, the value of avgFieldLength is correct.

The same observation can be made if the index consists of two simple documents:

doc1: title = 1 2 3 4
doc2: title = 1 2 3 4 5 6 7 8

The BM25 score calculation of doc2 is explained as:

0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of:
  0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
0.18232156 = idf(docFreq=2, docCount=2)
0.7757405 = tfNorm, computed from:
  1.0 = termFreq=1.0
  1.2 = parameter k1
  0.75 = parameter b
  6.0 = avgFieldLength
  10.24 = fieldLength

The value of fieldLength does not match 8.

Is there same "magic“ applied to the value of field length that goes beyond the 
standard BM25 score formula? 

If so, what is the idea behind this modification. If not, is this a Lucene / 
Solr bug?

Best regards,
Sascha






Re: Problem of facet on 170M documents

2013-11-02 Thread Sascha SZOTT
Hi Ming,

which Solr version are you using? In case you use one of the latest
versions (4.5 or above) try the new parameter facet.threads with a
reasonable value (4 to 8 gave me a massive performance speedup when
working with large facets, i.e. nTerms  10^7).

-Sascha


Mingfeng Yang wrote:
 I have an index with 170M documents, and two of the fields for each
 doc is source and url.  And I want to know the top 500 most
 frequent urls from Video source.
 
 So I did a facet with 
 fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and
 the matching documents are about 9 millions.
 
 The solr cluster is hosted on two ec2 instances each with 4 cpu, and
 32G memory. 16G is allocated tfor java heap.  4 master shards on one
 machine, and 4 replica on another machine. Connected together via
 zookeeper.
 
 Whenever I did the query above, the response is just taking too long
 and the client will get timed out. Sometimes,  when the end user is
 impatient, so he/she may wait for a few second for the results, and
 then kill the connection, and then issue the same query again and
 again.  Then the server will have to deal with multiple such heavy
 queries simultaneously and being so busy that we got no server
 hosting shard error, probably due to lost communication between solr
 node and zookeeper.
 
 Is there any way to deal with such problem?
 
 Thanks, Ming
 


intersection of filter queries with raw query parser

2013-05-31 Thread Sascha Szott

Hi folks,

is it possible to use the raw query parser with a disjunctive filter 
query? Say, I have a field 'foo' and two values 'v1' and 'v2' (the field 
values are free text and can contain any character). What I want is to 
retrieve all documents satisying fq=foo:(v1 OR v2). In case only one 
field (v1) is given, the query fq={!raw f=foo}v1 works as expected. But 
how can I formulate the filter query (with the raw query parser) in case 
two values are provided.


The same question was posted on Stackoverflow 
(http://stackoverflow.com/questions/5637675/solr-query-with-raw-data-and-union-multiple-facet-values) 
two years ago. But there was only the advice to give up using the raw 
query parser which is not what I want to do.


Thanks in advance,
Sascha


Re: Does SolrCloud support distributed IDFs?

2012-10-22 Thread Sascha SZOTT

Hi Mark,

Mark Miller wrote:

Still waiting on that issue. I think Andrzej should just update it to
trunk and commit - it's option and defaults to off. Go vote :)
Sounds like the problem is already solved and the remaining work 
consists of code integration? Can somebody estimate how much work that 
would be?


-Sascha


Does SolrCloud support distributed IDFs?

2012-10-21 Thread Sascha Szott
Hi folks,

a known limitation of the old distributed search feature is the lack of 
distributed/global IDFs (#SOLR-1632). Does SolrCloud bring some improvements in 
this direction?

Best regards,
Sascha


Re: Prefix query is not analysed?

2012-07-02 Thread Sascha Szott
Hi,

wildcard and fuzzy queries are not analyzed.

-Sascha



Alok Bhandari alokomprakashbhand...@gmail.com schrieb:

Hello ,

I am pushing Chuck Follett'.?.? in solr and when I query for this field
with query string field:Follett'.* I am getting 0 results.

field type declared is

fieldType name=text_email class=solr.TextField stored=true
indexed=true positionIncrementGap=100
analyzer
tokenizer class=solr.UAX29URLEmailTokenizerFactory
maxTokenLength=255/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType 

and parser we are using is EdisMax .

Is it the case that for prefix query the text analysis is not done I am
getting 0 results or there is something fundamentally wrong with my
data/schema .

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Prefix-query-is-not-analysed-tp3992435.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Prefix query is not analysed?

2012-07-02 Thread Sascha Szott
Hi,

I suppose you are using Solr 3.6. Then take a look at

http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

-Sascha



Alok Bhandari alokomprakashbhand...@gmail.com schrieb:

Thanks for reply.

If I check the debug query through solr-admin I can see that the lower case
filter is applied and 

rawquerystring:em_to_name:Follett'.*,
querystring:em_to_name:Follett'.*,
parsedquery:+em_to_name:follett'.*,
parsedquery_toString:+em_to_name:follett'.*,
explain:{},
QParser:ExtendedDismaxQParser,


I can see this query. So is it the case that only tokenization is not done
for the wildcard queries but other filters specified are applied?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Prefix-query-is-not-analysed-tp3992435p3992450.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: indexing documents in Apache Solr using php-curl library

2012-07-02 Thread Sascha SZOTT
Hi,

perhaps it's better to use a PHP Solr client library. I used

   https://code.google.com/p/solr-php-client/

in a project of mine and it worked just fine.

-Sascha

Asif wrote:
 I am indexing the file using php curl library. I am stuck here with the code
 echo Stored in:  . upload/ . $_FILES[file][name];
  $result=move_uploaded_file($_FILES[file][tmp_name],upload/ .
 $_FILES[file][name]);
  if ($result == 1) echo pUpload done ./p;
 $options = getopt(f:);
 $infile = $options['f'];
 
 $url = http://localhost:8983/solr/update/;;
 $filename = upload/ . $_FILES[file][name];
 $handle = fopen($filename, rb);
 $contents = fread($handle, filesize($filename));
 fclose($handle);
 echo $url;
 $post_string = file_get_contents(upload/ .
 $_FILES[file][name]);
 echo $contents;
 $header = array(Content-type:text/xml; charset=utf-8);
 
 $ch = curl_init();
 
 curl_setopt($ch, CURLOPT_URL, $url);
 curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 curl_setopt($ch, CURLOPT_POST, 1);
 curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
 curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
 curl_setopt($ch, CURLINFO_HEADER_OUT, 1);
 
 $data = curl_exec($ch);
 
 if (curl_errno($ch)) {
print curl_error: . curl_error($ch);
 } else {
curl_close($ch);
print curl exited okay\n;
echo Data returned...\n;
echo \n;
echo $data;
echo \n;
 }
 
 Nothing is showing as a result. Moreover there is nothing shown in the event
 log of Apache Solr. please help me with the code
 



Re: how to retrieve a doc from its docID ?

2012-06-30 Thread Sascha Szott
Hi,

did you include the fl parameter in the Solr query URL? If that's the case make 
sure that the field name 'text' is mentioned there. You should also make sure 
that the field definition (in schema.xml) for 'text' says stored=true, 
otherwise the field will not be returned.

-Sascha



Giovanni Gherdovich g.gherdov...@gmail.com schrieb:

Hi all,

when querying my solr instance, the answers I get
are the document IDs of my docs. Here is how one of my docs
looks like:

-- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- --
add
doc
field name=texthello solar!/field
field name=id123/field
/doc
/add
-- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- --

here is the response if I query for solar :

-- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- --
response
lst name=responseHeader/lst
result name=response numFound=1 start=0 maxScore=1.0
docfloat name=score1.0/float
str name=id123/str/doc
/result
/response
-- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- --

which is, solr gives me the doc ID. How to retrieve the doc's field text
given its id ?

cheers,
Giovanni



Re: querying thru solritas gives me zero results

2012-06-30 Thread Sascha Szott
Hi,

Solritas uses the dismax query parser. The dismax config parameter 'qf' 
specifies the index fields to be searched in. Make sure that 'name' is your 
default search field.

-Sascha




Giovanni Gherdovich g.gherdov...@gmail.com schrieb:

Hi all,

this morning I was very proud of myself since I managed
to set up solritas ( http://wiki.apache.org/solr/VelocityResponseWriter )
for the solr instance on my server (ubuntu natty).

This joy lasted only half a minute, since the only query
that gets more than zero results with solritas is the catchall *:*

for example:
http://my.server.com:8080/solr/select/?q=foobar has thousands of results,
​http://my.server.com:8080/solr/itas?q=foobar has none

Here the standard and velocity request handlers from my solrconfig.xml;

-- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8
requestHandler name=standard class=solr.SearchHandler default=true
lst name=defaults
str name=echoParamsexplicit/str
/lst
/requestHandler
-- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8

-- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8
queryResponseWriter name=velocity
class=org.apache.solr.request.VelocityResponseWriter/
requestHandler name=/itas class=solr.SearchHandler
lst name=defaults
str name=wtvelocity/str
str name=v.templatebrowse/str
str name=titleSolr cookbook example/str
str name=defTypedismax/str
str name=q.alt*:*/str
str name=rows10/str
str name=fl*,score/str
str name=qfname/str
/lst
/requestHandler
-- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8

any hint on how I can debug that?

cheers,
Giovanni



Re: Searching for digits with strings

2012-06-27 Thread Sascha Szott
Hi,

as far as I know Solr does not provide such a feature. If you cannot make any 
assumptions on the numbers, choose an appropriate library that is able to 
transform between numerical and non-numerical representations and populate the 
search field with both versions at index-time.

-Sascha

Alireza Salimi alireza.sal...@gmail.com schrieb:

Hi,

Well that's the only solution I got so far and it would work for most of
the cases,
but l thought there might be some better solutions.

Thanks

On Wed, Jun 27, 2012 at 5:49 PM, Upayavira u...@odoko.co.uk wrote:

 How many numbers? 0-9? Or every number under the sun?

 You could achieve a limited number by using synonyms, 0 is a synonym for
 nought and zero, etc.

 Upayavira

 On Wed, Jun 27, 2012, at 05:22 PM, Alireza Salimi wrote:
  Hi,
 
  I was wondering if there's a built in solution in Solr so that you can
  search for documents with digits by their string representations.
  i.e. search for 'two' would match fields which have '2' token and vice
  versa.
 
  Thanks





Re: getting started

2011-06-16 Thread Sascha SZOTT

Hi Mari,

it depends ...

* How many records are stored in your MySQL databases?
* How often will updates occur?
* How many db records / index documents are changed per update?

I would suggest to start with a single Solr core first. Thereby, you can 
concentrate on the basics and do not need to deal with more advanced 
things like sharding. In case you encounter performance issues later on, 
you can switch to a multi-core setup.


-Sascha

Mari Masuda wrote:

Hello,

I am new to Solr and am in the beginning planning stage of a large project and 
could use some advice so as not to make a huge design blunder that I will 
regret down the road.

Currently I have about 10 MySQL databases that store information about 
different archival collections.  For example, we have data and metadata about a 
political poster collection, a television program, documents and photographs of 
and about a famous author, etc.  My job is to work with the staff archivists to 
come up with a standard metadata template so the 10 databases can be 
consolidated into one.

Currently the info in these databases is accessed through 10 different sets of 
PHP pages that were written a long time ago for PHP 4.  My plan is to write a 
new Java application that will handle both public display of the info as well 
as an administrative interface so that staff members can add or edit the 
records.

I have decided to use Solr as the search mechanism for this project.  Because the info in each of 
our 10 collections is slightly different (e.g., a record about a poster does not contain duration 
information, but a record about a TV show does) I was thinking it would be good to separate each 
collection's index into a separate Solr core so that commits coming from one collection do not bog 
down the other unrelated collections.  One reservation I have is that eventually we would like to 
be able to type in Iraq and find records across all of the collections at once instead 
of having to search each collection separately.  Although I don't know anything about it at this 
stage, I did Google sharding after reading someone's recent post on this list and it 
sounds like that may be a potential answer to my question.  Does anyone have any advice on how I 
should initially set up Solr for my situation?  I am slowly making my way through the wiki and 
RTFMing, but I wanted to see what

the experts have to say because at this point I don't really know where to 
start.


Thank you very much,
Mari


Re: Solr coding

2011-03-23 Thread Sascha Szott

Hi,

depending on your needs, take a look at Apache ManifoldCF. It adds 
document-level security on top of Solr.


-Sascha

On 23.03.2011 14:20, satya swaroop wrote:

Hi All,
   As for my project Requirement i need to keep privacy for search of
files so that i need to modify the code of solr,

for example if there are 5 users and each user indexes some files as
   user1 -  java1, c1,sap1
   user2 -  java2, c2,sap2
   user3 -  java3, c3,sap3
   user4 -  java4, c4,sap4
   user5 -  java5, c5,sap5

and if a user2 searches for the keyword java then it should be display
only  the file java2 and not other files

so inorder to keep this filtering inside solr itself may i know where to
modify the code... i will access a database to check the user indexed files
and then filter the result... i didnt have any cores.. i indexed all files
in a single index...

Regards,
satya



Re: Search failing for matched text in large field

2011-03-23 Thread Sascha Szott

Hi Paul,

did you increase the value of the maxFieldLength parameter in your 
solrconfig.xml?


-Sascha

On 23.03.2011 17:05, Paul wrote:

I'm using solr 1.4.1.

I have a document that has a pretty big field. If I search for a
phrase that occurs near the start of that field, it works fine. If I
search for a phrase that appears even a little ways into the field, it
doesn't find it. Is there some limit to how far into a field solr will
search?

Here's the way I'm doing the search. All I'm changing is the text I'm
searching on to make it succeed or fail:

http://localhost:8983/solr/my_core/select/?q=%22search+phrase%22hl=onhl.fl=text

Or, if it is not related to how large the document is, what else could
it possibly be related to? Could there be some character in that field
that is stopping the search?


Re: Search failing for matched text in large field

2011-03-23 Thread Sascha Szott

On 23.03.2011 18:52, Paul wrote:

I increased maxFieldLength and reindexed a small number of documents.
That worked -- I got the correct results. In 3 minutes!

Did you mark the field in question as stored = false?

-Sascha



I assume that if I reindex all my documents that all searches will
become even slower. Is there any way to get all the results in a way
that is quick enough that my user won't get bored waiting? Is there
some optimization of this coming in solr 3.0?

On Wed, Mar 23, 2011 at 12:15 PM, Sascha Szottsz...@zib.de  wrote:

Hi Paul,

did you increase the value of the maxFieldLength parameter in your
solrconfig.xml?

-Sascha

On 23.03.2011 17:05, Paul wrote:


I'm using solr 1.4.1.

I have a document that has a pretty big field. If I search for a
phrase that occurs near the start of that field, it works fine. If I
search for a phrase that appears even a little ways into the field, it
doesn't find it. Is there some limit to how far into a field solr will
search?

Here's the way I'm doing the search. All I'm changing is the text I'm
searching on to make it succeed or fail:


http://localhost:8983/solr/my_core/select/?q=%22search+phrase%22hl=onhl.fl=text

Or, if it is not related to how large the document is, what else could
it possibly be related to? Could there be some character in that field
that is stopping the search?




Re: Index MS office

2011-02-02 Thread Sascha Szott

Hi,

have a look at Solr's ExtractingRequestHandler:

http://wiki.apache.org/solr/ExtractingRequestHandler

-Sascha

On 02.02.2011 16:49, Thumuluri, Sai wrote:

Good Morning,

  I am planning to get started on indexing MS office using ApacheSolr -
can someone please direct me where I should start?

Thanks,
Sai Thumuluri


Re: Malformed XML with exotic characters

2011-02-01 Thread Sascha Szott

Hi folks,

I've made the same observation when working with Solr's 
ExtractingRequestHandler on the command line (no browser interaction).


When issuing the following curl command

curl 
'http://mysolrhost/solr/update/extract?extractOnly=trueextractFormat=textwt=xmlresource.name=foo.pdf' 
--data-binary @foo.pdf -H 'Content-type:text/xml; charset=utf-8'  foo.xml


Solr's XML response writer returns malformed xml, e.g., xmllint gives me:

foo.xml:21: parser error : Char 0xD835 out of allowed range
foo.xml:21: parser error : PCDATA invalid Char value 55349

I'm not totally sure, if this is an Tika/PDFBox issue. However, I would 
expect in every case that the XML output produced by Solr is well-formed 
even if the libraries used under the hood return garbage.



-Sascha

p.s. I can provide the pdf file in question, if anybody would like to 
see it in action.



On 01.02.2011 16:43, Markus Jelsma wrote:

There is an issue with the XML response writer. It cannot cope with some very
exotic characters or possibly the right-to-left writing systems. The issue can
be reproduced by indexing the content of the home page of wikipedia as it
contains a lot of exotic matter. The problem does not affect the JSON response
writer.

The problem is, i am unsure whether this is a bug in Solr or that perhaps
Firefox itself trips over.


Here's the output of the JSONResponeWriter for a query returning the home
page:
{
  responseHeader:{
   status:0,
   QTime:1,
   params:{
fl:url,content,
indent:true,
wt:json,
q:*:*,
rows:1}},
  response:{numFound:6744,start:0,docs:[
{
 url:http://www.wikipedia.org/;,
 content:Wikipedia English The Free Encyclopedia 3 543 000+ articles 
日
本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie libre
1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей Italiano
L’enciclopedia libera 768 000+ voci Português A enciclopédia livre 669 000+
artigos Polski Wolna encyklopedia 769 000+ haseł Nederlands De vrije
encyclopedie 668 000+ artikelen Search  • Suchen  • Rechercher  • Szukaj  •
Ricerca  • 検索  • Buscar  • Busca  • Zoeken  • Поиск  • Sök  • 搜尋  • Cerca  •
Søk  • Haku  • Пошук  • Hledání  • Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara
• Cari  • Søg  • بحث  • Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو
• חיפוש  • Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky
Dansk Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk (bokmål)
Polski Português Română Русский Slovenčina Slovenščina Српски / Srpski Suomi
Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文   100 000+   العربية
• Български  • Català  • Česky  • Dansk  • Deutsch  • English  • Español  •
Esperanto  • فارسی  • Français  • 한국어  • Bahasa Indonesia  • Italiano  • עברית
• Lietuvių  • Magyar  • Bahasa Melayu  • Nederlands  • 日本語  • Norsk (bokmål)
• Polski  • Português  • Русский  • Română  • Slovenčina  • Slovenščina  •
Српски / Srpski  • Suomi  • Svenska  • Türkçe  • Українська  • Tiếng Việt  •
Volapük  • Winaray  • 中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  •
Asturianu  • Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • 
Беларуская
( Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  • 
Brezhoneg  • Чăваш
• Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  • Gaeilge  • Galego  •
ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  • Íslenska  • Basa Jawa  • 
ಕನ್ನಡ  •
ქართული  • Kurdî / كوردی  • Latina  • Latviešu  • Lëtzebuergesch  • Lumbaart
• Македонски  • മലയാളം  • मराठी  • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • 
Nnapulitano
• Occitan  • Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی
پنجابی  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  •
Srpskohrvatski / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ்
• తెలుగు  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa
Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru  •
Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी  • Bikol
Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  • Corsu  • Deitsch  •
ދިވެހި  • Diné Bizaad  • Eald Englisc  • Emigliàn–Rumagnòl  • Эрзянь  • 
Estremeñu
• Fiji Hindi  • Føroyskt  • Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak-
kâ-fa / 客家話  • Хальмг  • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano  •
Interlingua  • Interlingue  • Ирон Æвзаг  • Kapampangan  • Kaszëbsczi  •
Kernewek  • ភាសាខ្មែរ  • Kinyarwanda  • Коми  • Кыргызча  • Ladino / לאדינו  •
Ligure  • Limburgs  • Lingála  • lojban  • Malagasy  • Malti  • 文言  • Māori  •
مصرى  • مازِرونی / Mäzeruni  • Монгол  • မြန်မာဘာသာ  • Nāhuatlahtōlli  •
Nedersaksisch  • Nouormand  • Novial  • Нохчийн  • Олык Марий  • O‘zbek  • पाऴि
• Pangasinán  • ਪੰਜਾਬੀ / پنجابی  • Papiamentu  • پښتو  • Picard  • 

Re: Malformed XML with exotic characters

2011-02-01 Thread Sascha Szott

Hi Markus,

in my case the JSON response writer returns valid JSON. The same holds 
for the PHP response writer.


-Sascha

On 01.02.2011 18:44, Markus Jelsma wrote:

You can exclude the input's involvement by checking if other response writers
do work. For me, the JSONResponseWriter works perfectly with the same returned
data in some AJAX environment.

On Tuesday 01 February 2011 18:29:06 Sascha Szott wrote:

Hi folks,

I've made the same observation when working with Solr's
ExtractingRequestHandler on the command line (no browser interaction).

When issuing the following curl command

curl
'http://mysolrhost/solr/update/extract?extractOnly=trueextractFormat=text;
wt=xmlresource.name=foo.pdf' --data-binary @foo.pdf -H
'Content-type:text/xml; charset=utf-8'  foo.xml

Solr's XML response writer returns malformed xml, e.g., xmllint gives me:

foo.xml:21: parser error : Char 0xD835 out of allowed range
foo.xml:21: parser error : PCDATA invalid Char value 55349

I'm not totally sure, if this is an Tika/PDFBox issue. However, I would
expect in every case that the XML output produced by Solr is well-formed
even if the libraries used under the hood return garbage.


-Sascha

p.s. I can provide the pdf file in question, if anybody would like to
see it in action.

On 01.02.2011 16:43, Markus Jelsma wrote:

There is an issue with the XML response writer. It cannot cope with some
very exotic characters or possibly the right-to-left writing systems.
The issue can be reproduced by indexing the content of the home page of
wikipedia as it contains a lot of exotic matter. The problem does not
affect the JSON response writer.

The problem is, i am unsure whether this is a bug in Solr or that perhaps
Firefox itself trips over.


Here's the output of the JSONResponeWriter for a query returning the home
page:
{

   responseHeader:{

status:0,
QTime:1,
params:{

fl:url,content,
indent:true,
wt:json,
q:*:*,
rows:1}},

   response:{numFound:6744,start:0,docs:[

{

 url:http://www.wikipedia.org/;,
 content:Wikipedia English The Free Encyclopedia 3 543 000+ articles
 日

本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie
libre 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей
Italiano L’enciclopedia libera 768 000+ voci Português A enciclopédia
livre 669 000+ artigos Polski Wolna encyklopedia 769 000+ haseł
Nederlands De vrije encyclopedie 668 000+ artikelen Search  • Suchen  •
Rechercher  • Szukaj  • Ricerca  • 検索  • Buscar  • Busca  • Zoeken  •
Поиск  • Sök  • 搜尋  • Cerca  • Søk  • Haku  • Пошук  • Hledání  •
Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara • Cari  • Søg  • بحث  •
Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو • חיפוש  •
Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky Dansk
Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk
(bokmål) Polski Português Română Русский Slovenčina Slovenščina Српски /
Srpski Suomi Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文
100 000+   العربية • Български  • Català  • Česky  • Dansk  • Deutsch  •
English  • Español  • Esperanto  • فارسی  • Français  • 한국어  • Bahasa
Indonesia  • Italiano  • עברית • Lietuvių  • Magyar  • Bahasa Melayu  •
Nederlands  • 日本語  • Norsk (bokmål) • Polski  • Português  • Русский  •
Română  • Slovenčina  • Slovenščina  • Српски / Srpski  • Suomi  •
Svenska  • Türkçe  • Українська  • Tiếng Việt  • Volapük  • Winaray  •
中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  • Asturianu  •
Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • Беларуская (
Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  •
Brezhoneg  • Чăваш • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  •
Gaeilge  • Galego  • ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  •
Íslenska  • Basa Jawa  • ಕನ್ನಡ  • ქართული  • Kurdî / كوردی  • Latina  •
Latviešu  • Lëtzebuergesch  • Lumbaart • Македонски  • മലയാളം  • मराठी
• नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • Nnapulitano • Occitan  •
Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی پنجابی
  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  • Srpskohrvatski
/ Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ் • తెలుగు
  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa
Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru
  • Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी
• Bikol Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  •
Corsu  • Deitsch  • ދިވެހި  • Diné Bizaad  • Eald Englisc  •
Emigliàn–Rumagnòl  • Эрзянь  • Estremeñu • Fiji Hindi  • Føroyskt  •
Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak- kâ-fa / 客家話  • Хальмг
  • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano

missing type check when working with pint field type

2011-01-18 Thread Sascha Szott

Hi folks,

I've noticed an unexpected behavior while working with the various 
built-in integer field types (int, tint, pint). It seems as the first 
two ones are subject to type checking, while the latter one is not.


I'll give you an example based on the example schema that is shipped out 
with Solr. When trying to index the document


doc
  field name=id1/field
  field name=foo_iinvalid_value/field
  field name=foo_ti1/field
  field name=foo_pi1/field
/doc

Solr responds with a NumberFormatException (the same holds when setting 
the value of foo_ti to invalid_value):


java.lang.NumberFormatException: For input string: invalid_value

Surprisingly, an attempt to index the document

doc
  field name=id1/field
  field name=foo_i1/field
  field name=foo_ti1/field
  field name=foo_piinvalid_value/field
/doc

is successful. In the end, sorting on foo_pi leads to an exception, 
e.g., http://localhost:8983/solr/select?q=*:*sort=foo_pi desc


raises an HTTP 500 error:

java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(String.java:686)
at org.apache.lucene.search.FieldCache$7.parseInt(FieldCache.java:234)
	at 
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:457)
	at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)

at 
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
	at 
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:447)
	at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)

at 
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
	at 
org.apache.lucene.search.FieldComparator$IntComparator.setNextReader(FieldComparator.java:332)
	at 
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:94)

at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:249)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
	at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
	at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
	at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
	at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
	at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
	at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
	at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
	at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)

[...]


Is this a bug or did I missed something?

-Sascha


Re: missing type check when working with pint field type

2011-01-18 Thread Sascha Szott

Hi Erick,

I see the point. But what is pint (plong, pfloat, pdouble) actually 
intended for (sorting is not possible, no type checking is performed)? 
Seems to me as it is something very similar to the string type (both 
store and index the value verbatim).


-Sascha

On 18.01.2011 14:38, Erick Erickson wrote:

I suspect you missed this comment in the schema file:
***
Plain numeric field types that store and index the text
   value verbatim (and hence don't support range queries, since the
   lexicographic ordering isn't equal to the numeric ordering)
***

So what's happening is that the field is being indexed as a text type and, I
suspect,
begin tokenized. The error you're getting is when trying to sort against a
tokenized
field which is undefined. At least that's my story and I'm sticking to
it

Best
Erick

On Tue, Jan 18, 2011 at 8:10 AM, Sascha Szottsz...@zib.de  wrote:


Hi folks,

I've noticed an unexpected behavior while working with the various built-in
integer field types (int, tint, pint). It seems as the first two ones are
subject to type checking, while the latter one is not.

I'll give you an example based on the example schema that is shipped out
with Solr. When trying to index the document

doc
  field name=id1/field
  field name=foo_iinvalid_value/field
  field name=foo_ti1/field
  field name=foo_pi1/field
/doc

Solr responds with a NumberFormatException (the same holds when setting the
value of foo_ti to invalid_value):

java.lang.NumberFormatException: For input string: invalid_value

Surprisingly, an attempt to index the document

doc
  field name=id1/field
  field name=foo_i1/field
  field name=foo_ti1/field
  field name=foo_piinvalid_value/field
/doc

is successful. In the end, sorting on foo_pi leads to an exception, e.g.,
http://localhost:8983/solr/select?q=*:*sort=foo_pi desc

raises an HTTP 500 error:

java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(String.java:686)
at
org.apache.lucene.search.FieldCache$7.parseInt(FieldCache.java:234)
at
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:457)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)
at
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
at
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:447)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)
at
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
at
org.apache.lucene.search.FieldComparator$IntComparator.setNextReader(FieldComparator.java:332)
at
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:94)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:249)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
[...]


Is this a bug or did I missed something?

-Sascha





--
Sascha Szott :: KOBV/ZIB :: sz...@zib.de :: +49 30 84185-457


Re: post search using solrj

2010-12-30 Thread Sascha SZOTT

Hi Don,

you could give the HTTP method to be used as a second argument to the 
QueryRequest constructor:


[http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/request/QueryRequest.html#QueryRequest(org.apache.solr.common.params.SolrParams,%20org.apache.solr.client.solrj.SolrRequest.METHOD)]

-Sascha


Don Hill wrote:

Hi. I am using solrj and it has been working fine. I now have a requirement
to add more parameters. So many that I get a max URI exceeded error. Is
there anyway using SolrQuery todo a http post so I don't have these issues?

don



DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor

2010-08-11 Thread Sascha Szott

Hi folks,

why does FileListEntityProcessor ignores onError=continue and abort 
indexing if a directory or a file does not exist?


I'm using both XPathEntityProcessor and FileListEntityProcessor with 
onError set to continue. In case a directory or file is not present an 
Exception is thrown and indexing is stopped immediately.


Below you can find a stack trace that is generated in case the directory 
/home/doe/foo does not exist:


SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' 
value: /home/doe/foo/bar.xml is not a directory Processing Document # 3
at 
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)


How should I configure both processors so that missing directories and 
files are ignored and the indexing process does not stop immediately?


Best,
Sascha


Re: DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor

2010-08-11 Thread Sascha Szott

Sorry, there was a mistake in the stack trace. The correct one is:

SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' 
value: /home/doe/foo is not a directory Processing Document # 3
at 
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) 



-Sascha

On 11.08.2010 15:18, Sascha Szott wrote:

Hi folks,

why does FileListEntityProcessor ignores onError=continue and abort
indexing if a directory or a file does not exist?

I'm using both XPathEntityProcessor and FileListEntityProcessor with
onError set to continue. In case a directory or file is not present an
Exception is thrown and indexing is stopped immediately.

Below you can find a stack trace that is generated in case the directory
/home/doe/foo does not exist:

SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
value: /home/doe/foo/bar.xml is not a directory Processing Document # 3
at
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)

at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)

at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)

at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)

at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)

at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)

at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)

at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)


How should I configure both processors so that missing directories and
files are ignored and the indexing process does not stop immediately?

Best,
Sascha


Re: problem with formulating a negative query

2010-07-06 Thread Sascha Szott

Hi,

Chris Hostetter wrote:

AND, OR, and NOT are just syntactic-sugar for modifying
the MUST, MUST_NOT, and SHOULD.  The default op of OR only affects the
first clause of your query (R) because it doesn't have any modifiers --

Thanks for pointing that out!

-Sascha


the second clause has that NOT modifier so your query is effectivley...

topic:R -topic:[* TO *]

...which by definition can't match anything.

-Hoss



Re: problem with formulating a negative query

2010-06-30 Thread Sascha Szott

Hi Erick,

thanks for your explanations. But why are all docs being *removed* from 
the set of all docs that contain R in their topic field? This would 
correspond to a boolean AND and would stand in conflict with the clause 
q.op=OR. This seems a bit strange to me.


Furthermore, Smiley  Pugh stated in their Solr 1.4 book on pg. 102 that 
adding the a subexpression containing the negative query (-[* TO *]) and 
the match-all-docs clause (*:*) is only a workaround. Why is this 
workaround necessary at all?


Best,
Sascha

Erick Erickson wrote:

This may help:
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Boolean%20operators

But the clause you specified translates roughly as find all the
documents that contain R, then remove any of them that match
* TO *. * TO * contains all the documents with R, so everything
you just matched is removed from your results.

HTH
Erick

On Tue, Jun 29, 2010 at 12:40 PM, Sascha Szottsz...@zib.de  wrote:


Hi Ahmet,

it works, thanks a lot!

To be true I have no idea what's the problem with
defType=luceneq.op=ORdf=topicq=R NOT [* TO *]

-Sascha


Ahmet Arslan wrote:


I have a (multi-valued) field topic in my index which does

not need to exist in every document. Now, I'm struggling
with formulating a query that returns all documents that
either have no topic field at all *or* whose topic field
value is R.



Does this work?
defType=luceneq.op=ORq=topic:R (+*:* -topic:[* TO *])




Re: Is there a way to delete multiple documents using wildcard?

2010-06-30 Thread Sascha Szott

Hi,

you can delete all docs that match a certain query:

deletequeryuid:6-HOST*/query/delete

-Sascha

bbarani wrote:


Hi,

I am trying to delete a group of documents using wildcard. Something like

update?commit=true%20-H%20Content-Type:%20text/xml%20--data-binary%20'deletedocfield%20name=uid6-HOST*/field/doc/delete'

I want to delete all documents which contains the uid starting with 6-HOST
but this query doesnt seem to work.. Am I doing anything wrong??

Thanks,
BB


Re: Is there a way to delete multiple documents using wildcard?

2010-06-30 Thread Sascha Szott

Hi,

does /select?q=uid:6-HOST* return any documents?

-Sascha

bbarani wrote:


Hi,

Thanks a lot for your reply..

I tried the below query

update?commit=true%20-H%20Content-Type:%20text/xml%20--data-binary%20'deletequeryuid:6-HOST*/query/delete'

But even now none of the documents are getting deleted.. Am I forming the
URL wrong?

Thanks,
BB


Re: Is there a way to delete multiple documents using wildcard?

2010-06-30 Thread Sascha Szott

Hi,

take a look inside Solr's log file. Are there any error messages with 
respect to the update request?


Furthermore, you could try the following two commands instead:

curl http://host:port/solr/update; --form-string 
stream.body=deletequeryuid:6-HOST*/query/delete


curl http://host:port/solr/update; --form-string stream.body=commit/

-Sascha

bbarani wrote:


Yeah, I am getting the results when I use /select handler.

I tried the below query..

/select?q=uid:6-HOST*

Gotresult name=response numFound=52920 start=0

Thanks
BB


problem with formulating a negative query

2010-06-29 Thread Sascha Szott

Hi folks,

I have a (multi-valued) field topic in my index which does not need to 
exist in every document. Now, I'm struggling with formulating a query 
that returns all documents that either have no topic field at all *or* 
whose topic field value is R.


Unfortunately, the query

/select?q={!lucene q.op=OR df=topic}(R NOT [* TO *])

does not return any docs even though there are documents in my index 
that fulfil the specified condition as you can deduce from the queries 
listed below:


/select?q=topic:R  returns  0 docs

/select?q=-topic:[* TO *]  returns  0 docs

Appending the query with debugQuery=true returns:
str name=rawquerystring{!lucene q.op=OR df=topic}(R NOT [* TO *])/str
str name=querystring{!lucene q.op=OR df=topic}(R NOT [* TO *])/str
str name=parsedquerytopic:R -topic:[* TO *]/str
str name=parsedquery_toStringtopic:R -topic:[* TO *]/str

Does anybody have a clue of what is wrong here?

Thanks in advance,
Sascha


Re: Specifiying multiple mlt.fl fields

2010-06-19 Thread Sascha Szott

Hi Darren,

try mlt.fl=field1 field2

Best,
Sascha

Darren Govoni wrote:

Hi,
   I read the wiki and tried about a dozen variations such as:

...mlt.fl=field1mlt.fl=field2

and

...mlt.fl=field1,field2...

to specify more than one MLT field and it won't take. What's the trick?
Also, how to do it with SolrJ?

Nothing I try works. Solr 4.0 nightly build.

Any tips, very appreciated!

Darren







Re: federated / meta search

2010-06-18 Thread Sascha Szott

Hi Joe  Markus,

sounds good! Maybe I should better add a note on the Wiki page on 
federated search [1].


Thanks,
Sascha

[1] http://wiki.apache.org/solr/FederatedSearch

Joe Calderon wrote:

yes, you can use distributed search across shards with different
schemas as long as the query only references overlapping fields, i
usually test adding new fields or tokenizers on one shard and deploy
only after i verified its working properly

On Thu, Jun 17, 2010 at 1:10 PM, Markus Jelsmamarkus.jel...@buyways.nl  wrote:

Hi,



Check out Solr sharding [1] capabilities. I never tested it with different 
schema's but if each node is queried with fields that it supports, it should 
return useful results.



[1]: http://wiki.apache.org/solr/DistributedSearch



Cheers.

-Original message-
From: Sascha Szottsz...@zib.de
Sent: Thu 17-06-2010 19:44
To: solr-user@lucene.apache.org;
Subject: federated / meta search

Hi folks,

if I'm seeing it right Solr currently does not provide any support for
federated / meta searching. Therefore, I'd like to know if anyone has
already put efforts into this direction? Moreover, is federated / meta
search considered a scenario Solr should be able to deal with at all or
is it (far) beyond the scope of Solr?

To be more precise, I'll give you a short explanation of my
requirements. Assume, there are a couple of Solr instances running at
different places. The documents stored within those instances are all
from the same domain (bibliographic records), but it can not be ensured
that the schema definitions conform to 100%. But lets say, there are at
least some index fields that are present in all instances (fields with
the same name and type definition). Now, I'd like to perform a search on
all instances at the same time (with the restriction that the query
contains only those fields that overlap among the different schemas) and
combine the results in a reasonable way by utilizing the score
information associated with each hit. Please note, that due to legal
issues it is not feasible to build a single index that integrates the
documents of all Solr instances under consideration.

Thanks in advance,
Sascha






federated / meta search

2010-06-17 Thread Sascha Szott

Hi folks,

if I'm seeing it right Solr currently does not provide any support for 
federated / meta searching. Therefore, I'd like to know if anyone has 
already put efforts into this direction? Moreover, is federated / meta 
search considered a scenario Solr should be able to deal with at all or 
is it (far) beyond the scope of Solr?


To be more precise, I'll give you a short explanation of my 
requirements. Assume, there are a couple of Solr instances running at 
different places. The documents stored within those instances are all 
from the same domain (bibliographic records), but it can not be ensured 
that the schema definitions conform to 100%. But lets say, there are at 
least some index fields that are present in all instances (fields with 
the same name and type definition). Now, I'd like to perform a search on 
all instances at the same time (with the restriction that the query 
contains only those fields that overlap among the different schemas) and 
combine the results in a reasonable way by utilizing the score 
information associated with each hit. Please note, that due to legal 
issues it is not feasible to build a single index that integrates the 
documents of all Solr instances under consideration.


Thanks in advance,
Sascha



Re: strange results with query and hyphened words

2010-05-31 Thread Sascha Szott

Hi Markus,


the default-config for index is:

filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/

and for query:

filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 
catenateWords=0 catenateNumbers=0 catenateAll=0/

That's not true. The default configuration for query-time processing is:

filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=0
catenateNumbers=0
catenateAll=0
splitOnCaseChange=1/

By using this setting, a search for profi-auskunft will match 
profiauskunft.


It's important to note, that WordDelimiterFilterFactory's catenate* 
parameters should only be used in the index-time analysis stack. 
Otherwise the strange behaviour (search for profi-auskunft is translated 
into profi followed by (auskunft or profiauskunft) you mentioned will 
occur.


Best,
Sascha


-Ursprüngliche Nachricht-
Von: Sascha Szott [mailto:sz...@zib.de]
Gesendet: Sonntag, 30. Mai 2010 19:01
An: solr-user@lucene.apache.org
Betreff: Re: strange results with query and hyphened words

Hi Markus,

I was facing the same problem a few days ago and found an
explanation in
the mail archive that clarifies my question regarding the usage of
Solr's WordDelimiterFilterFactory:

http://markmail.org/message/qoby6kneedtwd42h

Best,
Sascha

markus.rietz...@rzf.fin-nrw.de wrote:

i am wondering why a search term with hyphen doesn't match.

my search term is prof-auskunft. in

WordDelimiterFilterFactory i have

catenateWords, so my understanding is that profi-auskunft

would search

for profiauskunft. when i use the analyse panel in solr

admi i see that

profi-auskunft matches a term profiauskunft.

the analyse will show

Query Analyzer
WhitespaceTokenizerFactory
profi-auskunft
SynonymFilterFactory
profi-auskunft
StopFilterFactory
profi-auskunft

WordDelimiterFilterFactory

term position   1   2
term text   profi   auskunft
profiauskunft
term type   wordword
word
source start,end0,5 6,14
0,15

LowerCaseFilterFactory
SnowballPorterFilterFactory

why is auskunft and profiauskunft in one column. how do they get
searched?

when i search profiauskunft i have 230 hits, when i now search for
profi-auskunft i do get less hits. when i call the search with
debugQuery=on i see

body:profi (auskunft profiauskunft)

what does this query mean? profi and auskunft or profiauskunft?




fieldType name=text_de class=solr.TextField
positionIncrementGap=100
analyzer type=index
  charFilter class=solr.HTMLStripCharFilterFactory /
  tokenizer class=solr.WhitespaceTokenizerFactory/
  !-- sg324 bei wortern die durch - und weitere leerzeichen
getrennt sind, werden diese zusammengefuehrt. --
  filter class=solr.HiphenatedWordsFilterFactory/
  !-- in this example, we will only use synonyms at

query time

  filter class=solr.SynonymFilterFactory
synonyms=index_synonyms_de.txt ignoreCase=true expand=false/
  --
  !-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the

index and query

analyzers to leave a 'gap' for more accurate

phrase queries.

  --
  filter class=solr.StopFilterFactory
  ignoreCase=true
  words=de/stopwords_de.txt
  enablePositionIncrements=true
  /
  !-- sg324 --
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory
language=German protected=de/protwords_de.txt/
/analyzer
analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.SynonymFilterFactory
synonyms=de/synonyms_de.txt ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory
  ignoreCase=true
  words=de/stopwords_de.txt
  enablePositionIncrements=true
  /
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory
language=German protected=de/protwords_de.txt/
/analyzer
/fieldType









Re: strange results with query and hyphened words

2010-05-31 Thread Sascha Szott
Sorry Markus, I mixed up the index and query field in analysis.jsp. In 
fact, I meant that a search for profiauskunft matches profi-auskunft.


I'm not sure, whether the case you are dealing with (search for 
profi-auskunft should match profiauskunft) is appropriately addressed by 
the WordDelimiterFilter. What about using the PatternReplaceCharFilter 
at query time to eliminate all intra-word hyphens?


-Sascha

Sascha Szott wrote:

Hi Markus,


the default-config for index is:

filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/

and for query:

filter class=solr.WordDelimiterFilterFactory generateWordParts=0
generateNumberParts=0 catenateWords=0 catenateNumbers=0
catenateAll=0/

That's not true. The default configuration for query-time processing is:

filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=0
catenateNumbers=0
catenateAll=0
splitOnCaseChange=1/

By using this setting, a search for profi-auskunft will match
profiauskunft.

It's important to note, that WordDelimiterFilterFactory's catenate*
parameters should only be used in the index-time analysis stack.
Otherwise the strange behaviour (search for profi-auskunft is translated
into profi followed by (auskunft or profiauskunft) you mentioned will
occur.

Best,
Sascha


-Ursprüngliche Nachricht-
Von: Sascha Szott [mailto:sz...@zib.de]
Gesendet: Sonntag, 30. Mai 2010 19:01
An: solr-user@lucene.apache.org
Betreff: Re: strange results with query and hyphened words

Hi Markus,

I was facing the same problem a few days ago and found an
explanation in
the mail archive that clarifies my question regarding the usage of
Solr's WordDelimiterFilterFactory:

http://markmail.org/message/qoby6kneedtwd42h

Best,
Sascha

markus.rietz...@rzf.fin-nrw.de wrote:

i am wondering why a search term with hyphen doesn't match.

my search term is prof-auskunft. in

WordDelimiterFilterFactory i have

catenateWords, so my understanding is that profi-auskunft

would search

for profiauskunft. when i use the analyse panel in solr

admi i see that

profi-auskunft matches a term profiauskunft.

the analyse will show

Query Analyzer
WhitespaceTokenizerFactory
profi-auskunft
SynonymFilterFactory
profi-auskunft
StopFilterFactory
profi-auskunft

WordDelimiterFilterFactory

term position 1 2
term text profi auskunft
profiauskunft
term type word word
word
source start,end 0,5 6,14
0,15

LowerCaseFilterFactory
SnowballPorterFilterFactory

why is auskunft and profiauskunft in one column. how do they get
searched?

when i search profiauskunft i have 230 hits, when i now search for
profi-auskunft i do get less hits. when i call the search with
debugQuery=on i see

body:profi (auskunft profiauskunft)

what does this query mean? profi and auskunft or profiauskunft?




fieldType name=text_de class=solr.TextField
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory /
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- sg324 bei wortern die durch - und weitere leerzeichen
getrennt sind, werden diese zusammengefuehrt. --
filter class=solr.HiphenatedWordsFilterFactory/
!-- in this example, we will only use synonyms at

query time

filter class=solr.SynonymFilterFactory
synonyms=index_synonyms_de.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the

index and query

analyzers to leave a 'gap' for more accurate

phrase queries.

--
filter class=solr.StopFilterFactory
ignoreCase=true
words=de/stopwords_de.txt
enablePositionIncrements=true
/
!-- sg324 --
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
language=German protected=de/protwords_de.txt/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=de/synonyms_de.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=de/stopwords_de.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
language=German protected=de/protwords_de.txt/
/analyzer
/fieldType









Re: strange results with query and hyphened words

2010-05-30 Thread Sascha Szott

Hi Markus,

I was facing the same problem a few days ago and found an explanation in 
the mail archive that clarifies my question regarding the usage of 
Solr's WordDelimiterFilterFactory:


http://markmail.org/message/qoby6kneedtwd42h

Best,
Sascha

markus.rietz...@rzf.fin-nrw.de wrote:

i am wondering why a search term with hyphen doesn't match.

my search term is prof-auskunft. in WordDelimiterFilterFactory i have
catenateWords, so my understanding is that profi-auskunft would search
for profiauskunft. when i use the analyse panel in solr admi i see that
profi-auskunft matches a term profiauskunft.

the analyse will show

Query Analyzer
WhitespaceTokenizerFactory
profi-auskunft
SynonymFilterFactory
profi-auskunft
StopFilterFactory
profi-auskunft

WordDelimiterFilterFactory

term position   1   2
term text   profi   auskunft
profiauskunft
term type   wordword
word
source start,end0,5 6,14
0,15

LowerCaseFilterFactory
SnowballPorterFilterFactory

why is auskunft and profiauskunft in one column. how do they get
searched?

when i search profiauskunft i have 230 hits, when i now search for
profi-auskunft i do get less hits. when i call the search with
debugQuery=on i see

body:profi (auskunft profiauskunft)

what does this query mean? profi and auskunft or profiauskunft?




fieldType name=text_de class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- sg324 bei wortern die durch - und weitere leerzeichen
getrennt sind, werden diese zusammengefuehrt. --
 filter class=solr.HiphenatedWordsFilterFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory
synonyms=index_synonyms_de.txt ignoreCase=true expand=false/
 --
 !-- Case insensitive stop word removal.
   add enablePositionIncrements=true in both the index and query
   analyzers to leave a 'gap' for more accurate phrase queries.
 --
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=de/stopwords_de.txt
 enablePositionIncrements=true
 /
 !-- sg324 --
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
language=German protected=de/protwords_de.txt/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory
synonyms=de/synonyms_de.txt ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=de/stopwords_de.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
language=German protected=de/protwords_de.txt/
   /analyzer
/fieldType






Re: sort by field length

2010-05-26 Thread Sascha Szott

Hi Erick,

Erick Erickson wrote:

Ah, I may have misunderstood, I somehow got it in my mind
you were talking about the length of each term (as in string length).

But if you're looking at the field length as the count of terms, that's
another question, sorry for the confusion...

I have to ask, though, why you want to sort this way? The relevance
calculations already factor in both term frequency and field length. What's
the use-case for sorting by field length given the above?
It's not a real world use-case -- I just want to get a better 
understanding of the data I'm indexing (therefore, performance is 
neglectable). In my current use case, you can think of the field length 
as an indicator of data quality (i.e., the longer the field content, the 
worse the quality is). Being able to sort the field data in order of 
decreasing length would allow me to investigate exceptional data items 
that are not appropriately handled by my curation process.


Best,
Sascha



Best
Erick

On Tue, May 25, 2010 at 3:40 AM, Sascha Szottsz...@zib.de  wrote:


Hi Erick,


Erick Erickson wrote:


Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.


Good point, thank you for the clarification. I thought that Lucene
internally stores the field length (e.g., in order to compute the relevance)
and getting this information at query time requires only a simple lookup.

-Sascha




But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szottsz...@zib.de   wrote:

  Hi folks,


is it possible to sort by field length without having to (redundantly)
save
the length information in a seperate index field? At first, I thought to
accomplish this using a function query, but I couldn't find an
appropriate
one.

Thanks in advance,
Sascha











Re: Faceted search not working?

2010-05-25 Thread Sascha Szott

Hi Birger,

Birger Lie wrote:

I don't think the bolean fields is mapped to on and off :)

You can use true and on interchangeably.

-Sascha




-birger

-Original Message-
From: Ilya Sterin [mailto:ster...@gmail.com]
Sent: 24. mai 2010 23:11
To: solr-user@lucene.apache.org
Subject: Faceted search not working?

I'm trying to perform a faceted search without any luck.  Result set doesn't 
return any facet information...

http://localhost:8080/solr/select/?q=title:*facet=onfacet.field=title

I'm getting the result set, but no face information present?  Is there 
something else that needs to happen to turn faceting on?

I'm using latest Solr 1.4 release.  Data is indexed from the database using 
dataimporter.

Thanks.

Ilya Sterin




Re: sort by field length

2010-05-25 Thread Sascha Szott

Hi Erick,

Erick Erickson wrote:

Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.
Good point, thank you for the clarification. I thought that Lucene 
internally stores the field length (e.g., in order to compute the 
relevance) and getting this information at query time requires only a 
simple lookup.


-Sascha



But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szottsz...@zib.de  wrote:


Hi folks,

is it possible to sort by field length without having to (redundantly) save
the length information in a seperate index field? At first, I thought to
accomplish this using a function query, but I couldn't find an appropriate
one.

Thanks in advance,
Sascha






Re: Highlighting is not happening

2010-05-25 Thread Sascha Szott

Hi,

to accomplish that, use the highlighting parameters hl.simple.pre and 
hl.simple.post.


By the way, there are a plenty of other parameters that affect 
highlighting. Take a look at:


http://wiki.apache.org/solr/HighlightingParameters

-Sascha

Doddamani, Prakash wrote:

Hey,

I thought the Highlights would happen in the field of the documents
returned from SOLR J
But it gives new list of Highlighting at below, sorry for the confusion

I was wondering is there a way that the fields returned itself contains
bold characters

Eg : if searched for query

doc
str field name=onereturned response which contains
bquery/b  should be bold/str
/doc


Regards
Prakash

-Original Message-
From: Sascha Szott [mailto:sz...@zib.de]
Sent: Monday, May 24, 2010 10:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Hi Prakash,

can you provide

1. the definition of the relevant field
2. your query
3. the definition of the relevant request handler 4. a field value that
is stored in your index and should be highlighted

-Sascha

Doddamani, Prakash wrote:

Thanks Sascha,

The type for fields for which I am searching are all text , and I
am using solr.TextField


fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory/
  !-- in this example, we will only use synonyms at query time
  filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
  --
  !-- Case insensitive stop word removal.
   enablePositionIncrements=true ensures that a 'gap' is
left to
   allow for accurate phrase queries.
  --
  filter class=solr.StopFilterFactory
  ignoreCase=true
  words=stopwords.txt
  enablePositionIncrements=true
  /
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
  /fieldType

Regards
Prakash


-Original Message-
From: Sascha Szott [mailto:sz...@zib.de]
Sent: Monday, May 24, 2010 10:29 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Hi Prakash,

more importantly, check the field type and its associated analyzer. In



case you use a non-tokenized type (e.g., string), highlighting will
not appear if only a partial field match exists (only exact matches,
i.e. the query coincides with the field value, will be highlighted).
If that's not your intent, you should at least define an tokenizer for



the field type.

Best,
Sascha

Doddamani, Prakash wrote:

Hey Daren,
Yes the fields for which I am searching are stored and indexed, also
they are returned from the query, Also it is not coming, if the
entire



search keyword is part of the field.

Thanks
Prakash

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
Sent: Monday, May 24, 2010 9:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Check that the field you are highlighting on is stored. It won't
work otherwise.


Now, this also means that the field is returned from the query. For
large text fields to be highlighted only, this means the entire text
is returned for each result.


There is a pending feature to address this, that allows you to tell
Solr to NOT return a specific field (to avoid unecessary transfer of
large text fields in this scenario).

Darren


Hi



I am using dismax request handler, I wanted to highlight the search
field,

So added

str name=hltrue/str

I was expecting like if I search for keyword Akon resultant docs
wherever the Akon is available is bold.



But I am not seeing them getting bold, could some one tell me the
real



path where I should tune

If I pass explicitly the hl=true does not work



I have added the request handler



requestHandler name=dismax class=solr.SearchHandler
   lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str

Re: Faceted search not working?

2010-05-25 Thread Sascha Szott

Hi,

please note, that the FacetComponent is one of the six search components 
that are automatically associated with solr.SearchHandler (this holds 
also for the QueryComponent).


Another note: By using name=components all default components will be 
replaced by the components you explicitly mentioned (i.e., 
QueryComponent and FacetComponent in your example). To avoid this, use 
name=last-components instead.


-Sascha

Jean-Sebastien Vachon wrote:

Is the FacetComponent loaded at all?

requestHandler name=standard class=solr.SearchHandler default=true
   arr name=components
   strquery/str
   strfacet/str
/arr
/requestHandler


On 2010-05-25, at 3:32 AM, Sascha Szott wrote:


Hi Birger,

Birger Lie wrote:

I don't think the bolean fields is mapped to on and off :)

You can use true and on interchangeably.

-Sascha




-birger

-Original Message-
From: Ilya Sterin [mailto:ster...@gmail.com]
Sent: 24. mai 2010 23:11
To: solr-user@lucene.apache.org
Subject: Faceted search not working?

I'm trying to perform a faceted search without any luck.  Result set doesn't 
return any facet information...

http://localhost:8080/solr/select/?q=title:*facet=onfacet.field=title

I'm getting the result set, but no face information present?  Is there 
something else that needs to happen to turn faceting on?

I'm using latest Solr 1.4 release.  Data is indexed from the database using 
dataimporter.

Thanks.

Ilya Sterin









sort by field length

2010-05-24 Thread Sascha Szott

Hi folks,

is it possible to sort by field length without having to (redundantly) 
save the length information in a seperate index field? At first, I 
thought to accomplish this using a function query, but I couldn't find 
an appropriate one.


Thanks in advance,
Sascha



Re: Highlighting is not happening

2010-05-24 Thread Sascha Szott

Hi Prakash,

more importantly, check the field type and its associated analyzer. In 
case you use a non-tokenized type (e.g., string), highlighting will 
not appear if only a partial field match exists (only exact matches, 
i.e. the query coincides with the field value, will be highlighted). If 
that's not your intent, you should at least define an tokenizer for the 
field type.


Best,
Sascha

Doddamani, Prakash wrote:

Hey Daren,
Yes the fields for which I am searching are stored and indexed, also
they are returned from the query,
Also it is not coming, if the entire search keyword is part of the
field.

Thanks
Prakash

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
Sent: Monday, May 24, 2010 9:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Check that the field you are highlighting on is stored. It won't work
otherwise.


Now, this also means that the field is returned from the query. For
large text fields to be highlighted only, this means the entire text is
returned for each result.


There is a pending feature to address this, that allows you to tell Solr
to NOT return a specific field (to avoid unecessary transfer of large
text fields in this scenario).

Darren


Hi



I am using dismax request handler, I wanted to highlight the search
field,

So added

str name=hltrue/str

I was expecting like if I search for keyword Akon resultant docs
wherever the Akon is available is bold.



But I am not seeing them getting bold, could some one tell me the real



path where I should tune

If I pass explicitly the hl=true does not work



I have added the request handler



requestHandler name=dismax class=solr.SearchHandler
 lst name=defaults
  str name=defTypedismax/str
  str name=echoParamsexplicit/str
  float name=tie0.01/float
  str name=qf
   name^20.0 coming^5 playing^4 keywords^0.1
  /str
   str name=bf
 rord(isclassic)^0.5 ord(listeners)^0.3
  /str
   str name=*,score
   name, coming, playing, keywords, score
  /str
  str name=mm
 2lt;-1 5lt;-2 6lt;90%
  /str
  int name=ps100/int
  str name=q.alt*:*/str
  !-- example highlighter config, enable per-query with hl=true
--

  str name=hltrue/str
 !--str name=hl.simple.preb/str
  str name=hl.simple.post/b/str  --
  !-- for this field, we want no fragmenting, just highlighting

--

  str name=f.name.hl.fragsize0/str
  !-- instructs Solr to return the field itself if no query terms
are found --
  !--str name=f.name.hl.alternateFieldname/str  --
  str name=f.text.hl.fragmenterregex/str  !-- defined below

--

 /lst
   /requestHandler

regards
prakash







Re: Highlighting is not happening

2010-05-24 Thread Sascha Szott

Hi Prakash,

can you provide

1. the definition of the relevant field
2. your query
3. the definition of the relevant request handler
4. a field value that is stored in your index and should be highlighted

-Sascha

Doddamani, Prakash wrote:

Thanks Sascha,

The type for fields for which I am searching are all text , and I am
using solr.TextField


fieldType name=text class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
 !-- Case insensitive stop word removal.
  enablePositionIncrements=true ensures that a 'gap' is left
to
  allow for accurate phrase queries.
 --
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldType

Regards
Prakash


-Original Message-
From: Sascha Szott [mailto:sz...@zib.de]
Sent: Monday, May 24, 2010 10:29 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Hi Prakash,

more importantly, check the field type and its associated analyzer. In
case you use a non-tokenized type (e.g., string), highlighting will
not appear if only a partial field match exists (only exact matches,
i.e. the query coincides with the field value, will be highlighted). If
that's not your intent, you should at least define an tokenizer for the
field type.

Best,
Sascha

Doddamani, Prakash wrote:

Hey Daren,
Yes the fields for which I am searching are stored and indexed, also
they are returned from the query, Also it is not coming, if the entire



search keyword is part of the field.

Thanks
Prakash

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
Sent: Monday, May 24, 2010 9:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting is not happening

Check that the field you are highlighting on is stored. It won't
work otherwise.


Now, this also means that the field is returned from the query. For
large text fields to be highlighted only, this means the entire text
is returned for each result.


There is a pending feature to address this, that allows you to tell
Solr to NOT return a specific field (to avoid unecessary transfer of
large text fields in this scenario).

Darren


Hi



I am using dismax request handler, I wanted to highlight the search
field,

So added

str name=hltrue/str

I was expecting like if I search for keyword Akon resultant docs
wherever the Akon is available is bold.



But I am not seeing them getting bold, could some one tell me the
real



path where I should tune

If I pass explicitly the hl=true does not work



I have added the request handler



requestHandler name=dismax class=solr.SearchHandler
  lst name=defaults
   str name=defTypedismax/str
   str name=echoParamsexplicit/str
   float name=tie0.01/float
   str name=qf
name^20.0 coming^5 playing^4 keywords^0.1
   /str
str name=bf
  rord(isclassic)^0.5 ord(listeners)^0.3
   /str
str name=*,score
name, coming, playing, keywords, score
   /str
   str name=mm
  2lt;-1 5lt;-2 6lt;90%
   /str
   int name=ps100/int
   str name=q.alt*:*/str
   !-- example highlighter config, enable per-query with hl=true
--

   str name=hltrue/str
  !--str name=hl.simple.preb/str
   str name=hl.simple.post/b/str   --
   !-- for this field, we want no fragmenting, just highlighting

--

   str name=f.name.hl.fragsize0/str
   !-- instructs Solr to return the field itself if no query
terms are found --
   !--str name=f.name.hl.alternateFieldname/str   --
   str name=f.text.hl.fragmenterregex/str   !-- defined
below

Re: Faceted search not working?

2010-05-24 Thread Sascha Szott

Hi Ilya,

Ilya Sterin wrote:

I'm trying to perform a faceted search without any luck.  Result set
doesn't return any facet information...

http://localhost:8080/solr/select/?q=title:*facet=onfacet.field=title

I'm getting the result set, but no face information present?  Is there
something else that needs to happen to turn faceting on?

No.

What does http://localhost:8080/solr/select/?q=title:*fl=titlewt=xml 
return?


-Sascha



Wildcard queries

2010-05-21 Thread Sascha Szott

Hi folks,

what's the idea behind the fact that no text analysis (e.g. lowercasing) 
is performed on wildcarded search terms?


In my context this behaviour seems to be counter-intuitive (I guess 
that's the case in the majority of applications) and my application 
needs to lowercase any input term before sending the HTTP request to my 
Solr server.


Would it be easy to disable this behaviour in Solr (1.5)? I would like 
to see a config parameter (per field type) that allows to disable this 
odd behaviour if needed. To ensure backward compatibility the odd 
behaviour should be the default anymore.


Am I missing any drawbacks?

Best,
Sascha



Re: Wildcard queries

2010-05-21 Thread Sascha Szott

Hi Robert,

thanks, you're absolutely right. I should better refine my initial 
question to: What's the idea behind the fact that no *lowercasing* is 
performed on wildcarded search terms if the field in question contains a 
LowercaseFilter in its associated field type definition?


-Sascha

Robert Muir wrote:

we can use stemming as an example:

lets say your query is c?ns?st?nt?y

how will this match consistently, which the porter stemmer
transforms to 'consistent'.
furthermore, note that i replaced the vowels with ?'s here. The porter
stemmer doesnt just rip stuff off the end, but attempts to guess
syllables as part of the process, so it cannot possibly work.

the only way it would work in this situation would be if you formed
permutations of all the possible words this wildcard would match, and
then did analysis on each form, and searched on all stems.

but, this is impossible, since the * operator allows an infinite language.

On Fri, May 21, 2010 at 10:11 AM, Sascha Szottsz...@zib.de  wrote:

Hi folks,

what's the idea behind the fact that no text analysis (e.g. lowercasing) is
performed on wildcarded search terms?

In my context this behaviour seems to be counter-intuitive (I guess that's
the case in the majority of applications) and my application needs to
lowercase any input term before sending the HTTP request to my Solr server.

Would it be easy to disable this behaviour in Solr (1.5)? I would like to
see a config parameter (per field type) that allows to disable this odd
behaviour if needed. To ensure backward compatibility the odd behaviour
should be the default anymore.

Am I missing any drawbacks?

Best,
Sascha






Re: Autosuggest

2010-05-15 Thread Sascha Szott

Hi,

maybe you would like to have a look at solr.ShingleFilterFactory [1] to 
expand your autosuggest to more than one term.


-Sascha

[1] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory


Blargy wrote:


Thanks for your help and especially your analyzer.. probably saved me a
full-import or two  :)





Re: How to tell which field matched?

2010-05-15 Thread Sascha Szott

Hi,

I'm not sure if debugQuery=on is a feasible solution in a productive 
environment, as generating such extra information requires a reasonable 
amount of computation.


-Sascha

Jon Baer wrote:

Does the standard debug component (?debugQuery=on) give you what you need?

http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_does_id:archangel_come_before_id:hawkgirl_when_querying_for_.22wings.22

- Jon

On May 14, 2010, at 4:03 PM, Tim Garton wrote:


All,
 I've searched around for help with something we are trying to do
and haven't come across much.  We are running solr 1.4.  Here is a
summary of the issue we are facing:

A simplified example of our schema is something like this:

   field name=id type=string indexed=true stored=true required=true 
/
   field name=title type=text indexed=true stored=true
required=true /
   field name=date_posted type=tdate indexed=true stored=true /
   field name=supplement_title type=text indexed=true
stored=true multiValued=true /
   field name=supplement_pdf_url type=text indexed=true
stored=true multiValued=true /
   field name=supplement_pdf_text type=text indexed=true
stored=true multiValued=true /

When someone does a search we search across the title,
supplement_title, and supplement_pdf_text fields.  When we get our
results, we would like to be able to tell which field the search
matched and if it's a multiValued field, which of the multiple values
matched.  This is so that we can display results similar to:

Example Title
Example Supplement Title
Example Supplement Title 2 (your search matched this document)
Example Supplement Title 3

Example Title 2
Example Supplement Title 4
Example Supplement Title 5
Example Supplement Title 6 (your search matched this document)

etc.

How would you recommend doing this?  Is there some way to get solr to
tell us which field matched, including multiValued fields?  As a
workaround we have been using highlighting to tell which field
matched, but it doesn't get us what we want for multiValued fields and
there is a significant cost to enabling the highlighting.  Should we
design our schema in some other fashion to achieve these results?
Thanks.

-Tim






Re: Solr Schema Question

2010-04-17 Thread Sascha Szott

Hi Serdar,

take a look at Solr's DataImportHandler:

http://wiki.apache.org/solr/DataImportHandler

Best,
Sascha

Serdar Sahin wrote:

Hi,

I am rather new to Solr and have a question.

We have around 200.000 txt files which are placed into the file cloud.
The file path is something similar to this:

file/97/8f/840/fa4-1.txt
file/a6/9d/ab0/ca2-2.txt etc.

and we also store the metadata (like title, description, tags etc)
about these files in the mysql server. So, what I want to do is to
index title, description, tags and other data from mysql, and also get
the txt file from file server, and link them as one record for
searching, but I could not figure out how to automatize this process.
I can give the path from the sql query like, Select id, title,
description, file_path, and then solr can use this path to retrieve
txt file, but I don't know whether is it possible or not.

What is the best way to index these files with their tag title and
description without coding in Java (Perl is ok). These txt files are
large, between 100kb-10mb, so the last option is to store them in the
database.

Thanks,

Serdar




Re: StreamingUpdateSolrServer hangs

2010-04-16 Thread Sascha Szott

Hi Yonik,

Yonik Seeley wrote:

Stephen, were you running stock Solr 1.4, or did you apply any of the
SolrJ patches?
I'm trying to figure out if anyone still has any problems, or if this
was fixed with SOLR-1711:
I'm using the latest trunk version (rev. 934846) and constantly running 
into the same problem. I'm using StreamingUpdateSolrServer with 3 treads 
and a queue size of 20 (not really knowing if this configuration is 
optimal). My multi-threaded application indexes 200k data items 
(bibliographic metadata in Dublin Core format) and constantly hangs 
after running for some time.


Below you can find the thread dump of one of my index threads (after the 
app hangs all dumps are the same)


thread 19 prio=10 tid=0x7fe8c0415800 nid=0x277d waiting on 
condition [0x42d05000]

   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  0x7fe8cdcb7598 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
	at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
	at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254)
	at 
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUpdateSolrServer.java:216)
	at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:64)
	at 
de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:29)
	at 
de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:10)
	at 
de.kobv.ked.index.AbstractIndexThread.addIndexDocument(AbstractIndexThread.java:59)

at de.kobv.ked.rss.RssThread.indiziere(RssThread.java:30)
at de.kobv.ked.rss.RssThread.run(RssThread.java:58)



and of the three SUSS threads:

pool-1-thread-3 prio=10 tid=0x7fe8c7b7f000 nid=0x2780 in 
Object.wait() [0x409ac000]

   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
	- waiting on 0x7fe8cdcb6f10 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
	- locked 0x7fe8cdcb6f10 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
	at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
	at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
	at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
	at 
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:153)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)

pool-1-thread-2 prio=10 tid=0x7fe8c7afa000 nid=0x277f in 
Object.wait() [0x40209000]

   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
	- waiting on 0x7fe8cdcb6f10 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
	- locked 0x7fe8cdcb6f10 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
	at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
	at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
	at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
	at 
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:153)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)

pool-1-thread-1 prio=10 tid=0x7fe8c79f2800 nid=0x277e in 
Object.wait() [0x42e06000]

   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
	- waiting on 0x7fe8cdcb6f10 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
	at 

Re: StreamingUpdateSolrServer hangs

2010-04-16 Thread Sascha Szott

Hi Yonik,

thanks for your fast reply.

Yonik Seeley wrote:

Thanks for the report Sascha.
So after the hang, it never recovers?  Some amount of hanging could be
visible if there was a commit on the Solr server or something else to
cause the solr requests to block for a while... but it should return
to normal on it's own...
In my case the whole application hangs and never recovers (CPU 
utilization goes down to near 0%). Interestingly, the problem 
reproducibly occurs only if SUSS is created with *more than 2* threads.



Looking at the stack trace, it looks like threads are blocked waiting
to get an http connection.
I forgot to mention that my index app has exclusive access to the Solr 
instance. Therefore, concurrent searches against the same Solr instance 
while indexing are excluded.



I'm traveling all next week, but I'll open a JIRA issue for this now.

Thank you!


Anything that would help us reproduce this is much appreciated.

Are there any other who have experienced the same problem?

-Sascha



On Fri, Apr 16, 2010 at 8:57 AM, Sascha Szottsz...@zib.de  wrote:

Hi Yonik,

Yonik Seeley wrote:


Stephen, were you running stock Solr 1.4, or did you apply any of the
SolrJ patches?
I'm trying to figure out if anyone still has any problems, or if this
was fixed with SOLR-1711:


I'm using the latest trunk version (rev. 934846) and constantly running into
the same problem. I'm using StreamingUpdateSolrServer with 3 treads and a
queue size of 20 (not really knowing if this configuration is optimal). My
multi-threaded application indexes 200k data items (bibliographic metadata
in Dublin Core format) and constantly hangs after running for some time.

Below you can find the thread dump of one of my index threads (after the app
hangs all dumps are the same)

thread 19 prio=10 tid=0x7fe8c0415800 nid=0x277d waiting on condition
[0x42d05000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for0x7fe8cdcb7598  (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254)
at
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUpdateSolrServer.java:216)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:64)
at
de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:29)
at
de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:10)
at
de.kobv.ked.index.AbstractIndexThread.addIndexDocument(AbstractIndexThread.java:59)
at de.kobv.ked.rss.RssThread.indiziere(RssThread.java:30)
at de.kobv.ked.rss.RssThread.run(RssThread.java:58)



and of the three SUSS threads:

pool-1-thread-3 prio=10 tid=0x7fe8c7b7f000 nid=0x2780 in Object.wait()
[0x409ac000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on0x7fe8cdcb6f10  (a
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked0x7fe8cdcb6f10  (a
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:153)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

pool-1-thread-2 prio=10 tid=0x7fe8c7afa000 nid=0x277f in Object.wait()
[0x40209000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on0x7fe8cdcb6f10  (a
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked0x7fe8cdcb6f10  (a

Re: Deploying Solr 1.3 in JBoss 5

2010-02-05 Thread Sascha Szott

Hi Luca,

could you add a note to the Wiki page [1]. Thanks!

-Sascha

[1] http://wiki.apache.org/solr/SolrJBoss

Luca Molteni wrote:

Bye the way, I finally solved it.

To deploy solr 1.3 in jboss 5, you simply have to remove

xercesImpl-2.8.1.jar
xml-apis-1.3.03.jar

 From the WEB-INF/lib folder of solr.war

Solr will use the lib provided by jboss 5.

Thank you again.

L.M.



On 3 February 2010 10:38, Luca Moltenivoloth...@gmail.com  wrote:

Apparently, that worked! I've never realized that the order of the
elements in XML is significant, nice to see.

As always, problems leads to other problems, so now I'm facing with a
Xerces ClassCastException with JDK 6.

org.jboss.xb.binding.JBossXBRuntimeException: Failed to create a new SAX parser
at 
org.jboss.xb.binding.UnmarshallerFactory$UnmarshallerFactoryImpl.newUnmarshaller(UnmarshallerFactory.java:100)
at 
org.jboss.web.tomcat.service.deployers.JBossContextConfig.processContextConfig(JBossContextConfig.java:549)
at 
org.jboss.web.tomcat.service.deployers.JBossContextConfig.init(JBossContextConfig.java:536)
at 
org.apache.catalina.startup.ContextConfig.lifecycleEvent(ContextConfig.java:279)
at 
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at 
org.apache.catalina.core.StandardContext.init(StandardContext.java:5436)
at 
org.apache.catalina.core.StandardContext.start(StandardContext.java:4148)
at 
org.jboss.web.tomcat.service.deployers.TomcatDeployment.performDeployInternal(TomcatDeployment.java:310)
at 
org.jboss.web.tomcat.service.deployers.TomcatDeployment.performDeploy(TomcatDeployment.java:142)
at 
org.jboss.web.deployers.AbstractWarDeployment.start(AbstractWarDeployment.java:461)
at org.jboss.web.deployers.WebModule.startModule(WebModule.java:118)
at org.jboss.web.deployers.WebModule.start(WebModule.java:97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:157)
at org.jboss.mx.server.Invocation.dispatch(Invocation.java:96)
at org.jboss.mx.server.Invocation.invoke(Invocation.java:88)
at 
org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:264)
at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:668)
at 
org.jboss.system.microcontainer.ServiceProxy.invoke(ServiceProxy.java:206)
at $Proxy38.start(Unknown Source)
at 
org.jboss.system.microcontainer.StartStopLifecycleAction.installAction(StartStopLifecycleAction.java:42)
at 
org.jboss.system.microcontainer.StartStopLifecycleAction.installAction(StartStopLifecycleAction.java:37)
at 
org.jboss.dependency.plugins.action.SimpleControllerContextAction.simpleInstallAction(SimpleControllerContextAction.java:62)
at 
org.jboss.dependency.plugins.action.AccessControllerContextAction.install(AccessControllerContextAction.java:71)
at 
org.jboss.dependency.plugins.AbstractControllerContextActions.install(AbstractControllerContextActions.java:51)
at 
org.jboss.dependency.plugins.AbstractControllerContext.install(AbstractControllerContext.java:348)
at 
org.jboss.system.microcontainer.ServiceControllerContext.install(ServiceControllerContext.java:297)
at 
org.jboss.dependency.plugins.AbstractController.install(AbstractController.java:1633)
at 
org.jboss.dependency.plugins.AbstractController.incrementState(AbstractController.java:935)
at 
org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:1083)
at 
org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:985)
at 
org.jboss.dependency.plugins.AbstractController.change(AbstractController.java:823)
at 
org.jboss.dependency.plugins.AbstractController.change(AbstractController.java:553)
at 
org.jboss.system.ServiceController.doChange(ServiceController.java:688)
at org.jboss.system.ServiceController.start(ServiceController.java:460)
at 
org.jboss.system.deployers.ServiceDeployer.start(ServiceDeployer.java:163)
at 
org.jboss.system.deployers.ServiceDeployer.deploy(ServiceDeployer.java:99)
at 
org.jboss.system.deployers.ServiceDeployer.deploy(ServiceDeployer.java:46)
at 
org.jboss.deployers.spi.deployer.helpers.AbstractSimpleRealDeployer.internalDeploy(AbstractSimpleRealDeployer.java:62)
at 
org.jboss.deployers.spi.deployer.helpers.AbstractRealDeployer.deploy(AbstractRealDeployer.java:50)
at 
org.jboss.deployers.plugins.deployers.DeployerWrapper.deploy(DeployerWrapper.java:171)

Re: (default) maximum chars per field

2010-02-05 Thread Sascha Szott

markus.rietz...@rzf.fin-nrw.de wrote:

ok,
i was looking for all types of max but somehow didn't saw the 
maxFieldLength.
this is a global parameter, right? can this be defined on a field basis?
It's a global parameter counting the maximum number of tokens(!) - not 
the number of characters or bytes - per field. If a field's content 
exceeds that number, the remaining tokens are truncated without any notice.


-Sascha



global would be enough at the moment.

thank you


-Ursprüngliche Nachricht-
Von: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Gesendet: Freitag, 5. Februar 2010 11:35
An: solr-user@lucene.apache.org
Betreff: Re: (default) maximum chars per field

On Fri, Feb 5, 2010 at 3:56 PM,
markus.rietz...@rzf.fin-nrw.de  wrote:


hi,
what is the default maximum charsize per field? i found a macChars
paramater for copyField but i don't think, that this is what i am
looking for.

we have indexed some documents via tika/solrcell. only the

beginning of

these documents can be searched. where can i define the

maximum size of

a document/field that will be indexed? at the moment we do

the updates

via xml upload. is there a maxsize for this xml. in

solconfig.xml i have

found multipartUploadLimitInKB=2048000, that means 2 GB

would be the

max size to post. that would be enough...



Increase maxFieldLength in your solrconfig.xml. The default is 10KB.

--
Regards,
Shalin Shekhar Mangar.





Re: java.lang.NullPointerException with MySQL DataImportHandler

2010-02-02 Thread Sascha Szott

Hi,

can you post

* the output of MySQL's describe command for all tables/views referenced 
in your DIH configuration

* the DIH configuration file (i.e., data-config.xml)
* the schema definition (i.e., schema.xml)

-Sascha

Jean-Michel Philippon-Nadeau wrote:

Hi,

It is my first install of Solr. The setup has been pretty
straightforward and yet, the performance is very impressive.

I am running into an issue with my MySQL DataImportHandler. I've
followed the quick-start in order to write the necessary config and so
far everything seemed to work.

However, I am missing some fields in my index. I've switched all fields
to stored=true temporarily in my schema to troubleshoot the issue. I
only have 3 fields listed in search results while I should have 8.

Could this be caused by ampersands or illegal entities in my database?
How can I see if DIH is importing correctly all my rows into the index?

Follows is the warning I have in my catalina.log.

Thank you very much,

Jean-Michel

===

Feb 2, 2010 12:21:07 AM org.apache.solr.handler.dataimport.SolrWriter
upload
WARNING: Error creating document :
SolrInputDocument[{keywords=keywords(1.0)={Dolce}, name=name(1.0)={Dolce
amp; Gabbana Damp;G Neckties designer Tie for men 543},
productID=productID(1.0)={220213}}]
java.lang.NullPointerException
 at
org.apache.lucene.util.StringHelper.intern(StringHelper.java:36)
 at org.apache.lucene.document.Field.init(Field.java:341)
 at org.apache.lucene.document.Field.init(Field.java:305)
 at
org.apache.solr.schema.FieldType.createField(FieldType.java:210)
 at
org.apache.solr.schema.SchemaField.createField(SchemaField.java:94)
 at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:246)
 at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
 at
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:75)
 at org.apache.solr.handler.dataimport.DataImportHandler
$1.upload(DataImportHandler.java:292)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:392)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
 at org.apache.solr.handler.dataimport.DataImporter
$1.run(DataImporter.java:370)





Re: Deploying Solr 1.3 in JBoss 5

2010-02-02 Thread Sascha Szott

Hi,

I'm not sure if that's a Solr issue. However, what happens if you set 
env-entry-value to C:/mypath/solr instead of ${solr.home.myhome}?


-Sascha

Am 02.02.2010 15:20, schrieb Luca Molteni:

Hello list,

I'm having some problem deploying solr to JBoss 5.

The problem is with environment variables:

Following this page of the wiki:  http://wiki.apache.org/solr/SolrJBoss

I've added to the web.xml of WEB-INF of solr

   env-entry
env-entry-namesolr/home/env-entry-name
env-entry-typejava.lang.String/env-entry-type
env-entry-value${solr.home.myhome}/env-entry-value
  /env-entry

Since I'm using lots of instances of solr in the same container.

This variable should be expanded by jboss itself in a path using
properties-services.xml:

 attribute name=Properties
solr.home.myhome=C:/mypath/solr
 /attribute

Unfortunately, during deployment of the solr application, it gives me
this error:

Caused by: org.jboss.xb.binding.JBossXBException: Failed to parse
source: The content of element type env-entry must match
(description?,env-entry-name,env-entry-value?,env-entry-type). @
vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14]
at 
org.jboss.xb.binding.parser.sax.SaxJBossXBParser.parse(SaxJBossXBParser.java:203)

... 33 more
Caused by: org.xml.sax.SAXException: The content of element type
env-entry must match
(description?,env-entry-name,env-entry-value?,env-entry-type). @
vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14]
at 
org.jboss.xb.binding.parser.sax.SaxJBossXBParser$MetaDataErrorHandler.error(SaxJBossXBParser.java:426)


Notice that the same .war and properties-services.xml works flawlessly
in JBoss 4.2.3

Any ideas?

Thank you very much.

L.M.


--
Sascha Szott
Kooperativer Bibliotheksverbund Berlin-Brandenburg (KOBV)
c/o Konrad-Zuse-Zentrum fuer Informationstechnik Berlin (ZIB)
Takustr. 7, D-14195 Berlin
Zimmer 4357
Telefon: (030) 841 85 - 457
Telefax: (030) 841 85 - 269
E-Mail: sz...@zib.de
WWW: http://www.kobv.de



Re: java.lang.NullPointerException with MySQL DataImportHandler

2010-02-02 Thread Sascha Szott

Hi,

since some of the fields used in your DIH configuration aren't mandatory 
(e.g., keywords and tags are defined as nullable in your db table 
schema), add a default value to all optional fields in your schema 
configuration (e.g., default = ). Note, that Solr does not understand 
the db-related concept of null values.


Solr's log output

SolrInputDocument[{keywords=keywords(1.0)={Dolce}, name=name(1.0)={Dolce
amp; Gabbana Damp;G Neckties designer Tie for men 543},
productID=productID(1.0)={220213}}]

indicates that there aren't any tags or descriptions stored for the item 
with productId 220213. Since no default value is specified, Solr raises 
an error when creating the index document.


-Sascha

Jean-Michel Philippon-Nadeau wrote:

Hi,

Thanks for the reply.

On Tue, 2010-02-02 at 16:57 +0100, Sascha Szott wrote:

* the output of MySQL's describe command for all tables/views referenced
in your DIH configuration


mysql  describe products;
++--+--+-+-++
| Field  | Type | Null | Key | Default | Extra
|
++--+--+-+-++
| productID  | int(10) unsigned | NO   | PRI | NULL|
auto_increment |
| skuCode| varchar(320) | YES  | MUL | NULL|
|
| upcCode| varchar(320) | YES  | MUL | NULL|
|
| name   | varchar(320) | NO   | | NULL|
|
| description| text | NO   | | NULL|
|
| keywords   | text | YES  | | NULL|
|
| disqusThreadID | varchar(50)  | NO   | | NULL|
|
| tags   | text | YES  | | NULL|
|
| createdOn  | int(10) unsigned | NO   | | NULL|
|
| lastUpdated| int(10) unsigned | NO   | | NULL|
|
| imageURL   | varchar(320) | YES  | | NULL|
|
| inStock| tinyint(1)   | YES  | MUL | 1   |
|
| active | tinyint(1)   | YES  | | 1   |
|
++--+--+-+-++
13 rows in set (0.00 sec)

mysql  describe product_soldby_vendor;
+-+--+--+-+-+---+
| Field   | Type | Null | Key | Default | Extra |
+-+--+--+-+-+---+
| productID   | int(10) unsigned | NO   | MUL | NULL|   |
| productVendorID | int(10) unsigned | NO   | MUL | NULL|   |
| price   | double   | NO   | | NULL|   |
| currency| varchar(5)   | NO   | | NULL|   |
| buyURL  | varchar(320) | NO   | | NULL|   |
+-+--+--+-+-+---+
5 rows in set (0.00 sec)

mysql  describe products_vendors_subcategories;
++--+--+-+-++
| Field  | Type | Null | Key | Default |
Extra  |
++--+--+-+-++
| productVendorSubcategoryID | int(10) unsigned | NO   | PRI | NULL|
auto_increment |
| productVendorCategoryID| int(10) unsigned | NO   | | NULL|
|
| labelEnglish   | varchar(320) | NO   | | NULL|
|
| labelFrench| varchar(320) | NO   | | NULL|
|
++--+--+-+-++
4 rows in set (0.00 sec)

mysql  describe products_vendors_categories;
+-+--+--+-+-++
| Field   | Type | Null | Key | Default |
Extra  |
+-+--+--+-+-++
| productVendorCategoryID | int(10) unsigned | NO   | PRI | NULL|
auto_increment |
| labelEnglish| varchar(320) | NO   | | NULL|
|
| labelFrench | varchar(320) | NO   | | NULL|
|
+-+--+--+-+-++
3 rows in set (0.00 sec)

mysql  describe product_vendor_in_subcategory;
+---+--+--+-+-+---+
| Field | Type | Null | Key | Default | Extra |
+---+--+--+-+-+---+
| productVendorID   | int(10) unsigned | NO   | MUL | NULL|   |
| productCategoryID | int(10) unsigned | NO   | MUL | NULL|   |
+---+--+--+-+-+---+
2 rows in set (0.00 sec)

mysql  describe products_vendors_countries;
++--+--+-+-++
| Field  | Type | Null | Key | Default |
Extra

Re: Deploying Solr 1.3 in JBoss 5

2010-02-02 Thread Sascha Szott

Luca Molteni wrote:

Actually, if I hard-code the value, it gives me the same error... interesting.

According to the error message:

The content of element type env-entry must match
(description?,env-entry-name,env-entry-value?,env-entry-type)

Maybe it helps to change the order of elements within env-entry 
(env-entry-value before env-entry-type)?


-Sascha




On 2 February 2010 17:14, Sascha Szottsz...@zib.de  wrote:

Hi,

I'm not sure if that's a Solr issue. However, what happens if you set
env-entry-value to C:/mypath/solr instead of ${solr.home.myhome}?

-Sascha

Am 02.02.2010 15:20, schrieb Luca Molteni:


Hello list,

I'm having some problem deploying solr to JBoss 5.

The problem is with environment variables:

Following this page of the wiki:  http://wiki.apache.org/solr/SolrJBoss

I've added to the web.xml of WEB-INF of solr

   env-entry
env-entry-namesolr/home/env-entry-name
env-entry-typejava.lang.String/env-entry-type
env-entry-value${solr.home.myhome}/env-entry-value
  /env-entry

Since I'm using lots of instances of solr in the same container.

This variable should be expanded by jboss itself in a path using
properties-services.xml:

 attribute name=Properties
solr.home.myhome=C:/mypath/solr
 /attribute

Unfortunately, during deployment of the solr application, it gives me
this error:

Caused by: org.jboss.xb.binding.JBossXBException: Failed to parse
source: The content of element type env-entry must match
(description?,env-entry-name,env-entry-value?,env-entry-type). @

vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14]
at
org.jboss.xb.binding.parser.sax.SaxJBossXBParser.parse(SaxJBossXBParser.java:203)

... 33 more
Caused by: org.xml.sax.SAXException: The content of element type
env-entry must match
(description?,env-entry-name,env-entry-value?,env-entry-type). @

vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14]
at
org.jboss.xb.binding.parser.sax.SaxJBossXBParser$MetaDataErrorHandler.error(SaxJBossXBParser.java:426)


Notice that the same .war and properties-services.xml works flawlessly
in JBoss 4.2.3

Any ideas?

Thank you very much.

L.M.




Re: How to display Highlight with VelocityResponseWriter?

2010-01-13 Thread Sascha Szott
Hi Qiuyan,

 Thanks a lot. It works now. When i added the line
 #set($hl = $response.highlighting)
 i got the highlighting. But i wonder if there's any document that
 describes the usage of that. I mean i didn't know the name of those
 methods. Actually i just managed to guess it.
Solritas (aka VelocityResponseWriter) binds a number of objects into a so
called VelocityContext (consult [1] for a complete list). You can think of
a map that allows you to access objects by symbolic names, e.g., an
instance of QueryResponse is stored under response (that's why you write
$response in your template).

Since $response is an instance of QueryResponse you can call all methods
on it the API [2] provides. Furthermore, Velocity incorporates a
JavaBean-like introspection mechanism that lets you write
$response.highlighting instead of $response.getHighlighting() (only a bit
of syntactic sugar).

-Sascha

[1] http://wiki.apache.org/solr/VelocityResponseWriter#line-93
[2]
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/response/QueryResponse.html

 Quoting Sascha Szott sz...@zib.de:

 Qiuyan,

 with highlight can also be displayed in the web gui. I've added bool
 name=hltrue/bool into the standard responseHandler and it already
 works, i.e without velocity. But the same line doesn't take effect in
 itas. Should i configure anything else? Thanks in advance.
 First of all, just a few notes on the /itas request handler in your
 solrconfig.xml:

 1. The entry

 arr name=components
   strhighlight/str
 /arr

 is obsolete, since the highlighting component is a default search
 component [1].

 2. Note that since you didn't specify a value for hl.fl highlighting
 will only affect the fields listed inside of qf.

 3. Why did you override the default value of hl.fragmenter? In most
 cases the default fragmenting algorithm (gap) works fine - and maybe
 in yours as well?


 To make sure all your hl related settings are correct, can you post
 an xml output (change the wt parameter to xml) for a search with
 highlighted results.

 And finally, can you post the vtl code snippet that should produce
 the highlighted output.

 -Sascha

 [1] http://wiki.apache.org/solr/SearchComponent














Re: How to display Highlight with VelocityResponseWriter?

2010-01-11 Thread Sascha Szott

Qiuyan,


with highlight can also be displayed in the web gui. I've added bool
name=hltrue/bool into the standard responseHandler and it already
works, i.e without velocity. But the same line doesn't take effect in
itas. Should i configure anything else? Thanks in advance.
First of all, just a few notes on the /itas request handler in your 
solrconfig.xml:


1. The entry

arr name=components
  strhighlight/str
/arr

is obsolete, since the highlighting component is a default search 
component [1].


2. Note that since you didn't specify a value for hl.fl highlighting 
will only affect the fields listed inside of qf.


3. Why did you override the default value of hl.fragmenter? In most 
cases the default fragmenting algorithm (gap) works fine - and maybe in 
yours as well?



To make sure all your hl related settings are correct, can you post an 
xml output (change the wt parameter to xml) for a search with 
highlighted results.


And finally, can you post the vtl code snippet that should produce the 
highlighted output.


-Sascha

[1] http://wiki.apache.org/solr/SearchComponent








Re: solrJ and spell check queries

2010-01-03 Thread Sascha Szott

Hi,

Jay Fisher wrote:

I'm trying to find a way to formulate the following query in solrJ. This is
the only way I can get the desired result but I can't figure out how to get
solrJ to generate the same query string. It always generates a url that
starts with select and I need it to start with spell. If there is an
alternative url string that will work please let me know.

http://solr-server/spell/?indent=onq=shertwt=jsonspellcheck=truespellcheck.collate=true

In case you hook SpellCheckComponent directly into the standard request 
handler, i.e., /select,


http://solr-server/select?indent=onq=shertwt=jsonspellcheck=truespellcheck.collate=true

should work.

-Sascha




Re: how to do a Parent/Child Mapping using entities

2009-12-30 Thread Sascha Szott

Hi,


Thanks Sascha for your post, but i find it interresting, but in my case i
don't want to use an additionnal field, i want to be able with the same
schema to do a simple query like : q=res_url:some url, and a query like
the other one;
You could easily write your own query parser (QParserPlugin, in Solr's 
terminology) that internally translates queries like


 q = res_url:url AND res_rank:rank

into
q = res_ranked_url:rank url

thus hiding the res_ranked_url field from the user/client.

I'm not sure, but maybe it's possible to utilize the order of values 
within the multi-valued field res_url directly in the newly created 
parser. This seems like the cleanest solution to me.


-Sascha


in other word; is there any solution to make two or more multivalued fields
in the same document linked with each other, e.g:
in this result:

-result name=response numFound=1 start=0
-doc
   str name=id1/str
   str name=keywordKey1/str
-arr name=res_url
   strurl1/str
   strurl2/str
   strurl3/str
   strurl4/str
   /arr
-arr name=res_rank
   str1/str
   str2/str
   str3/str
   str4/str
   /arr
   /doc
   /result

i would like to make solr understand that for this document, value:url1 of
res_url field is linked to value:1 of res_rank field, and all of them
are linked to the commen field keyword.
I think that i should use a custom field analyser or some thing like that;
but i don't know what to do.

but thanks for all; and any supplied help will be lovable.


Sascha Szott wrote:


Hi,

you could create an additional index field res_ranked_url that contains
the concatenated value of an url and its corresponding rank, e.g.,

res_rank +   + res_url

Then, q=res_ranked_url:1 url1 retrieves all documents with url1 as the
first url.

A drawback of this workaround is that you have to use a phrase query
thus preventing wildcard searches for urls.

-Sascha



Hello everybody, i would like to know how to create index supporting a
parent/child mapping and then querying the child to get the results.
in other words; imagine that we have a database containing 2
tables:Keyword[id(int), value(string)] and Result[id(int), res_url(text),
res_text(tex), res_date(date), res_rank(int)]
For indexing, i used the DataImportHandler to import data and it works
well,
and my query response seems good:(q=*:*) (imagine that we have only this
to
keywords and their results)

?xml version=1.0 encoding=UTF-8 ?
-response
-lst name=responseHeader
int name=status0/int
int name=QTime0/int
-lst name=params
str name=q*:*/str
/lst
/lst
-result name=response numFound=2 start=0
-doc
str name=id1/str
str name=keywordKey1/str
-arr name=res_url
strurl1/str
strurl2/str
strurl3/str
strurl4/str
/arr
-arr name=res_rank
str1/str
str2/str
str3/str
str4/str
/arr
/doc
-doc
str name=id2/str
str name=keywordKey2/str
-arr name=res_url
strurl1/str
strurl5/str
strurl8/str
strurl7/str
/arr
-arr name=res_rank
str1/str
str2/str
str3/str
str4/str
/arr
/doc
/result
/response

but the problem is when i tape a query kind of this:q=res_url:url2 AND
res_rank:1 and this to say that i want to search for the keywords in
which
the url (url2) is ranked at the first position, i have a result like
this:

?xml version=1.0 encoding=UTF-8 ?
-response
-lst name=responseHeader
int name=status0/int
int name=QTime0/int
-lst name=params
str name=qres_url:url2 AND res_rank:1/str
/lst
/lst
-result name=response numFound=1 start=0
-doc
str name=id1/str
str name=keywordKey1/str
-arr name=res_url
strurl1/str
strurl2/str
strurl3/str
strurl4/str
/arr
-arr name=res_rank
str1/str
str2/str
str3/str
str4/str
/arr
/doc
/result
/response

But this is not true; because the url present in the 1st position in the
results of the keyword key1 is url1 and not url2.
So what i want to say is : is there any solution to make the values of
the
multivalued fields linked;
so in our case we can see that the previous result say that:
   - url1 is present in 1st position of key1 results
   - url2 is present in 2nd position of key1 results
   - url3 is present in 3rd position of key1 results
   - url4 is present in 4th position of key1 results

and i would like that solr consider this when executing queries.

Any helps please; and thanks for all :)







Re: how to do a Parent/Child Mapping using entities

2009-12-29 Thread Sascha Szott

Hi,

you could create an additional index field res_ranked_url that contains 
the concatenated value of an url and its corresponding rank, e.g.,


res_rank +   + res_url

Then, q=res_ranked_url:1 url1 retrieves all documents with url1 as the 
first url.


A drawback of this workaround is that you have to use a phrase query 
thus preventing wildcard searches for urls.


-Sascha



Hello everybody, i would like to know how to create index supporting a
parent/child mapping and then querying the child to get the results.
in other words; imagine that we have a database containing 2
tables:Keyword[id(int), value(string)] and Result[id(int), res_url(text),
res_text(tex), res_date(date), res_rank(int)]
For indexing, i used the DataImportHandler to import data and it works well,
and my query response seems good:(q=*:*) (imagine that we have only this to
keywords and their results)

   ?xml version=1.0 encoding=UTF-8 ?
-response
-lst name=responseHeader
   int name=status0/int
   int name=QTime0/int
-lst name=params
   str name=q*:*/str
   /lst
   /lst
-result name=response numFound=2 start=0
-doc
   str name=id1/str
   str name=keywordKey1/str
-arr name=res_url
   strurl1/str
   strurl2/str
   strurl3/str
   strurl4/str
   /arr
-arr name=res_rank
   str1/str
   str2/str
   str3/str
   str4/str
   /arr
   /doc
-doc
   str name=id2/str
   str name=keywordKey2/str
-arr name=res_url
   strurl1/str
   strurl5/str
   strurl8/str
   strurl7/str
   /arr
-arr name=res_rank
   str1/str
   str2/str
   str3/str
   str4/str
   /arr
   /doc
   /result
   /response

but the problem is when i tape a query kind of this:q=res_url:url2 AND
res_rank:1 and this to say that i want to search for the keywords in which
the url (url2) is ranked at the first position, i have a result like this:

?xml version=1.0 encoding=UTF-8 ?
-response
-lst name=responseHeader
   int name=status0/int
   int name=QTime0/int
-lst name=params
   str name=qres_url:url2 AND res_rank:1/str
   /lst
   /lst
-result name=response numFound=1 start=0
-doc
   str name=id1/str
   str name=keywordKey1/str
-arr name=res_url
   strurl1/str
   strurl2/str
   strurl3/str
   strurl4/str
   /arr
-arr name=res_rank
   str1/str
   str2/str
   str3/str
   str4/str
   /arr
   /doc
   /result
   /response

But this is not true; because the url present in the 1st position in the
results of the keyword key1 is url1 and not url2.
So what i want to say is : is there any solution to make the values of the
multivalued fields linked;
so in our case we can see that the previous result say that:
  - url1 is present in 1st position of key1 results
  - url2 is present in 2nd position of key1 results
  - url3 is present in 3rd position of key1 results
  - url4 is present in 4th position of key1 results

and i would like that solr consider this when executing queries.

Any helps please; and thanks for all :)




Re: Optimize not having any effect on my index

2009-12-18 Thread Sascha Szott

Hi Aleksander,

Aleksander Stensby wrote:

So i tried with curl:
curl http://server:8983/solr/update --data-binary 'optimize/' -H
'Content-type:text/xml; charset=utf-8'

No difference here either... Am I doing anything wrong? Do i need to issue a
commit after the optimize?
Did you restart the Solr server instance after the optimize operation 
was completed?


BTW: You could initiate the optimization operation by POSTing 
optimize=true directly, i.e.,


curl http://server:8983/solr/update/update --form-string optimize=true


-Sascha



Re: Exception from Spellchecker

2009-12-15 Thread Sascha Szott

Hi Rafael,

Rafael Pappert wrote:

I try to enable the spellchecker in my 1.4.0 solr (running with tomcat 6 on 
debian).
But I always get the following exception, when I try to open 
http://localhost:8080/spell?:


The spellcheck=true pair is missing in your request. Try

http://localhost:8080/spell?q=spellcheck=true

-Sascha



RE: search on tomcat server

2009-12-07 Thread Sascha Szott
Hi Jill,

just to make sure your index contains at least one document, what is the
output of

http://localhost:8080/solr/select?q=*:*debugQuery=trueechoParams=all

Best,
Sascha

Jill Han wrote:
 In fact, I just followed the instructions titled as Tomcat On Windows.
 Here are the updates on my computer
 1. -Dsolr.solr.home=C:\solr\example
 2. change dataDir to dataDirC:\solr\example\data/dataDir in
 solrconfig.xml at C:\solr\example\conf
 3. created solr.xml at C:\Tomcat 5.5\conf\Catalina\localhost
 ?xml version=1.0 encoding=utf-8?
 Context docBase=c:/solr/example/apache-solr-1.3.0.war debug=0
 crossContext=true
   Environment name=solr/home type=java.lang.String
 value=c:/solr/example override=true/
 /Context

 I restarted Tomcat, went to http://localhost:8080/solr/admin/
 Entered video in Query String field, and got
 /**
 ?xml version=1.0 encoding=UTF-8 ?
 - response
 - lst name=responseHeader
   int name=status0/int
   int name=QTime0/int
 - lst name=params
   str name=rows10/str
   str name=start0/str
   str name=indenton/str
   str name=qvideo/str
   str name=version2.2/str
   /lst
   /lst
   result name=response numFound=0 start=0 /
   /response
 /
 My questions are
 1. is the setting correct?
 2. where does solr start to search words entered in Query String field
 3. how can I make result page like general searching result page, such as,
 not found, if found, a url, instead of xml will be returned.


 Thanks a lot for your helps,

 Jill

 -Original Message-
 From: William Pierce [mailto:evalsi...@hotmail.com]
 Sent: Friday, December 04, 2009 12:56 PM
 To: solr-user@lucene.apache.org
 Subject: Re: search on tomcat server

 Have you gone through the solr tomcat wiki?

 http://wiki.apache.org/solr/SolrTomcat

 I found this very helpful when I did our solr installation on tomcat.

 - Bill

 --
 From: Jill Han jill@alverno.edu
 Sent: Friday, December 04, 2009 8:54 AM
 To: solr-user@lucene.apache.org
 Subject: RE: search on tomcat server
 X-HOSTLOC: hermes.apache.org/140.211.11.3

 I went through all the links on
 http://wiki.apache.org/solr/#Search_and_Indexing
 And still have no clue as how to proceed.
 1. do I have to do some implementation in order to get solr to search
 doc.
 on tomcat server?
 2. if I have files, such as .doc, docx, .pdf, .jsp, .html, etc under
 window xp, c:/tomcat/webapps/test1, /webapps/test2,
   What should I do to make solr search those directories
 3. since I am using tomcat, instead of jetty, is there any demo that
 shows
 the solr searching features, and real searching result?

 Thanks,
 Jill


 -Original Message-
 From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
 Sent: Monday, November 30, 2009 10:40 AM
 To: solr-user@lucene.apache.org
 Subject: Re: search on tomcat server

 On Mon, Nov 30, 2009 at 9:55 PM, Jill Han jill@alverno.edu wrote:

 I got solr running on the tomcat server,
 http://localhost:8080/solr/admin/

 After I enter a search word, such as, solr, then hit Search button, it
 will go to

 http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10in
 dent=onhttp://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10in%0Adent=on

  and display

   ?xml version=1.0 encoding=UTF-8 ?

 -
 http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i
 ndent=onhttp://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i%0Andent=on
  response

 -
 http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i
 ndent=onhttp://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i%0Andent=on
lst name=responseHeader

  int name=status0/int

  int name=QTime0/int

 -
 http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i
 ndent=onhttp://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i%0Andent=on
  lst name=params

str name=rows10/str

str name=start0/str

str name=indenton/str

str name=qsolr/str

str name=version2.2/str

 /lst

   /lst

result name=response numFound=0 start=0 /

  /response

  My question is what is the next step to search files on tomcat
 server?



 Looks like you have not added any documents to Solr. See the Indexing
 Documents section at http://wiki.apache.org/solr/#Search_and_Indexing

 --
 Regards,
 Shalin Shekhar Mangar.





How to instruct MoreLikeThisHandler to sort results

2009-12-03 Thread Sascha Szott

Hi Folks,

is there any way to instruct MoreLikeThisHandler to sort results? I was 
wondering that MLTHandler recognizes faceting parameters among others, 
but it ignores the sort parameter.


Best,
Sascha



Re: Hierarchical xml

2009-12-02 Thread Sascha Szott

Pooja,

have a look at Solr's DataImportHandler. XPathEntityProcessor [1] should 
suit your needs.


Best,
Sascha

[1] http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

Pooja Verlani schrieb:

Hi,
I want to index an xml like following:

officer
nameJohn/name
dob1979-29-17T28:14:48Z/dob
collegeGroup
college
   nameABC College/name
   year1998/year
 /college
 college
   namePQRS College/name
   year2001/year
 /college
  college
   nameXYZ College/name
   year2003/year
 /college
/collegeGroup
/officer

I am not able to judge how should be the schema like?
Also, if I flatten such an xml and make collegename  year as multivalued
like this:
college_nameABC College, PQRS College, XYZ College/college_name
college_year1998,2001,2003/year

In such a scenario I can't make a coorespondence between ABC college  year
1998.

In case someone has an efficient way out, do share.
Thanks in anticipation.

Regards,
Pooja





Re: Indexing file content with custom field

2009-12-02 Thread Sascha Szott

Piero,

it sounds you're looking for an integration of Solr Cell and Solr's DIH 
facility -- a feature that isn't implemented yet (but the issue is 
already addressed in Solr-1358).


As a workaround, you could store the extracted contents in plain text 
files (either by using Solr Cell or Apache Tika directly, which is under 
the hood of Solr Cell). Afterwards, you could use DIH's 
XPathEntityProcessor (to read the metadata in your XML files) in 
conjunction with DIH's PlainTextEntityProcessor (to read the previously 
created text files).


Another workaround would be to pass the metadata content as literal 
parameters along with the /update/extract request, as described in [1]. 
This would require you to write a small program that constructs and 
sends appropriate POST requests by parsing your XML metadata files.


Best,
Sascha

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Literals

Rodolico Piero wrote:

Hi,

I need to index the contents of a file (doc, pdf, ecc) and a set of
custom metadata specified in the XML like a standard request to Solr.
From the documentation I can extract the contents of a file with the
request /update/extract (tika) and index metadata with a second
request /update by passing the XML. How do I do it all in a single
request? (without using curl but using http java lib or solrj). For
example (although I know that is not correct):

add
  doc
field name=id / field
field name=myfield-1 / field
field name=myfield-n / field
field name=content content of the extracted file (text) /
field
/doc
  /add

So I search it or by using metadata or full text on the content.
Sorry for my English ...

Thanks a lot.

 


Piero

 








[Solved] Re: VelocityResponseWriter/Solritas character encoding issue

2009-11-27 Thread Sascha Szott

Hi Erik,

I've finally solved the problem. Unfortunately, the parameter 
v.contentType was not described in the Solr wiki (I've fixed that now). 
The point is, you must specify (in your solrconfig.xml)


   str name=v.contentTypetext/xml;charset=UTF-8/str

in order to receive correctly UTF-8 encoded HTML. That's it!

Best,
Sascha

Erik Hatcher schrieb:

Sascha,

Can you give me a test document that causes an issue?  (maybe send me a 
Solr XML document in private e-mail).   I'll see what I can do once I 
can see the issue first hand.


Erik


On Nov 18, 2009, at 2:48 PM, Sascha Szott wrote:


Hi,

I've played around with Solr's VelocityResponseWriter (which is indeed 
a very useful feature for rapid prototyping). I've realized that 
Velocity uses ISO-8859-1 as default character encoding. I've changed 
this setting to UTF-8 in my velocity.properties file (inside the conf 
directory), i.e.,


  input.encoding=UTF-8
  output.encoding=UTF-8

and checked that the settings were successfully loaded.

Within the main Velocity template, browse.vm, the character encoding 
is set to UTF-8 as well, i.e.,


  meta http-equiv=content-type content=text/html; charset=UTF-8/

After starting Solr (which is deployed in a Tomcat 6 server on a 
Ubuntu machine), I ran into some character encoding problems.


Due to the change of input.encoding to UTF-8, no problems occur when 
non-ASCII characters are presend in the query string, e.g. german 
umlauts. But unfortunately, something is wrong with the encoding of 
characters in the html page that is generated by 
VelocityResponseWriter. The non-ASCII characters aren't displayed 
properly (for example, FF prints a black diamond with a white question 
mark). If I manually set the encoding to ISO-8859-1, the non-ASCII 
characters are displayed correctly. Does anybody have a clue?


Thanks in advance,
Sascha






VelocityResponseWriter/Solritas character encoding issue

2009-11-18 Thread Sascha Szott

Hi,

I've played around with Solr's VelocityResponseWriter (which is indeed a 
very useful feature for rapid prototyping). I've realized that Velocity 
uses ISO-8859-1 as default character encoding. I've changed this setting 
to UTF-8 in my velocity.properties file (inside the conf directory), i.e.,


   input.encoding=UTF-8
   output.encoding=UTF-8

and checked that the settings were successfully loaded.

Within the main Velocity template, browse.vm, the character encoding is 
set to UTF-8 as well, i.e.,


   meta http-equiv=content-type content=text/html; charset=UTF-8/

After starting Solr (which is deployed in a Tomcat 6 server on a Ubuntu 
machine), I ran into some character encoding problems.


Due to the change of input.encoding to UTF-8, no problems occur when 
non-ASCII characters are presend in the query string, e.g. german 
umlauts. But unfortunately, something is wrong with the encoding of 
characters in the html page that is generated by VelocityResponseWriter. 
The non-ASCII characters aren't displayed properly (for example, FF 
prints a black diamond with a white question mark). If I manually set 
the encoding to ISO-8859-1, the non-ASCII characters are displayed 
correctly. Does anybody have a clue?


Thanks in advance,
Sascha









Re: VelocityResponseWriter/Solritas character encoding issue

2009-11-18 Thread Sascha Szott

Hi Erik,

Erik Hatcher wrote:
Can you give me a test document that causes an issue?  (maybe send me a 
Solr XML document in private e-mail).   I'll see what I can do once I 
can see the issue first hand.
Thank you! Just try the utf8-example.xml file in the exampledoc 
directory. After having indexed the document, the output of the script 
test_utf8.sh suggests to me that everything works correctly:


 Solr server is up.
 HTTP GET is accepting UTF-8
 HTTP POST is accepting UTF-8
 HTTP POST does not default to UTF-8
 HTTP GET is accepting UTF-8 beyond the basic multilingual plane
 HTTP POST is accepting UTF-8 beyond the basic multilingual plane
 HTTP POST + URL params is accepting UTF-8 beyond the basic multilingual

If I'm using the standard QueryResponseWriter and the query q=umlauts, 
the responding xml page contains properly printed non-ASCII characters. 
The same query against the VelocityResponseWriter returns a lot of 
Unicode replacement characters (u+FFFD) instead.


-Sascha



On Nov 18, 2009, at 2:48 PM, Sascha Szott wrote:


Hi,

I've played around with Solr's VelocityResponseWriter (which is indeed 
a very useful feature for rapid prototyping). I've realized that 
Velocity uses ISO-8859-1 as default character encoding. I've changed 
this setting to UTF-8 in my velocity.properties file (inside the conf 
directory), i.e.,


  input.encoding=UTF-8
  output.encoding=UTF-8

and checked that the settings were successfully loaded.

Within the main Velocity template, browse.vm, the character encoding 
is set to UTF-8 as well, i.e.,


  meta http-equiv=content-type content=text/html; charset=UTF-8/

After starting Solr (which is deployed in a Tomcat 6 server on a 
Ubuntu machine), I ran into some character encoding problems.


Due to the change of input.encoding to UTF-8, no problems occur when 
non-ASCII characters are presend in the query string, e.g. german 
umlauts. But unfortunately, something is wrong with the encoding of 
characters in the html page that is generated by 
VelocityResponseWriter. The non-ASCII characters aren't displayed 
properly (for example, FF prints a black diamond with a white question 
mark). If I manually set the encoding to ISO-8859-1, the non-ASCII 
characters are displayed correctly. Does anybody have a clue?


Thanks in advance,
Sascha











Re: Indexing multiple documents in Solr/SolrCell

2009-11-17 Thread Sascha Szott

Kewin,

Kerwin wrote:

Our approach is similar to what you have mentioned in the jira issue except
that we have all metadata in the xml and not in the database. I am therefore
using a custom XmlUpdateRequestHandler to parse the XML and then calling
Tika from within the XML Loader to parse the content. Until now this seems
to work.
When and in which Solr version do you expect the jira issue to be
addressed?
That's a good question. Since I'm not a Solr committer, I cannot give 
any estimate on when it will be released (hopefully in Solr 1.5).


-Sascha


On Mon, Nov 16, 2009 at 5:02 PM, Sascha Szott sz...@zib.de wrote:


Hi,

the problem you've described -- an integration of DataImportHandler (to
traverse the XML file and get the document urls) and Solr Cell (to extract
content afterwards) -- is already addressed in issue SOLR-1358 (
https://issues.apache.org/jira/browse/SOLR-1358).

Best,
Sascha


Kerwin wrote:


Hi,

I am new to this forum and would like to know if the function described
below has been developed or exists in Solr. If it does not exist, is it a
good Idea and can I contribute.

We need to index multiple documents with different formats. So we use Solr
with Tika (Solr Cell).

Question:
Can you index both metadata and content for multiple documents iteratively
in Solr?
For example I have an XML with metadata and a links to the documents
content. There are many documents in this XML and I would like to index
them
all without firing multiple URLs.

Example of XML
add
doc
field name=id34122/field
field name=authorMichael/field
field name=size3MB/field
field name=URLURL of the document/field
/doc
/add
doc2./doc2.../docN

I need to index all these documents by sending this XML in a single
URL.The
collection of documents to be indexed could be on a file system.

I have altered the Solr code to be able to do this but is there an already
existing feature?






Re: Indexing multiple documents in Solr/SolrCell

2009-11-16 Thread Sascha Szott

Hi,

the problem you've described -- an integration of DataImportHandler (to 
traverse the XML file and get the document urls) and Solr Cell (to 
extract content afterwards) -- is already addressed in issue SOLR-1358 
(https://issues.apache.org/jira/browse/SOLR-1358).


Best,
Sascha

Kerwin wrote:

Hi,

I am new to this forum and would like to know if the function described
below has been developed or exists in Solr. If it does not exist, is it a
good Idea and can I contribute.

We need to index multiple documents with different formats. So we use Solr
with Tika (Solr Cell).

Question:
Can you index both metadata and content for multiple documents iteratively
in Solr?
For example I have an XML with metadata and a links to the documents
content. There are many documents in this XML and I would like to index them
all without firing multiple URLs.

Example of XML
add
doc
field name=id34122/field
field name=authorMichael/field
field name=size3MB/field
field name=URLURL of the document/field
/doc
/add
doc2./doc2.../docN

I need to index all these documents by sending this XML in a single URL.The
collection of documents to be indexed could be on a file system.

I have altered the Solr code to be able to do this but is there an already
existing feature?





Re: [DIH] blocking import operation

2009-11-12 Thread Sascha Szott
Noble Paul wrote:
 Yes , open an issue . This is a trivial change
I've opened JIRA issue SOLR-1554.

-Sascha


 On Thu, Nov 12, 2009 at 5:08 AM, Sascha Szott sz...@zib.de wrote:
 Noble,

 Noble Paul wrote:
 DIH imports are really long running. There is a good chance that the
 connection times out or breaks in between.
 Yes, you're right, I missed that point (in my case imports take no
 longer
 than a minute).

 how about a callback?
 Thanks for the hint. There was a discussion on adding a callback url to
 DIH a month ago, but it seems that no issue was raised. So, up to now
 its
 only possible to implement an appropriate Solr EventListener. Should we
 open an issue for supporting callback urls?

 Best,
 Sascha


 On Tue, Nov 10, 2009 at 12:12 AM, Sascha Szott sz...@zib.de wrote:
 Hi all,

 currently, DIH's import operation(s) only works asynchronously.
 Therefore,
 after submitting an import request, DIH returns immediately, while the
 import process (in case a large amount of data needs to be indexed)
 continues asynchronously behind the scenes.

 So, what is the recommended way to check if the import process has
 already
 finished? Or still better, is there any method / workaround that will
 block
 the import operation's caller until the operation has finished?

 In my application, the DIH receives some URL parameters which are used
 for
 determining the database name that is used within data-config.xml,
 e.g.

 http://localhost:8983/solr/dataimport?command=full-importdbname=foo

 Since only one DIH, /dataimport, is defined, but several database
 needs
 to
 be indexed, it is required to issue this command several times, e.g.

 http://localhost:8983/solr/dataimport?command=full-importdbname=foo

 ... wait until /dataimport?command=status says Indexing completed
 (but
 without using a loop that checks it again and again) ...

 http://localhost:8983/solr/dataimport?command=full-importdbname=barclean=false


 A suitable solution, at least IMHO, would be to have an additional DIH
 parameter which determines whether the import call is blocking on
 non-blocking, the default. As far as I see, this could be accomplished
 since
 Solr can execute more than one import operation at a time (it starts a
 new
 thread for each). Perhaps, my question is somehow related to the
 discussion
 [1] on ParallelDataImportHandler.

 Best,
 Sascha

 [1] http://www.lucidimagination.com/search/document/a9b26ade46466ee





 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com




Re: [DIH] concurrent requests to DIH

2009-11-12 Thread Sascha Szott
Hi Avlesh,

Avlesh Singh wrote:

 1. Is it considered as good practice to set up several DIH request
 handlers, one for each possible parameter value?

 Nothing wrong with this. My assumption is that you want to do this to
 speed
 up indexing. Each DIH instance would block all others, once a Lucene
 commit
 for the former is performed.
Thanks for this clarification.

 2. In case the range of parameter values is broad, it's not convenient to
 define separate request handlers for each value. But this entails a
 limitation (as far as I see): It is not possible to fire several request
 to the same DIH handler (with different parameter values) at the same
 time.

 Nope.

 I had done a similar exercise in my quest to write a
 ParallelDataImportHandler. This thread might be of interest to you -
 http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler.
 Though there is a ticket in JIRA, I haven't been able to contribute this
 back. If you think this is what you need, lemme know.
Actually, I've already read this thread. In my opinion, both support for
batch processing and multi-threading are important extensions of DIH's
current capabilities, though issue SOLR-1352 mainly targets the latter. Is
your PDIH implementation able to deal with batch processing right now?

Best,
Sascha

 On Thu, Nov 12, 2009 at 6:35 AM, Sascha Szott sz...@zib.de wrote:

 Hi all,

 I'm using the DIH in a parameterized way by passing request parameters
 that are used inside of my data-config. All imports end up in the same
 index.

 1. Is it considered as good practice to set up several DIH request
 handlers, one for each possible parameter value?

 2. In case the range of parameter values is broad, it's not convenient
 to
 define separate request handlers for each value. But this entails a
 limitation (as far as I see): It is not possible to fire several request
 to the same DIH handler (with different parameter values) at the same
 time. However, in case several request handlers would be used (as in
 1.),
 concurrent requests (to the different handlers) are possible. So, how to
 overcome this limitation?

 Best,
 Sascha





Re: [DIH] blocking import operation

2009-11-11 Thread Sascha Szott
Noble,

Noble Paul wrote:
 DIH imports are really long running. There is a good chance that the
 connection times out or breaks in between.
Yes, you're right, I missed that point (in my case imports take no longer
than a minute).

 how about a callback?
Thanks for the hint. There was a discussion on adding a callback url to
DIH a month ago, but it seems that no issue was raised. So, up to now its
only possible to implement an appropriate Solr EventListener. Should we
open an issue for supporting callback urls?

Best,
Sascha


 On Tue, Nov 10, 2009 at 12:12 AM, Sascha Szott sz...@zib.de wrote:
 Hi all,

 currently, DIH's import operation(s) only works asynchronously.
 Therefore,
 after submitting an import request, DIH returns immediately, while the
 import process (in case a large amount of data needs to be indexed)
 continues asynchronously behind the scenes.

 So, what is the recommended way to check if the import process has
 already
 finished? Or still better, is there any method / workaround that will
 block
 the import operation's caller until the operation has finished?

 In my application, the DIH receives some URL parameters which are used
 for
 determining the database name that is used within data-config.xml, e.g.

 http://localhost:8983/solr/dataimport?command=full-importdbname=foo

 Since only one DIH, /dataimport, is defined, but several database needs
 to
 be indexed, it is required to issue this command several times, e.g.

 http://localhost:8983/solr/dataimport?command=full-importdbname=foo

 ... wait until /dataimport?command=status says Indexing completed (but
 without using a loop that checks it again and again) ...

 http://localhost:8983/solr/dataimport?command=full-importdbname=barclean=false


 A suitable solution, at least IMHO, would be to have an additional DIH
 parameter which determines whether the import call is blocking on
 non-blocking, the default. As far as I see, this could be accomplished
 since
 Solr can execute more than one import operation at a time (it starts a
 new
 thread for each). Perhaps, my question is somehow related to the
 discussion
 [1] on ParallelDataImportHandler.

 Best,
 Sascha

 [1] http://www.lucidimagination.com/search/document/a9b26ade46466ee



[DIH] concurrent requests to DIH

2009-11-11 Thread Sascha Szott
Hi all,

I'm using the DIH in a parameterized way by passing request parameters
that are used inside of my data-config. All imports end up in the same
index.

1. Is it considered as good practice to set up several DIH request
handlers, one for each possible parameter value?

2. In case the range of parameter values is broad, it's not convenient to
define separate request handlers for each value. But this entails a
limitation (as far as I see): It is not possible to fire several request
to the same DIH handler (with different parameter values) at the same
time. However, in case several request handlers would be used (as in 1.),
concurrent requests (to the different handlers) are possible. So, how to
overcome this limitation?

Best,
Sascha


[DIH] SqlEntityProcessor does not recognize onError attribute

2009-11-09 Thread Sascha Szott

Hi all,

as stated in the Solr-WIKI, Solr 1.4 allows it to specify an onError 
attribute for *each* entity listed in the data config file (it is 
considered as one of the default attributes).


Unfortunately, the SqlEntityProcessor does not recognize the attribute's 
value -- i.e., in case an SQL exception is thrown somewhere inside the 
constructor of ResultSetIterators (which is an inner class of 
JdbcDataSource), Solr's import exits immediately, even though onError is 
set to continue or skip.


Why are database related exceptions (e.g., table does not exists, or an 
error in query syntax occurs) not being covered by the onError 
attribute? In my opinion, use cases exist that will profit from such an 
exception handling inside of Solr (for example, in cases where the 
existence of certain database tables or views is not predictable).


Should I raise an JIRA-issue about this?

-Sascha




Re: [DIH] SqlEntityProcessor does not recognize onError attribute

2009-11-09 Thread Sascha Szott

Hi,

Noble Paul നോബിള്‍ नोब्ळ् wrote:

On Mon, Nov 9, 2009 at 4:24 PM, Sascha Szott sz...@zib.de wrote:

Hi all,

as stated in the Solr-WIKI, Solr 1.4 allows it to specify an onError
attribute for *each* entity listed in the data config file (it is considered
as one of the default attributes).

Unfortunately, the SqlEntityProcessor does not recognize the attribute's
value -- i.e., in case an SQL exception is thrown somewhere inside the
constructor of ResultSetIterators (which is an inner class of
JdbcDataSource), Solr's import exits immediately, even though onError is set
to continue or skip.

Why are database related exceptions (e.g., table does not exists, or an
error in query syntax occurs) not being covered by the onError attribute? In
my opinion, use cases exist that will profit from such an exception handling
inside of Solr (for example, in cases where the existence of certain
database tables or views is not predictable).

We thought DB errors are not to be ignored because errors such as
table does not exist can be really serious.
In principle, I agree with you, though I would consider it as a 
programmer's responsibility to be aware of it (in case he/she sets 
onError to skip or continue).



Should I raise an JIRA-issue about this?

Raise an issue it can be fixed

I've created issue SOLR-1549.

Best,
Sascha



[DIH] blocking import operation

2009-11-09 Thread Sascha Szott

Hi all,

currently, DIH's import operation(s) only works asynchronously. 
Therefore, after submitting an import request, DIH returns immediately, 
while the import process (in case a large amount of data needs to be 
indexed) continues asynchronously behind the scenes.


So, what is the recommended way to check if the import process has 
already finished? Or still better, is there any method / workaround that 
will block the import operation's caller until the operation has finished?


In my application, the DIH receives some URL parameters which are used 
for determining the database name that is used within data-config.xml, e.g.


http://localhost:8983/solr/dataimport?command=full-importdbname=foo

Since only one DIH, /dataimport, is defined, but several database needs 
to be indexed, it is required to issue this command several times, e.g.


http://localhost:8983/solr/dataimport?command=full-importdbname=foo

... wait until /dataimport?command=status says Indexing completed (but 
without using a loop that checks it again and again) ...


http://localhost:8983/solr/dataimport?command=full-importdbname=barclean=false


A suitable solution, at least IMHO, would be to have an additional DIH 
parameter which determines whether the import call is blocking on 
non-blocking, the default. As far as I see, this could be accomplished 
since Solr can execute more than one import operation at a time (it 
starts a new thread for each). Perhaps, my question is somehow related 
to the discussion [1] on ParallelDataImportHandler.


Best,
Sascha

[1] http://www.lucidimagination.com/search/document/a9b26ade46466ee



Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-09-03 Thread Sascha Szott

Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3): 
For each row, extract the content from the corresponding pdf file using 
a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
in case you need to process other file types as well), put it between


foo![CDATA[

and

]]/foo

and store it in a text file. To keep the relationship between a file and 
its corresponding database row, use the primary key as the file name.


Within data-config.xml use the XPathEntityProcessor as follows (replace 
dbRow and primaryKey respectively):


entity name=pdfcontent
processor=XPathEntityProcessor
forEach=/foo
url=${dbRow.primaryKey}.xml
  field column=pdftext xpath=/foo/
/entity


And, by the way, in Solr 1.4 you do not have to put your content between 
xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.


Best,
Sascha

Khai Doan schrieb:

Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan





Building documents using content residing both in database tables and text files

2009-08-11 Thread Sascha Szott

Hello,

is it possible (and if it is, how can I accomplish it) to configure DIH 
to build up index documents by using content that resides in different 
data sources?


Here is an example scenario:
Let's assume we have a table T with two columns, ID (which is the 
primary key of T) and TITLE. Furthermore, each record in T is assigned a 
directory containing text files that were generated out of pdf documents 
by using Tika. A directory name is build by using the ID of the record 
in T associated to that directory, e.g. all text files associated to a 
record with id = 101 are stored in direcory 101.


Is there a way to configure DIH such that it uses ID, TITLE and the 
content of all related text files when building a document (the 
documents should have three fields: id, title, and text)?


Furthermore, as you may have noticed, a second question arises 
naturally: Will there be any integration of Solr Cell and DIH in an 
upcoming release, so that it would be possible to directly use the pdf 
documents instead of the extracted text files that were generated 
outside of Solr?


Best,
Sascha



Re: Building documents using content residing both in database tables and text files

2009-08-11 Thread Sascha Szott

Hi Noble,

Noble Paul wrote:

isn't it possible to do this by having two datasources (one Js=dbc and
another File) and two entities . The outer entity can read from a DB
and the inner entity can read from a file.

Yes, it is. Here's my db-data-config.xml file:

!-- definition of data sources --
dataSource name=ds.database
driver=...
url=...
user=...
password=... /
dataSource name=ds.filesystem
type=FileDataSource /


!-- building the document using both db and file content
 (files are stored in /tmp/recordId)
--
document name=doc
  entity name=t query=select * from t dataSource=ds.database
field column=id name=id /
field column=title name=title /
entity name=dir
processor=FileListEntityProcessor
baseDir=/tmp/${id}
fileName=.*
dataSource=null
rootEntity=false 
  entity name=file
  dataSource=ds.filesystem
  processor=XPathEntityProcessor
  forEach=/root
  url=${dir.fileAbsolutePath}
  stream=false 
field column=text xpath=/root /
  /entity
/entity
  /entity
/document


Only one additional adjustment has to be made: Since I'm using Solr 1.3 
and it comes without PlainTextEntityProcessor, I have to transform my 
plain text files in xml files by surrounding the content with a root 
element. That's all!



On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szottsz...@zib.de wrote:

Hello,

is it possible (and if it is, how can I accomplish it) to configure DIH to
build up index documents by using content that resides in different data
sources?

Here is an example scenario:
Let's assume we have a table T with two columns, ID (which is the primary
key of T) and TITLE. Furthermore, each record in T is assigned a directory
containing text files that were generated out of pdf documents by using
Tika. A directory name is build by using the ID of the record in T
associated to that directory, e.g. all text files associated to a record
with id = 101 are stored in direcory 101.

Is there a way to configure DIH such that it uses ID, TITLE and the content
of all related text files when building a document (the documents should
have three fields: id, title, and text)?

Furthermore, as you may have noticed, a second question arises naturally:
Will there be any integration of Solr Cell and DIH in an upcoming release,
so that it would be possible to directly use the pdf documents instead of
the extracted text files that were generated outside of Solr?


This is something I wish to see. But there has been no user request
yet. You can raise an issue and it can be looked upon

I've raised issue SOLR-1358.

Best,
Sascha