from:"Andreas Owen"

Re: ngramfilter minGramSize problem

2014-04-07 Thread Andreas Owen

it works well. now why does the search only find something when the  
fieldname is added to the query with stopwords?


cug - 9 hits
mit cug - 0 hits
plain_text:mit cug - 9 hits

why is this so? could it be a problem that stopwords aren't used in the  
query because no all fields that are search have the stopwordfilter?



On Mon, 07 Apr 2014 00:37:15 +0200, Furkan KAMACI furkankam...@gmail.com  
wrote:



Correction: My patch is at SOLR-5152
7 Nis 2014 01:05 tarihinde Andreas Owen ao...@swissonline.ch yazdı:


i thought i cound use filter class=solr.LengthFilterFactory min=1
max=2/ to index and search words that are only 1 or 2 chars long. it
seems to work but i have to test it some more


On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen ao...@swissonline.ch
wrote:

 i have the a fieldtype that uses ngramfilter whle indexing. is there a

setting that can force the ngramfilter to index smaller words then the
minGramSize? Mine is set to 3 and the search wont find word that are  
only 1
or 2 chars long. i would like to not set minGramSize=1 because the  
results

would be to diverse.

fieldtype:

fieldType name=text_de class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
!-- filter class=solr.WordDelimiterFilterFactory
types=at-under-alpha.txt/ --
filter class=solr.StopFilterFactory  
ignoreCase=true
words=lang/stopwords_de.txt format=snowball  
enablePositionIncrements=true/

!-- remove common words --
 filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory
language=German/ !-- remove noun/adjective inflections like plural
endings --
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.NGramFilterFactory minGramSize=3
maxGramSize=50/

   /analyzer
   analyzer type=query
tokenizer class=solr.
WhiteSpaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
ignoreCase=true words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --
filter class=solr.
GermanNormalizationFilterFactory/
filter  
class=solr.SnowballPorterFilterFactory

language=German/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   /analyzer
 /fieldType




--
Using Opera's mail client: http://www.opera.com/mail/




--
Using Opera's mail client: http://www.opera.com/mail/

ngramfilter minGramSize problem

2014-04-06 Thread Andreas Owen

i have the a fieldtype that uses ngramfilter whle indexing. is there a  
setting that can force the ngramfilter to index smaller words then the  
minGramSize? Mine is set to 3 and the search wont find word that are only  
1 or 2 chars long. i would like to not set minGramSize=1 because the  
results would be to diverse.


fieldtype:

fieldType name=text_de class=solr.TextField  
positionIncrementGap=100

  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
		!-- filter class=solr.WordDelimiterFilterFactory  
types=at-under-alpha.txt/ --
		filter class=solr.StopFilterFactory ignoreCase=true  
words=lang/stopwords_de.txt format=snowball  
enablePositionIncrements=true/ !-- remove common words --

filter class=solr.GermanNormalizationFilterFactory/
		filter class=solr.SnowballPorterFilterFactory language=German/  
!-- remove noun/adjective inflections like plural endings --
		filter class=solr.WordDelimiterFilterFactory generateWordParts=1  
generateNumberParts=1 catenateWords=1 catenateNumbers=1  
catenateAll=0 splitOnCaseChange=1/
		filter class=solr.NGramFilterFactory minGramSize=3  
maxGramSize=50/


   /analyzer
   analyzer type=query
tokenizer class=solr.WhiteSpaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
			filter class=solr.StopFilterFactory ignoreCase=true  
words=lang/stopwords_de.txt format=snowball  
enablePositionIncrements=true/ !-- remove common words --

filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/
			filter class=solr.WordDelimiterFilterFactory generateWordParts=1  
generateNumberParts=1 catenateWords=1 catenateNumbers=1  
catenateAll=0 splitOnCaseChange=1/

  /analyzer
/fieldType

Re: ngramfilter minGramSize problem

2014-04-06 Thread Andreas Owen

i thought i cound use filter class=solr.LengthFilterFactory min=1  
max=2/ to index and search words that are only 1 or 2 chars long. it  
seems to work but i have to test it some more



On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen ao...@swissonline.ch  
wrote:


i have the a fieldtype that uses ngramfilter whle indexing. is there a  
setting that can force the ngramfilter to index smaller words then the  
minGramSize? Mine is set to 3 and the search wont find word that are  
only 1 or 2 chars long. i would like to not set minGramSize=1 because  
the results would be to diverse.


fieldtype:

fieldType name=text_de class=solr.TextField  
positionIncrementGap=100

   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
		!-- filter class=solr.WordDelimiterFilterFactory  
types=at-under-alpha.txt/ --
		filter class=solr.StopFilterFactory ignoreCase=true  
words=lang/stopwords_de.txt format=snowball  
enablePositionIncrements=true/ !-- remove common words --

 filter class=solr.GermanNormalizationFilterFactory/
		filter class=solr.SnowballPorterFilterFactory language=German/  
!-- remove noun/adjective inflections like plural endings --
		filter class=solr.WordDelimiterFilterFactory generateWordParts=1  
generateNumberParts=1 catenateWords=1 catenateNumbers=1  
catenateAll=0 splitOnCaseChange=1/
		filter class=solr.NGramFilterFactory minGramSize=3  
maxGramSize=50/


   /analyzer
   analyzer type=query
tokenizer class=solr.WhiteSpaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
			filter class=solr.StopFilterFactory ignoreCase=true  
words=lang/stopwords_de.txt format=snowball  
enablePositionIncrements=true/ !-- remove common words --

filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/
			filter class=solr.WordDelimiterFilterFactory generateWordParts=1  
generateNumberParts=1 catenateWords=1 catenateNumbers=1  
catenateAll=0 splitOnCaseChange=1/

   /analyzer
 /fieldType



--
Using Opera's mail client: http://www.opera.com/mail/

dih data-config.xml onImportEnd event

2014-03-27 Thread Andreas Owen

i would like to call a url after the import is finished whith the event  
document onImportEnd=. how can i do this?

facet doesnt display all possibilities after selecting one

2014-03-27 Thread Andreas Owen

when i select a facet in thema_f all the others in the group disapear  
but the other facets keep the original findings. it seems like it should  
work. maybe the underscore is the wrong char for the seperator?


example documents in index

 doc
arr name=thema_f
  str1_Produkte/str
/arr
str name=iddms:381/str
/doc
  doc
arr name=thema_f
  str1_Beratung/str
  str1_Beratung_Beratungsportal PK/str
/arr
str name=iddms:2679/str
/doc
  doc
arr name=thema_f
  str1_Beratung/str
  str1_Beratung_Beratungsportal PK/str
/arr
str name=iddms:190/str
/doc



solrconfig.xml

requestHandler name=/select2 class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=defTypesynonym_edismax/str
   str name=synonymstrue/str
   str name=qfplain_text^10 editorschoice^200
title^20 h_*^14
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
productsegment^5 productgroup^5 contentmanager^5 links^5
last_modified^5 url^5
   /str
   str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str
   str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --

   str name=dftext/str
   str name=fl*,path,score/str
   str name=wtjson/str
   str name=q.opAND/str

   !-- Highlighting defaults --
   str name=hlon/str
   str name=hl.flplain_text,title/str
   str name=hl.fragSize200/str
   str name=hl.simple.prelt;bgt;/str
   str name=hl.simple.postlt;/bgt;/str

!-- lst name=invariants --
str name=faceton/str
str name=facet.mincount1/str
str name=facet.missingfalse/str
str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str
str name=f.inhaltstyp_s.facet.sortindex/str
str name=facet.field{!ex=doctype}doctype/str
str name=f.doctype.facet.sortindex/str
str name=facet.field{!ex=thema_f}thema_f/str
str name=f.thema_f.facet.sortindex/str
str 
name=facet.field{!ex=productsegment_f}productsegment_f/str
str name=f.productsegment_f.facet.sortindex/str
str name=facet.field{!ex=productgroup_f}productgroup_f/str
str name=f.productgroup_f.facet.sortindex/str
str name=facet.field{!ex=author_s}author_s/str
str name=f.author_s.facet.sortindex/str
		str  
name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str

str name=f.sachverstaendiger_s.facet.sortindex/str
str 
name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str
str name=f.veranstaltung_s.facet.sortindex/str
		str  
name=facet.field{!ex=kundensegment_aktive_beratung}kundensegment_aktive_beratung/str

str 
name=f.kundensegment_aktive_beratung.facet.sortindex/str
str name=facet.date{!ex=last_modified}last_modified/str
str name=facet.date.gap+1MONTH/str
str name=facet.date.endNOW/MONTH+1MONTH/str
str name=facet.date.startNOW/MONTH-36MONTHS/str
str name=facet.date.otherafter/str
/lst
/requestHandler




schema.xml

fieldType name=text_thema class=solr.TextField  
positionIncrementGap=100

 !-- analyzer
tokenizer class=solr.PatternTokenizerFactory pattern=_/
/analyzer--

 analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
 /analyzer
 analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/

 /analyzer
/fieldType

dih data-config.xml onImportEnd event

2014-03-27 Thread Andreas Owen


i would like to call a url after the import is finished whith the event
document onImportEnd=. how can i do this?

Re: dih data-config.xml onImportEnd event

2014-03-27 Thread Andreas Owen


sorry, the previous conversation was started with a false email-address.

On Thu, 27 Mar 2014 14:06:57 +0100, Stefan Matheis  
matheis.ste...@gmail.com wrote:


I would suggest you read the replies to your last mail (containing the  
very same question) first?


-Stefan


On Thursday, March 27, 2014 at 1:56 PM, Andreas Owen wrote:


i would like to call a url after the import is finished whith the event
document onImportEnd=. how can i do this?








--
Using Opera's mail client: http://www.opera.com/mail/

wrong query results with wdf and ngtf

2014-03-20 Thread Andreas Owen

Is there a way to tell ngramfilterfactory while indexing that number shall 
never be tokenized? then the query should be able to find numbers.

Or do i have to change the ngram-min for numbers (not alpha) to 1, if that is 
possible? So to speak put the hole number as token and not all possible tokens.

Solr analysis shows onnly WDF has no underscore in its tokens, the rest have 
it. can i tell the query to search numbers differently with NGTF, WT, LCF or 
whatever?

I also tried filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/
@ = ALPHA
_ = ALPHA

I have gotten nearly everything to work. There are to queries where i dont get 
back what i want.

avaloq frage 1- only returns if i set minGramSize=1 while 
indexing
yh_cug- query parser doesn't remove _ but the 
indexer does (WDF) so there is no match

Is there a way to also query the hole term avaloq frage 1 without tokenizing 
it?

Fieldtype:

fieldType name=text_de class=solr.TextField positionIncrementGap=100
  analyzer type=index 
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
 filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/ !-- remove noun/adjective inflections like plural endings 
-- 
filter class=solr.NGramFilterFactory minGramSize=3 
maxGramSize=15/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   /analyzer
   analyzer type=query
tokenizer class=solr.WhiteSpaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.StopFilterFactory 
ignoreCase=true words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/
  /analyzer
 /fieldType


Solrconfig:

 queryParser name=synonym_edismax
 class=solr.SynonymExpandingExtendedDismaxQParserPlugin
   lst name=synonymAnalyzers
 lst name=myCoolAnalyzer
   lst name=tokenizer
 str name=classstandard/str
   /lst
   lst name=filter
 str name=classshingle/str
 str name=outputUnigramsIfNoShinglestrue/str
 str name=outputUnigramstrue/str
 str name=minShingleSize2/str
 str name=maxShingleSize4/str
   /lst
   lst name=filter
 str name=classsynonym/str
 str name=tokenizerFactorysolr.KeywordTokenizerFactory/str
 str name=synonymssynonyms.txt/str
 str name=expandtrue/str
 str name=ignoreCasetrue/str
   /lst
 /lst
   /lst
 /queryParser
 
 requestHandler name=/select2 class=solr.SearchHandler
  lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=defTypesynonym_edismax/str
str name=synonymstrue/str
str name=qfplain_text^10 editorschoice^200
 title^20 h_*^14
 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
 contentmanager^5 links^5
 last_modified^5 url^5
/str
str name=bq(expiration:[NOW TO *] OR (*:* 
 -expiration:*))^6/str
str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --
 
str name=dftext/str
str name=fl*,path,score/str
str name=wtjson/str
str name=q.opAND/str
 
!-- Highlighting defaults --
str name=hlon/str
str name=hl.flplain_text,title/str
str name=hl.fragSize200/str
str name=hl.simple.prelt;bgt;/str
str name=hl.simple.postlt;/bgt;/str
 
 !-- lst name=invariants --
 str name=faceton/str
 str name=facet.mincount1/str
 str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str
 str name=f.inhaltstyp_s.facet.sortindex/str
 str name=facet.field{!ex=doctype}doctype/str
 str name=f.doctype.facet.sortindex/str
 str name=facet.field{!ex=thema_f}thema_f/str
 str name=f.thema_f.facet.sortindex/str
 str name=facet.field{!ex=author_s}author_s/str
 str name=f.author_s.facet.sortindex/str
 str
 name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str
 str name=f.sachverstaendiger_s.facet.sortindex/str
 str name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str
 str name=f.veranstaltung_s.facet.sortindex/str
 str name=facet.date{!ex=last_modified}last_modified/str
 str name=facet.date.gap+1MONTH/str
 str name=facet.date.endNOW/MONTH+1MONTH/str
 str name=facet.date.startNOW/MONTH-36MONTHS/str

wrong results with wdf ngtf

2014-03-20 Thread Andreas Owen

Is there a way to tell ngramfilterfactory while indexing that number shall
never be tokenized? then the query should be able to find numbers.

 

Or do i have to change the ngram-min for numbers (not alpha) to 1, if that
is possible? So to speak put the hole number as token and not all possible
tokens.

 

Solr analysis shows onnly WDF has no underscore in its tokens, the rest have
it. can i tell the query to search numbers differently with NGTF, WT, LCF or
whatever?

 

I also tried filter class=solr.WordDelimiterFilterFactory
types=at-under-alpha.txt/

@ = ALPHA

_ = ALPHA

 

I have gotten nearly everything to work. There are to queries where i dont
get back what i want.

 

avaloq frage 1   - only returns if i set
minGramSize=1 while indexing

yh_cug- query parser doesn't
remove _ but the indexer does (WDF) so there is no match

 

Is there a way to also query the hole term avaloq frage 1 without
tokenizing it?

 

Fieldtype:

 

fieldType name=text_de class=solr.TextField positionIncrementGap=100

  analyzer type=index 

   tokenizer
class=solr.StandardTokenizerFactory/

filter
class=solr.LowerCaseFilterFactory/

   filter
class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ 

   filter class=solr.StopFilterFactory
ignoreCase=true words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --

filter
class=solr.GermanNormalizationFilterFactory/

   filter
class=solr.SnowballPorterFilterFactory language=German/ !-- remove
noun/adjective inflections like plural endings --


   filter class=solr.NGramFilterFactory
minGramSize=3 maxGramSize=15/

   filter
class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/

   /analyzer

   analyzer type=query

   tokenizer
class=solr.WhiteSpaceTokenizerFactory/

   filter
class=solr.LowerCaseFilterFactory/

   filter
class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ 

   filter
class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --

   filter
class=solr.GermanNormalizationFilterFactory/

   filter
class=solr.SnowballPorterFilterFactory language=German/

  /analyzer

/fieldType

 

 

Solrconfig:

 

 queryParser name=synonym_edismax

 class=solr.SynonymExpandingExtendedDismaxQParserPlugin

   lst name=synonymAnalyzers

 lst name=myCoolAnalyzer

   lst name=tokenizer

 str name=classstandard/str

   /lst

   lst name=filter

 str name=classshingle/str

 str name=outputUnigramsIfNoShinglestrue/str

 str name=outputUnigramstrue/str

 str name=minShingleSize2/str

 str name=maxShingleSize4/str

   /lst

   lst name=filter

 str name=classsynonym/str

 str name=tokenizerFactorysolr.KeywordTokenizerFactory/str

 str name=synonymssynonyms.txt/str

 str name=expandtrue/str

 str name=ignoreCasetrue/str

   /lst

 /lst

   /lst

 /queryParser

 

 requestHandler name=/select2 class=solr.SearchHandler

  lst name=defaults

str name=echoParamsexplicit/str

int name=rows10/int

str name=defTypesynonym_edismax/str

str name=synonymstrue/str

str name=qfplain_text^10 editorschoice^200

 title^20 h_*^14

 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10

 contentmanager^5 links^5

 last_modified^5 url^5

/str

str name=bq(expiration:[NOW TO *] OR (*:* 

 -expiration:*))^6/str

str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --

 

str name=dftext/str

str name=fl*,path,score/str

str name=wtjson/str

str name=q.opAND/str

 

!-- Highlighting defaults --

str name=hlon/str

str name=hl.flplain_text,title/str

str name=hl.fragSize200/str

str name=hl.simple.prelt;bgt;/str

str name=hl.simple.postlt;/bgt;/str

 

 !-- lst name=invariants --

 str name=faceton/str

 str name=facet.mincount1/str

 str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str

 str name=f.inhaltstyp_s.facet.sortindex/str

 str name=facet.field{!ex=doctype}doctype/str

 str name=f.doctype.facet.sortindex/str

 str name=facet.field{!ex=thema_f}thema_f/str

 str name=f.thema_f.facet.sortindex/str

 str name=facet.field{!ex=author_s}author_s/str

 str name=f.author_s.facet.sortindex/str

 str

searche for single char number when ngram min is 3

2014-03-19 Thread Andreas Owen

Is there a way to tell ngramfilterfactory while indexing that number shall 
never be tokenized? then the query should be able to find numbers.
Or do i have to change the ngram min for numbers to 1, if that is possible? So 
to speak put the hole number as token and not all possible tokens.
Or can i tell the query to search numbers differently woth WT, LCF or whatever?

I attached a doc with screenshots from solr analyzer


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 13. März 2014 13:44
To: solr-user@lucene.apache.org
Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 
3 uppwards

I have gotten nearly everything to work. There are to queries where i dont get 
back what i want.

avaloq frage 1- only returns if i set minGramSize=1 while 
indexing
yh_cug- query parser doesn't remove _ but the 
indexer does (WDF) so there is no match

Is there a way to also query the hole term avaloq frage 1 without tokenizing 
it?

Fieldtype:

fieldType name=text_de class=solr.TextField positionIncrementGap=100
  analyzer type=index 
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
 filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/ !-- remove noun/adjective inflections like plural endings 
-- 
filter class=solr.NGramFilterFactory minGramSize=3 
maxGramSize=15/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   /analyzer
   analyzer type=query
tokenizer class=solr.WhiteSpaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.StopFilterFactory 
ignoreCase=true words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/
  /analyzer
 /fieldType


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch]
Sent: Mittwoch, 12. März 2014 18:39
To: solr-user@lucene.apache.org
Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 
3 uppwards

Hi Jack,

do you know how i can use local parameters in my solrconfig? The params are 
visible in the debugquery-output but solr doesn't parse them.

lst name=invariants
str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO 
*]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str /lst


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch]
Sent: Mittwoch, 12. März 2014 14:44
To: solr-user@lucene.apache.org
Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index  query).

here's the rest:

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
 enablePositionIncrements=true ensures that a 'gap' is left to
 allow for accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

-Original-Nachricht- 
 Von: Jack Krupansky j...@basetechnology.com
 An: solr-user@lucene.apache.org
 Datum: 12/03/2014 13:25
 Betreff: Re: NOT SOLVED searches for single char tokens instead of 
 from 3 uppwards
 
 You didn't show the new index analyzer - it's tricky to assure that 
 index and query are compatible, but the Admin UI Analysis page can

underscore in query error

2014-03-19 Thread Andreas Owen

If I use the underscore in the query I don't get any results. If I remove
the underscore it finds the docs with underscore.

Can I tell solr  to search through the ngtf instead of the wdf or is there
any better solution?

 

Query: yh_cug

 

I attached a doc with the analyzer output

RE: use local param in solrconfig fq for access-control

2014-03-13 Thread Andreas Owen

I have given up this idee and made a wrapper which adds a fq with the userroles 
to each request

-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Dienstag, 11. März 2014 23:32
To: solr-user@lucene.apache.org
Subject: use local param in solrconfig fq for access-control

i would like to use $r and $org for access control. it has to allow the fq's 
from my facet to work aswell. i'm not sure if i'm doing it wright or if i 
should add it to a qf or the q itself. the debugquery returns a parsed fq 
string and in them $r and $org are printed instead of their values. how do i 
get them to be intepreted? the lacal params are listed in the response so they 
should be valid.

lst name=invariants
  str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str
/lst

RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-13 Thread Andreas Owen

I have gotten nearly everything to work. There are to queries where i dont get 
back what i want.

avaloq frage 1- only returns if i set minGramSize=1 while 
indexing
yh_cug- query parser doesn't remove _ but the 
indexer does (WDF) so there is no match

Is there a way to also query the hole term avaloq frage 1 without tokenizing 
it?

Fieldtype:

fieldType name=text_de class=solr.TextField positionIncrementGap=100
  analyzer type=index 
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
 filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/ !-- remove noun/adjective inflections like plural endings 
-- 
filter class=solr.NGramFilterFactory minGramSize=3 
maxGramSize=15/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   /analyzer
   analyzer type=query
tokenizer class=solr.WhiteSpaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.StopFilterFactory 
ignoreCase=true words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/
  /analyzer
 /fieldType


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Mittwoch, 12. März 2014 18:39
To: solr-user@lucene.apache.org
Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 
3 uppwards

Hi Jack,

do you know how i can use local parameters in my solrconfig? The params are 
visible in the debugquery-output but solr doesn't parse them.

lst name=invariants
str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO 
*]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str /lst


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch]
Sent: Mittwoch, 12. März 2014 14:44
To: solr-user@lucene.apache.org
Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index  query).

here's the rest:

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
 enablePositionIncrements=true ensures that a 'gap' is left to
 allow for accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

-Original-Nachricht- 
 Von: Jack Krupansky j...@basetechnology.com
 An: solr-user@lucene.apache.org
 Datum: 12/03/2014 13:25
 Betreff: Re: NOT SOLVED searches for single char tokens instead of 
 from 3 uppwards
 
 You didn't show the new index analyzer - it's tricky to assure that 
 index and query are compatible, but the Admin UI Analysis page can help.
 
 Generally, using pure defaults for WDF is not what you want, 
 especially for query time. Usually there needs to be a slight 
 asymmetry between index and query for WDF - index generates more terms than 
 query.
 
 -- Jack Krupansky
 
 -Original Message-
 From: Andreas Owen
 Sent: Wednesday, March 12, 2014 6:20 AM
 To: solr-user@lucene.apache.org
 Subject: RE: NOT SOLVED searches for single char tokens instead of 
 from 3 uppwards
 
 I now have the following:
 
 analyzer type=query
 tokenizer class=solr.WhiteSpaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory 
 types=at-under-alpha.txt/ filter
 class

RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Andreas Owen

I now have the following:

analyzer type=query
tokenizer class=solr.WhiteSpaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory 
types=at-under-alpha.txt/ 
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory 
ignoreCase=true words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/
  /analyzer

The gui analysis shows me that wdf doesn't cut the underscore anymore but it 
still returns 0 results?

Output:

lst name=debug
  str name=rawquerystringyh_cug/str
  str name=querystringyh_cug/str
  str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | 
links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 
| h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | 
contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | 
doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] 
(+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str
  str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | 
thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | 
inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | 
title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) 
((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) 
(div(int(clicks),max(int(displays),const(1^8.0/str
  lst name=explain/
  arr name=expandedSynonyms
stryh_cug/str
  /arr
  lst name=reasonForNotExpandingSynonyms
str name=nameDidntFindAnySynonyms/str
str name=explanationNo synonyms found for this query.  Check your 
synonyms file./str
  /lst
  lst name=mainQueryParser
str name=QParserExtendedDismaxQParser/str
null name=altquerystring/
arr name=boost_queries
  str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str
/arr
arr name=parsed_boost_queries
  str(expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) 
-expiration:*))^6.0/str
/arr
arr name=boostfuncs
  strdiv(clicks,max(displays,1))^8/str
/arr
  /lst
  lst name=synonymQueryParser
str name=QParserExtendedDismaxQParser/str
null name=altquerystring/
arr name=boostfuncs
  strdiv(clicks,max(displays,1))^8/str
/arr
  /lst
  lst name=timing




-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Dienstag, 11. März 2014 14:25
To: solr-user@lucene.apache.org
Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

The usual use of an ngram filter is at index time and not at query time. 
What exactly are you trying to achieve by using ngram filtering at query time 
as well as index time?

Generally, it is inappropriate to combine the word delimiter filter with the 
standard tokenizer - the later removes the punctuation that normally influences 
how WDF treats the parts of a token. Use the white space tokenizer if you 
intend to use WDF.

Which query parser are you using? What fields are being queried?

Please post the parsed query string from the debug output - it will show the 
precise generated query.

I think what you are seeing is that the ngram filter is generating tokens like 
h_cugtest and then the WDF is removing the underscore and then h 
gets generated as a separate token.

-- Jack Krupansky

-Original Message-
From: Andreas Owen
Sent: Tuesday, March 11, 2014 5:09 AM
To: solr-user@lucene.apache.org
Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

I got it roght the first time and here is my requesthandler. The field 
plain_text is searched correctly and has the sam fieldtype as title - 
text_de

queryParser name=synonym_edismax 
class=solr.SynonymExpandingExtendedDismaxQParserPlugin
  lst name=synonymAnalyzers
lst name=myCoolAnalyzer
  lst name=tokenizer
str name=classstandard/str
  /lst
  lst name=filter
str name=classshingle/str
str name=outputUnigramsIfNoShinglestrue/str
str name=outputUnigramstrue/str
str name=minShingleSize2/str
str name=maxShingleSize4/str
  /lst
  lst name=filter
str name=classsynonym/str
str name=tokenizerFactorysolr.KeywordTokenizerFactory/str
str name=synonymssynonyms.txt/str
str name=expandtrue/str
str name=ignoreCasetrue/str
  /lst
/lst
  /lst
/queryParser

requestHandler name=/select2 class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=defTypesynonym_edismax/str
   str name=synonymstrue/str
   str name=qfplain_text^10 editorschoice^200
title^20 h_*^14
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   /str

Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Andreas Owen

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index  query).

here's the rest:

analyzer type=index
        tokenizer class=solr.WhitespaceTokenizerFactory/
        !-- in this example, we will only use synonyms at query time
        filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
        --
        !-- Case insensitive stop word removal.
             enablePositionIncrements=true ensures that a 'gap' is left to
             allow for accurate phrase queries.
        --
        filter class=solr.StopFilterFactory
                ignoreCase=true
                words=stopwords.txt
                enablePositionIncrements=true
                /
        filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
        filter class=solr.RemoveDuplicatesTokenFilterFactory/
      /analyzer

-Original-Nachricht- 
 Von: Jack Krupansky j...@basetechnology.com 
 An: solr-user@lucene.apache.org 
 Datum: 12/03/2014 13:25 
 Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 
 uppwards 
 
 You didn't show the new index analyzer - it's tricky to assure that index 
 and query are compatible, but the Admin UI Analysis page can help.
 
 Generally, using pure defaults for WDF is not what you want, especially for 
 query time. Usually there needs to be a slight asymmetry between index and 
 query for WDF - index generates more terms than query.
 
 -- Jack Krupansky
 
 -Original Message- 
 From: Andreas Owen
 Sent: Wednesday, March 12, 2014 6:20 AM
 To: solr-user@lucene.apache.org
 Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
 uppwards
 
 I now have the following:
 
 analyzer type=query
 tokenizer class=solr.WhiteSpaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=lang/stopwords_de.txt format=snowball 
 enablePositionIncrements=true/ !-- remove common words --
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German/
       /analyzer
 
 The gui analysis shows me that wdf doesn't cut the underscore anymore but it 
 still returns 0 results?
 
 Output:
 
 lst name=debug
   str name=rawquerystringyh_cug/str
   str name=querystringyh_cug/str
   str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | 
 links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | 
 url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | 
 breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | 
 editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) 
 ((expiration:[1394619501862 TO *] 
 (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
 FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str
   str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | 
 thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | 
 h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | 
 contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | 
 doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] 
 (+*:* -expiration:*))^6.0) 
 (div(int(clicks),max(int(displays),const(1^8.0/str
   lst name=explain/
   arr name=expandedSynonyms
     stryh_cug/str
   /arr
   lst name=reasonForNotExpandingSynonyms
     str name=nameDidntFindAnySynonyms/str
     str name=explanationNo synonyms found for this query.  Check your 
 synonyms file./str
   /lst
   lst name=mainQueryParser
     str name=QParserExtendedDismaxQParser/str
     null name=altquerystring/
     arr name=boost_queries
       str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str
     /arr
     arr name=parsed_boost_queries
       str(expiration:[1394619501862 TO *] 
 (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str
     /arr
     arr name=boostfuncs
       strdiv(clicks,max(displays,1))^8/str
     /arr
   /lst
   lst name=synonymQueryParser
     str name=QParserExtendedDismaxQParser/str
     null name=altquerystring/
     arr name=boostfuncs
       strdiv(clicks,max(displays,1))^8/str
     /arr
   /lst
   lst name=timing
 
 
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Dienstag, 11. März 2014 14:25
 To: solr-user@lucene.apache.org
 Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 
 uppwards
 
 The usual use of an ngram filter is at index time and not at query time.
 What exactly are you trying to achieve by using ngram filtering at query 
 time as well as index time?
 
 Generally, it is inappropriate to combine the word delimiter filter

RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Andreas Owen

Hi Jack,

do you know how i can use local parameters in my solrconfig? The params are 
visible in the debugquery-output but solr doesn't parse them.

lst name=invariants
str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO 
*]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str
/lst


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Mittwoch, 12. März 2014 14:44
To: solr-user@lucene.apache.org
Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index  query).

here's the rest:

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
 enablePositionIncrements=true ensures that a 'gap' is left to
 allow for accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

-Original-Nachricht- 
 Von: Jack Krupansky j...@basetechnology.com
 An: solr-user@lucene.apache.org
 Datum: 12/03/2014 13:25
 Betreff: Re: NOT SOLVED searches for single char tokens instead of 
 from 3 uppwards
 
 You didn't show the new index analyzer - it's tricky to assure that 
 index and query are compatible, but the Admin UI Analysis page can help.
 
 Generally, using pure defaults for WDF is not what you want, 
 especially for query time. Usually there needs to be a slight 
 asymmetry between index and query for WDF - index generates more terms than 
 query.
 
 -- Jack Krupansky
 
 -Original Message-
 From: Andreas Owen
 Sent: Wednesday, March 12, 2014 6:20 AM
 To: solr-user@lucene.apache.org
 Subject: RE: NOT SOLVED searches for single char tokens instead of 
 from 3 uppwards
 
 I now have the following:
 
 analyzer type=query
 tokenizer class=solr.WhiteSpaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory 
 types=at-under-alpha.txt/ filter 
 class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=lang/stopwords_de.txt format=snowball 
 enablePositionIncrements=true/ !-- remove common words -- filter 
 class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German/
   /analyzer
 
 The gui analysis shows me that wdf doesn't cut the underscore anymore 
 but it still returns 0 results?
 
 Output:
 
 lst name=debug
   str name=rawquerystringyh_cug/str
   str name=querystringyh_cug/str
   str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 |
 links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 |
 url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 |
 breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 
 |
 editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0))
 ((expiration:[1394619501862 TO *]
 (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
 FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_
 coord/str
   str name=parsedquery_toString+(tags:yh_cug^10.0 | 
 links:yh_cug^5.0 |
 thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 |
 h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 |
 contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | 
 editorschoice:yh_cug^200.0 |
 doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *]
 (+*:* -expiration:*))^6.0)
 (div(int(clicks),max(int(displays),const(1^8.0/str
   lst name=explain/
   arr name=expandedSynonyms
 stryh_cug/str
   /arr
   lst name=reasonForNotExpandingSynonyms
 str name=nameDidntFindAnySynonyms/str
 str name=explanationNo synonyms found for this query.  Check 
 your synonyms file./str
   /lst
   lst name=mainQueryParser
 str name=QParserExtendedDismaxQParser/str
 null name=altquerystring/
 arr name=boost_queries
   str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str
 /arr
 arr name=parsed_boost_queries
   str(expiration:[1394619501862 TO *]
 (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str
 /arr
 arr name=boostfuncs
   strdiv(clicks,max(displays,1))^8/str
 /arr
   /lst
   lst name=synonymQueryParser
 str name=QParserExtendedDismaxQParser/str
 null name=altquerystring/
 arr name=boostfuncs

searches for single char tokens instead of from 3 uppwards

2014-03-11 Thread Andreas Owen

i have a field with the following type:

fieldType name=text_de class=solr.TextField positionIncrementGap=100
      analyzer 
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_de.txt format=snowball 
enablePositionIncrements=true/ !-- remove common words --
               filter class=solr.GermanNormalizationFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German/ 
filter class=solr.NGramFilterFactory minGramSize=3 
maxGramSize=15/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
      /analyzer
    /fieldType


shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
query report of 2 results:
lst name=responseHeader  int name=status0/int  int 
name=QTime125/int  lst name=paramsstr 
name=debugQuerytrue/strstr 
name=fltitle,roles,organisations,id/strstr name=indenttrue/str   
 str name=qyh_cugtest/strstr name=_1394522589347/strstr 
name=wtxml/strstr name=fqorganisations:* roles:*/str  
/lst/lst
result name=response numFound=5 start=0
   ..
str name=dms:2681
1.6365329 = (MATCH) sum of:   1.6346203 = (MATCH) max of: 0.14759353 = 
(MATCH) product of:   0.28596246 = (MATCH) sum of: 0.01528686 = 
(MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of:   
0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
0.035319194 = queryWeight, product of:   5.540098 = idf(docFreq=9, 
maxDocs=937)   0.0063751927 = queryNorm 0.43282017 = 
fieldWeight in 0, product of:   1.0 = tf(freq=1.0), with freq of:   
  1.0 = termFreq=1.0   5.540098 = idf(docFreq=9, 
maxDocs=937)   0.078125 = fieldNorm(doc=0) 0.0119499 = 
(MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result of:   
0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
0.031227252 = queryWeight, product of:  
4.8982444 = idf(docFreq=18, maxDocs=937)   0.0063751927 = queryNorm 
0.38267535 = fieldWeight in 0, product of:   1.0 = 
tf(freq=1.0), with freq of: 1.0 = termFreq=1.0   
4.8982444 = idf(docFreq=18, maxDocs=937)   0.078125 = 
fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhc in 0) 
[DefaultSimilarity], result of:   0.019351374 = score(doc=0,freq=1.0 = 
termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of:   
6.2332454 = idf(docFreq=4, maxDocs=937)   0.0063751927 
= queryNorm 0.4869723 = fieldWeight in 0, product of:   
1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
   6.2332454 = idf(docFreq=4, maxDocs=937)   0.078125 = 
fieldNorm(doc=0) 0.019351374 = (MATCH)
weight(plain_text:hcu in 0) [DefaultSimilarity], result of:   
0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
0.03973814 = queryWeight, product of:   6.2332454 = idf(docFreq=4, 
maxDocs=937)   0.0063751927 = queryNorm 0.4869723 = 
fieldWeight in 0, product of:   1.0 = tf(freq=1.0), with freq of:   
  1.0 = termFreq=1.0   6.2332454 = idf(docFreq=4, 
maxDocs=937)   0.078125 = fieldNorm(doc=0) 0.01528686 = 
(MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of:   
0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
0.035319194 = queryWeight, product of:   5.540098 = idf(docFreq=9, 
maxDocs=937)   0.0063751927 = queryNorm 0.43282017 = 
fieldWeight in 0, product of:   1.0 =
tf(freq=1.0), with freq of: 1.0 = termFreq=1.0   
5.540098 = idf(docFreq=9, maxDocs=937)   0.078125 = 
fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:cugt in 0) 
[DefaultSimilarity], result of:   0.019351374 = score(doc=0,freq=1.0 = 
termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of:   
6.2332454 = idf(docFreq=4, maxDocs=937)   0.0063751927 
= queryNorm 0.4869723 = fieldWeight in 0, product of:   
1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
   6.2332454 = idf(docFreq=4, maxDocs=937)   0.078125 = 
fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhcu in 0) 
[DefaultSimilarity], result of:   0.019351374 = score(doc=0,freq=1.0 = 
termFreq=1.0 ), product of: 0.03973814 =
queryWeight, product of:

Re: SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-11 Thread Andreas Owen

sorry i looked at the wrong fieldtype

-Original-Nachricht- 
 Von: Andreas Owen a...@conx.ch 
 An: solr-user@lucene.apache.org 
 Datum: 11/03/2014 08:45 
 Betreff: searches for single char tokens instead of from 3 uppwards 
 
 i have a field with the following type:
 
 fieldType name=text_de class=solr.TextField positionIncrementGap=100
       analyzer 
         tokenizer class=solr.StandardTokenizerFactory/
         filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true 
 words=lang/stopwords_de.txt format=snowball 
 enablePositionIncrements=true/ !-- remove common words --
                filter class=solr.GermanNormalizationFilterFactory/
   filter class=solr.SnowballPorterFilterFactory 
 language=German/ 
   filter class=solr.NGramFilterFactory minGramSize=3 
 maxGramSize=15/
   filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=1 generateNumberParts=1 catenateWords=1 
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
       /analyzer
     /fieldType
 
 
 shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
 query report of 2 results:
 lst name=responseHeader  int name=status0/int  int 
 name=QTime125/int  lst name=params    str 
 name=debugQuerytrue/str    str 
 name=fltitle,roles,organisations,id/str    str name=indenttrue/str 
    str name=qyh_cugtest/str    str name=_1394522589347/str    
 str name=wtxml/str    str name=fqorganisations:* roles:*/str  
 /lst/lst
 result name=response numFound=5 start=0
    ..
 str name=dms:2681
 1.6365329 = (MATCH) sum of:   1.6346203 = (MATCH) max of:     0.14759353 = 
 (MATCH) product of:       0.28596246 = (MATCH) sum of:         0.01528686 = 
 (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of:           
 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of:             
 0.035319194 = queryWeight, product of:               5.540098 = 
 idf(docFreq=9, maxDocs=937)               0.0063751927 = queryNorm            
  0.43282017 = fieldWeight in 0, product of:               1.0 = tf(freq=1.0), 
 with freq of:                 1.0 = termFreq=1.0               5.540098 = 
 idf(docFreq=9, maxDocs=937)               0.078125 = fieldNorm(doc=0)         
 0.0119499 = (MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result 
 of:           0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ),
product of:             0.031227252 = queryWeight, product of:              
 4.8982444 = idf(docFreq=18, maxDocs=937)               0.0063751927 = 
 queryNorm             0.38267535 = fieldWeight in 0, product of:              
  1.0 = tf(freq=1.0), with freq of:                 1.0 = termFreq=1.0         
       4.8982444 = idf(docFreq=18, maxDocs=937)               0.078125 = 
 fieldNorm(doc=0)         0.019351374 = (MATCH) weight(plain_text:yhc in 0) 
 [DefaultSimilarity], result of:           0.019351374 = score(doc=0,freq=1.0 
 = termFreq=1.0 ), product of:             0.03973814 = queryWeight, product 
 of:               6.2332454 = idf(docFreq=4, maxDocs=937)               
 0.0063751927 = queryNorm             0.4869723 = fieldWeight in 0, product 
 of:               1.0 = tf(freq=1.0), with freq of:                 1.0 = 
 termFreq=1.0               6.2332454 =
idf(docFreq=4, maxDocs=937)               0.078125 = fieldNorm(doc=0)         
0.019351374 = (MATCH)
 weight(plain_text:hcu in 0) [DefaultSimilarity], result of:           
 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of:             
 0.03973814 = queryWeight, product of:               6.2332454 = 
 idf(docFreq=4, maxDocs=937)               0.0063751927 = queryNorm            
  0.4869723 = fieldWeight in 0, product of:               1.0 = tf(freq=1.0), 
 with freq of:                 1.0 = termFreq=1.0               6.2332454 = 
 idf(docFreq=4, maxDocs=937)               0.078125 = fieldNorm(doc=0)         
 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result 
 of:           0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
             0.035319194 = queryWeight, product of:               5.540098 = 
 idf(docFreq=9, maxDocs=937)               0.0063751927 =
queryNorm             0.43282017 = fieldWeight in 0, product of:               
1.0 =
 tf(freq=1.0), with freq of:                 1.0 = termFreq=1.0               
 5.540098 = idf(docFreq=9, maxDocs=937)               0.078125 = 
 fieldNorm(doc=0)         0.019351374 = (MATCH) weight(plain_text:cugt in 0) 
 [DefaultSimilarity], result of:           0.019351374 = score(doc=0,freq=1.0 
 = termFreq=1.0 ), product of:             0.03973814 = queryWeight, product 
 of:               6.2332454 = idf(docFreq=4, maxDocs=937)               
 0.0063751927 = queryNorm             0.4869723 = fieldWeight in 0, product 
 of:               1.0 = tf(freq=1.0), with freq of:                 1.0 = 
 termFreq=1.0               6.2332454 = idf(docFreq=4

RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-11 Thread Andreas Owen

I got it roght the first time and here is my requesthandler. The field 
plain_text is searched correctly and has the sam fieldtype as title - 
text_de

queryParser name=synonym_edismax 
class=solr.SynonymExpandingExtendedDismaxQParserPlugin
  lst name=synonymAnalyzers
lst name=myCoolAnalyzer
  lst name=tokenizer
str name=classstandard/str
  /lst
  lst name=filter
str name=classshingle/str
str name=outputUnigramsIfNoShinglestrue/str
str name=outputUnigramstrue/str
str name=minShingleSize2/str
str name=maxShingleSize4/str
  /lst
  lst name=filter
str name=classsynonym/str
str 
name=tokenizerFactorysolr.KeywordTokenizerFactory/str
str name=synonymssynonyms.txt/str
str name=expandtrue/str
str name=ignoreCasetrue/str
  /lst
/lst
  /lst
/queryParser

requestHandler name=/select2 class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=defTypesynonym_edismax/str
   str name=synonymstrue/str
   str name=qfplain_text^10 editorschoice^200
title^20 h_*^14 
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   /str

str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str
   str name=bq(expiration:[NOW TO *] OR (*:* 
-expiration:*))^6/str  !-- tested: now or newer or empty gets small boost --
   str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --

   str name=dftext/str
   str name=fl*,path,score/str
   str name=wtjson/str
   str name=q.opAND/str
   
   !-- Highlighting defaults --
   str name=hlon/str
   str name=hl.flplain_text,title/str
   str name=hl.fragSize200/str
   str name=hl.simple.prelt;bgt;/str
   str name=hl.simple.postlt;/bgt;/str
   
 !-- lst name=invariants --
str name=faceton/str
str name=facet.mincount1/str
str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str
str name=f.inhaltstyp_s.facet.sortindex/str
str name=facet.field{!ex=doctype}doctype/str
str name=f.doctype.facet.sortindex/str
str name=facet.field{!ex=thema_f}thema_f/str
str name=f.thema_f.facet.sortindex/str
str name=facet.field{!ex=author_s}author_s/str
str name=f.author_s.facet.sortindex/str
str 
name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str
str name=f.sachverstaendiger_s.facet.sortindex/str
str 
name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str
str name=f.veranstaltung_s.facet.sortindex/str
str name=facet.date{!ex=last_modified}last_modified/str
str name=facet.date.gap+1MONTH/str
str name=facet.date.endNOW/MONTH+1MONTH/str
str name=facet.date.startNOW/MONTH-36MONTHS/str
str name=facet.date.otherafter/str


   /lst
/requestHandler
 


 i have a field with the following type:
 
 fieldType name=text_de class=solr.TextField 
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true 
 words=lang/stopwords_de.txt format=snowball 
 enablePositionIncrements=true/ !-- remove common words --
filter class=solr.GermanNormalizationFilterFactory/
   filter class=solr.SnowballPorterFilterFactory 
 language=German/
   filter class=solr.NGramFilterFactory minGramSize=3 
 maxGramSize=15/
   filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=1 generateNumberParts=1 catenateWords=1 
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   /analyzer
 /fieldType
 
 
 shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
query report of 2 results:

 lst name=responseHeader  int name=status0/int  int 
 name=QTime125/int  lst name=paramsstr 
 name=debugQuerytrue/strstr 
 name=fltitle,roles,organisations,id/strstr 
 name=indenttrue/strstr name=qyh_cugtest/strstr 
 name=_1394522589347/strstr name=wtxml/strstr 
 name=fqorganisations:* roles:*/str

query with local params

2014-03-11 Thread Andreas Owen

This works great but i would like to use lacal params r and org instead of 
hard-coded
str name=fq (*:* -organisations:[* TO *] -roles:[* TO *]) 
(+organisations:(150 42) +roles:(174 72))

I would like
str name=fq (*:* -organisations:[* TO *] -roles:[* TO *]) 
(+organisations:($org) +roles:($r))

Shouldn't the numbers be in the output below (parsed_filter_queries) and not $r 
and $org? I use this in my requesthandler and need it to be added as fq or 
query params without being able to be overriden, has anybody any idees? Oh and 
i use facets so fq has to be combinable.

Debug query:

lst name=responseHeader
  int name=status0/int
  int name=QTime109/int
  lst name=params
str name=debugQuerytrue/str
str name=indenttrue/str
str name=r267/str
str name=qyh_cug/str
str name=_1394533792473/str
str name=wtxml/str
  /lst
...
arr name=filter_queries
str{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str
  /arr
  arr name=parsed_filter_queries
str(MatchAllDocsQuery(*:*) -organisations:[ TO *] -roles:[ TO *]) 
(+organisations:$org +roles:$r) (-organisations:[ TO *] +roles:$r) 
(+organisations:$org -roles:[ TO *])/str
  /arr

use local params in query

2014-03-11 Thread Andreas Owen

Shouldn't the numbers be in the output below (parsed_filter_queries) and not
$r and $org? 

 

This works great but i would like to use lacal params r and org instead
of hard-coded

str name=fq (*:* -organisations:[* TO *] -roles:[* TO
*]) (+organisations:(150 42) +roles:(174 72))

 

I would like

str name=fq (*:* -organisations:[* TO *] -roles:[* TO
*]) (+organisations:($org) +roles:($r))

 

I use this in my requesthandler under invariant because i need it to be
added to the query without being able to be overriden. Oh and i use facets
so fq has to be combinable. This should work or am i understanding it wrong?

 

Debug query:

 

lst name=responseHeader

  int name=status0/int

  int name=QTime109/int

  lst name=params

str name=debugQuerytrue/str

str name=indenttrue/str

str name=r267/str

str name=qyh_cug/str

str name=_1394533792473/str

str name=wtxml/str

  /lst

...

arr name=filter_queries

str{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *])
(+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r))
(+organisations:($org) -roles:[ TO *])/str

  /arr

  arr name=parsed_filter_queries

str(MatchAllDocsQuery(*:*) -organisations:[ TO *] -roles:[ TO *])
(+organisations:$org +roles:$r) (-organisations:[ TO *] +roles:$r)
(+organisations:$org -roles:[ TO *])/str

  /arr

use local param in solrconfig fq for access-control

2014-03-11 Thread Andreas Owen

i would like to use $r and $org for access control. it has to allow the fq's 
from my facet to work aswell. i'm not sure if i'm doing it wright or if i 
should add it to a qf or the q itself. the debugquery returns a parsed fq 
string and in them $r and $org are printed instead of their values. how do i 
get them to be intepreted? the lacal params are listed in the response so they 
should be valid.

lst name=invariants
      str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) 
(+organisations:($org) -roles:[ TO *])/str
/lst

maxClauseCount is set to 1024

2014-03-10 Thread Andreas Owen


does this maxClauseCount go over each field individually or all put together? 
is it the date fields?


when i execute a query i get this error:


lst name=responseHeader   int name=status500/int   int 
name=QTime93/int   lst name=params str name=indenttrue/str   
  str name=qEin PDFchen als Dokument roles:*/str str 
name=_1394436617394/str str name=wtxml/str   /lst /lst 
result name=response numFound=499 start=0 maxScore=0.40899447
doc
.
float name=score0.10604319/float
/doc
/result lst name=facet_counts   lst name=facet_queries/   lst 
name=facet_fields lst name=inhaltstyp_s   int 
name=Agenda390/int   int name=Formular2/int   int 
name=Formulare27/int   int name=Für Dokumente only1/int   
int name=Für Websiten only1/int   int name=Hilfsmittel3/int 
  int name=Information3/int   int name=Präsentation1/int   
int name=Regelung8/int   int name=Schulung10/int   int 
name=Schulung_ONL1/int   int name=Test14/int   int 
name=Weisung37/int   int name=test1/int /lst lst 
name=doctype   int name=doc1/int   int name=docx4/int
   int name=htm8/int   int name=pdf44/int   int 
name=pptx4/int   int name=vsd1/int   int 
name=xlsx6/int
/lst lst name=thema_f   int name=1_57/int   int 
name=1_Anleitungen11/int   int name=1_Anleitungen_Ausbildung 
[Anleitungen]11/int   int name=1_Ausbildung3/int   int 
name=1_Ausbildung_Weiterbildung3/int   int name=1_Beratung4/int  
 int name=1_Beratung_Beratungsportal FK1/int   int 
name=1_Beratung_Beratungsportal PK2/int   int 
name=1_Beratung_Beratungsprozess1/int   int 
name=1_Handlungsempfehlung2/int   int 
name=1_Handlungsempfehlung_a2/int   int 
name=1_Marktbearbeitung2/int   int 
name=1_Marktbearbeitung_Events2/int   int name=1_Produkte29/int  
 int name=1_Weisungen1/int   int name=1_Weisungen_Workplace 
[Weisungen]1/int /lst lst name=author_s   int 
name=17/int   int name=Aeschlimann Monika
(MAE)1/int   int name=Ancora Carlo (CAA)1/int   int 
name=Bannwart Markus (MBA)4/int   int name=Basse Detlev 
(DBS)1/int   int name=Beerli Dominik (DBI)3/int   int 
name=Bollinger Beat (BBO)5/int   int name=Brunner Elisabeth 
(EBN)1/int   int name=Brüschweiler Otto (OBR)5/int   int 
name=Buric Aleksandra (ABC)1/int   int name=Bächtold Eliane 
(EBA)2/int   int name=Chieco Daniela (DCH)1/int   int 
name=D'Adamo-Gähler Karin (KDA)1/int   int name=Dannecker Dietmar 
(DDA)1/int   int name=De Biasio Claudio (CDB)35/int   int 
name=Donatsch Roman (RDO)1/int   int name=Eberhart Livia 
(LET)2/int   int name=Etter Alice (AET)26/int   int 
name=Fankhauser Hausi (HFA)2/int   int name=Frei Beat (BFI)1/int 
  int name=Frick
Patrick (PFR)2/int   int name=Grasset André (AGT)3/int   int 
name=Grava Reto (RGV)1/int   int name=Gunterswiler Walter 
(WGU)1/int   int name=Gürkan Simon (SGN)1/int   int 
name=Heimbeck Markus (MHI)27/int   int name=Helbling Andreas 
(AHG)3/int   int name=Held Hans-Jörg (HHE)1/int   int 
name=Helg Christoph (CHL)1/int   int name=Hofer Astrid 
(AHO)3/int   int name=Huber Kalevi (KHU)1/int   int 
name=Huber Paul (PHU)1/int   int name=Häberli Peter (PHI)3/int   
int name=Häfliger Gabriela (GHA)6/int   int name=Hümbeli 
Isabelle (IHE)3/int   int name=Isler Myriam (MIS)1/int   int 
name=Jäger Andreas (AJA)2/int   int name=Kasper Markus 
(MKP)2/int   int name=Keller Reto (RKE)2/int   int 
name=Knecht Urs
(UKN)2/int   int name=Kutter Benedikt (BKU)2/int   int 
name=Kälin-Klay Sonja (SKY)28/int   int name=Lutz René 
(RLU)4/int   int name=Matanovic Jacques (JMT)2/int   int 
name=Monti Mirko (MMO)1/int   int name=Märki Susanne (SMA)16/int 
  int name=Olimpio Marco (MOL)46/int   int name=Pfister Nicole 
(NPF)1/int   int name=Pozzi Anthony (ANP)5/int   int 
name=Reinhard Martin (MRE)11/int   int name=Reutlinger Graf Caroline 
(CRE)58/int   int name=Roth Rolf (ROR)1/int   int name=Rutz 
Mirco (MRT)2/int   int name=Salvisberg Adrian (ASA)29/int   
int name=Sassano Marianna (MSN)2/int   int name=Schaffhauser Carmen 
(CSR)2/int   int name=Schoop Hans-Jörg (HSP)1/int   int 
name=Schrieder Bernadette (BSD)1/int   int
name=Seeholzer Carola (CSZ)1/int   int name=Storniolo Patrizia 
(PSO)9/int   int name=Tanner-Ott Sara (STN)4/int   int 
name=Tobler Tamara (TTO)75/int   int name=Trefzer-Hug Cornelia 
(CTF)2/int   int name=Uhlmann Heinz (HUH)2/int   int 
name=Vettori Renato (RVE)1/int   int name=Vogel Heinrich 
(HVO)2/int   int name=Weibel Stephanie (SWL)2/int   int 
name=Weinzerl Rudolf (RWE)1/int   int name=Wellauer Pascal 
(PWL)4/int   int name=Wild Ursula (UWD)1/int   int 
name=Wuffli Markus (MWU)1/int   int name=Wüthrich

set fq operator independently

2014-03-04 Thread Andreas Owen

i want to use the following in fq and i need to set the operator to OR. My q.op 
is AND but I need OR in fq. I have read about ofq but that is for putting OR 
between multiple fq. Can I set the operator for fq?

     (-organisations:[ TO *] -roles:[ TO *]) (+organisations:(150 42) 
+roles:(174 72))


The statement should find all docs without organisations and roles or those 
that have at least one roles and organisations entry. these fields are 
multivalued.

Re[2]: query parameters

2014-03-03 Thread Andreas Owen

ok i like the logic, you can do much more. i think this should do it for me:

         (-organisations:[ TO *] -roles:[ TO *]) (+organisations:(150 42) 
+roles:(174 72))


i want to use this in fq and i need to set the operator to OR. My q.op is AND 
but I need OR in fq. I have read about ofq but that is for putting OR between 
multiple fq. Can I set the operator for fq?

The statement should find all docs without organisations and roles or those 
that have at least one roles and organisations entry. these fields are 
multivalued.

-Original-Nachricht- 
 Von: Erick Erickson erickerick...@gmail.com 
 An: solr-user@lucene.apache.org 
 Datum: 19/02/2014 04:09 
 Betreff: Re: query parameters 
 
 Solr/Lucene query language is NOT strictly boolean, see
 Chris's excellent blog here:
 http://searchhub.org/dev/2011/12/28/why-not-and-or-and-not/
 
 Best,
 Erick
 
 
 On Tue, Feb 18, 2014 at 11:54 AM, Andreas Owen a...@conx.ch wrote:
 
  I tried it in solr admin query and it showed me all the docs without a
  value
  in ogranisations and roles. It didn't matter if i used a base term, isn't
  that give through the q-parameter?
 
  -Original Message-
  From: Raymond Wiker [mailto:rwi...@gmail.com]
  Sent: Dienstag, 18. Februar 2014 13:19
  To: solr-user@lucene.apache.org
  Subject: Re: query parameters
 
  That could be because the second condition does not do what you think it
  does... have you tried running the second condition separately?
 
  You may have to add a base term to the second condition, like what you
  have for the bq parameter in your config file; i.e, something like
 
  (*:* -organisations:[ TO *] -roles:[ TO *])
 
 
 
 
  On Tue, Feb 18, 2014 at 12:16 PM, Andreas Owen a...@conx.ch wrote:
 
   It seams that fq doesn't except OR because: (organisations:(150 OR 41)
   AND
   roles:(174)) OR  (-organisations:[ TO *] AND -roles:[ TO *]) only
   returns docs that match the first conditions. it doesn't return any
   docs with the empty fields organisations and roles.
  
   -Original Message-
   From: Andreas Owen [mailto:a...@conx.ch]
   Sent: Montag, 17. Februar 2014 05:08
   To: solr-user@lucene.apache.org
   Subject: query parameters
  
  
   in solrconfig of my solr 4.3 i have a userdefined requestHandler. i
   would like to use fq to force the following conditions:
      1: organisations is empty and roles is empty
      2: organisations contains one of the commadelimited list in
   variable $org
      3: roles contains one of the commadelimited list in variable $r
      4: rule 2 and 3
  
   snipet of what i got (havent checked out if the is a in operator
   like in sql for the list value)
  
   lst name=defaults
          str name=echoParamsexplicit/str
          int name=rows10/int
          str name=defTypeedismax/str
              str name=synonymstrue/str
              str name=qfplain_text^10 editorschoice^200
                   title^20 h_*^14
                   tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
                   contentmanager^5 links^5
                   last_modified^5 url^5
              /str
              str name=fq(organisations='' roles='') or
   (organisations=$org roles=$r) or (organisations='' roles=$r) or
   (organisations=$org roles='')/str
              str name=bq(expiration:[NOW TO *] OR (*:*
   -expiration:*))^6/str  !-- tested: now or newer or empty gets small
   boost --
              str name=bfdiv(clicks,max(displays,1))^8/str !--
   tested
   --

RE: query parameters

2014-02-18 Thread Andreas Owen

It seams that fq doesn't except OR because: (organisations:(150 OR 41) AND 
roles:(174)) OR  (-organisations:[ TO *] AND -roles:[ TO *]) only returns 
docs that match the first conditions. it doesn't return any docs with the empty 
fields organisations and roles.

-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Montag, 17. Februar 2014 05:08
To: solr-user@lucene.apache.org
Subject: query parameters

in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like 
to use fq to force the following conditions:
   1: organisations is empty and roles is empty
   2: organisations contains one of the commadelimited list in variable $org
   3: roles contains one of the commadelimited list in variable $r
   4: rule 2 and 3

snipet of what i got (havent checked out if the is a in operator like in sql 
for the list value)

lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=defTypeedismax/str
   str name=synonymstrue/str
   str name=qfplain_text^10 editorschoice^200
title^20 h_*^14 
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   /str
   str name=fq(organisations='' roles='') or (organisations=$org 
roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')/str
   str name=bq(expiration:[NOW TO *] OR (*:* 
-expiration:*))^6/str  !-- tested: now or newer or empty gets small boost --
   str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --

RE: query parameters

2014-02-18 Thread Andreas Owen

I tried it in solr admin query and it showed me all the docs without a value
in ogranisations and roles. It didn't matter if i used a base term, isn't
that give through the q-parameter?

-Original Message-
From: Raymond Wiker [mailto:rwi...@gmail.com] 
Sent: Dienstag, 18. Februar 2014 13:19
To: solr-user@lucene.apache.org
Subject: Re: query parameters

That could be because the second condition does not do what you think it
does... have you tried running the second condition separately?

You may have to add a base term to the second condition, like what you
have for the bq parameter in your config file; i.e, something like

(*:* -organisations:[ TO *] -roles:[ TO *])

On Tue, Feb 18, 2014 at 12:16 PM, Andreas Owen a...@conx.ch wrote:

 It seams that fq doesn't except OR because: (organisations:(150 OR 41) 
 AND
 roles:(174)) OR  (-organisations:[ TO *] AND -roles:[ TO *]) only 
 returns docs that match the first conditions. it doesn't return any 
 docs with the empty fields organisations and roles.

 -Original Message-
 From: Andreas Owen [mailto:a...@conx.ch]
 Sent: Montag, 17. Februar 2014 05:08
 To: solr-user@lucene.apache.org
 Subject: query parameters

 in solrconfig of my solr 4.3 i have a userdefined requestHandler. i 
 would like to use fq to force the following conditions:
1: organisations is empty and roles is empty
2: organisations contains one of the commadelimited list in 
 variable $org
3: roles contains one of the commadelimited list in variable $r
4: rule 2 and 3

 snipet of what i got (havent checked out if the is a in operator 
 like in sql for the list value)

 lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=defTypeedismax/str
str name=synonymstrue/str
str name=qfplain_text^10 editorschoice^200
 title^20 h_*^14
 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
 contentmanager^5 links^5
 last_modified^5 url^5
/str
str name=fq(organisations='' roles='') or 
 (organisations=$org roles=$r) or (organisations='' roles=$r) or 
 (organisations=$org roles='')/str
str name=bq(expiration:[NOW TO *] OR (*:* 
 -expiration:*))^6/str  !-- tested: now or newer or empty gets small 
 boost --
str name=bfdiv(clicks,max(displays,1))^8/str !-- 
 tested
 --

query parameters

2014-02-16 Thread Andreas Owen


in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like 
to use fq to force the following conditions:
   1: organisations is empty and roles is empty
   2: organisations contains one of the commadelimited list in variable $org
   3: roles contains one of the commadelimited list in variable $r
   4: rule 2 and 3

snipet of what i got (havent checked out if the is a in operator like in sql 
for the list value)

lst name=defaults
       str name=echoParamsexplicit/str
       int name=rows10/int
       str name=defTypeedismax/str
   str name=synonymstrue/str
   str name=qfplain_text^10 editorschoice^200
title^20 h_*^14 
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   /str
   str name=fq(organisations='' roles='') or (organisations=$org 
roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')/str
   str name=bq(expiration:[NOW TO *] OR (*:* 
-expiration:*))^6/str  !-- tested: now or newer or empty gets small boost --
   str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --

admin gui right side not loading

2014-01-15 Thread Andreas Owen

I'm using solr 4.3.1 and have installed it on a win 2008 server. Solr is
working, for example import  search. But the admin guis right side isn't
loading and I get a javascript error for several d3-objects. The last error
is:

 

Load timeout for modules: lib/order!lib/jquery.autogrow
lib/order!lib/jquery.cookie lib/order!lib/jquery.form
lib/order!lib/jquery.jstree lib/order!lib/jquery.sammy
lib/order!lib/jquery.timeago lib/order!lib/jquery.blockUI
lib/order!lib/highlight lib/order!lib/linker lib/order!lib/ZeroClipboard
lib/order!lib/d3 lib/order!lib/chosen lib/order!scripts/app
lib/order!scripts/analysis lib/order!scripts/cloud lib/order!scripts/cores
lib/order!scripts/dataimport lib/order!scripts/dashboard
lib/order!scripts/file lib/order!scripts/index
lib/order!scripts/java-properties lib/order!scripts/logging
lib/order!scripts/ping lib/order!scripts/plugins lib/order!scripts/query
lib/order!scripts/replication lib/order!scripts/schema-browser
lib/order!scripts/threads lib/jquery.autogrow lib/jquery.cookie
lib/jquery.form lib/jquery.jstree lib/jquery.sammy lib/jquery.timeago
lib/jquery.blockUI lib/highlight lib/linker lib/ZeroClipboard lib/d3
lib/chosen scripts/app scripts/analysis scripts/cloud scripts/cores
scripts/dataimport scripts/dashboard scripts/file scripts/index
scripts/java-properties scripts/logging scripts/ping scripts/plugins
scripts/query scripts/replication scripts/schema-browser scripts/threads 

http://requirejs.org/docs/errors.html#timeout

 

I have no apparent errors in the log file and the exact conf is working on a
other server. What can I do?

RE: json update moves doc to end

2013-12-04 Thread Andreas Owen

 of:
4.349904 = idf(docFreq=29, maxDocs=855)
0.0070840283 = queryNorm
  0.1359345 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
4.349904 = idf(docFreq=29, maxDocs=855)
0.03125 = fieldNorm(doc=0)
  0.006139375 = (MATCH) weight(plain_text:berich in 0)
[DefaultSimilarity], result of:
0.006139375 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  0.037305873 = queryWeight, product of:
5.266195 = idf(docFreq=11, maxDocs=855)
0.0070840283 = queryNorm
  0.16456859 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
5.266195 = idf(docFreq=11, maxDocs=855)
0.03125 = fieldNorm(doc=0)
  0.0059541636 = (MATCH) weight(plain_text:ericht in 0)
[DefaultSimilarity], result of:
0.0059541636 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  0.036738846 = queryWeight, product of:
5.186152 = idf(docFreq=12, maxDocs=855)
0.0070840283 = queryNorm
  0.16206725 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
5.186152 = idf(docFreq=12, maxDocs=855)
0.03125 = fieldNorm(doc=0)
  0.006139375 = (MATCH) weight(plain_text:bericht in 0)
[DefaultSimilarity], result of:
0.006139375 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  0.037305873 = queryWeight, product of:
5.266195 = idf(docFreq=11, maxDocs=855)
0.0070840283 = queryNorm
  0.16456859 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
5.266195 = idf(docFreq=11, maxDocs=855)
0.03125 = fieldNorm(doc=0)
7.054 = (MATCH) weight(editorschoice:bericht^200.0 in 0)
[DefaultSimilarity], result of:
  7.054 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.749 = queryWeight, product of:
  200.0 = boost
  7.0579543 = idf(docFreq=1, maxDocs=855)
  7.0840283E-4 = queryNorm
7.0579543 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  7.0579543 = idf(docFreq=1, maxDocs=855)
  1.0 = fieldNorm(doc=0)
  0.0021252085 = (MATCH) product of:
0.004250417 = (MATCH) sum of:
  0.004250417 = (MATCH) sum of:
0.004250417 = (MATCH) MatchAllDocsQuery, product of:
  0.004250417 = queryNorm
0.5 = coord(1/2)
  -Infinity = (MATCH) FunctionQuery(log(int(clicks))), product of:
-Infinity = log(int(clicks)=0)
8.0 = boost
7.0840283E-4 = queryNorm
/str

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Dienstag, 3. Dezember 2013 20:30
To: solr-user@lucene.apache.org
Subject: Re: json update moves doc to end

Try adding debug=all and you'll see exactly how docs are scored. Also,
it'll show you exactly how your query is parsed. Paste that if it's
confused, it'll help figure out what's going wrong.


On Tue, Dec 3, 2013 at 1:37 PM, Andreas Owen a...@conx.ch wrote:

 So isn't it sorted automaticly by relevance (boost value)? If not do 
 should i set it in solrconfig?

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Dienstag, 3. Dezember 2013 19:07
 To: solr-user@lucene.apache.org
 Subject: Re: json update moves doc to end

 What order, the order if you supply no explicit sort at all?

 Solr does not make any guarantees about what order documents will come 
 back in if you do not ask for a sort.

 In general in Solr/lucene, the only way to update a document is to 
 re-add it as a new document, so that's probably what's going on behind 
 the scenes, and it probably effects the 'default' sort order -- which 
 Solr makes no agreement about anyway, you probably shouldn't even 
 count on it being consistent at all.

 If you want a consistent sort order, maybe add a field with a 
 timestamp, and ask for results sorted by the timestamp field? And then 
 make sure not to change the timestamp when you do an update that you 
 don't want to change the order?

 Apologies if I've misunderstood the situation.

 On 12/3/13 1:00 PM, Andreas Owen wrote:
  When I search for agenda I get a lot of hits. Now if I update the 2.
  Result by json-update the doc is moved to the end of the index when 
  I search for it again. The field I change is editorschoice and it 
  never contains the search term agenda so I don't see why it 
  changes the order. Why does it?
 
 
 
  Part of Solrconfig requesthandler I use:
 
  requestHandler name=/select2 class=solr.SearchHandler
 
lst name=defaults
 
   str name=echoParamsexplicit/str
 
   int name=rows10/int
 
str name=defTypesynonym_edismax/str
 
  str name

RE: json update moves doc to end

2013-12-04 Thread Andreas Owen

I changed my boost-function log(clickrate)^8 to div(clciks,displays)^8 and
it works now. I get the following output from debug

0.0022668892 = (MATCH) FunctionQuery(div(const(2),const(5))), product of:
0.4 = div(const(2),const(5))
8.0 = boost
7.0840283E-4 = queryNorm

Am i undestanding this right, that 0.4 and 8.0 result in 7.084? I'm
having trouble undestanding how much i boosted it.

As i use NgramFilterFactory i get a lot of hits because of the tokens. Can i
make the boost higher if the hole search-term is found and not just part of
it?


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Mittwoch, 4. Dezember 2013 15:07
To: solr-user@lucene.apache.org
Subject: Re: json update moves doc to end

Well, both have a score of -Infinity. So they're equal and the tiebreaker
is the internal Lucene doc ID.

Now this is not helpful since the question now is where -Infinity comes
from, this looks suspicious:
 -Infinity = (MATCH) FunctionQuery(log(int(clicks))), product of:
-Infinity = log(int(clicks)=0)

not much help I know, but

Erick


On Wed, Dec 4, 2013 at 7:24 AM, Andreas Owen a...@conx.ch wrote:

 Hi Erick

 Here are the last 2 results from a search and i am not understanding 
 why the last one with the boost editorschoice^200 isn't at the top. By 
 the way can i also give a substantial boost to results that contain 
 the hole search-request and not just 3 or 4 letters (tokens)?

 str name=dms:1003
 -Infinity = (MATCH) sum of:
   0.013719446 = (MATCH) max of:
 0.013719446 = (MATCH) sum of:
   2.090396E-4 = (MATCH) weight(plain_text:ber in 841) 
 [DefaultSimilarity], result of:
 2.090396E-4 = score(doc=841,freq=8.0 = termFreq=8.0 ), product 
 of:
   0.009452709 = queryWeight, product of:
 1.3343692 = idf(docFreq=611, maxDocs=855)
 0.0070840283 = queryNorm
   0.022114253 = fieldWeight in 841, product of:
 2.828427 = tf(freq=8.0), with freq of:
   8.0 = termFreq=8.0
 1.3343692 = idf(docFreq=611, maxDocs=855)
 0.005859375 = fieldNorm(doc=841)
   0.0012402858 = (MATCH) weight(plain_text:eri in 841) 
 [DefaultSimilarity], result of:
 0.0012402858 = score(doc=841,freq=9.0 = termFreq=9.0 ), 
 product of:
   0.022357063 = queryWeight, product of:
 3.1559815 = idf(docFreq=98, maxDocs=855)
 0.0070840283 = queryNorm
   0.05547624 = fieldWeight in 841, product of:
 3.0 = tf(freq=9.0), with freq of:
   9.0 = termFreq=9.0
 3.1559815 = idf(docFreq=98, maxDocs=855)
 0.005859375 = fieldNorm(doc=841)
   5.0511415E-4 = (MATCH) weight(plain_text:ric in 841) 
 [DefaultSimilarity], result of:
 5.0511415E-4 = score(doc=841,freq=1.0 = termFreq=1.0 ), 
 product of:
   0.024712078 = queryWeight, product of:
 3.4884217 = idf(docFreq=70, maxDocs=855)
 0.0070840283 = queryNorm
   0.020439971 = fieldWeight in 841, product of:
 1.0 = tf(freq=1.0), with freq of:
   1.0 = termFreq=1.0
 3.4884217 = idf(docFreq=70, maxDocs=855)
 0.005859375 = fieldNorm(doc=841)
   8.721528E-4 = (MATCH) weight(plain_text:ich in 841) 
 [DefaultSimilarity], result of:
 8.721528E-4 = score(doc=841,freq=12.0 = termFreq=12.0 ), 
 product of:
   0.017446788 = queryWeight, product of:
 2.4628344 = idf(docFreq=197, maxDocs=855)
 0.0070840283 = queryNorm
   0.049989305 = fieldWeight in 841, product of:
 3.4641016 = tf(freq=12.0), with freq of:
   12.0 = termFreq=12.0
 2.4628344 = idf(docFreq=197, maxDocs=855)
 0.005859375 = fieldNorm(doc=841)
   7.725705E-4 = (MATCH) weight(plain_text:cht in 841) 
 [DefaultSimilarity], result of:
 7.725705E-4 = score(doc=841,freq=4.0 = termFreq=4.0 ), product 
 of:
   0.021610687 = queryWeight, product of:
 3.050621 = idf(docFreq=109, maxDocs=855)
 0.0070840283 = queryNorm
   0.035749465 = fieldWeight in 841, product of:
 2.0 = tf(freq=4.0), with freq of:
   4.0 = termFreq=4.0
 3.050621 = idf(docFreq=109, maxDocs=855)
 0.005859375 = fieldNorm(doc=841)
   0.0010287998 = (MATCH) weight(plain_text:beri in 841) 
 [DefaultSimilarity], result of:
 0.0010287998 = score(doc=841,freq=1.0 = termFreq=1.0 ), 
 product of:
   0.035267927 = queryWeight, product of:
 4.978513 = idf(docFreq=15, maxDocs=855)
 0.0070840283 = queryNorm
   0.029170973 = fieldWeight in 841, product of:
 1.0 = tf(freq=1.0), with freq of:
   1.0 = termFreq=1.0
 4.978513 = idf(docFreq=15, maxDocs=855)
 0.005859375 = fieldNorm(doc=841)
   0.0010556461 = (MATCH) weight(plain_text:eric in 841) 
 [DefaultSimilarity

json update moves doc to end

2013-12-03 Thread Andreas Owen

When I search for agenda I get a lot of hits. Now if I update the 2.
Result by json-update the doc is moved to the end of the index when I search
for it again. The field I change is editorschoice and it never contains
the search term agenda so I dont see why it changes the order. Why does
it?

 

Part of Solrconfig requesthandler I use:

requestHandler name=/select2 class=solr.SearchHandler

 lst name=defaults

str name=echoParamsexplicit/str

int name=rows10/int

 str name=defTypesynonym_edismax/str

   str name=synonymstrue/str

   str name=qfplain_text^10 editorschoice^200

   title^20 h_*^14 

   tags^10 thema^15 inhaltstyp^6 breadcrumb^6
doctype^10

   contentmanager^5 links^5

   last_modified^5  url^5

   /str

   str name=bq(expiration:[NOW TO *] OR (*:*
-expiration:*))^6/str  !-- tested: now or newer or empty gets small boost
--

   str name=bflog(clicks)^8/str !-- tested --

   !-- todo: anzahl-links(count urlparse in links query) /
häufigkeit von suchbegriff (bf= count in title and text)--

 str name=dftext/str

   str name=fl*,path,score/str

   str name=wtjson/str

   str name=q.opAND/str

   

   !-- Highlighting defaults --

str name=hlon/str

 str name=hl.flplain_text,title/str

   str name=hl.simple.prelt;bgt;/str

str name=hl.simple.postlt;/bgt;/str

   

 !-- lst name=invariants --

str name=faceton/str

   str name=facet.mincount1/str

str
name=facet.field{!ex=inhaltstyp}inhaltstyp/str

   str
name=f.inhaltstyp.facet.sortindex/str

   str
name=facet.field{!ex=doctype}doctype/str

   str name=f.doctype.facet.sortindex/str

   str
name=facet.field{!ex=thema_f}thema_f/str

   str name=f.thema_f.facet.sortindex/str

   str
name=facet.field{!ex=author_s}author_s/str

   str name=f.author_s.facet.sortindex/str

   str
name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str

   str
name=f.sachverstaendiger_s.facet.sortindex/str

   str
name=facet.field{!ex=veranstaltung}veranstaltung/str

   str
name=f.veranstaltung.facet.sortindex/str

   str
name=facet.date{!ex=last_modified}last_modified/str

   str
name=facet.date.gap+1MONTH/str

   str
name=facet.date.endNOW/MONTH+1MONTH/str

   str
name=facet.date.startNOW/MONTH-36MONTHS/str

   str
name=facet.date.otherafter/str   

   /lst

/requestHandler

RE: json update moves doc to end

2013-12-03 Thread Andreas Owen

So isn't it sorted automaticly by relevance (boost value)? If not do should
i set it in solrconfig?

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Dienstag, 3. Dezember 2013 19:07
To: solr-user@lucene.apache.org
Subject: Re: json update moves doc to end

What order, the order if you supply no explicit sort at all?

Solr does not make any guarantees about what order documents will come back
in if you do not ask for a sort.

In general in Solr/lucene, the only way to update a document is to re-add it
as a new document, so that's probably what's going on behind the scenes, and
it probably effects the 'default' sort order -- which Solr makes no
agreement about anyway, you probably shouldn't even count on it being
consistent at all.

If you want a consistent sort order, maybe add a field with a timestamp, and
ask for results sorted by the timestamp field? And then make sure not to
change the timestamp when you do an update that you don't want to change the
order?

Apologies if I've misunderstood the situation.

On 12/3/13 1:00 PM, Andreas Owen wrote:
 When I search for agenda I get a lot of hits. Now if I update the 2.
 Result by json-update the doc is moved to the end of the index when I 
 search for it again. The field I change is editorschoice and it 
 never contains the search term agenda so I don't see why it changes 
 the order. Why does it?

 Part of Solrconfig requesthandler I use:

 requestHandler name=/select2 class=solr.SearchHandler

   lst name=defaults

  str name=echoParamsexplicit/str

  int name=rows10/int

   str name=defTypesynonym_edismax/str

 str name=synonymstrue/str

 str name=qfplain_text^10 editorschoice^200

 title^20 h_*^14

 tags^10 thema^15 inhaltstyp^6 
 breadcrumb^6
 doctype^10

 contentmanager^5 links^5

 last_modified^5  url^5

 /str

 str name=bq(expiration:[NOW TO *] OR (*:* 
 -expiration:*))^6/str  !-- tested: now or newer or empty gets small 
 boost
 --

 str name=bflog(clicks)^8/str !-- tested --

 !-- todo: anzahl-links(count urlparse in links 
 query) / häufigkeit von suchbegriff (bf= count in title and text)--

   str name=dftext/str

 str name=fl*,path,score/str

 str name=wtjson/str

 str name=q.opAND/str

 !-- Highlighting defaults --

  str name=hlon/str

   str name=hl.flplain_text,title/str

 str name=hl.simple.prelt;bgt;/str

  str name=hl.simple.postlt;/bgt;/str

   !-- lst name=invariants --

  str name=faceton/str

 str name=facet.mincount1/str

  str
 name=facet.field{!ex=inhaltstyp}inhaltstyp/str

 str
 name=f.inhaltstyp.facet.sortindex/str

 str
 name=facet.field{!ex=doctype}doctype/str

 str 
 name=f.doctype.facet.sortindex/str

 str
 name=facet.field{!ex=thema_f}thema_f/str

 str 
 name=f.thema_f.facet.sortindex/str

 str
 name=facet.field{!ex=author_s}author_s/str

 str 
 name=f.author_s.facet.sortindex/str

 str
 name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str

 str
 name=f.sachverstaendiger_s.facet.sortindex/str

 str
 name=facet.field{!ex=veranstaltung}veranstaltung/str

 str
 name=f.veranstaltung.facet.sortindex/str

 str
 name=facet.date{!ex=last_modified}last_modified/str

 str 
 name=facet.date.gap+1MONTH/str

 str 
 name=facet.date.endNOW/MONTH+1MONTH/str

 str 
 name=facet.date.startNOW/MONTH-36MONTHS/str

 str 
 name=facet.date.otherafter/str

 /lst

 /requestHandler

search with wildcard

2013-11-21 Thread Andreas Owen

I am querying test in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
Supertestplan it isn't found unless I use a wildcards *test*. This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.

 

fieldType name=text_de class=solr.TextField positionIncrementGap=100

  analyzer 

tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/

   

filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --

filter class=solr.GermanNormalizationFilterFactory/

   filter
class=solr.SnowballPorterFilterFactory language=German/ !-- remove
noun/adjective inflections like plural endings --



  /analyzer

/fieldType

RE: search with wildcard

2013-11-21 Thread Andreas Owen

I suppose i have to create another field with diffenet tokenizers and set
the boost very low so it doesn't really mess with my ranking because there
the word is now in 2 fields. What kind of tokenizer can do the job?

From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 21. November 2013 16:13
To: solr-user@lucene.apache.org
Subject: search with wildcard

I am querying test in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
Supertestplan it isn't found unless I use a wildcards *test*. This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.

fieldType name=text_de class=solr.TextField positionIncrementGap=100

  analyzer 

tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/

filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --

filter class=solr.GermanNormalizationFilterFactory/

   filter
class=solr.SnowballPorterFilterFactory language=German/ !-- remove
noun/adjective inflections like plural endings --

  /analyzer

/fieldType

RE: date range tree

2013-11-13 Thread Andreas Owen

I solved it by adding a loop for years and one for quartals in which i count
the month-facets

-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Montag, 11. November 2013 17:52
To: solr-user@lucene.apache.org
Subject: RE: date range tree

Has someone at least got a idee how i could do a year/month-date-tree? 

In Solr-Wiki it is mentioned that facet.date.gap=+1DAY,+2DAY,+3DAY,+10DAY
should create 4 buckets but it doesn't work


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch]
Sent: Donnerstag, 7. November 2013 18:23
To: solr-user@lucene.apache.org
Subject: date range tree

I would like to make a facet on a date field with the following tree:

 

2013

4.Quartal

December

November

Oktober

3.Quartal

September

August

Juli

2.Quartal

June

Mai

April

1.   Quartal

March

February

January

2012 .

Same as above

 

 

So far I have this in solrconfig.xml:

 

str
name=facet.date{!ex=last_modified,thema,inhaltstyp,doctype}last_modified
/str

   str
name=facet.date.gap+1MONTH/str

   str
name=facet.date.endNOW/MONTH/str

   str
name=facet.date.startNOW/MONTH-36MONTHS/str

   str
name=facet.date.otherafter/str

 

Can I do this in one query or do I need multiple queries? If yes how would I
do the second and keep all the facet queries in the count?

RE: date range tree

2013-11-11 Thread Andreas Owen

Has someone at least got a idee how i could do a year/month-date-tree? 

In Solr-Wiki it is mentioned that facet.date.gap=+1DAY,+2DAY,+3DAY,+10DAY
should create 4 buckets but it doesn't work


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 7. November 2013 18:23
To: solr-user@lucene.apache.org
Subject: date range tree

I would like to make a facet on a date field with the following tree:

 

2013

4.Quartal

December

November

Oktober

3.Quartal

September

August

Juli

2.Quartal

June

Mai

April

1.   Quartal

March

February

January

2012 .

Same as above

 

 

So far I have this in solrconfig.xml:

 

str
name=facet.date{!ex=last_modified,thema,inhaltstyp,doctype}last_modified
/str

   str
name=facet.date.gap+1MONTH/str

   str
name=facet.date.endNOW/MONTH/str

   str
name=facet.date.startNOW/MONTH-36MONTHS/str

   str
name=facet.date.otherafter/str

 

Can I do this in one query or do I need multiple queries? If yes how would I
do the second and keep all the facet queries in the count?

count links pointing to id

2013-11-09 Thread Andreas Owen

I have a multivalue field with links pointing to ids of solrdocuments. I
would like calculate how many links are pointing to each document und put
that number into the field links2me. How can I do this, I would prefer to do
it with a query and the updater so solr can do it internaly if possible?

date range tree

2013-11-07 Thread Andreas Owen

I would like to make a facet on a date field with the following tree:

 

2013

4.Quartal

December

November

Oktober

3.Quartal

September

August

Juli

2.Quartal

June

Mai

April

1.   Quartal

March

February

January

2012 .

Same as above

 

 

So far I have this in solrconfig.xml:

 

str
name=facet.date{!ex=last_modified,thema,inhaltstyp,doctype}last_modified
/str

   str
name=facet.date.gap+1MONTH/str

   str
name=facet.date.endNOW/MONTH/str

   str
name=facet.date.startNOW/MONTH-36MONTHS/str

   str
name=facet.date.otherafter/str

 

Can I do this in one query or do I need multiple queries? If yes how would I
do the second and keep all the facet queries in the count?

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-10-01 Thread Andreas Owen

i'm already using URLDataSource

On 30. Sep 2013, at 5:41 PM, P Williams wrote:

 Hi Andreas,
 
 When using 
 XPathEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessoryour
 DataSource
 must be of type DataSourceReader.  You shouldn't be using
 BinURLDataSource, it's giving you the cast exception.  Use
 URLDataSourcehttps://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/URLDataSource.html
 or
 FileDataSourcehttps://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/FileDataSource.htmlinstead.
 
 I don't think you need to specify namespaces, at least you didn't used to.
 The other thing that I've noticed is that the anywhere xpath expression //
 doesn't always work in DIH.  You might have to be more specific.
 
 Cheers,
 Tricia
 
 
 
 
 
 On Sun, Sep 29, 2013 at 9:47 AM, Andreas Owen a...@conx.ch wrote:
 
 how dum can you get. obviously quite dum... i would have to analyze the
 html-pages with a nested instance like this:
 
 entity name=rec processor=XPathEntityProcessor
 url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml
 forEach=/docs/doc dataSource=main
 
entity name=htm processor=XPathEntityProcessor
 url=${rec.urlParse} forEach=/xhtml:html dataSource=dataUrl
field column=text xpath=//content /
field column=h_2 xpath=//body /
field column=text_nohtml xpath=//text /
field column=h_1 xpath=//h:h1 /
/entity
 /entity
 
 but i'm pretty sure the foreach is wrong and the xpath expressions. in the
 moment i getting the following error:
 
Caused by: java.lang.RuntimeException:
 org.apache.solr.handler.dataimport.DataImportHandlerException:
 java.lang.ClassCastException:
 sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast
 to java.io.Reader
 
 
 
 
 
 On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote:
 
 ok i see what your getting at but why doesn't the following work:
 
  field xpath=//h:h1 column=h_1 /
  field column=text xpath=/xhtml:html/xhtml:body /
 
 i removed the tiki-processor. what am i missing, i haven't found
 anything in the wiki?
 
 
 On 28. Sep 2013, at 12:28 AM, P Williams wrote:
 
 I spent some more time thinking about this.  Do you really need to use
 the
 TikaEntityProcessor?  It doesn't offer anything new to the document you
 are
 building that couldn't be accomplished by the XPathEntityProcessor alone
 from what I can tell.
 
 I also tried to get the Advanced
 Parsinghttp://wiki.apache.org/solr/TikaEntityProcessorexample to
 work without success.  There are some obvious typos (document
 instead of /document) and an odd order to the pieces (dataSources is
 enclosed by document).  It also looks like
 FieldStreamDataSource
 http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html
 is
 the one that is meant to work in this context. If Koji is still around
 maybe he could offer some help?  Otherwise this bit of erroneous
 instruction should probably be removed from the wiki.
 
 Cheers,
 Tricia
 
 $ svn diff
 Index:
 
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
 ===
 ---
 
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
   (revision 1526990)
 +++
 
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
   (working copy)
 @@ -99,13 +99,13 @@
   runFullImport(getConfigHTML(identity));
   assertQ(req(*:*), testsHTMLIdentity);
 }
 -
 +
 private String getConfigHTML(String htmlMapper) {
   return
   dataConfig +
 dataSource type='BinFileDataSource'/ +
 document +
 -entity name='Tika' format='xml'
 processor='TikaEntityProcessor'  +
 +entity name='Tika' format='html'
 processor='TikaEntityProcessor'  +
  url=' +
 getFile(dihextras/structured.html).getAbsolutePath() + '  +
   ((htmlMapper == null) ?  : ( htmlMapper=' + htmlMapper +
 ')) +  +
 field column='text'/ +
 @@ -114,4 +114,36 @@
   /dataConfig;
 
 }
 +  private String[] testsHTMLH1 = {
 +  //*[@numFound='1']
 +  , //str[@name='h1'][contains(.,'H1 Header')]
 +  };
 +
 +  @Test
 +  public void testTikaHTMLMapperSubEntity() throws Exception {
 +runFullImport(getConfigSubEntity(identity));
 +assertQ(req(*:*), testsHTMLH1);
 +  }
 +
 +  private String getConfigSubEntity(String htmlMapper) {
 +return
 +dataConfig +
 +dataSource type='BinFileDataSource' name='bin'/ +
 +dataSource type='FieldStreamDataSource' name='fld'/ +
 +document +
 +entity name

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-29 Thread Andreas Owen

how dum can you get. obviously quite dum... i would have to analyze the 
html-pages with a nested instance like this:

entity name=rec processor=XPathEntityProcessor 
url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml 
forEach=/docs/doc dataSource=main 

entity name=htm processor=XPathEntityProcessor 
url=${rec.urlParse} forEach=/xhtml:html dataSource=dataUrl
field column=text xpath=//content /
field column=h_2 xpath=//body /
field column=text_nohtml xpath=//text /
field column=h_1 xpath=//h:h1 /
/entity
/entity

but i'm pretty sure the foreach is wrong and the xpath expressions. in the 
moment i getting the following error:

Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: 
sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to 
java.io.Reader





On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote:

 ok i see what your getting at but why doesn't the following work:
   
   field xpath=//h:h1 column=h_1 /
   field column=text xpath=/xhtml:html/xhtml:body /
 
 i removed the tiki-processor. what am i missing, i haven't found anything in 
 the wiki?
 
 
 On 28. Sep 2013, at 12:28 AM, P Williams wrote:
 
 I spent some more time thinking about this.  Do you really need to use the
 TikaEntityProcessor?  It doesn't offer anything new to the document you are
 building that couldn't be accomplished by the XPathEntityProcessor alone
 from what I can tell.
 
 I also tried to get the Advanced
 Parsinghttp://wiki.apache.org/solr/TikaEntityProcessorexample to
 work without success.  There are some obvious typos (document
 instead of /document) and an odd order to the pieces (dataSources is
 enclosed by document).  It also looks like
 FieldStreamDataSourcehttp://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.htmlis
 the one that is meant to work in this context. If Koji is still around
 maybe he could offer some help?  Otherwise this bit of erroneous
 instruction should probably be removed from the wiki.
 
 Cheers,
 Tricia
 
 $ svn diff
 Index:
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
 ===
 ---
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
(revision 1526990)
 +++
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
(working copy)
 @@ -99,13 +99,13 @@
runFullImport(getConfigHTML(identity));
assertQ(req(*:*), testsHTMLIdentity);
  }
 -
 +
  private String getConfigHTML(String htmlMapper) {
return
dataConfig +
  dataSource type='BinFileDataSource'/ +
  document +
 -entity name='Tika' format='xml'
 processor='TikaEntityProcessor'  +
 +entity name='Tika' format='html'
 processor='TikaEntityProcessor'  +
   url=' +
 getFile(dihextras/structured.html).getAbsolutePath() + '  +
((htmlMapper == null) ?  : ( htmlMapper=' + htmlMapper +
 ')) +  +
  field column='text'/ +
 @@ -114,4 +114,36 @@
/dataConfig;
 
  }
 +  private String[] testsHTMLH1 = {
 +  //*[@numFound='1']
 +  , //str[@name='h1'][contains(.,'H1 Header')]
 +  };
 +
 +  @Test
 +  public void testTikaHTMLMapperSubEntity() throws Exception {
 +runFullImport(getConfigSubEntity(identity));
 +assertQ(req(*:*), testsHTMLH1);
 +  }
 +
 +  private String getConfigSubEntity(String htmlMapper) {
 +return
 +dataConfig +
 +dataSource type='BinFileDataSource' name='bin'/ +
 +dataSource type='FieldStreamDataSource' name='fld'/ +
 +document +
 +entity name='tika' processor='TikaEntityProcessor' url=' +
 getFile(dihextras/structured.html).getAbsolutePath() + '
 dataSource='bin' format='html' rootEntity='false' +
 +!--Do appropriate mapping here  meta=\true\ means it is a
 metadata field -- +
 +field column='Author' meta='true' name='author'/ +
 +field column='title' meta='true' name='title'/ +
 +!--'text' is an implicit field emited by TikaEntityProcessor .
 Map it appropriately-- +
 +field name='text' column='text'/ +
 +entity name='detail' type='XPathEntityProcessor' forEach='/html'
 dataSource='fld' dataField='tika.text' rootEntity='true'  +
 +field xpath='//div'  column='foo'/ +
 +field xpath='//h1'  column='h1' / +
 +/entity +
 +/entity +
 +/document +
 +/dataConfig;
 +  }
 +
 }
 Index:
 solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-28 Thread Andreas Owen

thanks but the first suggestion is already implemented and the 2. didn't work. 
i have also tried htmlMapper=identity but nothing worked.

i also tried this but the html was stripped in both fields

entity name=tika processor=TikaEntityProcessor url=${rec.urlParse} 
dataSource=dataUrl onError=skip htmlMapper=identity format=html 
transformer=HTMLStripTransformer
field column=text name=text stripHTML=false /
field column=text name=text_nohtml 
stripHTML=true /

but in the end i think it's best to cut tika out because i'm not getting any 
benefits from it. i would just need to get this to work:

field xpath=//h:h1 column=h_1 /
field column=text xpath=/xhtml:html/xhtml:body /

the fields are empty and i'm not getting any errors in the logs.


On 28. Sep 2013, at 2:43 AM, Alexandre Rafalovitch wrote:

 This is a rather complicated example to chew through, but try the following
 two things:
 *) dataField=${tika.text}  = dataField=text (or less likely htmlMapper
 tika.text)
 You might be trying to read content of the field rather than passing
 reference to the field that seems to be expected. This might explain the
 exception.
 
 *) It may help to be aware of
 https://issues.apache.org/jira/browse/SOLR-4530 . There is a new
 htmlMapper=identity flag on Tika entries to ensure more of HTML structure
 passing through. By default, Tika strips out most of the HTML tags.
 
 Regards,
   Alex.
 
 On Thu, Sep 26, 2013 at 5:17 PM, Andreas Owen a...@conx.ch wrote:
 
entity name=tika processor=TikaEntityProcessor
 url=${rec.urlParse} dataSource=dataUrl onError=skip format=html
field column=text/
 
entity name=detail type=XPathEntityProcessor
 forEach=/html dataSource=fld dataField=${tika.text} rootEntity=true
 onError=skip
field xpath=//h1 column=h_1 /
/entity
/entity
 
 
 
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-27 Thread Andreas Owen

i removed the FieldReaderDataSource and dataSource=fld but it didn't help. i 
get the following for each document:
DataImportHandlerException: Exception in invoking url null Processing 
Document # 9
nullpointerexception


On 26. Sep 2013, at 8:39 PM, P Williams wrote:

 Hi,
 
 Haven't tried this myself but maybe try leaving out the
 FieldReaderDataSource entirely.  From my quick searching looks like it's
 tied to SQL.  Did you try copying the
 http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example
 exactly?  What happens when you leave out FieldReaderDataSource?
 
 Cheers,
 Tricia
 
 
 On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3.1 and the dataimporter. i am trying to use
 XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages
 but i'm getting this error for each document. i have also tried
 dataField=tika.text and dataField=text to no avail. the nested
 XPathEntityProcessor detail creates the error, the rest works fine. what
 am i doing wrong?
 
 error:
 
 ERROR - 2013-09-26 12:08:49.006;
 org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed
 'null'
 java.lang.ClassCastException: java.io.StringReader cannot be cast to
 java.util.Iterator
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at
 org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-27 Thread Andreas Owen

(TestRuleAssertionsRequired.java:43)
 at
 org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
 at
 org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
 at
 org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55)
 at
 com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
 at
 com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358)
 at java.lang.Thread.run(Thread.java:722)
 
 
 
 On Fri, Sep 27, 2013 at 3:55 AM, Andreas Owen a...@conx.ch wrote:
 
 i removed the FieldReaderDataSource and dataSource=fld but it didn't
 help. i get the following for each document:
DataImportHandlerException: Exception in invoking url null
 Processing Document # 9
nullpointerexception
 
 
 On 26. Sep 2013, at 8:39 PM, P Williams wrote:
 
 Hi,
 
 Haven't tried this myself but maybe try leaving out the
 FieldReaderDataSource entirely.  From my quick searching looks like it's
 tied to SQL.  Did you try copying the
 http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example
 exactly?  What happens when you leave out FieldReaderDataSource?
 
 Cheers,
 Tricia
 
 
 On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3.1 and the dataimporter. i am trying to use
 XPathEntityProcessor within the TikaEntityProcessor for indexing
 html-pages
 but i'm getting this error for each document. i have also tried
 dataField=tika.text and dataField=text to no avail. the nested
 XPathEntityProcessor detail creates the error, the rest works fine.
 what
 am i doing wrong?
 
 error:
 
 ERROR - 2013-09-26 12:08:49.006;
 org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed
 'null'
 java.lang.ClassCastException: java.io.StringReader cannot be cast to
 java.util.Iterator
   at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
   at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
   at
 
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
   at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
   at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
   at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
   at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
   at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
   at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
   at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
   at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
   at
 
 org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
   at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
   at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
   at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
   at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
   at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
   at
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
   at
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
   at
 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
   at
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
   at
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
   at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
   at
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
   at
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
   at
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
   at org.eclipse.jetty.server.Server.handle(Server.java:365)
   at
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485

XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-26 Thread Andreas Owen

i'm using solr 4.3.1 and the dataimporter. i am trying to use 
XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages but 
i'm getting this error for each document. i have also tried 
dataField=tika.text and dataField=text to no avail. the nested 
XPathEntityProcessor detail creates the error, the rest works fine. what am i 
doing wrong?

error:

ERROR - 2013-09-26 12:08:49.006; 
org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed 'null'
java.lang.ClassCastException: java.io.StringReader cannot be cast to 
java.util.Iterator
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at 
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
ERROR - 2013-09-26 12:08:49.022; org.apache.solr.common.SolrException; 
Exception in entity : 
detail:org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.StringReader cannot be cast to 
java.util.Iterator
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at

dih HTMLStripTransformer

2013-09-24 Thread Andreas Owen

why does stripHTML=false have no effect in dih? the html is strippedin text 
and text_nohtml when i do display the index with select?q=*

i'm trying to get a field without html and one with it so i can also index the 
links on the page.

data-config.xml
entity name=rec processor=XPathEntityProcessor 
url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml 
forEach=/docs/doc dataSource=main !-- transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=url xpath=//url /
field column=urlParse xpath=//urlParse /
field column=last_modified xpath=//last_modified /
field column=Author xpath=//author /

entity name=tika processor=TikaEntityProcessor 
url=${rec.urlParse} dataSource=dataUrl onError=skip htmlMapper=identity 
format=html transformer=HTMLStripTransformer
field column=text name=text stripHTML=false /
field column=text name=text_nohtml 
stripHTML=true /
!--  transformer=RegexTransformer
field column=text_html_b 
regex=(?s)^.*lt;div.*id=.*gt;(.*)lt;/divgt;.*$ replaceWith=$1 
sourceColName=text  /
field column=text_html_b 
regex=(?s)^.*lt;!-body-gt;(.*)lt;!-/body-gt;.*$ replaceWith=$1 
sourceColName=text  / --
/entity
/entity

Re: dih delete doc per $deleteDocById

2013-09-22 Thread Andreas Owen

sorry, it works like this, i had a typo in my conf :-(

On 17. Sep 2013, at 2:44 PM, Andreas Owen wrote:

 i would like to know how to get it to work and delete documents per xml and 
 dih.
 
 On 17. Sep 2013, at 1:47 PM, Shalin Shekhar Mangar wrote:
 
 What is your question?
 
 On Tue, Sep 17, 2013 at 12:17 AM, andreas owen a.o...@gmx.net wrote:
 i am using dih and want to delete indexed documents by xml-file with ids. i 
 have seen $deleteDocById used in entity query=...
 
 data-config.xml:
 entity name=rec processor=XPathEntityProcessor 
 url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml
  forEach=/docs/doc dataSource=main 
   field column=$deleteDocById xpath=//id /
 /entity
 
 xml-file:
 docs
   doc
   id2345/id
   /doc
 /docs
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.

Re: dih delete doc per $deleteDocById

2013-09-17 Thread Andreas Owen

i would like to know how to get it to work and delete documents per xml and dih.

On 17. Sep 2013, at 1:47 PM, Shalin Shekhar Mangar wrote:

 What is your question?
 
 On Tue, Sep 17, 2013 at 12:17 AM, andreas owen a.o...@gmx.net wrote:
 i am using dih and want to delete indexed documents by xml-file with ids. i 
 have seen $deleteDocById used in entity query=...
 
 data-config.xml:
 entity name=rec processor=XPathEntityProcessor 
 url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml
  forEach=/docs/doc dataSource=main 
field column=$deleteDocById xpath=//id /
 /entity
 
 xml-file:
 docs
doc
id2345/id
/doc
 /docs
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.

dih delete doc per $deleteDocById

2013-09-16 Thread andreas owen

i am using dih and want to delete indexed documents by xml-file with ids. i 
have seen $deleteDocById used in entity query=...

data-config.xml:
entity name=rec processor=XPathEntityProcessor 
url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml 
forEach=/docs/doc dataSource=main 
field column=$deleteDocById xpath=//id /  
/entity

xml-file:
docs
doc
id2345/id
/doc
/docs

Re: charset encoding

2013-09-12 Thread Andreas Owen

no jetty, and yes for tomcat i've seen a couple of answers

On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:

 Using tomcat by any chance? The ML archive has the solution. May be on
 Wiki, too.
 
 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3.1 with tika to index html-pages. the html files are
 iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the
 server-http-header says it's utf8 and firefox-webdeveloper agrees.
 
 when i index a page with special chars like ä,ö,ü solr outputs it
 completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
 it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
 has anyone got a idea whats wrong?

Re: charset encoding

2013-09-12 Thread Andreas Owen

could it have something to do with the meta encoding tag is iso-8859-1 but the 
http-header tag is utf8 and firefox inteprets it as utf8?

On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote:

 no jetty, and yes for tomcat i've seen a couple of answers
 
 On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:
 
 Using tomcat by any chance? The ML archive has the solution. May be on
 Wiki, too.
 
 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3.1 with tika to index html-pages. the html files are
 iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the
 server-http-header says it's utf8 and firefox-webdeveloper agrees.
 
 when i index a page with special chars like ä,ö,ü solr outputs it
 completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
 it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
 has anyone got a idea whats wrong?

Re: charset encoding

2013-09-12 Thread Andreas Owen

it was the http-header, as soon as i force a iso-8859-1 header it worked

On 12. Sep 2013, at 9:44 AM, Andreas Owen wrote:

 could it have something to do with the meta encoding tag is iso-8859-1 but 
 the http-header tag is utf8 and firefox inteprets it as utf8?
 
 On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote:
 
 no jetty, and yes for tomcat i've seen a couple of answers
 
 On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:
 
 Using tomcat by any chance? The ML archive has the solution. May be on
 Wiki, too.
 
 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3.1 with tika to index html-pages. the html files are
 iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the
 server-http-header says it's utf8 and firefox-webdeveloper agrees.
 
 when i index a page with special chars like ä,ö,ü solr outputs it
 completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
 it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
 has anyone got a idea whats wrong?

Re: charfilter doesn't do anything

2013-09-11 Thread Andreas Owen

perfect, i tried it before but always at the tail of the expression with no 
effect. thanks a lot. a last question, do you know how to keep the html 
comments from being filtered before the transformer has done its work?


On 10. Sep 2013, at 3:17 PM, Jack Krupansky wrote:

 Okay, I can repro the problem. Yes, in appears that the pattern replace char 
 filter does not default to multiline mode for pattern matching, so body on 
 one line and /body on another line cannot be matched.
 
 Now, whether that is by design or a bug or an option for enhancement is a 
 matter for some committer to comment on.
 
 But, the good news is that you can in fact set multiline mode in your pattern 
 my starting it with (?s), which means that dot accepts line break 
 characters as well.
 
 So, here are my revised field types:
 
 fieldType name=text_html_body class=solr.TextField 
 positionIncrementGap=100 
 analyzer
   charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=(?s)^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 fieldType name=text_html_body_strip class=solr.TextField 
 positionIncrementGap=100 
 analyzer
   charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=(?s)^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
   charFilter class=solr.HTMLStripCharFilterFactory /
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 The first type accepts everything within body, including nested HTML 
 formatting, while the latter strips nested HTML formatting as well.
 
 The tokenizer will in fact strip out white space, but that happens after all 
 character filters have completed.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Tuesday, September 10, 2013 7:07 AM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 ok i am getting there now but if there are newlines involved the regex stops 
 as soon as it reaches a \r\n even if i try [\t\r\n.]* in the regex. I have 
 to get rid of the newlines. why isn't whitespaceTokenizerFactory the right 
 element for this?
 
 
 On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:
 
 Use XML then. Although you will need to escape the XML special characters as 
 I did in the pattern.
 
 The point is simply: Quickly and simply try to find the simple test scenario 
 that illustrates the problem.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 7:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i tried but that isn't working either, it want a data-stream, i'll have to 
 check how to post json instead of xml
 
 On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:
 
 Did you at least try the pattern I gave you?
 
 The point of the curl was the data, not how you send the data. You can just 
 use the standard Solr simple post tool.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 6:40 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i've downloaded curl and tried it in the comman prompt and power shell on 
 my win 2008r2 server, thats why i used my dataimporter with a single line 
 html file and copy/pastet the lines into schema.xml
 
 
 On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
 
 Did you in fact try my suggested example? If not, please do so.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 4:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i index html pages with a lot of lines and not just a string with the 
 body-tag.
 it doesn't work with proper html files, even though i took all the new 
 lines out.
 
 html-file:
 htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html
 
 solr update debug output:
 text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
 content=\ISO-8859-1\\r\nmeta name=\Content-Type\ 
 content=\text/html; 
 charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das 
 will ich sehenfooter-content/body/html]
 
 
 
 On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
 
 I tried this and it seems to work when added to the standard Solr example 
 in 4.4:
 
 field name=body type=text_html_body indexed=true stored=true /
 
 fieldType name=text_html_body class=solr.TextField 
 positionIncrementGap=100 
 analyzer
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 That char filter retains only text between body and /body. Is that 
 what you wanted?
 
 Indexing this data:
 
 curl 'localhost:8983/solr/update?commit=true' -H 
 'Content-type:application/json

charset encoding

2013-09-11 Thread Andreas Owen

i'm using solr 4.3.1 with tika to index html-pages. the html files are 
iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the 
server-http-header says it's utf8 and firefox-webdeveloper agrees. 

when i index a page with special chars like ä,ö,ü solr outputs it completly 
foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it 
seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone 
got a idea whats wrong?

Re: charfilter doesn't do anything

2013-09-10 Thread Andreas Owen

ok i am getting there now but if there are newlines involved the regex stops as
soon as it reaches a \r\n even if i try [\t\r\n.]* in the regex. I have to
get rid of the newlines. why isn't whitespaceTokenizerFactory the right element
for this?

On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:

Use XML then. Although you will need to escape the XML special characters as
I did in the pattern.

The point is simply: Quickly and simply try to find the simple test scenario
that illustrates the problem.

-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Monday, September 09, 2013 7:05 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i tried but that isn't working either, it want a data-stream, i'll have to
check how to post json instead of xml

On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:

Did you at least try the pattern I gave you?

The point of the curl was the data, not how you send the data. You can just
use the standard Solr simple post tool.

-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Monday, September 09, 2013 6:40 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i've downloaded curl and tried it in the comman prompt and power shell on my
win 2008r2 server, thats why i used my dataimporter with a single line html
file and copy/pastet the lines into schema.xml

On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:

Did you in fact try my suggested example? If not, please do so.

-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Monday, September 09, 2013 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i index html pages with a lot of lines and not just a string with the
body-tag.
it doesn't work with proper html files, even though i took all the new
lines out.

html-file:
htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html

solr update debug output:
text_html: [html\r\n\r\nmeta name=\Content-Encoding\
content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html;
charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das
will ich sehenfooter-content/body/html]

On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

I tried this and it seems to work when added to the standard Solr example
in 4.4:

field name=body type=text_html_body indexed=true stored=true /

fieldType name=text_html_body class=solr.TextField
positionIncrementGap=100
analyzer
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

That char filter retains only text between body and /body. Is that
what you wanted?

Indexing this data:

curl 'localhost:8983/solr/update?commit=true' -H
'Content-type:application/json' -d '
[{id:doc-1,body:abc bodyA test./body def}]'

And querying with these commands:

curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
Shows all data

curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
shows the body text

curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
shows nothing (outside of body)

curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
shows nothing (outside of body)

curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
Shows nothing, HTML tag stripped

In your original query, you didn't show us what your default field, df
parameter, was.

-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Sunday, September 08, 2013 5:21 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:

Hmmm, have you looked at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Not quite the body, perhaps, but might it help?

On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:

ok i have html pages with html.!--body--content i
want!--/body--./html. i want to extract (index, store) only
that between the body-comments. i thought regexTransformer would be the
best because xpath doesn't work in tika and i cant nest a
xpathEntetyProcessor to use xpath. what i have also found out is that the
htmlparser from tika cuts my body-comments out and tries to make well
formed html, which i would like to switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:

On 9/6/2013 7:09 AM, Andreas Owen wrote:
i've managed to get it working if i use the regexTransformer and string
is on the same line in my tika entity. but when the string is multilined
it
isn't working even though i tried ?s to set the flag dotall.

entity

Re: charfilter doesn't do anything

2013-09-09 Thread Andreas Owen

i index html pages with a lot of lines and not just a string with the body-tag. 
it doesn't work with proper html files, even though i took all the new lines 
out.

html-file:
htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html

solr update debug output:
text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; 
charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will 
ich sehenfooter-content/body/html]



On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

 I tried this and it seems to work when added to the standard Solr example in 
 4.4:
 
 field name=body type=text_html_body indexed=true stored=true /
 
 fieldType name=text_html_body class=solr.TextField 
 positionIncrementGap=100 
 analyzer
   charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 That char filter retains only text between body and /body. Is that what 
 you wanted?
 
 Indexing this data:
 
 curl 'localhost:8983/solr/update?commit=true' -H 
 'Content-type:application/json' -d '
 [{id:doc-1,body:abc bodyA test./body def}]'
 
 And querying with these commands:
 
 curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
 Shows all data
 
 curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
 shows the body text
 
 curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
 Shows nothing, HTML tag stripped
 
 In your original query, you didn't show us what your default field, df 
 parameter, was.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Sunday, September 08, 2013 5:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 yes but that filter html and not the specific tag i want.
 
 On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
 
 Hmmm, have you looked at:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Not quite the body, perhaps, but might it help?
 
 
 On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:
 
 ok i have html pages with html.!--body--content i
 want!--/body--./html. i want to extract (index, store) only
 that between the body-comments. i thought regexTransformer would be the
 best because xpath doesn't work in tika and i cant nest a
 xpathEntetyProcessor to use xpath. what i have also found out is that the
 htmlparser from tika cuts my body-comments out and tries to make well
 formed html, which i would like to switch off.
 
 On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
 
 On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string
 is on the same line in my tika entity. but when the string is multilined it
 isn't working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url}
 dataSource=dataUrl onError=skip htmlMapper=identity format=html
 transformer=RegexTransformer
field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 /entity
 
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.
 
 Sounds like we've got an XY problem here.
 
 http://people.apache.org/~hossman/#xyproblem
 
 How about you tell us *exactly* what you'd actually like to have happen
 and then we can find a solution for you?
 
 It sounds a little bit like you're interested in stripping all the HTML
 tags out.  Perhaps the HTMLStripCharFilter?
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Something that I already said: By using the KeywordTokenizer, you won't
 be able to search for individual words on your HTML input.  The entire
 input string is treated as a single token, and therefore ONLY exact
 entire-field matches (or certain wildcard matches) will be possible.
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
 
 Note that no matter what you do to your data with the analysis chain,
 Solr will always return the text that was originally indexed in search
 results.  If you need to affect what gets stored as well, perhaps you
 need an Update Processor.
 
 Thanks,
 Shawn

Re: charfilter doesn't do anything

2013-09-09 Thread Andreas Owen

i tried but that isn't working either, it want a data-stream, i'll have to 
check how to post json instead of xml

On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:

 Did you at least try the pattern I gave you?
 
 The point of the curl was the data, not how you send the data. You can just 
 use the standard Solr simple post tool.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 6:40 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i've downloaded curl and tried it in the comman prompt and power shell on my 
 win 2008r2 server, thats why i used my dataimporter with a single line html 
 file and copy/pastet the lines into schema.xml
 
 
 On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
 
 Did you in fact try my suggested example? If not, please do so.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 4:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i index html pages with a lot of lines and not just a string with the 
 body-tag.
 it doesn't work with proper html files, even though i took all the new lines 
 out.
 
 html-file:
 htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html
 
 solr update debug output:
 text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
 content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; 
 charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das 
 will ich sehenfooter-content/body/html]
 
 
 
 On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
 
 I tried this and it seems to work when added to the standard Solr example 
 in 4.4:
 
 field name=body type=text_html_body indexed=true stored=true /
 
 fieldType name=text_html_body class=solr.TextField 
 positionIncrementGap=100 
 analyzer
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 That char filter retains only text between body and /body. Is that what 
 you wanted?
 
 Indexing this data:
 
 curl 'localhost:8983/solr/update?commit=true' -H 
 'Content-type:application/json' -d '
 [{id:doc-1,body:abc bodyA test./body def}]'
 
 And querying with these commands:
 
 curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
 Shows all data
 
 curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
 shows the body text
 
 curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
 Shows nothing, HTML tag stripped
 
 In your original query, you didn't show us what your default field, df 
 parameter, was.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Sunday, September 08, 2013 5:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 yes but that filter html and not the specific tag i want.
 
 On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
 
 Hmmm, have you looked at:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Not quite the body, perhaps, but might it help?
 
 
 On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:
 
 ok i have html pages with html.!--body--content i
 want!--/body--./html. i want to extract (index, store) only
 that between the body-comments. i thought regexTransformer would be the
 best because xpath doesn't work in tika and i cant nest a
 xpathEntetyProcessor to use xpath. what i have also found out is that the
 htmlparser from tika cuts my body-comments out and tries to make well
 formed html, which i would like to switch off.
 
 On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
 
 On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string
 is on the same line in my tika entity. but when the string is multilined 
 it
 isn't working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url}
 dataSource=dataUrl onError=skip htmlMapper=identity format=html
 transformer=RegexTransformer
  field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 /entity
 
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.
 
 Sounds like we've got an XY problem here.
 
 http://people.apache.org/~hossman/#xyproblem
 
 How about you tell us *exactly* what you'd actually like to have happen
 and then we can find

Re: charfilter doesn't do anything

2013-09-09 Thread Andreas Owen

i've downloaded curl and tried it in the comman prompt and power shell on my 
win 2008r2 server, thats why i used my dataimporter with a single line html 
file and copy/pastet the lines into schema.xml


On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:

 Did you in fact try my suggested example? If not, please do so.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 4:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i index html pages with a lot of lines and not just a string with the 
 body-tag.
 it doesn't work with proper html files, even though i took all the new lines 
 out.
 
 html-file:
 htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html
 
 solr update debug output:
 text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
 content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; 
 charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das 
 will ich sehenfooter-content/body/html]
 
 
 
 On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
 
 I tried this and it seems to work when added to the standard Solr example in 
 4.4:
 
 field name=body type=text_html_body indexed=true stored=true /
 
 fieldType name=text_html_body class=solr.TextField 
 positionIncrementGap=100 
 analyzer
  charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 That char filter retains only text between body and /body. Is that what 
 you wanted?
 
 Indexing this data:
 
 curl 'localhost:8983/solr/update?commit=true' -H 
 'Content-type:application/json' -d '
 [{id:doc-1,body:abc bodyA test./body def}]'
 
 And querying with these commands:
 
 curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
 Shows all data
 
 curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
 shows the body text
 
 curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
 Shows nothing, HTML tag stripped
 
 In your original query, you didn't show us what your default field, df 
 parameter, was.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Sunday, September 08, 2013 5:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 yes but that filter html and not the specific tag i want.
 
 On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
 
 Hmmm, have you looked at:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Not quite the body, perhaps, but might it help?
 
 
 On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:
 
 ok i have html pages with html.!--body--content i
 want!--/body--./html. i want to extract (index, store) only
 that between the body-comments. i thought regexTransformer would be the
 best because xpath doesn't work in tika and i cant nest a
 xpathEntetyProcessor to use xpath. what i have also found out is that the
 htmlparser from tika cuts my body-comments out and tries to make well
 formed html, which i would like to switch off.
 
 On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
 
 On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string
 is on the same line in my tika entity. but when the string is multilined it
 isn't working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url}
 dataSource=dataUrl onError=skip htmlMapper=identity format=html
 transformer=RegexTransformer
   field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 /entity
 
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.
 
 Sounds like we've got an XY problem here.
 
 http://people.apache.org/~hossman/#xyproblem
 
 How about you tell us *exactly* what you'd actually like to have happen
 and then we can find a solution for you?
 
 It sounds a little bit like you're interested in stripping all the HTML
 tags out.  Perhaps the HTMLStripCharFilter?
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Something that I already said: By using the KeywordTokenizer, you won't
 be able to search for individual words on your HTML input.  The entire
 input string is treated as a single token, and therefore ONLY exact
 entire-field matches (or certain wildcard matches) will be possible.
 
 
 http

Re: charfilter doesn't do anything

2013-09-08 Thread Andreas Owen

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:

Hmmm, have you looked at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Not quite the body, perhaps, but might it help?

On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:

On 9/6/2013 7:09 AM, Andreas Owen wrote:
i've managed to get it working if i use the regexTransformer and string
is on the same line in my tika entity. but when the string is multilined it
isn't working even though i tried ?s to set the flag dotall.

entity name=tika processor=TikaEntityProcessor url=${rec.url}
dataSource=dataUrl onError=skip htmlMapper=identity format=html
transformer=RegexTransformer
field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;
replaceWith=QQQ sourceColName=text /
/entity

then i tried it like this and i get a stackoverflow

field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;
replaceWith=QQQ sourceColName=text /

in javascript this works but maybe because i only used a small string.

Sounds like we've got an XY problem here.

http://people.apache.org/~hossman/#xyproblem

How about you tell us *exactly* what you'd actually like to have happen
and then we can find a solution for you?

It sounds a little bit like you're interested in stripping all the HTML
tags out. Perhaps the HTMLStripCharFilter?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Something that I already said: By using the KeywordTokenizer, you won't
be able to search for individual words on your HTML input. The entire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory

Note that no matter what you do to your data with the analysis chain,
Solr will always return the text that was originally indexed in search
results. If you need to affect what gets stored as well, perhaps you
need an Update Processor.

Thanks,
Shawn

Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen

the input string is a normal html page with the word Zahlungsverkehr in it and 
my query is ...solr/collection1/select?q=*

On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:

 And show us an input string and a query that fail.
 
 -- Jack Krupansky
 
 -Original Message- From: Shawn Heisey
 Sent: Thursday, September 05, 2013 2:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 On 9/5/2013 10:03 AM, Andreas Owen wrote:
 i would like to filter / replace a word during indexing but it doesn't do 
 anything and i dont get a error.
 
 in schema.xml i have the following:
 
 field name=text_html type=text_cutHtml indexed=true stored=true 
 multiValued=true/
 
 fieldType name=text_cutHtml class=solr.TextField
 analyzer
  !--  tokenizer class=solr.StandardTokenizerFactory/ --
  charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=Zahlungsverkehr replacement=ASDFGHJK /
  tokenizer class=solr.KeywordTokenizerFactory/
 /analyzer
   /fieldType
 
 my 2. question is where can i say that the expression is multilined like in 
 javascript i can use /m at the end of the pattern?
 
 I don't know about your second question.  I don't know if that will be
 possible, but I'll leave that to someone who's more expert than I.
 
 As for the first question, here's what I have.  Did you reindex?  That
 will be required.
 
 http://wiki.apache.org/solr/HowToReindex
 
 Assuming that you did reindex, are you trying to search for ASDFGHJK in
 a field that contains more than just Zahlungsverkehr?  The keyword
 tokenizer might not do what you expect - it tokenizes the entire input
 string as a single token, which means that you won't be able to search
 for single words in a multi-word field without wildcards, which are
 pretty slow.
 
 Note that both the pattern and replacement are case sensitive.  This is
 how regex works.  You haven't used a lowercase filter, which means that
 you won't be able to search for asdfghjk.
 
 Use the analysis tab in the UI on your core to see what Solr does to
 your field text.
 
 Thanks,
 Shawn

Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen

i've managed to get it working if i use the regexTransformer and string is on 
the same line in my tika entity. but when the string is multilined it isn't 
working even though i tried ?s to set the flag dotall.

entity name=tika processor=TikaEntityProcessor url=${rec.url} 
dataSource=dataUrl onError=skip htmlMapper=identity format=html 
transformer=RegexTransformer
field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; 
replaceWith=QQQ sourceColName=text  /
/entity

then i tried it like this and i get a stackoverflow

field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; 
replaceWith=QQQ sourceColName=text  /

in javascript this works but maybe because i only used a small string.



On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote:

 Is there any chance that your changed your schema since you indexed the data? 
 If so, re-index the data.
 
 If a * query finds nothing, that implies that the default field is empty. 
 Are you sure the df parameter is set to the field containing your data? 
 Show us your request handler definition and a sample of your actual Solr 
 input (Solr XML or JSON?) so that we can see what fields are being populated.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Friday, September 06, 2013 4:01 AM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 the input string is a normal html page with the word Zahlungsverkehr in it 
 and my query is ...solr/collection1/select?q=*
 
 On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:
 
 And show us an input string and a query that fail.
 
 -- Jack Krupansky
 
 -Original Message- From: Shawn Heisey
 Sent: Thursday, September 05, 2013 2:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 On 9/5/2013 10:03 AM, Andreas Owen wrote:
 i would like to filter / replace a word during indexing but it doesn't do 
 anything and i dont get a error.
 
 in schema.xml i have the following:
 
 field name=text_html type=text_cutHtml indexed=true stored=true 
 multiValued=true/
 
 fieldType name=text_cutHtml class=solr.TextField
 analyzer
 !--  tokenizer class=solr.StandardTokenizerFactory/ --
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=Zahlungsverkehr replacement=ASDFGHJK /
 tokenizer class=solr.KeywordTokenizerFactory/
 /analyzer
  /fieldType
 
 my 2. question is where can i say that the expression is multilined like in 
 javascript i can use /m at the end of the pattern?
 
 I don't know about your second question.  I don't know if that will be
 possible, but I'll leave that to someone who's more expert than I.
 
 As for the first question, here's what I have.  Did you reindex?  That
 will be required.
 
 http://wiki.apache.org/solr/HowToReindex
 
 Assuming that you did reindex, are you trying to search for ASDFGHJK in
 a field that contains more than just Zahlungsverkehr?  The keyword
 tokenizer might not do what you expect - it tokenizes the entire input
 string as a single token, which means that you won't be able to search
 for single words in a multi-word field without wildcards, which are
 pretty slow.
 
 Note that both the pattern and replacement are case sensitive.  This is
 how regex works.  You haven't used a lowercase filter, which means that
 you won't be able to search for asdfghjk.
 
 Use the analysis tab in the UI on your core to see what Solr does to
 your field text.
 
 Thanks,
 Shawn

Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen

ok i have html pages with html.!--body--content i 
want!--/body--./html. i want to extract (index, store) only that 
between the body-comments. i thought regexTransformer would be the best because 
xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. 
what i have also found out is that the htmlparser from tika cuts my 
body-comments out and tries to make well formed html, which i would like to 
switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:

 On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string is 
 on the same line in my tika entity. but when the string is multilined it 
 isn't working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url} 
 dataSource=dataUrl onError=skip htmlMapper=identity format=html 
 transformer=RegexTransformer
  field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; 
 replaceWith=QQQ sourceColName=text  /
 /entity
  
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; 
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.
 
 Sounds like we've got an XY problem here.
 
 http://people.apache.org/~hossman/#xyproblem
 
 How about you tell us *exactly* what you'd actually like to have happen
 and then we can find a solution for you?
 
 It sounds a little bit like you're interested in stripping all the HTML
 tags out.  Perhaps the HTMLStripCharFilter?
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Something that I already said: By using the KeywordTokenizer, you won't
 be able to search for individual words on your HTML input.  The entire
 input string is treated as a single token, and therefore ONLY exact
 entire-field matches (or certain wildcard matches) will be possible.
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
 
 Note that no matter what you do to your data with the analysis chain,
 Solr will always return the text that was originally indexed in search
 results.  If you need to affect what gets stored as well, perhaps you
 need an Update Processor.
 
 Thanks,
 Shawn

charfilter doesn't do anything

2013-09-05 Thread Andreas Owen

i would like to filter / replace a word during indexing but it doesn't do 
anything and i dont get a error.

in schema.xml i have the following:

field name=text_html type=text_cutHtml indexed=true stored=true 
multiValued=true/

fieldType name=text_cutHtml class=solr.TextField
analyzer
  !--  tokenizer class=solr.StandardTokenizerFactory/ --
  charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=Zahlungsverkehr replacement=ASDFGHJK /
  tokenizer class=solr.KeywordTokenizerFactory/
/analyzer
   /fieldType

my 2. question is where can i say that the expression is multilined like in 
javascript i can use /m at the end of the pattern?

Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Andreas Owen

so could i just nest it in a XPathEntityProcessor to filter the html or is 
there something like xpath for tika?

entity name=htm processor=XPathEntityProcessor url=${rec.file} 
forEach=/div[@id='content'] dataSource=main
entity name=tika processor=TikaEntityProcessor 
url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity 
format=html 
field column=text /
/entity
/entity

but now i dont know how to pass the text to tika, what do i put in url and 
datasource?


On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:

 I don't know much about Tika but in the example data-config.xml that
 you posted, the xpath attribute on the field text won't work
 because the xpath attribute is used only by a XPathEntityProcessor.
 
 On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote:
 I want tika to only index the content in div id=content.../div for the 
 field text. unfortunately it's indexing the hole page. Can't xpath do this?
 
 data-config.xml:
 
 dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource name=main/
 document
entity name=rec processor=XPathEntityProcessor 
 url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
 dataSource=main !--transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /
 
entity name=tika processor=TikaEntityProcessor 
 url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
 htmlMapper=identity format=html 
field column=text xpath=//div[@id='content'] /
 
/entity
/entity
 /document
 /dataConfig
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.

Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Andreas Owen

or could i use a filter in schema.xml where i define a fieldtype and use some 
filter that understands xpath?

On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote:

 No that wouldn't work. It seems that you probably need a custom
 Transformer to extract the right div content. I do not know if
 TikaEntityProcessor supports such a thing.
 
 On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen a...@conx.ch wrote:
 so could i just nest it in a XPathEntityProcessor to filter the html or is 
 there something like xpath for tika?
 
 entity name=htm processor=XPathEntityProcessor url=${rec.file} 
 forEach=/div[@id='content'] dataSource=main
entity name=tika processor=TikaEntityProcessor 
 url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity 
 format=html 
field column=text /
/entity
/entity
 
 but now i dont know how to pass the text to tika, what do i put in url and 
 datasource?
 
 
 On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:
 
 I don't know much about Tika but in the example data-config.xml that
 you posted, the xpath attribute on the field text won't work
 because the xpath attribute is used only by a XPathEntityProcessor.
 
 On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote:
 I want tika to only index the content in div id=content.../div for 
 the field text. unfortunately it's indexing the hole page. Can't xpath 
 do this?
 
 data-config.xml:
 
 dataConfig
   dataSource type=BinFileDataSource name=data/
   dataSource type=BinURLDataSource name=dataUrl/
   dataSource type=URLDataSource name=main/
 document
   entity name=rec processor=XPathEntityProcessor 
 url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
 dataSource=main !--transformer=script:GenerateId--
   field column=title xpath=//title /
   field column=id xpath=//id /
   field column=file xpath=//file /
   field column=path xpath=//path /
   field column=url xpath=//url /
   field column=Author xpath=//author /
 
   entity name=tika processor=TikaEntityProcessor 
 url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
 htmlMapper=identity format=html 
   field column=text xpath=//div[@id='content'] /
 
   /entity
   /entity
 /document
 /dataConfig
 
 
 
 --
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.

dataimporter tika doesn't extract certain div

2013-08-29 Thread Andreas Owen

I want tika to only index the content in div id=content.../div for the 
field text. unfortunately it's indexing the hole page. Can't xpath do this?

data-config.xml:

dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource name=main/
document
entity name=rec processor=XPathEntityProcessor 
url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
dataSource=main !--transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /

entity name=tika processor=TikaEntityProcessor 
url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
htmlMapper=identity format=html 
field column=text xpath=//div[@id='content'] /

/entity
/entity
/document
/dataConfig

Re: dataimporter tika fields empty

2013-08-23 Thread Andreas Owen

ok but i'm not doing any path extraction, at least i don't think so.

htmlMapper=identity isn't preserving html

it's reading the content of the pages but it's not putting it into text_test 
and text. it's only in text_test the copyField isn't working. 

data-config.xml:

dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource name=main/
document
entity name=rec processor=XPathEntityProcessor 
url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
dataSource=main 
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /

entity name=tika processor=TikaEntityProcessor 
url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
htmlMapper=identity 
field column=text name=text_test /
copyField source=text_test dest=text /
!-- field column=text_test 
xpath=//div[@id='content'] /  --
/entity
/entity
/document
/dataConfig


On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote:

 Ah. That's because Tika processor does not support path extraction. You
 need to nest one more level.
 
 Regards,
  Alex
 On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote:
 
 i can do it like this but then the content isn't copied to text. it's just
 in text_test
 
 entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
field column=text name=text_test
copyField source=text_test dest=text /
 /entity
 
 
 On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:
 
 i put it in the tika-entity as attribute, but it doesn't change
 anything. my bigger concern is why text_test isn't populated at all
 
 On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
 
 Can you try SOLR-4530 switch:
 https://issues.apache.org/jira/browse/SOLR-4530
 
 Specifically, setting htmlMapper=identity on the entity definition.
 This
 will tell Tika to send full HTML rather than a seriously stripped one.
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
 On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm trying to index a html page and only user the div with the
 id=content. unfortunately nothing is working within the tika-entity,
 only
 the standard text (content) is populated.
 
  do i have to use copyField for test_text to get the data?
  or is there a problem with the entity-hirarchy?
  or is the xpath wrong, even though i've tried it without and just
 using text?
  or should i use the updateextractor?
 
 data-config.xml:
 
 dataConfig
  dataSource type=BinFileDataSource name=data/
  dataSource type=BinURLDataSource name=dataUrl/
  dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
  entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/docs/doc dataSource=main
  field column=title xpath=//title /
  field column=id xpath=//id /
  field column=file xpath=//file /
  field column=path xpath=//path /
  field column=url xpath=//url /
  field column=Author xpath=//author /
 
  entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
  !-- copyField source=text dest=text_test /
 --
  field column=text_test
 xpath=//div[@id='content'] /
  /entity
  /entity
 /document
 /dataConfig
 
 docImporterUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 docs
 doc
  id5/id
  authortkb/author
  titleStartseite/title
  descriptionblabla .../description
  filehttp://localhost/tkb/internet/index.cfm/file
  urlhttp://localhost/tkb/internet/index.cfm/url/url
  path2http\specialConf/path2
  /doc
  doc
  id6/id
  authortkb/author
  titleEigenheim/title
  descriptionMachen Sie sich erste Gedanken über den
 Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder
 gar ein
 spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
 Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in
 finanzieller
 Hinsicht gelingt./description
  file
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
  url
 http://127.0.0.1/tkb

Re: dataimporter tika fields empty

2013-08-23 Thread Andreas Owen

i changed following line (xpath): field column=text 
xpath=//div[@id='content'] name=text_test /

On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote:

 Ah. That's because Tika processor does not support path extraction. You
 need to nest one more level.
 
 Regards,
  Alex
 On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote:
 
 i can do it like this but then the content isn't copied to text. it's just
 in text_test
 
 entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
field column=text name=text_test
copyField source=text_test dest=text /
 /entity
 
 
 On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:
 
 i put it in the tika-entity as attribute, but it doesn't change
 anything. my bigger concern is why text_test isn't populated at all
 
 On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
 
 Can you try SOLR-4530 switch:
 https://issues.apache.org/jira/browse/SOLR-4530
 
 Specifically, setting htmlMapper=identity on the entity definition.
 This
 will tell Tika to send full HTML rather than a seriously stripped one.
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
 On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm trying to index a html page and only user the div with the
 id=content. unfortunately nothing is working within the tika-entity,
 only
 the standard text (content) is populated.
 
  do i have to use copyField for test_text to get the data?
  or is there a problem with the entity-hirarchy?
  or is the xpath wrong, even though i've tried it without and just
 using text?
  or should i use the updateextractor?
 
 data-config.xml:
 
 dataConfig
  dataSource type=BinFileDataSource name=data/
  dataSource type=BinURLDataSource name=dataUrl/
  dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
  entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/docs/doc dataSource=main
  field column=title xpath=//title /
  field column=id xpath=//id /
  field column=file xpath=//file /
  field column=path xpath=//path /
  field column=url xpath=//url /
  field column=Author xpath=//author /
 
  entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
  !-- copyField source=text dest=text_test /
 --
  field column=text_test
 xpath=//div[@id='content'] /
  /entity
  /entity
 /document
 /dataConfig
 
 docImporterUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 docs
 doc
  id5/id
  authortkb/author
  titleStartseite/title
  descriptionblabla .../description
  filehttp://localhost/tkb/internet/index.cfm/file
  urlhttp://localhost/tkb/internet/index.cfm/url/url
  path2http\specialConf/path2
  /doc
  doc
  id6/id
  authortkb/author
  titleEigenheim/title
  descriptionMachen Sie sich erste Gedanken über den
 Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder
 gar ein
 spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
 Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in
 finanzieller
 Hinsicht gelingt./description
  file
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
  url
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url
  /doc
 /docs

dataimporter tika fields empty

2013-08-22 Thread Andreas Owen

i'm trying to index a html page and only user the div with the id=content. 
unfortunately nothing is working within the tika-entity, only the standard text 
(content) is populated. 

do i have to use copyField for test_text to get the data? 
or is there a problem with the entity-hirarchy?
or is the xpath wrong, even though i've tried it without and just using 
text?
or should i use the updateextractor?

data-config.xml:

dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource 
baseUrl=http://127.0.0.1/tkb/internet/; name=main/
document
entity name=rec processor=XPathEntityProcessor 
url=docImportUrl.xml forEach=/docs/doc dataSource=main 
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /  

entity name=tika processor=TikaEntityProcessor 
url=${rec.path}${rec.file} dataSource=dataUrl 
!-- copyField source=text dest=text_test / --
field column=text_test xpath=//div[@id='content'] 
/   
/entity
/entity
/document
/dataConfig

docImporterUrl.xml:

?xml version=1.0 encoding=utf-8?
docs
doc
id5/id
authortkb/author
titleStartseite/title
descriptionblabla .../description
filehttp://localhost/tkb/internet/index.cfm/file
urlhttp://localhost/tkb/internet/index.cfm/url/url
path2http\specialConf/path2
/doc
doc
id6/id
authortkb/author
titleEigenheim/title
descriptionMachen Sie sich erste Gedanken über den Erwerb von 
Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes 
Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von 
Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht 
gelingt./description

filehttp://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file

urlhttp://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url
/doc
/docs

Re: dataimporter tika fields empty

2013-08-22 Thread Andreas Owen

i put it in the tika-entity as attribute, but it doesn't change anything. my 
bigger concern is why text_test isn't populated at all

On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:

 Can you try SOLR-4530 switch:
 https://issues.apache.org/jira/browse/SOLR-4530
 
 Specifically, setting htmlMapper=identity on the entity definition. This
 will tell Tika to send full HTML rather than a seriously stripped one.
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm trying to index a html page and only user the div with the
 id=content. unfortunately nothing is working within the tika-entity, only
 the standard text (content) is populated.
 
do i have to use copyField for test_text to get the data?
or is there a problem with the entity-hirarchy?
or is the xpath wrong, even though i've tried it without and just
 using text?
or should i use the updateextractor?
 
 data-config.xml:
 
 dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/docs/doc dataSource=main
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /
 
entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
!-- copyField source=text dest=text_test /
 --
field column=text_test
 xpath=//div[@id='content'] /
/entity
/entity
 /document
 /dataConfig
 
 docImporterUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 docs
 doc
id5/id
authortkb/author
titleStartseite/title
descriptionblabla .../description
filehttp://localhost/tkb/internet/index.cfm/file
urlhttp://localhost/tkb/internet/index.cfm/url/url
path2http\specialConf/path2
/doc
doc
id6/id
authortkb/author
titleEigenheim/title
descriptionMachen Sie sich erste Gedanken über den
 Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein
 spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
 Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller
 Hinsicht gelingt./description
file
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
url
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url
/doc
 /docs

Re: dataimporter tika fields empty

2013-08-22 Thread Andreas Owen

i can do it like this but then the content isn't copied to text. it's just in 
text_test

entity name=tika processor=TikaEntityProcessor 
url=${rec.path}${rec.file} dataSource=dataUrl 
field column=text name=text_test
copyField source=text_test dest=text /
/entity


On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:

 i put it in the tika-entity as attribute, but it doesn't change anything. my 
 bigger concern is why text_test isn't populated at all
 
 On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
 
 Can you try SOLR-4530 switch:
 https://issues.apache.org/jira/browse/SOLR-4530
 
 Specifically, setting htmlMapper=identity on the entity definition. This
 will tell Tika to send full HTML rather than a seriously stripped one.
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm trying to index a html page and only user the div with the
 id=content. unfortunately nothing is working within the tika-entity, only
 the standard text (content) is populated.
 
   do i have to use copyField for test_text to get the data?
   or is there a problem with the entity-hirarchy?
   or is the xpath wrong, even though i've tried it without and just
 using text?
   or should i use the updateextractor?
 
 data-config.xml:
 
 dataConfig
   dataSource type=BinFileDataSource name=data/
   dataSource type=BinURLDataSource name=dataUrl/
   dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
   entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/docs/doc dataSource=main
   field column=title xpath=//title /
   field column=id xpath=//id /
   field column=file xpath=//file /
   field column=path xpath=//path /
   field column=url xpath=//url /
   field column=Author xpath=//author /
 
   entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
   !-- copyField source=text dest=text_test /
 --
   field column=text_test
 xpath=//div[@id='content'] /
   /entity
   /entity
 /document
 /dataConfig
 
 docImporterUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 docs
 doc
   id5/id
   authortkb/author
   titleStartseite/title
   descriptionblabla .../description
   filehttp://localhost/tkb/internet/index.cfm/file
   urlhttp://localhost/tkb/internet/index.cfm/url/url
   path2http\specialConf/path2
   /doc
   doc
   id6/id
   authortkb/author
   titleEigenheim/title
   descriptionMachen Sie sich erste Gedanken über den
 Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein
 spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
 Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller
 Hinsicht gelingt./description
   file
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
   url
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url
   /doc
 /docs

Re: dataimporter, custom fields and parsing error

2013-07-23 Thread Andreas Owen

i have tried post.jar and it works when i set the literal.id in solrconfig.xml. 
i can't pass the id with post.jar (-Dparams=literal.id=abc) because i get a 
error: could not find or load main class .id=abc.


On 20. Jul 2013, at 7:05 PM, Andreas Owen wrote:

 path was set text wasn't, but it doesn't make a difference. my importer says 
 1 row fetched, 0 docs processed, 0 docs skipped. i don't understand how it 
 can have 2 docs indexed with such a output.
 
 
 On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote:
 
 Are the path and text fields set to stored in the schema.xml?
 
 
 On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen a...@conx.ch wrote:
 
 they are in my schema, path is typed correctly the others are default
 fields which already exist. all the other fields are populated and i can
 search for them, just path and text aren't.
 
 
 On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote:
 
 Dumb question: they are in your schema? Spelled right, in the right
 section, using types also defined? Can you populate them by hand with a
 CSV
 file and post.jar?
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3 which i just downloaded today and am using only jars
 that came with it. i have enabled the dataimporter and it runs without
 error. but the field path (included in schema.xml) and text (file
 content) aren't indexed. what am i doing wrong?
 
 solr-path: C:\ColdFusion10\cfusion\jetty-new
 collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
 pdf-doc-path: C:\web\development\tkb\internet\public
 
 
 data-config.xml:
 
 dataConfig
  dataSource type=BinFileDataSource name=data/
  dataSource type=BinURLDataSource name=dataUrl/
  dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
  entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/albums/album dataSource=main !--
 
 transformer=script:GenerateId--
  field column=title xpath=//title /
  field column=id xpath=//file /
  field column=path xpath=//path /
  field column=Author xpath=//author /
 
  !-- field
 column=tstamp2013-07-05T14:59:46.889Z/field --
 
  entity name=tika processor=TikaEntityProcessor
 url=../../../../../web/development/tkb/internet/public/${rec.path}/${
 rec.id}
 
 dataSource=data 
  field column=text /
 
  /entity
  /entity
 /document
 /dataConfig
 
 
 docImportUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 albums
  album
  authorPeter Z./author
  titleBeratungsseminar kundenbrief/title
  descriptionwie kommuniziert man/description
 
 file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file
  pathdownload/online/path
  /album
  album
  authorMarcel X./author
  titlekuchen backen/title
  descriptiontorten, kuchen, geb‰ck .../description
  fileKundenbrief.pdf/file
  pathdownload/online/path
  /album
 /albums
 
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.

Re: dataimporter, custom fields and parsing error

2013-07-20 Thread Andreas Owen

they are in my schema, path is typed correctly the others are default fields 
which already exist. all the other fields are populated and i can search for 
them, just path and text aren't.


On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote:

 Dumb question: they are in your schema? Spelled right, in the right
 section, using types also defined? Can you populate them by hand with a CSV
 file and post.jar?
 
 Regards,
   Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3 which i just downloaded today and am using only jars
 that came with it. i have enabled the dataimporter and it runs without
 error. but the field path (included in schema.xml) and text (file
 content) aren't indexed. what am i doing wrong?
 
 solr-path: C:\ColdFusion10\cfusion\jetty-new
 collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
 pdf-doc-path: C:\web\development\tkb\internet\public
 
 
 data-config.xml:
 
 dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/albums/album dataSource=main !--
 
 transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//file /
field column=path xpath=//path /
field column=Author xpath=//author /
 
!-- field
 column=tstamp2013-07-05T14:59:46.889Z/field --
 
entity name=tika processor=TikaEntityProcessor
 url=../../../../../web/development/tkb/internet/public/${rec.path}/${
 rec.id}
 
 dataSource=data 
field column=text /
 
/entity
/entity
 /document
 /dataConfig
 
 
 docImportUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 albums
album
authorPeter Z./author
titleBeratungsseminar kundenbrief/title
descriptionwie kommuniziert man/description
 
 file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file
pathdownload/online/path
/album
album
authorMarcel X./author
titlekuchen backen/title
descriptiontorten, kuchen, geb‰ck .../description
fileKundenbrief.pdf/file
pathdownload/online/path
/album
 /albums

Re: dataimporter, custom fields and parsing error

2013-07-20 Thread Andreas Owen

path was set text wasn't, but it doesn't make a difference. my importer says 1 
row fetched, 0 docs processed, 0 docs skipped. i don't understand how it can 
have 2 docs indexed with such a output.


On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote:

 Are the path and text fields set to stored in the schema.xml?
 
 
 On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen a...@conx.ch wrote:
 
 they are in my schema, path is typed correctly the others are default
 fields which already exist. all the other fields are populated and i can
 search for them, just path and text aren't.
 
 
 On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote:
 
 Dumb question: they are in your schema? Spelled right, in the right
 section, using types also defined? Can you populate them by hand with a
 CSV
 file and post.jar?
 
 Regards,
  Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3 which i just downloaded today and am using only jars
 that came with it. i have enabled the dataimporter and it runs without
 error. but the field path (included in schema.xml) and text (file
 content) aren't indexed. what am i doing wrong?
 
 solr-path: C:\ColdFusion10\cfusion\jetty-new
 collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
 pdf-doc-path: C:\web\development\tkb\internet\public
 
 
 data-config.xml:
 
 dataConfig
   dataSource type=BinFileDataSource name=data/
   dataSource type=BinURLDataSource name=dataUrl/
   dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
   entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/albums/album dataSource=main !--
 
 transformer=script:GenerateId--
   field column=title xpath=//title /
   field column=id xpath=//file /
   field column=path xpath=//path /
   field column=Author xpath=//author /
 
   !-- field
 column=tstamp2013-07-05T14:59:46.889Z/field --
 
   entity name=tika processor=TikaEntityProcessor
 url=../../../../../web/development/tkb/internet/public/${rec.path}/${
 rec.id}
 
 dataSource=data 
   field column=text /
 
   /entity
   /entity
 /document
 /dataConfig
 
 
 docImportUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 albums
   album
   authorPeter Z./author
   titleBeratungsseminar kundenbrief/title
   descriptionwie kommuniziert man/description
 
 file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file
   pathdownload/online/path
   /album
   album
   authorMarcel X./author
   titlekuchen backen/title
   descriptiontorten, kuchen, geb‰ck .../description
   fileKundenbrief.pdf/file
   pathdownload/online/path
   /album
 /albums
 
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.

dataimporter, custom fields and parsing error

2013-07-19 Thread Andreas Owen

i'm using solr 4.3 which i just downloaded today and am using only jars that 
came with it. i have enabled the dataimporter and it runs without error. but 
the field path (included in schema.xml) and text (file content) aren't 
indexed. what am i doing wrong?

solr-path: C:\ColdFusion10\cfusion\jetty-new
collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
pdf-doc-path: C:\web\development\tkb\internet\public


data-config.xml:

dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource 
baseUrl=http://127.0.0.1/tkb/internet/; name=main/
document
entity name=rec processor=XPathEntityProcessor 
url=docImportUrl.xml forEach=/albums/album dataSource=main !--

transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//file /
field column=path xpath=//path /
field column=Author xpath=//author /

!-- field column=tstamp2013-07-05T14:59:46.889Z/field --

entity name=tika processor=TikaEntityProcessor 
url=../../../../../web/development/tkb/internet/public/${rec.path}/${rec.id} 

dataSource=data 
field column=text /

/entity
/entity
/document
/dataConfig


docImportUrl.xml:

?xml version=1.0 encoding=utf-8?
albums
album
authorPeter Z./author
titleBeratungsseminar kundenbrief/title
descriptionwie kommuniziert man/description
file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file
pathdownload/online/path
/album
album
authorMarcel X./author
titlekuchen backen/title
descriptiontorten, kuchen, geb‰ck .../description
fileKundenbrief.pdf/file
pathdownload/online/path
/album
/albums

Re: solr autodetectparser tikaconfig dataimporter error

2013-07-18 Thread Andreas Owen

i have now changed some things and the import runs without error. in schema.xml 
i haven't got the field text but contentsExact. unfortunatly the text (from 
file) isn't indexed even though i mapped it to the proper field. what am i 
doing wrong?

data-config.xml:

dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource 
baseUrl=http://127.0.0.1/tkb/internet/; name=main/
document
entity name=rec processor=XPathEntityProcessor url=docImport.xml 
forEach=/albums/album dataSource=main 
!--transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//file /
field column=path xpath=//path /
field column=Author xpath=//author /

!-- field column=tstamp2013-07-05T14:59:46.889Z/field --

entity name=f processor=FileListEntityProcessor 
baseDir=C:\web\development\tkb\internet\public fileName=${rec.id} 
dataSource=data onError=skip
entity name=tika processor=TikaEntityProcessor 
url=${f.fileAbsolutePath}
field column=text name=contentsExact /
/entity
/entity
/entity
/document
/dataConfig

i noticed, that when I move the field author into the tika-entity it isn't 
indexed. can this have something to do why the text from the file isn't 
indexed? Do I have to do something special about the entity-levels in 
document

ps: how do i import tsstamp, it's a static value?




On 14. Jul 2013, at 10:30 PM, Jack Krupansky wrote:

 Caused by: java.lang.NoSuchMethodError:
 
 That means you have some out of date jars or some newer jars mixed in with 
 the old ones.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Sunday, July 14, 2013 3:07 PM
 To: solr-user@lucene.apache.org
 Subject: Re: solr autodetectparser tikaconfig dataimporter error
 
 hi
 
 is there nowone with a idea what this error is or even give me a pointer 
 where to look? If not is there a alternitave way to import documents from a 
 xml-file with meta-data and the filename to parse?
 
 thanks for any help.
 
 
 On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote:
 
 i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
 import a
 file via xml i get this error, it doesn't matter what file format i try =
 to index txt, cfm, pdf all the same error:
 
 SEVERE: Exception while processing: rec document :
 SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
 title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
 contents=3Dcontents(1.0)=3D{wie
 kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
 =
 path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
 DataImportHandlerException:
 java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:669)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:622)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
 68)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
 
 at
 =
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
 java:359)
 at
 =
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
 27)
 at
 =
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
 8)
 Caused by: java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
 at
 =
 org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
 rocessor.java:122)
 at
 =
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
 ocessorWrapper.java:238)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:596)
 ... 6 more
 
 Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
 SEVERE: Full Import
 failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
 java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:669)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:622)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
 68)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
 
 at
 =
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
 java:359)
 at
 =
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
 27)
 at
 =
 org.apache.solr.handler.dataimport.DataImporter$1.run

Re: solr autodetectparser tikaconfig dataimporter error

2013-07-14 Thread Andreas Owen

hi

is there nowone with a idea what this error is or even give me a pointer where 
to look? If not is there a alternitave way to import documents from a xml-file 
with meta-data and the filename to parse?

thanks for any help.


On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote:

 i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
 import a
 file via xml i get this error, it doesn't matter what file format i try =
 to index txt, cfm, pdf all the same error:
 
 SEVERE: Exception while processing: rec document :
 SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
 title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
 contents=3Dcontents(1.0)=3D{wie
 kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
 =
 path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
 DataImportHandlerException:
 java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:669)
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:622)
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
 68)
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
 
   at
 =
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
 java:359)
   at
 =
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
 27)
   at
 =
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
 8)
 Caused by: java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
   at
 =
 org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
 rocessor.java:122)
   at
 =
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
 ocessorWrapper.java:238)
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:596)
   ... 6 more
 
 Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
 SEVERE: Full Import
 failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
 java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:669)
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:622)
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
 68)
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
 
   at
 =
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
 java:359)
   at
 =
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
 27)
   at
 =
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
 8)
 Caused by: java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
   at
 =
 org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
 rocessor.java:122)
   at
 =
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
 ocessorWrapper.java:238)
   at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:596)
   ... 6 more
 
 Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 =
 rollback
 
 data-config.xml:
 dataConfig
   dataSource type=3DBinURLDataSource name=3Ddata/
   dataSource type=3DURLDataSource =
 baseUrl=3Dhttp://127.0.0.1/tkb/internet/;
 name=3Dmain/
 document
   entity name=3Drec processor=3DXPathEntityProcessor =
 url=3DdocImport.xml
 forEach=3D/albums/album dataSource=3Dmain=20
   field column=3Dtitle xpath=3D//title /
   field column=3Did xpath=3D//file /
   field column=3Dcontents xpath=3D//description /
   field column=3Dpath xpath=3D//path /
   field column=3DAuthor xpath=3D//author /
   =09
   =09
   =09
   entity processor=3DTikaEntityProcessor
 =
 url=3Dfile:///C:\web\development\tkb\internet\public\download\online\${re=
 c.id}
 dataSource=3Ddata onerror=3Dskip
field column=3Dcontents name=3Dtext /
   /entity
   /entity
 /document
 /dataConfig
 
 the lib are included and declared in the logs, i have also tried =
 tika-app
 1.0 and tagsoup 1.2 with the same result. can someone please help, i =
 don't
 know where to start looking for the error.

solr autodetectparser tikaconfig dataimporter error

2013-07-12 Thread Andreas Owen

i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
import a
file via xml i get this error, it doesn't matter what file format i try =
to index txt, cfm, pdf all the same error:

SEVERE: Exception while processing: rec document :
SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
contents=3Dcontents(1.0)=3D{wie
kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
=
path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
SEVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 =
rollback

data-config.xml:
dataConfig
dataSource type=3DBinURLDataSource name=3Ddata/
dataSource type=3DURLDataSource =
baseUrl=3Dhttp://127.0.0.1/tkb/internet/;
name=3Dmain/
document
entity name=3Drec processor=3DXPathEntityProcessor =
url=3DdocImport.xml
forEach=3D/albums/album dataSource=3Dmain=20
field column=3Dtitle xpath=3D//title /
field column=3Did xpath=3D//file /
field column=3Dcontents xpath=3D//description /
field column=3Dpath xpath=3D//path /
field column=3DAuthor xpath=3D//author /
=09
=09
=09
entity processor=3DTikaEntityProcessor
=
url=3Dfile:///C:\web\development\tkb\internet\public\download\online\${re=
c.id}
dataSource=3Ddata onerror=3Dskip
 field column=3Dcontents name=3Dtext /
/entity
/entity
/document
/dataConfig

the lib are included and declared in the logs, i have also tried =
tika-app
1.0 and tagsoup 1.2 with the same result. can someone please help, i =
don't
know where to start looking for the error.

78 matches

Mail list logo