Re: ngramfilter minGramSize problem
it works well. now why does the search only find something when the fieldname is added to the query with stopwords? cug - 9 hits mit cug - 0 hits plain_text:mit cug - 9 hits why is this so? could it be a problem that stopwords aren't used in the query because no all fields that are search have the stopwordfilter? On Mon, 07 Apr 2014 00:37:15 +0200, Furkan KAMACI furkankam...@gmail.com wrote: Correction: My patch is at SOLR-5152 7 Nis 2014 01:05 tarihinde Andreas Owen ao...@swissonline.ch yazdı: i thought i cound use filter class=solr.LengthFilterFactory min=1 max=2/ to index and search words that are only 1 or 2 chars long. it seems to work but i have to test it some more On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen ao...@swissonline.ch wrote: i have the a fieldtype that uses ngramfilter whle indexing. is there a setting that can force the ngramfilter to index smaller words then the minGramSize? Mine is set to 3 and the search wont find word that are only 1 or 2 chars long. i would like to not set minGramSize=1 because the results would be to diverse. fieldtype: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !-- filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ -- filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=50/ /analyzer analyzer type=query tokenizer class=solr. WhiteSpaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr. GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType -- Using Opera's mail client: http://www.opera.com/mail/ -- Using Opera's mail client: http://www.opera.com/mail/
ngramfilter minGramSize problem
i have the a fieldtype that uses ngramfilter whle indexing. is there a setting that can force the ngramfilter to index smaller words then the minGramSize? Mine is set to 3 and the search wont find word that are only 1 or 2 chars long. i would like to not set minGramSize=1 because the results would be to diverse. fieldtype: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !-- filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ -- filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=50/ /analyzer analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType
Re: ngramfilter minGramSize problem
i thought i cound use filter class=solr.LengthFilterFactory min=1 max=2/ to index and search words that are only 1 or 2 chars long. it seems to work but i have to test it some more On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen ao...@swissonline.ch wrote: i have the a fieldtype that uses ngramfilter whle indexing. is there a setting that can force the ngramfilter to index smaller words then the minGramSize? Mine is set to 3 and the search wont find word that are only 1 or 2 chars long. i would like to not set minGramSize=1 because the results would be to diverse. fieldtype: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !-- filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ -- filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=50/ /analyzer analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType -- Using Opera's mail client: http://www.opera.com/mail/
dih data-config.xml onImportEnd event
i would like to call a url after the import is finished whith the event document onImportEnd=. how can i do this?
facet doesnt display all possibilities after selecting one
when i select a facet in thema_f all the others in the group disapear but the other facets keep the original findings. it seems like it should work. maybe the underscore is the wrong char for the seperator? example documents in index doc arr name=thema_f str1_Produkte/str /arr str name=iddms:381/str /doc doc arr name=thema_f str1_Beratung/str str1_Beratung_Beratungsportal PK/str /arr str name=iddms:2679/str /doc doc arr name=thema_f str1_Beratung/str str1_Beratung_Beratungsportal PK/str /arr str name=iddms:190/str /doc solrconfig.xml requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 productsegment^5 productgroup^5 contentmanager^5 links^5 last_modified^5 url^5 /str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str str name=bfdiv(clicks,max(displays,1))^8/str !-- tested -- str name=dftext/str str name=fl*,path,score/str str name=wtjson/str str name=q.opAND/str !-- Highlighting defaults -- str name=hlon/str str name=hl.flplain_text,title/str str name=hl.fragSize200/str str name=hl.simple.prelt;bgt;/str str name=hl.simple.postlt;/bgt;/str !-- lst name=invariants -- str name=faceton/str str name=facet.mincount1/str str name=facet.missingfalse/str str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str str name=f.inhaltstyp_s.facet.sortindex/str str name=facet.field{!ex=doctype}doctype/str str name=f.doctype.facet.sortindex/str str name=facet.field{!ex=thema_f}thema_f/str str name=f.thema_f.facet.sortindex/str str name=facet.field{!ex=productsegment_f}productsegment_f/str str name=f.productsegment_f.facet.sortindex/str str name=facet.field{!ex=productgroup_f}productgroup_f/str str name=f.productgroup_f.facet.sortindex/str str name=facet.field{!ex=author_s}author_s/str str name=f.author_s.facet.sortindex/str str name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str str name=f.sachverstaendiger_s.facet.sortindex/str str name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str str name=f.veranstaltung_s.facet.sortindex/str str name=facet.field{!ex=kundensegment_aktive_beratung}kundensegment_aktive_beratung/str str name=f.kundensegment_aktive_beratung.facet.sortindex/str str name=facet.date{!ex=last_modified}last_modified/str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH+1MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str /lst /requestHandler schema.xml fieldType name=text_thema class=solr.TextField positionIncrementGap=100 !-- analyzer tokenizer class=solr.PatternTokenizerFactory pattern=_/ /analyzer-- analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType
dih data-config.xml onImportEnd event
i would like to call a url after the import is finished whith the event document onImportEnd=. how can i do this?
Re: dih data-config.xml onImportEnd event
sorry, the previous conversation was started with a false email-address. On Thu, 27 Mar 2014 14:06:57 +0100, Stefan Matheis matheis.ste...@gmail.com wrote: I would suggest you read the replies to your last mail (containing the very same question) first? -Stefan On Thursday, March 27, 2014 at 1:56 PM, Andreas Owen wrote: i would like to call a url after the import is finished whith the event document onImportEnd=. how can i do this? -- Using Opera's mail client: http://www.opera.com/mail/
wrong query results with wdf and ngtf
Is there a way to tell ngramfilterfactory while indexing that number shall never be tokenized? then the query should be able to find numbers. Or do i have to change the ngram-min for numbers (not alpha) to 1, if that is possible? So to speak put the hole number as token and not all possible tokens. Solr analysis shows onnly WDF has no underscore in its tokens, the rest have it. can i tell the query to search numbers differently with NGTF, WT, LCF or whatever? I also tried filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ @ = ALPHA _ = ALPHA I have gotten nearly everything to work. There are to queries where i dont get back what i want. avaloq frage 1- only returns if i set minGramSize=1 while indexing yh_cug- query parser doesn't remove _ but the indexer does (WDF) so there is no match Is there a way to also query the hole term avaloq frage 1 without tokenizing it? Fieldtype: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer /fieldType Solrconfig: queryParser name=synonym_edismax class=solr.SynonymExpandingExtendedDismaxQParserPlugin lst name=synonymAnalyzers lst name=myCoolAnalyzer lst name=tokenizer str name=classstandard/str /lst lst name=filter str name=classshingle/str str name=outputUnigramsIfNoShinglestrue/str str name=outputUnigramstrue/str str name=minShingleSize2/str str name=maxShingleSize4/str /lst lst name=filter str name=classsynonym/str str name=tokenizerFactorysolr.KeywordTokenizerFactory/str str name=synonymssynonyms.txt/str str name=expandtrue/str str name=ignoreCasetrue/str /lst /lst /lst /queryParser requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str str name=bfdiv(clicks,max(displays,1))^8/str !-- tested -- str name=dftext/str str name=fl*,path,score/str str name=wtjson/str str name=q.opAND/str !-- Highlighting defaults -- str name=hlon/str str name=hl.flplain_text,title/str str name=hl.fragSize200/str str name=hl.simple.prelt;bgt;/str str name=hl.simple.postlt;/bgt;/str !-- lst name=invariants -- str name=faceton/str str name=facet.mincount1/str str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str str name=f.inhaltstyp_s.facet.sortindex/str str name=facet.field{!ex=doctype}doctype/str str name=f.doctype.facet.sortindex/str str name=facet.field{!ex=thema_f}thema_f/str str name=f.thema_f.facet.sortindex/str str name=facet.field{!ex=author_s}author_s/str str name=f.author_s.facet.sortindex/str str name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str str name=f.sachverstaendiger_s.facet.sortindex/str str name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str str name=f.veranstaltung_s.facet.sortindex/str str name=facet.date{!ex=last_modified}last_modified/str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH+1MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str
wrong results with wdf ngtf
Is there a way to tell ngramfilterfactory while indexing that number shall never be tokenized? then the query should be able to find numbers. Or do i have to change the ngram-min for numbers (not alpha) to 1, if that is possible? So to speak put the hole number as token and not all possible tokens. Solr analysis shows onnly WDF has no underscore in its tokens, the rest have it. can i tell the query to search numbers differently with NGTF, WT, LCF or whatever? I also tried filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ @ = ALPHA _ = ALPHA I have gotten nearly everything to work. There are to queries where i dont get back what i want. avaloq frage 1 - only returns if i set minGramSize=1 while indexing yh_cug- query parser doesn't remove _ but the indexer does (WDF) so there is no match Is there a way to also query the hole term avaloq frage 1 without tokenizing it? Fieldtype: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer /fieldType Solrconfig: queryParser name=synonym_edismax class=solr.SynonymExpandingExtendedDismaxQParserPlugin lst name=synonymAnalyzers lst name=myCoolAnalyzer lst name=tokenizer str name=classstandard/str /lst lst name=filter str name=classshingle/str str name=outputUnigramsIfNoShinglestrue/str str name=outputUnigramstrue/str str name=minShingleSize2/str str name=maxShingleSize4/str /lst lst name=filter str name=classsynonym/str str name=tokenizerFactorysolr.KeywordTokenizerFactory/str str name=synonymssynonyms.txt/str str name=expandtrue/str str name=ignoreCasetrue/str /lst /lst /lst /queryParser requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str str name=bfdiv(clicks,max(displays,1))^8/str !-- tested -- str name=dftext/str str name=fl*,path,score/str str name=wtjson/str str name=q.opAND/str !-- Highlighting defaults -- str name=hlon/str str name=hl.flplain_text,title/str str name=hl.fragSize200/str str name=hl.simple.prelt;bgt;/str str name=hl.simple.postlt;/bgt;/str !-- lst name=invariants -- str name=faceton/str str name=facet.mincount1/str str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str str name=f.inhaltstyp_s.facet.sortindex/str str name=facet.field{!ex=doctype}doctype/str str name=f.doctype.facet.sortindex/str str name=facet.field{!ex=thema_f}thema_f/str str name=f.thema_f.facet.sortindex/str str name=facet.field{!ex=author_s}author_s/str str name=f.author_s.facet.sortindex/str str
searche for single char number when ngram min is 3
Is there a way to tell ngramfilterfactory while indexing that number shall never be tokenized? then the query should be able to find numbers. Or do i have to change the ngram min for numbers to 1, if that is possible? So to speak put the hole number as token and not all possible tokens. Or can i tell the query to search numbers differently woth WT, LCF or whatever? I attached a doc with screenshots from solr analyzer -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 13. März 2014 13:44 To: solr-user@lucene.apache.org Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards I have gotten nearly everything to work. There are to queries where i dont get back what i want. avaloq frage 1- only returns if i set minGramSize=1 while indexing yh_cug- query parser doesn't remove _ but the indexer does (WDF) so there is no match Is there a way to also query the hole term avaloq frage 1 without tokenizing it? Fieldtype: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer /fieldType -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 18:39 To: solr-user@lucene.apache.org Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards Hi Jack, do you know how i can use local parameters in my solrconfig? The params are visible in the debugquery-output but solr doesn't parse them. lst name=invariants str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str /lst -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 14:44 To: solr-user@lucene.apache.org Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index query). here's the rest: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer -Original-Nachricht- Von: Jack Krupansky j...@basetechnology.com An: solr-user@lucene.apache.org Datum: 12/03/2014 13:25 Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards You didn't show the new index analyzer - it's tricky to assure that index and query are compatible, but the Admin UI Analysis page can
underscore in query error
If I use the underscore in the query I don't get any results. If I remove the underscore it finds the docs with underscore. Can I tell solr to search through the ngtf instead of the wdf or is there any better solution? Query: yh_cug I attached a doc with the analyzer output
RE: use local param in solrconfig fq for access-control
I have given up this idee and made a wrapper which adds a fq with the userroles to each request -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Dienstag, 11. März 2014 23:32 To: solr-user@lucene.apache.org Subject: use local param in solrconfig fq for access-control i would like to use $r and $org for access control. it has to allow the fq's from my facet to work aswell. i'm not sure if i'm doing it wright or if i should add it to a qf or the q itself. the debugquery returns a parsed fq string and in them $r and $org are printed instead of their values. how do i get them to be intepreted? the lacal params are listed in the response so they should be valid. lst name=invariants str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str /lst
RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards
I have gotten nearly everything to work. There are to queries where i dont get back what i want. avaloq frage 1- only returns if i set minGramSize=1 while indexing yh_cug- query parser doesn't remove _ but the indexer does (WDF) so there is no match Is there a way to also query the hole term avaloq frage 1 without tokenizing it? Fieldtype: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer /fieldType -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 18:39 To: solr-user@lucene.apache.org Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards Hi Jack, do you know how i can use local parameters in my solrconfig? The params are visible in the debugquery-output but solr doesn't parse them. lst name=invariants str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str /lst -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 14:44 To: solr-user@lucene.apache.org Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index query). here's the rest: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer -Original-Nachricht- Von: Jack Krupansky j...@basetechnology.com An: solr-user@lucene.apache.org Datum: 12/03/2014 13:25 Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards You didn't show the new index analyzer - it's tricky to assure that index and query are compatible, but the Admin UI Analysis page can help. Generally, using pure defaults for WDF is not what you want, especially for query time. Usually there needs to be a slight asymmetry between index and query for WDF - index generates more terms than query. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Wednesday, March 12, 2014 6:20 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I now have the following: analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class
RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards
I now have the following: analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer The gui analysis shows me that wdf doesn't cut the underscore anymore but it still returns 0 results? Output: lst name=debug str name=rawquerystringyh_cug/str str name=querystringyh_cug/str str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) (div(int(clicks),max(int(displays),const(1^8.0/str lst name=explain/ arr name=expandedSynonyms stryh_cug/str /arr lst name=reasonForNotExpandingSynonyms str name=nameDidntFindAnySynonyms/str str name=explanationNo synonyms found for this query. Check your synonyms file./str /lst lst name=mainQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boost_queries str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str /arr arr name=parsed_boost_queries str(expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str /arr arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=synonymQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=timing -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Dienstag, 11. März 2014 14:25 To: solr-user@lucene.apache.org Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards The usual use of an ngram filter is at index time and not at query time. What exactly are you trying to achieve by using ngram filtering at query time as well as index time? Generally, it is inappropriate to combine the word delimiter filter with the standard tokenizer - the later removes the punctuation that normally influences how WDF treats the parts of a token. Use the white space tokenizer if you intend to use WDF. Which query parser are you using? What fields are being queried? Please post the parsed query string from the debug output - it will show the precise generated query. I think what you are seeing is that the ngram filter is generating tokens like h_cugtest and then the WDF is removing the underscore and then h gets generated as a separate token. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Tuesday, March 11, 2014 5:09 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I got it roght the first time and here is my requesthandler. The field plain_text is searched correctly and has the sam fieldtype as title - text_de queryParser name=synonym_edismax class=solr.SynonymExpandingExtendedDismaxQParserPlugin lst name=synonymAnalyzers lst name=myCoolAnalyzer lst name=tokenizer str name=classstandard/str /lst lst name=filter str name=classshingle/str str name=outputUnigramsIfNoShinglestrue/str str name=outputUnigramstrue/str str name=minShingleSize2/str str name=maxShingleSize4/str /lst lst name=filter str name=classsynonym/str str name=tokenizerFactorysolr.KeywordTokenizerFactory/str str name=synonymssynonyms.txt/str str name=expandtrue/str str name=ignoreCasetrue/str /lst /lst /lst /queryParser requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str
Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards
yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index query). here's the rest: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer -Original-Nachricht- Von: Jack Krupansky j...@basetechnology.com An: solr-user@lucene.apache.org Datum: 12/03/2014 13:25 Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards You didn't show the new index analyzer - it's tricky to assure that index and query are compatible, but the Admin UI Analysis page can help. Generally, using pure defaults for WDF is not what you want, especially for query time. Usually there needs to be a slight asymmetry between index and query for WDF - index generates more terms than query. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Wednesday, March 12, 2014 6:20 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I now have the following: analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer The gui analysis shows me that wdf doesn't cut the underscore anymore but it still returns 0 results? Output: lst name=debug str name=rawquerystringyh_cug/str str name=querystringyh_cug/str str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord/str str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) (div(int(clicks),max(int(displays),const(1^8.0/str lst name=explain/ arr name=expandedSynonyms stryh_cug/str /arr lst name=reasonForNotExpandingSynonyms str name=nameDidntFindAnySynonyms/str str name=explanationNo synonyms found for this query. Check your synonyms file./str /lst lst name=mainQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boost_queries str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str /arr arr name=parsed_boost_queries str(expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str /arr arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=synonymQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=timing -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Dienstag, 11. März 2014 14:25 To: solr-user@lucene.apache.org Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards The usual use of an ngram filter is at index time and not at query time. What exactly are you trying to achieve by using ngram filtering at query time as well as index time? Generally, it is inappropriate to combine the word delimiter filter
RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards
Hi Jack, do you know how i can use local parameters in my solrconfig? The params are visible in the debugquery-output but solr doesn't parse them. lst name=invariants str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str /lst -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 14:44 To: solr-user@lucene.apache.org Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index query). here's the rest: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer -Original-Nachricht- Von: Jack Krupansky j...@basetechnology.com An: solr-user@lucene.apache.org Datum: 12/03/2014 13:25 Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards You didn't show the new index analyzer - it's tricky to assure that index and query are compatible, but the Admin UI Analysis page can help. Generally, using pure defaults for WDF is not what you want, especially for query time. Usually there needs to be a slight asymmetry between index and query for WDF - index generates more terms than query. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Wednesday, March 12, 2014 6:20 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I now have the following: analyzer type=query tokenizer class=solr.WhiteSpaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory types=at-under-alpha.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ /analyzer The gui analysis shows me that wdf doesn't cut the underscore anymore but it still returns 0 results? Output: lst name=debug str name=rawquerystringyh_cug/str str name=querystringyh_cug/str str name=parsedquery(+DisjunctionMaxQuery((tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_ coord/str str name=parsedquery_toString+(tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) (div(int(clicks),max(int(displays),const(1^8.0/str lst name=explain/ arr name=expandedSynonyms stryh_cug/str /arr lst name=reasonForNotExpandingSynonyms str name=nameDidntFindAnySynonyms/str str name=explanationNo synonyms found for this query. Check your synonyms file./str /lst lst name=mainQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boost_queries str(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str /arr arr name=parsed_boost_queries str(expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0/str /arr arr name=boostfuncs strdiv(clicks,max(displays,1))^8/str /arr /lst lst name=synonymQueryParser str name=QParserExtendedDismaxQParser/str null name=altquerystring/ arr name=boostfuncs
searches for single char tokens instead of from 3 uppwards
i have a field with the following type: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a query report of 2 results: lst name=responseHeader int name=status0/int int name=QTime125/int lst name=paramsstr name=debugQuerytrue/strstr name=fltitle,roles,organisations,id/strstr name=indenttrue/str str name=qyh_cugtest/strstr name=_1394522589347/strstr name=wtxml/strstr name=fqorganisations:* roles:*/str /lst/lst result name=response numFound=5 start=0 .. str name=dms:2681 1.6365329 = (MATCH) sum of: 1.6346203 = (MATCH) max of: 0.14759353 = (MATCH) product of: 0.28596246 = (MATCH) sum of: 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.035319194 = queryWeight, product of: 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.0119499 = (MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result of: 0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.031227252 = queryWeight, product of: 4.8982444 = idf(docFreq=18, maxDocs=937) 0.0063751927 = queryNorm 0.38267535 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 4.8982444 = idf(docFreq=18, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhc in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:hcu in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.035319194 = queryWeight, product of: 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:cugt in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhcu in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of:
Re: SOLVED searches for single char tokens instead of from 3 uppwards
sorry i looked at the wrong fieldtype -Original-Nachricht- Von: Andreas Owen a...@conx.ch An: solr-user@lucene.apache.org Datum: 11/03/2014 08:45 Betreff: searches for single char tokens instead of from 3 uppwards i have a field with the following type: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a query report of 2 results: lst name=responseHeader int name=status0/int int name=QTime125/int lst name=params str name=debugQuerytrue/str str name=fltitle,roles,organisations,id/str str name=indenttrue/str str name=qyh_cugtest/str str name=_1394522589347/str str name=wtxml/str str name=fqorganisations:* roles:*/str /lst/lst result name=response numFound=5 start=0 .. str name=dms:2681 1.6365329 = (MATCH) sum of: 1.6346203 = (MATCH) max of: 0.14759353 = (MATCH) product of: 0.28596246 = (MATCH) sum of: 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.035319194 = queryWeight, product of: 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.0119499 = (MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result of: 0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.031227252 = queryWeight, product of: 4.8982444 = idf(docFreq=18, maxDocs=937) 0.0063751927 = queryNorm 0.38267535 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 4.8982444 = idf(docFreq=18, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhc in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:hcu in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.035319194 = queryWeight, product of: 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:cugt in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4
RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards
I got it roght the first time and here is my requesthandler. The field plain_text is searched correctly and has the sam fieldtype as title - text_de queryParser name=synonym_edismax class=solr.SynonymExpandingExtendedDismaxQParserPlugin lst name=synonymAnalyzers lst name=myCoolAnalyzer lst name=tokenizer str name=classstandard/str /lst lst name=filter str name=classshingle/str str name=outputUnigramsIfNoShinglestrue/str str name=outputUnigramstrue/str str name=minShingleSize2/str str name=maxShingleSize4/str /lst lst name=filter str name=classsynonym/str str name=tokenizerFactorysolr.KeywordTokenizerFactory/str str name=synonymssynonyms.txt/str str name=expandtrue/str str name=ignoreCasetrue/str /lst /lst /lst /queryParser requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bfdiv(clicks,max(displays,1))^8/str !-- tested -- str name=dftext/str str name=fl*,path,score/str str name=wtjson/str str name=q.opAND/str !-- Highlighting defaults -- str name=hlon/str str name=hl.flplain_text,title/str str name=hl.fragSize200/str str name=hl.simple.prelt;bgt;/str str name=hl.simple.postlt;/bgt;/str !-- lst name=invariants -- str name=faceton/str str name=facet.mincount1/str str name=facet.field{!ex=inhaltstyp_s}inhaltstyp_s/str str name=f.inhaltstyp_s.facet.sortindex/str str name=facet.field{!ex=doctype}doctype/str str name=f.doctype.facet.sortindex/str str name=facet.field{!ex=thema_f}thema_f/str str name=f.thema_f.facet.sortindex/str str name=facet.field{!ex=author_s}author_s/str str name=f.author_s.facet.sortindex/str str name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str str name=f.sachverstaendiger_s.facet.sortindex/str str name=facet.field{!ex=veranstaltung_s}veranstaltung_s/str str name=f.veranstaltung_s.facet.sortindex/str str name=facet.date{!ex=last_modified}last_modified/str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH+1MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str /lst /requestHandler i have a field with the following type: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a query report of 2 results: lst name=responseHeader int name=status0/int int name=QTime125/int lst name=paramsstr name=debugQuerytrue/strstr name=fltitle,roles,organisations,id/strstr name=indenttrue/strstr name=qyh_cugtest/strstr name=_1394522589347/strstr name=wtxml/strstr name=fqorganisations:* roles:*/str
query with local params
This works great but i would like to use lacal params r and org instead of hard-coded str name=fq (*:* -organisations:[* TO *] -roles:[* TO *]) (+organisations:(150 42) +roles:(174 72)) I would like str name=fq (*:* -organisations:[* TO *] -roles:[* TO *]) (+organisations:($org) +roles:($r)) Shouldn't the numbers be in the output below (parsed_filter_queries) and not $r and $org? I use this in my requesthandler and need it to be added as fq or query params without being able to be overriden, has anybody any idees? Oh and i use facets so fq has to be combinable. Debug query: lst name=responseHeader int name=status0/int int name=QTime109/int lst name=params str name=debugQuerytrue/str str name=indenttrue/str str name=r267/str str name=qyh_cug/str str name=_1394533792473/str str name=wtxml/str /lst ... arr name=filter_queries str{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str /arr arr name=parsed_filter_queries str(MatchAllDocsQuery(*:*) -organisations:[ TO *] -roles:[ TO *]) (+organisations:$org +roles:$r) (-organisations:[ TO *] +roles:$r) (+organisations:$org -roles:[ TO *])/str /arr
use local params in query
Shouldn't the numbers be in the output below (parsed_filter_queries) and not $r and $org? This works great but i would like to use lacal params r and org instead of hard-coded str name=fq (*:* -organisations:[* TO *] -roles:[* TO *]) (+organisations:(150 42) +roles:(174 72)) I would like str name=fq (*:* -organisations:[* TO *] -roles:[* TO *]) (+organisations:($org) +roles:($r)) I use this in my requesthandler under invariant because i need it to be added to the query without being able to be overriden. Oh and i use facets so fq has to be combinable. This should work or am i understanding it wrong? Debug query: lst name=responseHeader int name=status0/int int name=QTime109/int lst name=params str name=debugQuerytrue/str str name=indenttrue/str str name=r267/str str name=qyh_cug/str str name=_1394533792473/str str name=wtxml/str /lst ... arr name=filter_queries str{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str /arr arr name=parsed_filter_queries str(MatchAllDocsQuery(*:*) -organisations:[ TO *] -roles:[ TO *]) (+organisations:$org +roles:$r) (-organisations:[ TO *] +roles:$r) (+organisations:$org -roles:[ TO *])/str /arr
use local param in solrconfig fq for access-control
i would like to use $r and $org for access control. it has to allow the fq's from my facet to work aswell. i'm not sure if i'm doing it wright or if i should add it to a qf or the q itself. the debugquery returns a parsed fq string and in them $r and $org are printed instead of their values. how do i get them to be intepreted? the lacal params are listed in the response so they should be valid. lst name=invariants str name=fq{!q.op=OR} (*:* -organisations:[ TO *] -roles:[ TO *]) (+organisations:($org) +roles:($r)) (-organisations:[ TO *] +roles:($r)) (+organisations:($org) -roles:[ TO *])/str /lst
maxClauseCount is set to 1024
does this maxClauseCount go over each field individually or all put together? is it the date fields? when i execute a query i get this error: lst name=responseHeader int name=status500/int int name=QTime93/int lst name=params str name=indenttrue/str str name=qEin PDFchen als Dokument roles:*/str str name=_1394436617394/str str name=wtxml/str /lst /lst result name=response numFound=499 start=0 maxScore=0.40899447 doc . float name=score0.10604319/float /doc /result lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=inhaltstyp_s int name=Agenda390/int int name=Formular2/int int name=Formulare27/int int name=Für Dokumente only1/int int name=Für Websiten only1/int int name=Hilfsmittel3/int int name=Information3/int int name=Präsentation1/int int name=Regelung8/int int name=Schulung10/int int name=Schulung_ONL1/int int name=Test14/int int name=Weisung37/int int name=test1/int /lst lst name=doctype int name=doc1/int int name=docx4/int int name=htm8/int int name=pdf44/int int name=pptx4/int int name=vsd1/int int name=xlsx6/int /lst lst name=thema_f int name=1_57/int int name=1_Anleitungen11/int int name=1_Anleitungen_Ausbildung [Anleitungen]11/int int name=1_Ausbildung3/int int name=1_Ausbildung_Weiterbildung3/int int name=1_Beratung4/int int name=1_Beratung_Beratungsportal FK1/int int name=1_Beratung_Beratungsportal PK2/int int name=1_Beratung_Beratungsprozess1/int int name=1_Handlungsempfehlung2/int int name=1_Handlungsempfehlung_a2/int int name=1_Marktbearbeitung2/int int name=1_Marktbearbeitung_Events2/int int name=1_Produkte29/int int name=1_Weisungen1/int int name=1_Weisungen_Workplace [Weisungen]1/int /lst lst name=author_s int name=17/int int name=Aeschlimann Monika (MAE)1/int int name=Ancora Carlo (CAA)1/int int name=Bannwart Markus (MBA)4/int int name=Basse Detlev (DBS)1/int int name=Beerli Dominik (DBI)3/int int name=Bollinger Beat (BBO)5/int int name=Brunner Elisabeth (EBN)1/int int name=Brüschweiler Otto (OBR)5/int int name=Buric Aleksandra (ABC)1/int int name=Bächtold Eliane (EBA)2/int int name=Chieco Daniela (DCH)1/int int name=D'Adamo-Gähler Karin (KDA)1/int int name=Dannecker Dietmar (DDA)1/int int name=De Biasio Claudio (CDB)35/int int name=Donatsch Roman (RDO)1/int int name=Eberhart Livia (LET)2/int int name=Etter Alice (AET)26/int int name=Fankhauser Hausi (HFA)2/int int name=Frei Beat (BFI)1/int int name=Frick Patrick (PFR)2/int int name=Grasset André (AGT)3/int int name=Grava Reto (RGV)1/int int name=Gunterswiler Walter (WGU)1/int int name=Gürkan Simon (SGN)1/int int name=Heimbeck Markus (MHI)27/int int name=Helbling Andreas (AHG)3/int int name=Held Hans-Jörg (HHE)1/int int name=Helg Christoph (CHL)1/int int name=Hofer Astrid (AHO)3/int int name=Huber Kalevi (KHU)1/int int name=Huber Paul (PHU)1/int int name=Häberli Peter (PHI)3/int int name=Häfliger Gabriela (GHA)6/int int name=Hümbeli Isabelle (IHE)3/int int name=Isler Myriam (MIS)1/int int name=Jäger Andreas (AJA)2/int int name=Kasper Markus (MKP)2/int int name=Keller Reto (RKE)2/int int name=Knecht Urs (UKN)2/int int name=Kutter Benedikt (BKU)2/int int name=Kälin-Klay Sonja (SKY)28/int int name=Lutz René (RLU)4/int int name=Matanovic Jacques (JMT)2/int int name=Monti Mirko (MMO)1/int int name=Märki Susanne (SMA)16/int int name=Olimpio Marco (MOL)46/int int name=Pfister Nicole (NPF)1/int int name=Pozzi Anthony (ANP)5/int int name=Reinhard Martin (MRE)11/int int name=Reutlinger Graf Caroline (CRE)58/int int name=Roth Rolf (ROR)1/int int name=Rutz Mirco (MRT)2/int int name=Salvisberg Adrian (ASA)29/int int name=Sassano Marianna (MSN)2/int int name=Schaffhauser Carmen (CSR)2/int int name=Schoop Hans-Jörg (HSP)1/int int name=Schrieder Bernadette (BSD)1/int int name=Seeholzer Carola (CSZ)1/int int name=Storniolo Patrizia (PSO)9/int int name=Tanner-Ott Sara (STN)4/int int name=Tobler Tamara (TTO)75/int int name=Trefzer-Hug Cornelia (CTF)2/int int name=Uhlmann Heinz (HUH)2/int int name=Vettori Renato (RVE)1/int int name=Vogel Heinrich (HVO)2/int int name=Weibel Stephanie (SWL)2/int int name=Weinzerl Rudolf (RWE)1/int int name=Wellauer Pascal (PWL)4/int int name=Wild Ursula (UWD)1/int int name=Wuffli Markus (MWU)1/int int name=Wüthrich
set fq operator independently
i want to use the following in fq and i need to set the operator to OR. My q.op is AND but I need OR in fq. I have read about ofq but that is for putting OR between multiple fq. Can I set the operator for fq? (-organisations:[ TO *] -roles:[ TO *]) (+organisations:(150 42) +roles:(174 72)) The statement should find all docs without organisations and roles or those that have at least one roles and organisations entry. these fields are multivalued.
Re[2]: query parameters
ok i like the logic, you can do much more. i think this should do it for me: (-organisations:[ TO *] -roles:[ TO *]) (+organisations:(150 42) +roles:(174 72)) i want to use this in fq and i need to set the operator to OR. My q.op is AND but I need OR in fq. I have read about ofq but that is for putting OR between multiple fq. Can I set the operator for fq? The statement should find all docs without organisations and roles or those that have at least one roles and organisations entry. these fields are multivalued. -Original-Nachricht- Von: Erick Erickson erickerick...@gmail.com An: solr-user@lucene.apache.org Datum: 19/02/2014 04:09 Betreff: Re: query parameters Solr/Lucene query language is NOT strictly boolean, see Chris's excellent blog here: http://searchhub.org/dev/2011/12/28/why-not-and-or-and-not/ Best, Erick On Tue, Feb 18, 2014 at 11:54 AM, Andreas Owen a...@conx.ch wrote: I tried it in solr admin query and it showed me all the docs without a value in ogranisations and roles. It didn't matter if i used a base term, isn't that give through the q-parameter? -Original Message- From: Raymond Wiker [mailto:rwi...@gmail.com] Sent: Dienstag, 18. Februar 2014 13:19 To: solr-user@lucene.apache.org Subject: Re: query parameters That could be because the second condition does not do what you think it does... have you tried running the second condition separately? You may have to add a base term to the second condition, like what you have for the bq parameter in your config file; i.e, something like (*:* -organisations:[ TO *] -roles:[ TO *]) On Tue, Feb 18, 2014 at 12:16 PM, Andreas Owen a...@conx.ch wrote: It seams that fq doesn't except OR because: (organisations:(150 OR 41) AND roles:(174)) OR (-organisations:[ TO *] AND -roles:[ TO *]) only returns docs that match the first conditions. it doesn't return any docs with the empty fields organisations and roles. -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Montag, 17. Februar 2014 05:08 To: solr-user@lucene.apache.org Subject: query parameters in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like to use fq to force the following conditions: 1: organisations is empty and roles is empty 2: organisations contains one of the commadelimited list in variable $org 3: roles contains one of the commadelimited list in variable $r 4: rule 2 and 3 snipet of what i got (havent checked out if the is a in operator like in sql for the list value) lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypeedismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=fq(organisations='' roles='') or (organisations=$org roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')/str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --
RE: query parameters
It seams that fq doesn't except OR because: (organisations:(150 OR 41) AND roles:(174)) OR (-organisations:[ TO *] AND -roles:[ TO *]) only returns docs that match the first conditions. it doesn't return any docs with the empty fields organisations and roles. -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Montag, 17. Februar 2014 05:08 To: solr-user@lucene.apache.org Subject: query parameters in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like to use fq to force the following conditions: 1: organisations is empty and roles is empty 2: organisations contains one of the commadelimited list in variable $org 3: roles contains one of the commadelimited list in variable $r 4: rule 2 and 3 snipet of what i got (havent checked out if the is a in operator like in sql for the list value) lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypeedismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=fq(organisations='' roles='') or (organisations=$org roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')/str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --
RE: query parameters
I tried it in solr admin query and it showed me all the docs without a value in ogranisations and roles. It didn't matter if i used a base term, isn't that give through the q-parameter? -Original Message- From: Raymond Wiker [mailto:rwi...@gmail.com] Sent: Dienstag, 18. Februar 2014 13:19 To: solr-user@lucene.apache.org Subject: Re: query parameters That could be because the second condition does not do what you think it does... have you tried running the second condition separately? You may have to add a base term to the second condition, like what you have for the bq parameter in your config file; i.e, something like (*:* -organisations:[ TO *] -roles:[ TO *]) On Tue, Feb 18, 2014 at 12:16 PM, Andreas Owen a...@conx.ch wrote: It seams that fq doesn't except OR because: (organisations:(150 OR 41) AND roles:(174)) OR (-organisations:[ TO *] AND -roles:[ TO *]) only returns docs that match the first conditions. it doesn't return any docs with the empty fields organisations and roles. -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Montag, 17. Februar 2014 05:08 To: solr-user@lucene.apache.org Subject: query parameters in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like to use fq to force the following conditions: 1: organisations is empty and roles is empty 2: organisations contains one of the commadelimited list in variable $org 3: roles contains one of the commadelimited list in variable $r 4: rule 2 and 3 snipet of what i got (havent checked out if the is a in operator like in sql for the list value) lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypeedismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=fq(organisations='' roles='') or (organisations=$org roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')/str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --
query parameters
in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like to use fq to force the following conditions: 1: organisations is empty and roles is empty 2: organisations contains one of the commadelimited list in variable $org 3: roles contains one of the commadelimited list in variable $r 4: rule 2 and 3 snipet of what i got (havent checked out if the is a in operator like in sql for the list value) lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypeedismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=fq(organisations='' roles='') or (organisations=$org roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')/str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --
admin gui right side not loading
I'm using solr 4.3.1 and have installed it on a win 2008 server. Solr is working, for example import search. But the admin guis right side isn't loading and I get a javascript error for several d3-objects. The last error is: Load timeout for modules: lib/order!lib/jquery.autogrow lib/order!lib/jquery.cookie lib/order!lib/jquery.form lib/order!lib/jquery.jstree lib/order!lib/jquery.sammy lib/order!lib/jquery.timeago lib/order!lib/jquery.blockUI lib/order!lib/highlight lib/order!lib/linker lib/order!lib/ZeroClipboard lib/order!lib/d3 lib/order!lib/chosen lib/order!scripts/app lib/order!scripts/analysis lib/order!scripts/cloud lib/order!scripts/cores lib/order!scripts/dataimport lib/order!scripts/dashboard lib/order!scripts/file lib/order!scripts/index lib/order!scripts/java-properties lib/order!scripts/logging lib/order!scripts/ping lib/order!scripts/plugins lib/order!scripts/query lib/order!scripts/replication lib/order!scripts/schema-browser lib/order!scripts/threads lib/jquery.autogrow lib/jquery.cookie lib/jquery.form lib/jquery.jstree lib/jquery.sammy lib/jquery.timeago lib/jquery.blockUI lib/highlight lib/linker lib/ZeroClipboard lib/d3 lib/chosen scripts/app scripts/analysis scripts/cloud scripts/cores scripts/dataimport scripts/dashboard scripts/file scripts/index scripts/java-properties scripts/logging scripts/ping scripts/plugins scripts/query scripts/replication scripts/schema-browser scripts/threads http://requirejs.org/docs/errors.html#timeout I have no apparent errors in the log file and the exact conf is working on a other server. What can I do?
RE: json update moves doc to end
of: 4.349904 = idf(docFreq=29, maxDocs=855) 0.0070840283 = queryNorm 0.1359345 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 4.349904 = idf(docFreq=29, maxDocs=855) 0.03125 = fieldNorm(doc=0) 0.006139375 = (MATCH) weight(plain_text:berich in 0) [DefaultSimilarity], result of: 0.006139375 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.037305873 = queryWeight, product of: 5.266195 = idf(docFreq=11, maxDocs=855) 0.0070840283 = queryNorm 0.16456859 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.266195 = idf(docFreq=11, maxDocs=855) 0.03125 = fieldNorm(doc=0) 0.0059541636 = (MATCH) weight(plain_text:ericht in 0) [DefaultSimilarity], result of: 0.0059541636 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.036738846 = queryWeight, product of: 5.186152 = idf(docFreq=12, maxDocs=855) 0.0070840283 = queryNorm 0.16206725 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.186152 = idf(docFreq=12, maxDocs=855) 0.03125 = fieldNorm(doc=0) 0.006139375 = (MATCH) weight(plain_text:bericht in 0) [DefaultSimilarity], result of: 0.006139375 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.037305873 = queryWeight, product of: 5.266195 = idf(docFreq=11, maxDocs=855) 0.0070840283 = queryNorm 0.16456859 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.266195 = idf(docFreq=11, maxDocs=855) 0.03125 = fieldNorm(doc=0) 7.054 = (MATCH) weight(editorschoice:bericht^200.0 in 0) [DefaultSimilarity], result of: 7.054 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.749 = queryWeight, product of: 200.0 = boost 7.0579543 = idf(docFreq=1, maxDocs=855) 7.0840283E-4 = queryNorm 7.0579543 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 7.0579543 = idf(docFreq=1, maxDocs=855) 1.0 = fieldNorm(doc=0) 0.0021252085 = (MATCH) product of: 0.004250417 = (MATCH) sum of: 0.004250417 = (MATCH) sum of: 0.004250417 = (MATCH) MatchAllDocsQuery, product of: 0.004250417 = queryNorm 0.5 = coord(1/2) -Infinity = (MATCH) FunctionQuery(log(int(clicks))), product of: -Infinity = log(int(clicks)=0) 8.0 = boost 7.0840283E-4 = queryNorm /str -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Dienstag, 3. Dezember 2013 20:30 To: solr-user@lucene.apache.org Subject: Re: json update moves doc to end Try adding debug=all and you'll see exactly how docs are scored. Also, it'll show you exactly how your query is parsed. Paste that if it's confused, it'll help figure out what's going wrong. On Tue, Dec 3, 2013 at 1:37 PM, Andreas Owen a...@conx.ch wrote: So isn't it sorted automaticly by relevance (boost value)? If not do should i set it in solrconfig? -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Dienstag, 3. Dezember 2013 19:07 To: solr-user@lucene.apache.org Subject: Re: json update moves doc to end What order, the order if you supply no explicit sort at all? Solr does not make any guarantees about what order documents will come back in if you do not ask for a sort. In general in Solr/lucene, the only way to update a document is to re-add it as a new document, so that's probably what's going on behind the scenes, and it probably effects the 'default' sort order -- which Solr makes no agreement about anyway, you probably shouldn't even count on it being consistent at all. If you want a consistent sort order, maybe add a field with a timestamp, and ask for results sorted by the timestamp field? And then make sure not to change the timestamp when you do an update that you don't want to change the order? Apologies if I've misunderstood the situation. On 12/3/13 1:00 PM, Andreas Owen wrote: When I search for agenda I get a lot of hits. Now if I update the 2. Result by json-update the doc is moved to the end of the index when I search for it again. The field I change is editorschoice and it never contains the search term agenda so I don't see why it changes the order. Why does it? Part of Solrconfig requesthandler I use: requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name
RE: json update moves doc to end
I changed my boost-function log(clickrate)^8 to div(clciks,displays)^8 and it works now. I get the following output from debug 0.0022668892 = (MATCH) FunctionQuery(div(const(2),const(5))), product of: 0.4 = div(const(2),const(5)) 8.0 = boost 7.0840283E-4 = queryNorm Am i undestanding this right, that 0.4 and 8.0 result in 7.084? I'm having trouble undestanding how much i boosted it. As i use NgramFilterFactory i get a lot of hits because of the tokens. Can i make the boost higher if the hole search-term is found and not just part of it? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Mittwoch, 4. Dezember 2013 15:07 To: solr-user@lucene.apache.org Subject: Re: json update moves doc to end Well, both have a score of -Infinity. So they're equal and the tiebreaker is the internal Lucene doc ID. Now this is not helpful since the question now is where -Infinity comes from, this looks suspicious: -Infinity = (MATCH) FunctionQuery(log(int(clicks))), product of: -Infinity = log(int(clicks)=0) not much help I know, but Erick On Wed, Dec 4, 2013 at 7:24 AM, Andreas Owen a...@conx.ch wrote: Hi Erick Here are the last 2 results from a search and i am not understanding why the last one with the boost editorschoice^200 isn't at the top. By the way can i also give a substantial boost to results that contain the hole search-request and not just 3 or 4 letters (tokens)? str name=dms:1003 -Infinity = (MATCH) sum of: 0.013719446 = (MATCH) max of: 0.013719446 = (MATCH) sum of: 2.090396E-4 = (MATCH) weight(plain_text:ber in 841) [DefaultSimilarity], result of: 2.090396E-4 = score(doc=841,freq=8.0 = termFreq=8.0 ), product of: 0.009452709 = queryWeight, product of: 1.3343692 = idf(docFreq=611, maxDocs=855) 0.0070840283 = queryNorm 0.022114253 = fieldWeight in 841, product of: 2.828427 = tf(freq=8.0), with freq of: 8.0 = termFreq=8.0 1.3343692 = idf(docFreq=611, maxDocs=855) 0.005859375 = fieldNorm(doc=841) 0.0012402858 = (MATCH) weight(plain_text:eri in 841) [DefaultSimilarity], result of: 0.0012402858 = score(doc=841,freq=9.0 = termFreq=9.0 ), product of: 0.022357063 = queryWeight, product of: 3.1559815 = idf(docFreq=98, maxDocs=855) 0.0070840283 = queryNorm 0.05547624 = fieldWeight in 841, product of: 3.0 = tf(freq=9.0), with freq of: 9.0 = termFreq=9.0 3.1559815 = idf(docFreq=98, maxDocs=855) 0.005859375 = fieldNorm(doc=841) 5.0511415E-4 = (MATCH) weight(plain_text:ric in 841) [DefaultSimilarity], result of: 5.0511415E-4 = score(doc=841,freq=1.0 = termFreq=1.0 ), product of: 0.024712078 = queryWeight, product of: 3.4884217 = idf(docFreq=70, maxDocs=855) 0.0070840283 = queryNorm 0.020439971 = fieldWeight in 841, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.4884217 = idf(docFreq=70, maxDocs=855) 0.005859375 = fieldNorm(doc=841) 8.721528E-4 = (MATCH) weight(plain_text:ich in 841) [DefaultSimilarity], result of: 8.721528E-4 = score(doc=841,freq=12.0 = termFreq=12.0 ), product of: 0.017446788 = queryWeight, product of: 2.4628344 = idf(docFreq=197, maxDocs=855) 0.0070840283 = queryNorm 0.049989305 = fieldWeight in 841, product of: 3.4641016 = tf(freq=12.0), with freq of: 12.0 = termFreq=12.0 2.4628344 = idf(docFreq=197, maxDocs=855) 0.005859375 = fieldNorm(doc=841) 7.725705E-4 = (MATCH) weight(plain_text:cht in 841) [DefaultSimilarity], result of: 7.725705E-4 = score(doc=841,freq=4.0 = termFreq=4.0 ), product of: 0.021610687 = queryWeight, product of: 3.050621 = idf(docFreq=109, maxDocs=855) 0.0070840283 = queryNorm 0.035749465 = fieldWeight in 841, product of: 2.0 = tf(freq=4.0), with freq of: 4.0 = termFreq=4.0 3.050621 = idf(docFreq=109, maxDocs=855) 0.005859375 = fieldNorm(doc=841) 0.0010287998 = (MATCH) weight(plain_text:beri in 841) [DefaultSimilarity], result of: 0.0010287998 = score(doc=841,freq=1.0 = termFreq=1.0 ), product of: 0.035267927 = queryWeight, product of: 4.978513 = idf(docFreq=15, maxDocs=855) 0.0070840283 = queryNorm 0.029170973 = fieldWeight in 841, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 4.978513 = idf(docFreq=15, maxDocs=855) 0.005859375 = fieldNorm(doc=841) 0.0010556461 = (MATCH) weight(plain_text:eric in 841) [DefaultSimilarity
json update moves doc to end
When I search for agenda I get a lot of hits. Now if I update the 2. Result by json-update the doc is moved to the end of the index when I search for it again. The field I change is editorschoice and it never contains the search term agenda so I dont see why it changes the order. Why does it? Part of Solrconfig requesthandler I use: requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bflog(clicks)^8/str !-- tested -- !-- todo: anzahl-links(count urlparse in links query) / häufigkeit von suchbegriff (bf= count in title and text)-- str name=dftext/str str name=fl*,path,score/str str name=wtjson/str str name=q.opAND/str !-- Highlighting defaults -- str name=hlon/str str name=hl.flplain_text,title/str str name=hl.simple.prelt;bgt;/str str name=hl.simple.postlt;/bgt;/str !-- lst name=invariants -- str name=faceton/str str name=facet.mincount1/str str name=facet.field{!ex=inhaltstyp}inhaltstyp/str str name=f.inhaltstyp.facet.sortindex/str str name=facet.field{!ex=doctype}doctype/str str name=f.doctype.facet.sortindex/str str name=facet.field{!ex=thema_f}thema_f/str str name=f.thema_f.facet.sortindex/str str name=facet.field{!ex=author_s}author_s/str str name=f.author_s.facet.sortindex/str str name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str str name=f.sachverstaendiger_s.facet.sortindex/str str name=facet.field{!ex=veranstaltung}veranstaltung/str str name=f.veranstaltung.facet.sortindex/str str name=facet.date{!ex=last_modified}last_modified/str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH+1MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str /lst /requestHandler
RE: json update moves doc to end
So isn't it sorted automaticly by relevance (boost value)? If not do should i set it in solrconfig? -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Dienstag, 3. Dezember 2013 19:07 To: solr-user@lucene.apache.org Subject: Re: json update moves doc to end What order, the order if you supply no explicit sort at all? Solr does not make any guarantees about what order documents will come back in if you do not ask for a sort. In general in Solr/lucene, the only way to update a document is to re-add it as a new document, so that's probably what's going on behind the scenes, and it probably effects the 'default' sort order -- which Solr makes no agreement about anyway, you probably shouldn't even count on it being consistent at all. If you want a consistent sort order, maybe add a field with a timestamp, and ask for results sorted by the timestamp field? And then make sure not to change the timestamp when you do an update that you don't want to change the order? Apologies if I've misunderstood the situation. On 12/3/13 1:00 PM, Andreas Owen wrote: When I search for agenda I get a lot of hits. Now if I update the 2. Result by json-update the doc is moved to the end of the index when I search for it again. The field I change is editorschoice and it never contains the search term agenda so I don't see why it changes the order. Why does it? Part of Solrconfig requesthandler I use: requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bflog(clicks)^8/str !-- tested -- !-- todo: anzahl-links(count urlparse in links query) / häufigkeit von suchbegriff (bf= count in title and text)-- str name=dftext/str str name=fl*,path,score/str str name=wtjson/str str name=q.opAND/str !-- Highlighting defaults -- str name=hlon/str str name=hl.flplain_text,title/str str name=hl.simple.prelt;bgt;/str str name=hl.simple.postlt;/bgt;/str !-- lst name=invariants -- str name=faceton/str str name=facet.mincount1/str str name=facet.field{!ex=inhaltstyp}inhaltstyp/str str name=f.inhaltstyp.facet.sortindex/str str name=facet.field{!ex=doctype}doctype/str str name=f.doctype.facet.sortindex/str str name=facet.field{!ex=thema_f}thema_f/str str name=f.thema_f.facet.sortindex/str str name=facet.field{!ex=author_s}author_s/str str name=f.author_s.facet.sortindex/str str name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str str name=f.sachverstaendiger_s.facet.sortindex/str str name=facet.field{!ex=veranstaltung}veranstaltung/str str name=f.veranstaltung.facet.sortindex/str str name=facet.date{!ex=last_modified}last_modified/str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH+1MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str /lst /requestHandler
search with wildcard
I am querying test in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like Supertestplan it isn't found unless I use a wildcards *test*. This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words. fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- /analyzer /fieldType
RE: search with wildcard
I suppose i have to create another field with diffenet tokenizers and set the boost very low so it doesn't really mess with my ranking because there the word is now in 2 fields. What kind of tokenizer can do the job? From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 21. November 2013 16:13 To: solr-user@lucene.apache.org Subject: search with wildcard I am querying test in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like Supertestplan it isn't found unless I use a wildcards *test*. This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words. fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- /analyzer /fieldType
RE: date range tree
I solved it by adding a loop for years and one for quartals in which i count the month-facets -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Montag, 11. November 2013 17:52 To: solr-user@lucene.apache.org Subject: RE: date range tree Has someone at least got a idee how i could do a year/month-date-tree? In Solr-Wiki it is mentioned that facet.date.gap=+1DAY,+2DAY,+3DAY,+10DAY should create 4 buckets but it doesn't work -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 7. November 2013 18:23 To: solr-user@lucene.apache.org Subject: date range tree I would like to make a facet on a date field with the following tree: 2013 4.Quartal December November Oktober 3.Quartal September August Juli 2.Quartal June Mai April 1. Quartal March February January 2012 . Same as above So far I have this in solrconfig.xml: str name=facet.date{!ex=last_modified,thema,inhaltstyp,doctype}last_modified /str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str Can I do this in one query or do I need multiple queries? If yes how would I do the second and keep all the facet queries in the count?
RE: date range tree
Has someone at least got a idee how i could do a year/month-date-tree? In Solr-Wiki it is mentioned that facet.date.gap=+1DAY,+2DAY,+3DAY,+10DAY should create 4 buckets but it doesn't work -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 7. November 2013 18:23 To: solr-user@lucene.apache.org Subject: date range tree I would like to make a facet on a date field with the following tree: 2013 4.Quartal December November Oktober 3.Quartal September August Juli 2.Quartal June Mai April 1. Quartal March February January 2012 . Same as above So far I have this in solrconfig.xml: str name=facet.date{!ex=last_modified,thema,inhaltstyp,doctype}last_modified /str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str Can I do this in one query or do I need multiple queries? If yes how would I do the second and keep all the facet queries in the count?
count links pointing to id
I have a multivalue field with links pointing to ids of solrdocuments. I would like calculate how many links are pointing to each document und put that number into the field links2me. How can I do this, I would prefer to do it with a query and the updater so solr can do it internaly if possible?
date range tree
I would like to make a facet on a date field with the following tree: 2013 4.Quartal December November Oktober 3.Quartal September August Juli 2.Quartal June Mai April 1. Quartal March February January 2012 . Same as above So far I have this in solrconfig.xml: str name=facet.date{!ex=last_modified,thema,inhaltstyp,doctype}last_modified /str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str Can I do this in one query or do I need multiple queries? If yes how would I do the second and keep all the facet queries in the count?
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
i'm already using URLDataSource On 30. Sep 2013, at 5:41 PM, P Williams wrote: Hi Andreas, When using XPathEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessoryour DataSource must be of type DataSourceReader. You shouldn't be using BinURLDataSource, it's giving you the cast exception. Use URLDataSourcehttps://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/URLDataSource.html or FileDataSourcehttps://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/FileDataSource.htmlinstead. I don't think you need to specify namespaces, at least you didn't used to. The other thing that I've noticed is that the anywhere xpath expression // doesn't always work in DIH. You might have to be more specific. Cheers, Tricia On Sun, Sep 29, 2013 at 9:47 AM, Andreas Owen a...@conx.ch wrote: how dum can you get. obviously quite dum... i would have to analyze the html-pages with a nested instance like this: entity name=rec processor=XPathEntityProcessor url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml forEach=/docs/doc dataSource=main entity name=htm processor=XPathEntityProcessor url=${rec.urlParse} forEach=/xhtml:html dataSource=dataUrl field column=text xpath=//content / field column=h_2 xpath=//body / field column=text_nohtml xpath=//text / field column=h_1 xpath=//h:h1 / /entity /entity but i'm pretty sure the foreach is wrong and the xpath expressions. in the moment i getting the following error: Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to java.io.Reader On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote: ok i see what your getting at but why doesn't the following work: field xpath=//h:h1 column=h_1 / field column=text xpath=/xhtml:html/xhtml:body / i removed the tiki-processor. what am i missing, i haven't found anything in the wiki? On 28. Sep 2013, at 12:28 AM, P Williams wrote: I spent some more time thinking about this. Do you really need to use the TikaEntityProcessor? It doesn't offer anything new to the document you are building that couldn't be accomplished by the XPathEntityProcessor alone from what I can tell. I also tried to get the Advanced Parsinghttp://wiki.apache.org/solr/TikaEntityProcessorexample to work without success. There are some obvious typos (document instead of /document) and an odd order to the pieces (dataSources is enclosed by document). It also looks like FieldStreamDataSource http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html is the one that is meant to work in this context. If Koji is still around maybe he could offer some help? Otherwise this bit of erroneous instruction should probably be removed from the wiki. Cheers, Tricia $ svn diff Index: solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java === --- solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java (revision 1526990) +++ solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java (working copy) @@ -99,13 +99,13 @@ runFullImport(getConfigHTML(identity)); assertQ(req(*:*), testsHTMLIdentity); } - + private String getConfigHTML(String htmlMapper) { return dataConfig + dataSource type='BinFileDataSource'/ + document + -entity name='Tika' format='xml' processor='TikaEntityProcessor' + +entity name='Tika' format='html' processor='TikaEntityProcessor' + url=' + getFile(dihextras/structured.html).getAbsolutePath() + ' + ((htmlMapper == null) ? : ( htmlMapper=' + htmlMapper + ')) + + field column='text'/ + @@ -114,4 +114,36 @@ /dataConfig; } + private String[] testsHTMLH1 = { + //*[@numFound='1'] + , //str[@name='h1'][contains(.,'H1 Header')] + }; + + @Test + public void testTikaHTMLMapperSubEntity() throws Exception { +runFullImport(getConfigSubEntity(identity)); +assertQ(req(*:*), testsHTMLH1); + } + + private String getConfigSubEntity(String htmlMapper) { +return +dataConfig + +dataSource type='BinFileDataSource' name='bin'/ + +dataSource type='FieldStreamDataSource' name='fld'/ + +document + +entity name
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
how dum can you get. obviously quite dum... i would have to analyze the html-pages with a nested instance like this: entity name=rec processor=XPathEntityProcessor url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml forEach=/docs/doc dataSource=main entity name=htm processor=XPathEntityProcessor url=${rec.urlParse} forEach=/xhtml:html dataSource=dataUrl field column=text xpath=//content / field column=h_2 xpath=//body / field column=text_nohtml xpath=//text / field column=h_1 xpath=//h:h1 / /entity /entity but i'm pretty sure the foreach is wrong and the xpath expressions. in the moment i getting the following error: Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to java.io.Reader On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote: ok i see what your getting at but why doesn't the following work: field xpath=//h:h1 column=h_1 / field column=text xpath=/xhtml:html/xhtml:body / i removed the tiki-processor. what am i missing, i haven't found anything in the wiki? On 28. Sep 2013, at 12:28 AM, P Williams wrote: I spent some more time thinking about this. Do you really need to use the TikaEntityProcessor? It doesn't offer anything new to the document you are building that couldn't be accomplished by the XPathEntityProcessor alone from what I can tell. I also tried to get the Advanced Parsinghttp://wiki.apache.org/solr/TikaEntityProcessorexample to work without success. There are some obvious typos (document instead of /document) and an odd order to the pieces (dataSources is enclosed by document). It also looks like FieldStreamDataSourcehttp://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.htmlis the one that is meant to work in this context. If Koji is still around maybe he could offer some help? Otherwise this bit of erroneous instruction should probably be removed from the wiki. Cheers, Tricia $ svn diff Index: solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java === --- solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java (revision 1526990) +++ solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java (working copy) @@ -99,13 +99,13 @@ runFullImport(getConfigHTML(identity)); assertQ(req(*:*), testsHTMLIdentity); } - + private String getConfigHTML(String htmlMapper) { return dataConfig + dataSource type='BinFileDataSource'/ + document + -entity name='Tika' format='xml' processor='TikaEntityProcessor' + +entity name='Tika' format='html' processor='TikaEntityProcessor' + url=' + getFile(dihextras/structured.html).getAbsolutePath() + ' + ((htmlMapper == null) ? : ( htmlMapper=' + htmlMapper + ')) + + field column='text'/ + @@ -114,4 +114,36 @@ /dataConfig; } + private String[] testsHTMLH1 = { + //*[@numFound='1'] + , //str[@name='h1'][contains(.,'H1 Header')] + }; + + @Test + public void testTikaHTMLMapperSubEntity() throws Exception { +runFullImport(getConfigSubEntity(identity)); +assertQ(req(*:*), testsHTMLH1); + } + + private String getConfigSubEntity(String htmlMapper) { +return +dataConfig + +dataSource type='BinFileDataSource' name='bin'/ + +dataSource type='FieldStreamDataSource' name='fld'/ + +document + +entity name='tika' processor='TikaEntityProcessor' url=' + getFile(dihextras/structured.html).getAbsolutePath() + ' dataSource='bin' format='html' rootEntity='false' + +!--Do appropriate mapping here meta=\true\ means it is a metadata field -- + +field column='Author' meta='true' name='author'/ + +field column='title' meta='true' name='title'/ + +!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-- + +field name='text' column='text'/ + +entity name='detail' type='XPathEntityProcessor' forEach='/html' dataSource='fld' dataField='tika.text' rootEntity='true' + +field xpath='//div' column='foo'/ + +field xpath='//h1' column='h1' / + +/entity + +/entity + +/document + +/dataConfig; + } + } Index: solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
thanks but the first suggestion is already implemented and the 2. didn't work. i have also tried htmlMapper=identity but nothing worked. i also tried this but the html was stripped in both fields entity name=tika processor=TikaEntityProcessor url=${rec.urlParse} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=HTMLStripTransformer field column=text name=text stripHTML=false / field column=text name=text_nohtml stripHTML=true / but in the end i think it's best to cut tika out because i'm not getting any benefits from it. i would just need to get this to work: field xpath=//h:h1 column=h_1 / field column=text xpath=/xhtml:html/xhtml:body / the fields are empty and i'm not getting any errors in the logs. On 28. Sep 2013, at 2:43 AM, Alexandre Rafalovitch wrote: This is a rather complicated example to chew through, but try the following two things: *) dataField=${tika.text} = dataField=text (or less likely htmlMapper tika.text) You might be trying to read content of the field rather than passing reference to the field that seems to be expected. This might explain the exception. *) It may help to be aware of https://issues.apache.org/jira/browse/SOLR-4530 . There is a new htmlMapper=identity flag on Tika entries to ensure more of HTML structure passing through. By default, Tika strips out most of the HTML tags. Regards, Alex. On Thu, Sep 26, 2013 at 5:17 PM, Andreas Owen a...@conx.ch wrote: entity name=tika processor=TikaEntityProcessor url=${rec.urlParse} dataSource=dataUrl onError=skip format=html field column=text/ entity name=detail type=XPathEntityProcessor forEach=/html dataSource=fld dataField=${tika.text} rootEntity=true onError=skip field xpath=//h1 column=h_1 / /entity /entity Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
i removed the FieldReaderDataSource and dataSource=fld but it didn't help. i get the following for each document: DataImportHandlerException: Exception in invoking url null Processing Document # 9 nullpointerexception On 26. Sep 2013, at 8:39 PM, P Williams wrote: Hi, Haven't tried this myself but maybe try leaving out the FieldReaderDataSource entirely. From my quick searching looks like it's tied to SQL. Did you try copying the http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example exactly? What happens when you leave out FieldReaderDataSource? Cheers, Tricia On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 and the dataimporter. i am trying to use XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages but i'm getting this error for each document. i have also tried dataField=tika.text and dataField=text to no avail. the nested XPathEntityProcessor detail creates the error, the rest works fine. what am i doing wrong? error: ERROR - 2013-09-26 12:08:49.006; org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed 'null' java.lang.ClassCastException: java.io.StringReader cannot be cast to java.util.Iterator at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
(TestRuleAssertionsRequired.java:43) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358) at java.lang.Thread.run(Thread.java:722) On Fri, Sep 27, 2013 at 3:55 AM, Andreas Owen a...@conx.ch wrote: i removed the FieldReaderDataSource and dataSource=fld but it didn't help. i get the following for each document: DataImportHandlerException: Exception in invoking url null Processing Document # 9 nullpointerexception On 26. Sep 2013, at 8:39 PM, P Williams wrote: Hi, Haven't tried this myself but maybe try leaving out the FieldReaderDataSource entirely. From my quick searching looks like it's tied to SQL. Did you try copying the http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example exactly? What happens when you leave out FieldReaderDataSource? Cheers, Tricia On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 and the dataimporter. i am trying to use XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages but i'm getting this error for each document. i have also tried dataField=tika.text and dataField=text to no avail. the nested XPathEntityProcessor detail creates the error, the rest works fine. what am i doing wrong? error: ERROR - 2013-09-26 12:08:49.006; org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed 'null' java.lang.ClassCastException: java.io.StringReader cannot be cast to java.util.Iterator at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485
XPathEntityProcessor nested in TikaEntityProcessor query null exception
i'm using solr 4.3.1 and the dataimporter. i am trying to use XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages but i'm getting this error for each document. i have also tried dataField=tika.text and dataField=text to no avail. the nested XPathEntityProcessor detail creates the error, the rest works fine. what am i doing wrong? error: ERROR - 2013-09-26 12:08:49.006; org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed 'null' java.lang.ClassCastException: java.io.StringReader cannot be cast to java.util.Iterator at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) ERROR - 2013-09-26 12:08:49.022; org.apache.solr.common.SolrException; Exception in entity : detail:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.StringReader cannot be cast to java.util.Iterator at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at
dih HTMLStripTransformer
why does stripHTML=false have no effect in dih? the html is strippedin text and text_nohtml when i do display the index with select?q=* i'm trying to get a field without html and one with it so i can also index the links on the page. data-config.xml entity name=rec processor=XPathEntityProcessor url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml forEach=/docs/doc dataSource=main !-- transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=url xpath=//url / field column=urlParse xpath=//urlParse / field column=last_modified xpath=//last_modified / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.urlParse} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=HTMLStripTransformer field column=text name=text stripHTML=false / field column=text name=text_nohtml stripHTML=true / !-- transformer=RegexTransformer field column=text_html_b regex=(?s)^.*lt;div.*id=.*gt;(.*)lt;/divgt;.*$ replaceWith=$1 sourceColName=text / field column=text_html_b regex=(?s)^.*lt;!-body-gt;(.*)lt;!-/body-gt;.*$ replaceWith=$1 sourceColName=text / -- /entity /entity
Re: dih delete doc per $deleteDocById
sorry, it works like this, i had a typo in my conf :-( On 17. Sep 2013, at 2:44 PM, Andreas Owen wrote: i would like to know how to get it to work and delete documents per xml and dih. On 17. Sep 2013, at 1:47 PM, Shalin Shekhar Mangar wrote: What is your question? On Tue, Sep 17, 2013 at 12:17 AM, andreas owen a.o...@gmx.net wrote: i am using dih and want to delete indexed documents by xml-file with ids. i have seen $deleteDocById used in entity query=... data-config.xml: entity name=rec processor=XPathEntityProcessor url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml forEach=/docs/doc dataSource=main field column=$deleteDocById xpath=//id / /entity xml-file: docs doc id2345/id /doc /docs -- Regards, Shalin Shekhar Mangar.
Re: dih delete doc per $deleteDocById
i would like to know how to get it to work and delete documents per xml and dih. On 17. Sep 2013, at 1:47 PM, Shalin Shekhar Mangar wrote: What is your question? On Tue, Sep 17, 2013 at 12:17 AM, andreas owen a.o...@gmx.net wrote: i am using dih and want to delete indexed documents by xml-file with ids. i have seen $deleteDocById used in entity query=... data-config.xml: entity name=rec processor=XPathEntityProcessor url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml forEach=/docs/doc dataSource=main field column=$deleteDocById xpath=//id / /entity xml-file: docs doc id2345/id /doc /docs -- Regards, Shalin Shekhar Mangar.
dih delete doc per $deleteDocById
i am using dih and want to delete indexed documents by xml-file with ids. i have seen $deleteDocById used in entity query=... data-config.xml: entity name=rec processor=XPathEntityProcessor url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml forEach=/docs/doc dataSource=main field column=$deleteDocById xpath=//id / /entity xml-file: docs doc id2345/id /doc /docs
Re: charset encoding
no jetty, and yes for tomcat i've seen a couple of answers On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote: Using tomcat by any chance? The ML archive has the solution. May be on Wiki, too. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: charset encoding
could it have something to do with the meta encoding tag is iso-8859-1 but the http-header tag is utf8 and firefox inteprets it as utf8? On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote: no jetty, and yes for tomcat i've seen a couple of answers On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote: Using tomcat by any chance? The ML archive has the solution. May be on Wiki, too. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: charset encoding
it was the http-header, as soon as i force a iso-8859-1 header it worked On 12. Sep 2013, at 9:44 AM, Andreas Owen wrote: could it have something to do with the meta encoding tag is iso-8859-1 but the http-header tag is utf8 and firefox inteprets it as utf8? On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote: no jetty, and yes for tomcat i've seen a couple of answers On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote: Using tomcat by any chance? The ML archive has the solution. May be on Wiki, too. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: charfilter doesn't do anything
perfect, i tried it before but always at the tail of the expression with no effect. thanks a lot. a last question, do you know how to keep the html comments from being filtered before the transformer has done its work? On 10. Sep 2013, at 3:17 PM, Jack Krupansky wrote: Okay, I can repro the problem. Yes, in appears that the pattern replace char filter does not default to multiline mode for pattern matching, so body on one line and /body on another line cannot be matched. Now, whether that is by design or a bug or an option for enhancement is a matter for some committer to comment on. But, the good news is that you can in fact set multiline mode in your pattern my starting it with (?s), which means that dot accepts line break characters as well. So, here are my revised field types: fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=(?s)^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=text_html_body_strip class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=(?s)^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType The first type accepts everything within body, including nested HTML formatting, while the latter strips nested HTML formatting as well. The tokenizer will in fact strip out white space, but that happens after all character filters have completed. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Tuesday, September 10, 2013 7:07 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything ok i am getting there now but if there are newlines involved the regex stops as soon as it reaches a \r\n even if i try [\t\r\n.]* in the regex. I have to get rid of the newlines. why isn't whitespaceTokenizerFactory the right element for this? On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote: Use XML then. Although you will need to escape the XML special characters as I did in the pattern. The point is simply: Quickly and simply try to find the simple test scenario that illustrates the problem. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 7:05 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: Did you at least try the pattern I gave you? The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 6:40 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json
charset encoding
i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: charfilter doesn't do anything
ok i am getting there now but if there are newlines involved the regex stops as soon as it reaches a \r\n even if i try [\t\r\n.]* in the regex. I have to get rid of the newlines. why isn't whitespaceTokenizerFactory the right element for this? On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote: Use XML then. Although you will need to escape the XML special characters as I did in the pattern. The point is simply: Quickly and simply try to find the simple test scenario that illustrates the problem. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 7:05 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: Did you at least try the pattern I gave you? The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 6:40 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity
Re: charfilter doesn't do anything
i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn
Re: charfilter doesn't do anything
i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: Did you at least try the pattern I gave you? The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 6:40 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find
Re: charfilter doesn't do anything
i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http
Re: charfilter doesn't do anything
yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn
Re: charfilter doesn't do anything
the input string is a normal html page with the word Zahlungsverkehr in it and my query is ...solr/collection1/select?q=* On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote: And show us an input string and a query that fail. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, September 05, 2013 2:41 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything On 9/5/2013 10:03 AM, Andreas Owen wrote: i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error. in schema.xml i have the following: field name=text_html type=text_cutHtml indexed=true stored=true multiValued=true/ fieldType name=text_cutHtml class=solr.TextField analyzer !-- tokenizer class=solr.StandardTokenizerFactory/ -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=Zahlungsverkehr replacement=ASDFGHJK / tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern? I don't know about your second question. I don't know if that will be possible, but I'll leave that to someone who's more expert than I. As for the first question, here's what I have. Did you reindex? That will be required. http://wiki.apache.org/solr/HowToReindex Assuming that you did reindex, are you trying to search for ASDFGHJK in a field that contains more than just Zahlungsverkehr? The keyword tokenizer might not do what you expect - it tokenizes the entire input string as a single token, which means that you won't be able to search for single words in a multi-word field without wildcards, which are pretty slow. Note that both the pattern and replacement are case sensitive. This is how regex works. You haven't used a lowercase filter, which means that you won't be able to search for asdfghjk. Use the analysis tab in the UI on your core to see what Solr does to your field text. Thanks, Shawn
Re: charfilter doesn't do anything
i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote: Is there any chance that your changed your schema since you indexed the data? If so, re-index the data. If a * query finds nothing, that implies that the default field is empty. Are you sure the df parameter is set to the field containing your data? Show us your request handler definition and a sample of your actual Solr input (Solr XML or JSON?) so that we can see what fields are being populated. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Friday, September 06, 2013 4:01 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything the input string is a normal html page with the word Zahlungsverkehr in it and my query is ...solr/collection1/select?q=* On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote: And show us an input string and a query that fail. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, September 05, 2013 2:41 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything On 9/5/2013 10:03 AM, Andreas Owen wrote: i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error. in schema.xml i have the following: field name=text_html type=text_cutHtml indexed=true stored=true multiValued=true/ fieldType name=text_cutHtml class=solr.TextField analyzer !-- tokenizer class=solr.StandardTokenizerFactory/ -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=Zahlungsverkehr replacement=ASDFGHJK / tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern? I don't know about your second question. I don't know if that will be possible, but I'll leave that to someone who's more expert than I. As for the first question, here's what I have. Did you reindex? That will be required. http://wiki.apache.org/solr/HowToReindex Assuming that you did reindex, are you trying to search for ASDFGHJK in a field that contains more than just Zahlungsverkehr? The keyword tokenizer might not do what you expect - it tokenizes the entire input string as a single token, which means that you won't be able to search for single words in a multi-word field without wildcards, which are pretty slow. Note that both the pattern and replacement are case sensitive. This is how regex works. You haven't used a lowercase filter, which means that you won't be able to search for asdfghjk. Use the analysis tab in the UI on your core to see what Solr does to your field text. Thanks, Shawn
Re: charfilter doesn't do anything
ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn
charfilter doesn't do anything
i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error. in schema.xml i have the following: field name=text_html type=text_cutHtml indexed=true stored=true multiValued=true/ fieldType name=text_cutHtml class=solr.TextField analyzer !-- tokenizer class=solr.StandardTokenizerFactory/ -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=Zahlungsverkehr replacement=ASDFGHJK / tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern?
Re: dataimporter tika doesn't extract certain div
so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika? entity name=htm processor=XPathEntityProcessor url=${rec.file} forEach=/div[@id='content'] dataSource=main entity name=tika processor=TikaEntityProcessor url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text / /entity /entity but now i dont know how to pass the text to tika, what do i put in url and datasource? On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote: I don't know much about Tika but in the example data-config.xml that you posted, the xpath attribute on the field text won't work because the xpath attribute is used only by a XPathEntityProcessor. On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote: I want tika to only index the content in div id=content.../div for the field text. unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main !--transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text xpath=//div[@id='content'] / /entity /entity /document /dataConfig -- Regards, Shalin Shekhar Mangar.
Re: dataimporter tika doesn't extract certain div
or could i use a filter in schema.xml where i define a fieldtype and use some filter that understands xpath? On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote: No that wouldn't work. It seems that you probably need a custom Transformer to extract the right div content. I do not know if TikaEntityProcessor supports such a thing. On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen a...@conx.ch wrote: so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika? entity name=htm processor=XPathEntityProcessor url=${rec.file} forEach=/div[@id='content'] dataSource=main entity name=tika processor=TikaEntityProcessor url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text / /entity /entity but now i dont know how to pass the text to tika, what do i put in url and datasource? On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote: I don't know much about Tika but in the example data-config.xml that you posted, the xpath attribute on the field text won't work because the xpath attribute is used only by a XPathEntityProcessor. On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote: I want tika to only index the content in div id=content.../div for the field text. unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main !--transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text xpath=//div[@id='content'] / /entity /entity /document /dataConfig -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
dataimporter tika doesn't extract certain div
I want tika to only index the content in div id=content.../div for the field text. unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main !--transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text xpath=//div[@id='content'] / /entity /entity /document /dataConfig
Re: dataimporter tika fields empty
ok but i'm not doing any path extraction, at least i don't think so. htmlMapper=identity isn't preserving html it's reading the content of the pages but it's not putting it into text_test and text. it's only in text_test the copyField isn't working. data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity field column=text name=text_test / copyField source=text_test dest=text / !-- field column=text_test xpath=//div[@id='content'] / -- /entity /entity /document /dataConfig On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote: Ah. That's because Tika processor does not support path extraction. You need to nest one more level. Regards, Alex On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote: i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb
Re: dataimporter tika fields empty
i changed following line (xpath): field column=text xpath=//div[@id='content'] name=text_test / On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote: Ah. That's because Tika processor does not support path extraction. You need to nest one more level. Regards, Alex On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote: i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
dataimporter tika fields empty
i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl=http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description filehttp://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file urlhttp://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: dataimporter tika fields empty
i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: dataimporter tika fields empty
i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: dataimporter, custom fields and parsing error
i have tried post.jar and it works when i set the literal.id in solrconfig.xml. i can't pass the id with post.jar (-Dparams=literal.id=abc) because i get a error: could not find or load main class .id=abc. On 20. Jul 2013, at 7:05 PM, Andreas Owen wrote: path was set text wasn't, but it doesn't make a difference. my importer says 1 row fetched, 0 docs processed, 0 docs skipped. i don't understand how it can have 2 docs indexed with such a output. On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote: Are the path and text fields set to stored in the schema.xml? On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen a...@conx.ch wrote: they are in my schema, path is typed correctly the others are default fields which already exist. all the other fields are populated and i can search for them, just path and text aren't. On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote: Dumb question: they are in your schema? Spelled right, in the right section, using types also defined? Can you populate them by hand with a CSV file and post.jar? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3 which i just downloaded today and am using only jars that came with it. i have enabled the dataimporter and it runs without error. but the field path (included in schema.xml) and text (file content) aren't indexed. what am i doing wrong? solr-path: C:\ColdFusion10\cfusion\jetty-new collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1 pdf-doc-path: C:\web\development\tkb\internet\public data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/albums/album dataSource=main !-- transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//file / field column=path xpath=//path / field column=Author xpath=//author / !-- field column=tstamp2013-07-05T14:59:46.889Z/field -- entity name=tika processor=TikaEntityProcessor url=../../../../../web/development/tkb/internet/public/${rec.path}/${ rec.id} dataSource=data field column=text / /entity /entity /document /dataConfig docImportUrl.xml: ?xml version=1.0 encoding=utf-8? albums album authorPeter Z./author titleBeratungsseminar kundenbrief/title descriptionwie kommuniziert man/description file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file pathdownload/online/path /album album authorMarcel X./author titlekuchen backen/title descriptiontorten, kuchen, geb‰ck .../description fileKundenbrief.pdf/file pathdownload/online/path /album /albums -- Regards, Shalin Shekhar Mangar.
Re: dataimporter, custom fields and parsing error
they are in my schema, path is typed correctly the others are default fields which already exist. all the other fields are populated and i can search for them, just path and text aren't. On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote: Dumb question: they are in your schema? Spelled right, in the right section, using types also defined? Can you populate them by hand with a CSV file and post.jar? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3 which i just downloaded today and am using only jars that came with it. i have enabled the dataimporter and it runs without error. but the field path (included in schema.xml) and text (file content) aren't indexed. what am i doing wrong? solr-path: C:\ColdFusion10\cfusion\jetty-new collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1 pdf-doc-path: C:\web\development\tkb\internet\public data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/albums/album dataSource=main !-- transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//file / field column=path xpath=//path / field column=Author xpath=//author / !-- field column=tstamp2013-07-05T14:59:46.889Z/field -- entity name=tika processor=TikaEntityProcessor url=../../../../../web/development/tkb/internet/public/${rec.path}/${ rec.id} dataSource=data field column=text / /entity /entity /document /dataConfig docImportUrl.xml: ?xml version=1.0 encoding=utf-8? albums album authorPeter Z./author titleBeratungsseminar kundenbrief/title descriptionwie kommuniziert man/description file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file pathdownload/online/path /album album authorMarcel X./author titlekuchen backen/title descriptiontorten, kuchen, geb‰ck .../description fileKundenbrief.pdf/file pathdownload/online/path /album /albums
Re: dataimporter, custom fields and parsing error
path was set text wasn't, but it doesn't make a difference. my importer says 1 row fetched, 0 docs processed, 0 docs skipped. i don't understand how it can have 2 docs indexed with such a output. On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote: Are the path and text fields set to stored in the schema.xml? On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen a...@conx.ch wrote: they are in my schema, path is typed correctly the others are default fields which already exist. all the other fields are populated and i can search for them, just path and text aren't. On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote: Dumb question: they are in your schema? Spelled right, in the right section, using types also defined? Can you populate them by hand with a CSV file and post.jar? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3 which i just downloaded today and am using only jars that came with it. i have enabled the dataimporter and it runs without error. but the field path (included in schema.xml) and text (file content) aren't indexed. what am i doing wrong? solr-path: C:\ColdFusion10\cfusion\jetty-new collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1 pdf-doc-path: C:\web\development\tkb\internet\public data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/albums/album dataSource=main !-- transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//file / field column=path xpath=//path / field column=Author xpath=//author / !-- field column=tstamp2013-07-05T14:59:46.889Z/field -- entity name=tika processor=TikaEntityProcessor url=../../../../../web/development/tkb/internet/public/${rec.path}/${ rec.id} dataSource=data field column=text / /entity /entity /document /dataConfig docImportUrl.xml: ?xml version=1.0 encoding=utf-8? albums album authorPeter Z./author titleBeratungsseminar kundenbrief/title descriptionwie kommuniziert man/description file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file pathdownload/online/path /album album authorMarcel X./author titlekuchen backen/title descriptiontorten, kuchen, geb‰ck .../description fileKundenbrief.pdf/file pathdownload/online/path /album /albums -- Regards, Shalin Shekhar Mangar.
dataimporter, custom fields and parsing error
i'm using solr 4.3 which i just downloaded today and am using only jars that came with it. i have enabled the dataimporter and it runs without error. but the field path (included in schema.xml) and text (file content) aren't indexed. what am i doing wrong? solr-path: C:\ColdFusion10\cfusion\jetty-new collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1 pdf-doc-path: C:\web\development\tkb\internet\public data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl=http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/albums/album dataSource=main !-- transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//file / field column=path xpath=//path / field column=Author xpath=//author / !-- field column=tstamp2013-07-05T14:59:46.889Z/field -- entity name=tika processor=TikaEntityProcessor url=../../../../../web/development/tkb/internet/public/${rec.path}/${rec.id} dataSource=data field column=text / /entity /entity /document /dataConfig docImportUrl.xml: ?xml version=1.0 encoding=utf-8? albums album authorPeter Z./author titleBeratungsseminar kundenbrief/title descriptionwie kommuniziert man/description file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file pathdownload/online/path /album album authorMarcel X./author titlekuchen backen/title descriptiontorten, kuchen, geb‰ck .../description fileKundenbrief.pdf/file pathdownload/online/path /album /albums
Re: solr autodetectparser tikaconfig dataimporter error
i have now changed some things and the import runs without error. in schema.xml i haven't got the field text but contentsExact. unfortunatly the text (from file) isn't indexed even though i mapped it to the proper field. what am i doing wrong? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl=http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImport.xml forEach=/albums/album dataSource=main !--transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//file / field column=path xpath=//path / field column=Author xpath=//author / !-- field column=tstamp2013-07-05T14:59:46.889Z/field -- entity name=f processor=FileListEntityProcessor baseDir=C:\web\development\tkb\internet\public fileName=${rec.id} dataSource=data onError=skip entity name=tika processor=TikaEntityProcessor url=${f.fileAbsolutePath} field column=text name=contentsExact / /entity /entity /entity /document /dataConfig i noticed, that when I move the field author into the tika-entity it isn't indexed. can this have something to do why the text from the file isn't indexed? Do I have to do something special about the entity-levels in document ps: how do i import tsstamp, it's a static value? On 14. Jul 2013, at 10:30 PM, Jack Krupansky wrote: Caused by: java.lang.NoSuchMethodError: That means you have some out of date jars or some newer jars mixed in with the old ones. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, July 14, 2013 3:07 PM To: solr-user@lucene.apache.org Subject: Re: solr autodetectparser tikaconfig dataimporter error hi is there nowone with a idea what this error is or even give me a pointer where to look? If not is there a alternitave way to import documents from a xml-file with meta-data and the filename to parse? thanks for any help. On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote: i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to = import a file via xml i get this error, it doesn't matter what file format i try = to index txt, cfm, pdf all the same error: SEVERE: Exception while processing: rec document : SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt}, title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, = contents=3Dcontents(1.0)=3D{wie kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.}, = path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.= DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run
Re: solr autodetectparser tikaconfig dataimporter error
hi is there nowone with a idea what this error is or even give me a pointer where to look? If not is there a alternitave way to import documents from a xml-file with meta-data and the filename to parse? thanks for any help. On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote: i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to = import a file via xml i get this error, it doesn't matter what file format i try = to index txt, cfm, pdf all the same error: SEVERE: Exception while processing: rec document : SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt}, title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, = contents=3Dcontents(1.0)=3D{wie kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.}, = path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.= DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 = rollback data-config.xml: dataConfig dataSource type=3DBinURLDataSource name=3Ddata/ dataSource type=3DURLDataSource = baseUrl=3Dhttp://127.0.0.1/tkb/internet/; name=3Dmain/ document entity name=3Drec processor=3DXPathEntityProcessor = url=3DdocImport.xml forEach=3D/albums/album dataSource=3Dmain=20 field column=3Dtitle xpath=3D//title / field column=3Did xpath=3D//file / field column=3Dcontents xpath=3D//description / field column=3Dpath xpath=3D//path / field column=3DAuthor xpath=3D//author / =09 =09 =09 entity processor=3DTikaEntityProcessor = url=3Dfile:///C:\web\development\tkb\internet\public\download\online\${re= c.id} dataSource=3Ddata onerror=3Dskip field column=3Dcontents name=3Dtext / /entity /entity /document /dataConfig the lib are included and declared in the logs, i have also tried = tika-app 1.0 and tagsoup 1.2 with the same result. can someone please help, i = don't know where to start looking for the error.
solr autodetectparser tikaconfig dataimporter error
i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to = import a file via xml i get this error, it doesn't matter what file format i try = to index txt, cfm, pdf all the same error: SEVERE: Exception while processing: rec document : SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt}, title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, = contents=3Dcontents(1.0)=3D{wie kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.}, = path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.= DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 = rollback data-config.xml: dataConfig dataSource type=3DBinURLDataSource name=3Ddata/ dataSource type=3DURLDataSource = baseUrl=3Dhttp://127.0.0.1/tkb/internet/; name=3Dmain/ document entity name=3Drec processor=3DXPathEntityProcessor = url=3DdocImport.xml forEach=3D/albums/album dataSource=3Dmain=20 field column=3Dtitle xpath=3D//title / field column=3Did xpath=3D//file / field column=3Dcontents xpath=3D//description / field column=3Dpath xpath=3D//path / field column=3DAuthor xpath=3D//author / =09 =09 =09 entity processor=3DTikaEntityProcessor = url=3Dfile:///C:\web\development\tkb\internet\public\download\online\${re= c.id} dataSource=3Ddata onerror=3Dskip field column=3Dcontents name=3Dtext / /entity /entity /document /dataConfig the lib are included and declared in the logs, i have also tried = tika-app 1.0 and tagsoup 1.2 with the same result. can someone please help, i = don't know where to start looking for the error.