date:20110609

--- On Thu, 6/9/11, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote:

 From: Bryan Loofbourrow bloofbour...@knowledgemosaic.com
 Subject: Displaying highlights in formatted HTML document
 To: solr-user@lucene.apache.org
 Date: Thursday, June 9, 2011, 2:14 AM
 Here is my use case:

 I have a large number of HTML documents, sizes in the
 0.5K-50M range, most
 around, say, 10M.

 I want to be able to present the user with the formatted
 HTML document, with
 the hits tagged, so that he may iterate through them, and
 see them in the
 context of the document, with the document looking as it
 would be presented
 by a browser; that is, fully formatted, with its tables and
 italics and font
 sizes and all.

 This is something that the user would explicitly request
 from within a set
 of search results, not something I’d expect to have
 returned from an initial
 search – the initial search merely returns the snippets
 around the hits. But
 if the user wants to dive into one of the returned results
 and see them in
 context, I need to be able to go get that.

 We are currently solving this problem by using an entirely
 separate search
 engine (dtSearch), which performs the tagging of the hits
 in the HTML just
 fine. But the solution is unsatisfactory because there are
 Solr searches
 that dtSearch’s capabilities cannot reasonably match.

 Can anyone suggest a good way to use Solr/Lucene for this
 instead? I’m
 thinking a separate core for this purpose might make sense,
 so as not to
 burden the primary search core with the full contents of
 the document. But
 after that, I’m stuck. How can I get Solr to express the
 highlighting in the
 context of the formatted HTML document?

 If Solr does not do this currently, and anyone can suggest
 ways to add the
 feature, any tips on how this might best be incorporated
 into the
 implementation would be welcome.

I am doing the same thing (solr trunk) using the following field type:

fieldType name=HTMLText class=solr.TextField positionIncrementGap=100
analyzer type=index
charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/
charFilter class=solr.HTMLStripCharFilterFactory 
mapping=mappings.txt/tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.TurkishLowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
enablePositionIncrements=true/
filter class=solr.SynonymFilterFactory synonyms=synonyms_index.txt 
ignoreCase=true expand=true/
/analyzeranalyzer type=query
charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.TurkishLowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
enablePositionIncrements=true/
/analyzer

In your separate core - which will is queried when the user wants to dive into 
one of the returned results - feed your html files in to this field. 

You may want to increase max analyzed chars too.
int name=hl.maxAnalyzedChars147483647/int

wrong index version of solr3.2?

2011-06-09 Thread Bernd Fehling



After switching to solr 3.2 and building a new index from scratch I ran 
check_index which reports:
Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1]

Why do I get FORMAT_3_1 and Lucene 3.1, anything wrong with my index?

from my schema.xml:
schema name=my_solr320_schema version=1.3

from my solrconfig.xml:
luceneMatchVersionLUCENE_32/luceneMatchVersion

Regards,
Bernd

Re: Multiple Values not getting Indexed

2011-06-09 Thread Stefan Matheis

Pawan,

just separating multiple values by comma does not make them
multi-value in solr-speak. But if you're already using DIH, you may
try the http://wiki.apache.org/solr/DataImportHandler#RegexTransformer
to 'splitBy' the field and get the expected field-values

Regards
Stefan

On Thu, Jun 9, 2011 at 6:14 AM, Pawan Darira pawan.dar...@gmail.com wrote:
 Hi

 I am trying to index 2 fields with multiple values. BUT, it is only putting
 1 value for each  ignoring rest of the values after comma(,). I am fetching
 query through DIH. It works fine if i have only 1 value each of the 2 fields

 E.g. Field1 - 150,178,461,151,310,306,305,179,137,162
  Field2 - Chandigarh,Gurgaon,New
 Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others

 *Schema.xml*

 field name=city_type type=text indexed=true stored=true/
 field name=city_desc type=text indexed=true stored=true/


 p.s. i tried multivalued=true but of no help.

 --
 Thanks,
 Pawan Darira

Re: Code for getting distinct facet counts across shards(Distributed Process).

2011-06-09 Thread Bill Bell

I have coded and tested this and it appears to work.

Are you having any problems?

On 6/9/11 12:35 AM, rajini maski rajinima...@gmail.com wrote:

 In solr 1.4.1, for getting distinct facet terms count across shards,



The piece of code added for getting count of distinct facet terms across
distributed process is as followed:





Class: facetcomponent.java

Function: -- finishStage(ResponseBuilder rb)



  for (DistribFieldFacet dff : fi.facets.values()) {

//just after this line of code

 else { // TODO: log error or throw exception?

 counts = dff.getLexSorted();



int namedistint = 0;


namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetPa
rams.FACET_NAMEDISTINCT,0);

if (namedistint  == 0)

facet_fields.add(dff.getKey(), fieldCounts);



if (namedistint  == 1)

facet_fields.add(numfacetTerms, counts.length);




 if (namedistint  == 2) {

 NamedList resCount = new NamedList();


 resCount.add(numfacetTerms, counts.length);


 resCount.add(counts, fieldCounts);

facet_fields.add(dff.getKey(), resCount);

 }




Is this flow correct ?  I have worked with few test cases and it has
worked
fine.  but i want to know if there are any bugs that can creep in here?
(My
concern is this piece of code should not effect the rest of logic)




*Code flow with comments for reference:*


 Function : --   finishStage(ResponseBuilder rb)



  //in this for loop ,

 for (DistribFieldFacet dff : fi.facets.values()) {



//just after this line of code

 else { // TODO: log error or throw exception?

 counts = dff.getLexSorted();



 int namedistint = 0;  //default



//get the value of facet.numterms from the input query


namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetPa
rams.FACET_NAMEDISTINCT,0);



// based on the value for  facet.numterms==0 or 1 or 2  , if conditions



//Get only facet field counts

if (namedistint  == 0)

{

facet_fields.add(dff.getKey(), fieldCounts);


}



//get only distinct facet term count

if (namedistint  == 1)

{

facet_fields.add(numfacetTerms, counts.length);


}



//get facet field count and distinct term count.

 if (namedistint  == 2) {

 NamedList resCount = new NamedList();


 resCount.add(numfacetTerms, counts.length);


 resCount.add(counts, fieldCounts);

facet_fields.add(dff.getKey(), resCount);

 }





Regards,

Rajani





On Fri, May 27, 2011 at 1:14 PM, rajini maski rajinima...@gmail.com
wrote:

  No such issues . Successfully integrated with 1.4.1 and it works across
 single index.

 for f.2.facet.numFacetTerms=1  parameter it will give the distinct count
 result

 for f.2.facet.numFacetTerms=2 parameter  it will give counts as well as
 results for facets.

 But this is working only across single index not distributed process.
The
 conditions you have added in simple facet.java- if namedistinct count
==int
  ( 0, 1 and 2 condtions).. Should it be added in distributed process
 function to enable it work across shards?

 Rajani



 On Fri, May 27, 2011 at 12:33 PM, Bill Bell billnb...@gmail.com wrote:

 I am pretty sure it does not yet support distributed shards..

 But the patch was written for 4.0... So there might be issues with
running
 it on 1.4.1.

 On 5/26/11 11:08 PM, rajini maski rajinima...@gmail.com wrote:

  The patch solr 2242 for getting count of distinct facet terms
 doesn't
 work for distributedProcess
 
 (https://issues.apache.org/jira/browse/SOLR-2242)
 
 The error log says
 
  HTTP ERROR 500
 Problem accessing /solr/select. Reason:
 
 For input string: numFacetTerms
 
 java.lang.NumberFormatException: For input string: numFacetTerms
 at

 
java.lang.NumberFormatException.forInputString(NumberFormatException.ja
va:
 48)
 at java.lang.Long.parseLong(Long.java:403)
 at java.lang.Long.parseLong(Long.java:461)
 at 
org.apache.solr.schema.TrieField.readableToIndexed(TrieField.java:331)
 at org.apache.solr.schema.TrieField.toInternal(TrieField.java:344)
 at

 
org.apache.solr.handler.component.FacetComponent$DistribFieldFacet.add(
Fac
 etComponent.java:619)
 at

 
org.apache.solr.handler.component.FacetComponent.countFacets(FacetCompo
nen
 t.java:265)
 at

 
org.apache.solr.handler.component.FacetComponent.handleResponses(FacetC
omp
 onent.java:235)
 at

 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(Searc
hHa
 ndler.java:290)
 at

 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandler
Bas
 e.java:131)
 at

Re: Multiple Values not getting Indexed

2011-06-09 Thread Bill Bell

Is there a way to splitBy and trim the field after splitting?

I know I can do it with Javascript in DIH, but how about using the regex
parser?

On 6/9/11 1:18 AM, Stefan Matheis matheis.ste...@googlemail.com wrote:

Pawan,

just separating multiple values by comma does not make them
multi-value in solr-speak. But if you're already using DIH, you may
try the http://wiki.apache.org/solr/DataImportHandler#RegexTransformer
to 'splitBy' the field and get the expected field-values

Regards
Stefan

On Thu, Jun 9, 2011 at 6:14 AM, Pawan Darira pawan.dar...@gmail.com
wrote:
 Hi

 I am trying to index 2 fields with multiple values. BUT, it is only
putting
 1 value for each  ignoring rest of the values after comma(,). I am
fetching
 query through DIH. It works fine if i have only 1 value each of the 2
fields

 E.g. Field1 - 150,178,461,151,310,306,305,179,137,162
  Field2 - Chandigarh,Gurgaon,New
 Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others

 *Schema.xml*

 field name=city_type type=text indexed=true stored=true/
 field name=city_desc type=text indexed=true stored=true/


 p.s. i tried multivalued=true but of no help.

 --
 Thanks,
 Pawan Darira

Re: Multiple Values not getting Indexed

2011-06-09 Thread Bill Bell

You have to take the input and splitBy something like , to get it into
an array and reposted back to
Solr...

I believe others have suggested that?

On 6/8/11 10:14 PM, Pawan Darira pawan.dar...@gmail.com wrote:

Hi

I am trying to index 2 fields with multiple values. BUT, it is only
putting
1 value for each  ignoring rest of the values after comma(,). I am
fetching
query through DIH. It works fine if i have only 1 value each of the 2
fields

E.g. Field1 - 150,178,461,151,310,306,305,179,137,162
 Field2 - Chandigarh,Gurgaon,New
Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others

*Schema.xml*

field name=city_type type=text indexed=true stored=true/
field name=city_desc type=text indexed=true stored=true/


p.s. i tried multivalued=true but of no help.

-- 
Thanks,
Pawan Darira

Solr monitoring: Newrelic

2011-06-09 Thread roySolr

Hello,

I found this tool to monitor solr querys, cache etc. 

http://newrelic.com/ http://newrelic.com/ 

I have some problems with the installation of it. I get the following
errors:

Could not locate a Tomcat, Jetty or JBoss instance in /var/www/sites/royr
Try re-running the install command from AppServerRootDirectory/newrelic.
If that doesn't work, locate and edit the start script manually.
Generated New Relic configuration file
/var/www/sites/royr/newrelic/newrelic.yml
* Install incomplete

Does anybody have experience with Newrelic in combination with Solr?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042889.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr monitoring: Newrelic

2011-06-09 Thread Sujatha Arun

You need to  install the new relic folder  under tomcat folder,  in case
app server is tomcat.

Then from the command line   ,you need to run the install commnad given  in
the new relic site from your newrelic folder.

Once this is done, restart the appserver and you shld be able to see a log
file created under newrelic folder,  if all went well.

Regards
Sujatha
On Thu, Jun 9, 2011 at 1:27 PM, roySolr royrutten1...@gmail.com wrote:

 Hello,

 I found this tool to monitor solr querys, cache etc.

 http://newrelic.com/ http://newrelic.com/

 I have some problems with the installation of it. I get the following
 errors:

 Could not locate a Tomcat, Jetty or JBoss instance in /var/www/sites/royr
 Try re-running the install command from AppServerRootDirectory/newrelic.
 If that doesn't work, locate and edit the start script manually.
 Generated New Relic configuration file
 /var/www/sites/royr/newrelic/newrelic.yml
 * Install incomplete

 Does anybody have experience with Newrelic in combination with Solr?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042889.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr monitoring: Newrelic

2011-06-09 Thread roySolr


I use Jetty, it's standard in the solr package. Where can i find 
the jetty folder? 

then i can start this command:
java -jar newrelic.jar install

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042981.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Displaying highlights in formatted HTML document

2011-06-09 Thread lboutros

Hi Bryan,

how do you index your html files ? I mean do you create fields for different
parts of your document (for different stop words lists, stemming, etc) ?
with DIH or solrj or something else ? 

iorixxx, could you please explain a bit more your solution, because I don't
see how your solution could give an exact highlighting, I mean with the
different fields analysis for each fields.

I developed this week a new highlighter module which transfers the fields
highlighting to the original document (xml in my case) (I use payloads to
store offsets and lenghts of fields in the index). This way, I use the good
analyzers to do the highlighting correctly and then, I replace the different
field parts in the document by the highlighted parts. It is not finished
yet, but I already have some good results.
This is a client request too. Let me know if the iorixxx's solution is not
enought for your particular use case.

Ludovic.



-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Displaying-highlights-in-formatted-HTML-document-tp3041909p3042983.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr monitoring: Newrelic

2011-06-09 Thread Sujatha Arun

There  is no jetty folder  in the standard package ,but the jetty war file
is under example/lib folder ,so this where u need to  put the newrelic
folder i guess

Regards
Sujatha

On Thu, Jun 9, 2011 at 2:03 PM, roySolr royrutten1...@gmail.com wrote:


 I use Jetty, it's standard in the solr package. Where can i find
 the jetty folder?

 then i can start this command:
 java -jar newrelic.jar install

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042981.html
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr monitoring: Newrelic

2011-06-09 Thread roySolr

Yes, that's the problem. There is no jetty folder. 
I have try the example/lib directory, it's not working. There is no jetty
war file, only
jetty-***.jar files

Same error, could not locate a jetty instance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Displaying highlights in formatted HTML document

 iorixxx, could you please explain a bit more your solution,
 because I don't
 see how your solution could give an exact highlighting, I
 mean with the
 different fields analysis for each fields.

It does not work with your use case (e.g. different synonyms applied different 
parts of the html/xml etc)

ExtractingRequestHandler - renaming tika generated fields

Hi,

I post a PDF from a CMS client, which has metadata about the document. One of 
those metadata is the title. I trust the title of the CMS more than the title 
extracted from the PDF, but I cannot find a way to both send 
literal.title=CMS-Title as well as changing the name of the title field 
generated by Tika/SolrCell. If I do fmap.title=tika_title then my literal.title 
also also changes name. Any ideas?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

Re: how to Index and Search non-Eglish Text in solr

2011-06-09 Thread Mohammad Shariq

Can I specify multiple language in filter tag in schema.xml ???  like below

fieldType name=text class=solr.TextField positionIncrementGap=100
   analyzer type=index
  tokenizer class=solr.
WhitespaceTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/

filter class=solr.SnowballPorterFilterFactory language=Dutch /
filter class=solr.SnowballPorterFilterFactory language=English /
filter class=solr.SnowballPorterFilterFactory language=Chinese /
tokenizer class=solr.WhitespaceTokenizerFactory/
tokenizer class=solr.CJKTokenizerFactory/



  filter class=solr.LowerCaseFilterFactory/filter
class=solr.SnowballPorterFilterFactory language=Hungarian /


On 8 June 2011 18:47, Erick Erickson erickerick...@gmail.com wrote:

 This page is a handy reference for individual languages...
 http://wiki.apache.org/solr/LanguageAnalysis

 But the usual approach, especially for Chinese/Japanese/Korean
 (CJK) is to index the content in different fields with language-specific
 analyzers then spread your search across the language-specific
 fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
 particularly give surprising results if you put words from different
 languages in the same field.

 Best
 Erick

 On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq shariqn...@gmail.com
 wrote:
  Hi,
  I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
  English, but my requirement extend to index the news of other languages
 too.
 
  This is how my schema looks :
  field name=news type=text indexed=true stored=false
  required=false/
 
 
  And the text Field in schema.xml looks like :
 
  fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
  generateNumberParts=1 catenateWords=1 catenateNumbers=1
  catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
  protected=protwords.txt/
 /analyzer
 analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
  generateNumberParts=1 catenateWords=0 catenateNumbers=0
  catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
  protected=protwords.txt/
 /analyzer
  /fieldType
 
 
  My Problem is :
  Now I want to index the news articles in other languages to e.g.
  Chinese,Japnese.
  How I can I modify my text field so that I can Index the news in other
 lang
  too and make it searchable ??
 
  Thanks
  Shariq
 
 
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 




-- 
Thanks and Regards
Mohammad Shariq

Re: Solr monitoring: Newrelic

2011-06-09 Thread Sujatha Arun

Try the RPM support  accessed from the accout support page ,Giving all
details ,they are very helpful.

Regards
Sujatha

On Thu, Jun 9, 2011 at 2:33 PM, roySolr royrutten1...@gmail.com wrote:

 Yes, that's the problem. There is no jetty folder.
 I have try the example/lib directory, it's not working. There is no jetty
 war file, only
 jetty-***.jar files

 Same error, could not locate a jetty instance.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: AW: How to deal with many files using solr external file field

2011-06-09 Thread Martin Grotzke

Hi,

as I'm also involved in this issue (on the side of Sven) I created a
patch, that replaces the float array by a map that stores score by doc,
so it contains as many entries as the external scoring file contains
lines, but no more.

I created an issue for this: https://issues.apache.org/jira/browse/SOLR-2583

It would be great if someone could have a look at it and comment.

Thanx for your feedback,
cheers,
Martin


On 06/08/2011 12:22 PM, Bohnsack, Sven wrote:
 Hi,
 
 I could not provide a stack trace and IMHO it won't provide some useful 
 information. But we've made a good progress in the analysis.
 
 We took a deeper look at what happened, when an external-file-field-Request 
 is sent to SOLR:
 
 * SOLR looks if there is a file for the requested query, e.g. trousers
 * If so, then SOLR loads the trousers-file and generates a HashMap-Entry 
 consisting of a FileFloatSource-Object and a FloatArray with the size of the 
 number of documents in the SOLR-index. Every document matched by the query 
 gains the score-value, which is provided in the external-score-file. For 
 every(!) other document SOLR writes a zero in that FloatArray
 * if SOLR does not find a file for the query-Request, then SOLR still 
 generates a HashMapEntry with score zero for every document
 
 In our case we have about 8.5 Mio. documents in our index and one of those 
 Arrays occupies about 34MB Heap Space. Having e.g. 100 different queries and 
 using external file field for sorting the result, SOLR occupies about 3.4GB 
 of Heap Space.
 
 The problem might be the use of WeakHashMap [1], which prevents the Garbage 
 Collector from cleaning up unused Keys.
 
 
 What do you think could be a possible solution for this whole problem? 
 (except from don't use external file fields ;)
 
 
 Regards
 Sven
 
 
 [1]: A hashtable-based Map implementation with weak keys. An entry in a 
 WeakHashMap will automatically be removed when its key is no longer in 
 ordinary use. More precisely, the presence of a mapping for a given key will 
 not prevent the key from being discarded by the garbage collector, that is, 
 made finalizable, finalized, and then reclaimed. When a key has been 
 discarded its entry is effectively removed from the map, so this class 
 behaves somewhat differently than other Map implementations.
 
 -Ursprüngliche Nachricht-
 Von: mtnes...@gmail.com [mailto:mtnes...@gmail.com] Im Auftrag von Simon 
 Rosenthal
 Gesendet: Mittwoch, 8. Juni 2011 03:56
 An: solr-user@lucene.apache.org
 Betreff: Re: How to deal with many files using solr external file field
 
 Can you provide a stack trace for the OOM eexception ?
 
 On Tue, Jun 7, 2011 at 4:25 PM, Bohnsack, Sven
 sven.bohns...@shopping24.dewrote:
 
 Hi all,

 we're using solr 1.4 and external file field ([1]) for sorting our
 searchresults. We have about 40.000 Terms, for which we use this sorting
 option.
 Currently we're running into massive OutOfMemory-Problems and were not
 pretty sure, what's the matter. It seems that the garbage collector stops
 working or some processes are going wild. However, solr starts to allocate
 more and more RAM until we experience this OutOfMemory-Exception.


 We noticed the following:

 For some terms one could see in the solr log that there appear some
 java.io.FileNotFoundExceptions, when solr tries to load an external file for
 a term for which there is not such a file, e.g. solr tries to load the
 external score file for trousers but there ist none in the
 /solr/data-Folder.

 Question: is it possible, that those exceptions are responsible for the
 OutOfMemory-Problem or could it be due to the large(?) number of 40k terms
 for which we want to sort the result via external file field?

 I'm looking forward for your answers, suggestions and ideas :)


 Regards
 Sven


 [1]:
 http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html


-- 
Martin Grotzke
http://twitter.com/martin_grotzke



signature.asc
Description: OpenPGP digital signature

Re: Tokenising based on known words?

2011-06-09 Thread lee carroll

we've played with HyphenationCompoundWordTokenFilterFactory it works
better than maintaining a word dictionary to split (although we ended
up not using it for reasons i can't recall)

see

http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html



On 9 June 2011 06:42, Gora Mohanty g...@mimirtech.com wrote:
 On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel mark.man...@gmail.com wrote:
 Not sure if this possible, but figured I would ask the question.

 Basically, we have some users who do some pretty rediculous things ;o)

 Rather than writing red jacket, they write redjacket, which obviously
 returns no results.
 [...]

 Have you tried using synonyms,
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
 It seems like they should fit your use case.

 Regards,
 Gora

Boost or sort a query with range values

2011-06-09 Thread jlefebvre

Hello

I try to boost a query with a range values but I can't find the correct
syntax :
this is ok .bq=myfield:-1^5 but I want to do something lik this
bq=myfield:-1 to 1^5

Boost value from -1 to 1

thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043328.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Boost or sort a query with range values

2011-06-09 Thread lee carroll

[* TO *]^5

On 9 June 2011 11:31, jlefebvre jlefeb...@allocine.fr wrote:
 Hello

 I try to boost a query with a range values but I can't find the correct
 syntax :
 this is ok .bq=myfield:-1^5 but I want to do something lik this
 bq=myfield:-1 to 1^5

 Boost value from -1 to 1

 thanks

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043328.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Boost or sort a query with range values

2011-06-09 Thread jlefebvre

thanks it's ok

another question
how to do a condition in bq ?

something like bq=iif(myfield1 = 0 AND myfield2 = 1;1;0)

thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043406.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Boost or sort a query with range values

Check the new if() function in Trunk, SOLR-2136. You could then use it in bf= 
or boost=

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. juni 2011, at 13.05, jlefebvre wrote:

 thanks it's ok
 
 another question
 how to do a condition in bq ?
 
 something like bq=iif(myfield1 = 0 AND myfield2 = 1;1;0)
 
 thanks
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043406.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Boost or sort a query with range values

Btw. your example is a simple boolean query, and this will also work:
bq=(myfield1:0 AND myfield2:1)^100.0

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. juni 2011, at 13.31, Jan Høydahl wrote:

 Check the new if() function in Trunk, SOLR-2136. You could then use it in 
 bf= or boost=
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 9. juni 2011, at 13.05, jlefebvre wrote:
 
 thanks it's ok
 
 another question
 how to do a condition in bq ?
 
 something like bq=iif(myfield1 = 0 AND myfield2 = 1;1;0)
 
 thanks
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043406.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: London open source search social - 13th June

2011-06-09 Thread Richard Marr

Just a quick reminder that we're meeting on Monday. Come along if you're
around.


On 1 June 2011 13:27, Richard Marr richard.m...@gmail.com wrote:

 Hi guys,

 Just to let you know we're meeting up to talk all-things-search on Monday
 13th June. There's usually a good mix of backgrounds and experience levels
 so if you're free and in the London area then it'd be good to see you there.

 Details:
 7pm - The Elgin - 96 Ladbrooke Grove
 http://www.meetup.com/london-search-social/events/20387881/

 

 Greetings search geeks!

 We've booked the next meetup for the 13th June. As usual, the plan is to
 meet up and geek out over a friendly beer.

 I know my co-organiser René has been working on some interesting search
 projects, and I've recently left Empora to work on my own project so by June
 I should hopefully have some war stories about using @elasticsearch in
 production. The format is completely open though so please bring your own
 topics if you've got them.

 Hope to see you there!

 --
 Richard Marr

[Mahout] Integration with Solr

Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the
core build but the docs say that it's not very good for very large indexes.
Anyone have thoughts on this?

Thanks,
Adam

Re: Tokenising based on known words?

2011-06-09 Thread Mark Mandel

Synonyms really wouldn't work for every possible combination of words in our
index.

Thanks for the idea though.

Mark

On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel mark.man...@gmail.com wrote:
  Not sure if this possible, but figured I would ask the question.
 
  Basically, we have some users who do some pretty rediculous things ;o)
 
  Rather than writing red jacket, they write redjacket, which obviously
  returns no results.
 [...]

 Have you tried using synonyms,

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
 It seems like they should fit your use case.

 Regards,
 Gora




-- 
E: mark.man...@gmail.com
T: http://www.twitter.com/neurotic
W: www.compoundtheory.com

cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
http://www.cfobjective.com.au

Hands-on ColdFusion ORM Training
www.ColdFusionOrmTraining.com

Edismax sorting help

2011-06-09 Thread Denis Kuzmenok

Hi, everyone.

I have fields:
text fields: name, title, text
boolean field: isflag (true / false)
int field: popularity (0 to 9)

Now i do query:
defType=edismax
start=0
rows=20
fl=id,name
q=lg optimus
fq=
qf=name^3 title text^0.3
sort=score desc
pf=name
bf=isflag sqrt(popularity)
mm=100%
debugQuery=on


If i do query like Samsung i want to see prior most relevant results
with  isflag:true and bigger popularity, but if i do query like Nokia
6500  and  there is isflag:false, then it should be higher because of
exact  match.  Tried different combinations, but didn't found one that
suites   me.   Just   got   isflag/popularity   sorting   working   or
isflag/relevancy sorting.

Re: tika integration exception and other related queries

2011-06-09 Thread Gary Taylor


Naveen,

Not sure our requirement matches yours, but one of the things we index 
is a comment item that can have one or more files attached to it.  To 
index the whole thing as a single Solr document we create a zipfile 
containing a file with the comment details in it and any additional 
attached files.  This is submitted to Solr as a TEXT field in an XML 
doc, along with other meta-data fields from the comment.  In our schema 
the TEXT field is indexed but not stored, so when we search and get a 
match back it doesn't contain all of the contents from the attached 
files etc., only the stored fields in our schema.   Admittedly, the user 
can therefore get back a comment match with no indication as to WHERE 
the match occurred (ie. was it in the meta-data or the contents of the 
attached files), but at the moment we're only interested in getting 
appropriate matches, not explaining where the match is.


Hope that helps.

Kind regards,
Gary.



On 09/06/2011 03:00, Naveen Gupta wrote:

Hi Gary

It started working .. though i did not test for Zip files, but for rar
files, it is working fine ..

only thing what i wanted to do is to index the metadata (text mapped to
content) not store the data  Also in search result, i want to filter the
stuffs ... and it started working fine .. i don't want to show the content
stuffs to the end user, since the way it extracts the information is not
very helpful to the user .. although we can apply few of the analyzers and
filters to remove the unnecessary tags ..still the information would not be
of much help .. looking for your opinion ... what you did in order to filter
out the content or are you showing the content extracted to the end user?

Even in case, we are showing the text part to the end user, how can i limit
the number of characters while querying the search results ... is there any
feature where we can achieve this ... the concept of snippet kind of thing
...

Thanks
Naveen

On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylorg...@inovem.com  wrote:


Naveen,

For indexing Zip files with Tika, take a look at the following thread :


http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html

I got it to work with the 3.1 source and a couple of patches.

Hope this helps.

Regards,
Gary.



On 08/06/2011 04:12, Naveen Gupta wrote:


Hi Can somebody answer this ...

3. can somebody tell me an idea how to do indexing for a zip file ?

1. while sending docx, we are getting following error.

Re: [Mahout] Integration with Solr

2011-06-09 Thread Tomás Fernández Löbbe

I don't know much of it, but I know Grant Ingersoll posted about that:
http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/

On Thu, Jun 9, 2011 at 9:24 AM, Adam Estrada
estrada.adam.gro...@gmail.comwrote:

 Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the
 core build but the docs say that it's not very good for very large indexes.
 Anyone have thoughts on this?

 Thanks,
 Adam

RE: Tokenising based on known words?

2011-06-09 Thread Steven A Rowe

Hi Mark,

Are you familiar with shingles aka token n-grams?

http://lucene.apache.org/solr/api/org/apache/solr/analysis/ShingleFilterFactory.html

Use the empty string for the tokenSeparator to get wordstogether style tokens 
in your index. 

I think you'll want to apply this filter only at index-time, since the users 
will supply the shingles all by themselves :).

Steve

 -Original Message-
 From: Mark Mandel [mailto:mark.man...@gmail.com]
 Sent: Thursday, June 09, 2011 8:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenising based on known words?
 
 Synonyms really wouldn't work for every possible combination of words in
 our
 index.
 
 Thanks for the idea though.
 
 Mark
 
 On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty g...@mimirtech.com wrote:
 
  On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel mark.man...@gmail.com
 wrote:
   Not sure if this possible, but figured I would ask the question.
  
   Basically, we have some users who do some pretty rediculous things
 ;o)
  
   Rather than writing red jacket, they write redjacket, which
 obviously
   returns no results.
  [...]
 
  Have you tried using synonyms,
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymF
 ilterFactory
  It seems like they should fit your use case.
 
  Regards,
  Gora
 
 
 
 
 --
 E: mark.man...@gmail.com
 T: http://www.twitter.com/neurotic
 W: www.compoundtheory.com
 
 cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
 http://www.cfobjective.com.au
 
 Hands-on ColdFusion ORM Training
 www.ColdFusionOrmTraining.com

how can I return function results in my query?

2011-06-09 Thread Jason Toy

I want to be able to run a query  like idf(text, 'term') and have that data
returned with my search results.  I've searched the docs,but I'm unable to
find how to do it.  Is this possible and how can I do that ?

Re: how can I return function results in my query?

 I want to be able to run a
 query  like idf(text, 'term') and have that data
 returned with my search results.  I've searched the
 docs,but I'm unable to
 find how to do it.  Is this possible and how can I do
 that ?

http://wiki.apache.org/solr/FunctionQuery#idf

Re: how to Index and Search non-Eglish Text in solr

No, you'd have to create multiple fieldTypes, one for each language

Best
Erick

On Thu, Jun 9, 2011 at 5:26 AM, Mohammad Shariq shariqn...@gmail.com wrote:
 Can I specify multiple language in filter tag in schema.xml ???  like below

 fieldType name=text class=solr.TextField positionIncrementGap=100
   analyzer type=index
      tokenizer class=solr.
 WhitespaceTokenizerFactory/
      filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
      filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=1/

 filter class=solr.SnowballPorterFilterFactory language=Dutch /
 filter class=solr.SnowballPorterFilterFactory language=English /
 filter class=solr.SnowballPorterFilterFactory language=Chinese /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 tokenizer class=solr.CJKTokenizerFactory/



      filter class=solr.LowerCaseFilterFactory/filter
 class=solr.SnowballPorterFilterFactory language=Hungarian /


 On 8 June 2011 18:47, Erick Erickson erickerick...@gmail.com wrote:

 This page is a handy reference for individual languages...
 http://wiki.apache.org/solr/LanguageAnalysis

 But the usual approach, especially for Chinese/Japanese/Korean
 (CJK) is to index the content in different fields with language-specific
 analyzers then spread your search across the language-specific
 fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
 particularly give surprising results if you put words from different
 languages in the same field.

 Best
 Erick

 On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq shariqn...@gmail.com
 wrote:
  Hi,
  I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
  English, but my requirement extend to index the news of other languages
 too.
 
  This is how my schema looks :
  field name=news type=text indexed=true stored=false
  required=false/
 
 
  And the text Field in schema.xml looks like :
 
  fieldType name=text class=solr.TextField positionIncrementGap=100
     analyzer type=index
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true/
        filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
  generateNumberParts=1 catenateWords=1 catenateNumbers=1
  catenateAll=0 splitOnCaseChange=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.SnowballPorterFilterFactory language=English
  protected=protwords.txt/
     /analyzer
     analyzer type=query
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
        filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true/
        filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
  generateNumberParts=1 catenateWords=0 catenateNumbers=0
  catenateAll=0 splitOnCaseChange=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.SnowballPorterFilterFactory language=English
  protected=protwords.txt/
     /analyzer
  /fieldType
 
 
  My Problem is :
  Now I want to index the news articles in other languages to e.g.
  Chinese,Japnese.
  How I can I modify my text field so that I can Index the news in other
 lang
  too and make it searchable ??
 
  Thanks
  Shariq
 
 
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 




 --
 Thanks and Regards
 Mohammad Shariq

Re: Edismax sorting help

2011-06-09 Thread Yonik Seeley

2011/6/9 Denis Kuzmenok forward...@ukr.net:
 Hi, everyone.

 I have fields:
 text fields: name, title, text
 boolean field: isflag (true / false)
 int field: popularity (0 to 9)

 Now i do query:
 defType=edismax
 start=0
 rows=20
 fl=id,name
 q=lg optimus
 fq=
 qf=name^3 title text^0.3
 sort=score desc
 pf=name
 bf=isflag sqrt(popularity)
 mm=100%
 debugQuery=on


 If i do query like Samsung i want to see prior most relevant results
 with  isflag:true and bigger popularity, but if i do query like Nokia
 6500  and  there is isflag:false, then it should be higher because of
 exact  match.  Tried different combinations, but didn't found one that
 suites   me.   Just   got   isflag/popularity   sorting   working   or
 isflag/relevancy sorting.

Multiplicative boosts tend to be more stable...

Perhaps try replacing
  bf=isflag sqrt(popularity)
with
  bq=isflag:true^10  // vary the boost to change how much
isflag counts vs the relevancy score of the main query
  boost=sqrt(popularity)  // this will multiply the result by
sqrt(popularity)... assumes that every document has a non-zero
popularity

You could get more creative in trunk where booleans have better
support in function queries.

-Yonik
http://www.lucidimagination.com

Re: Solr monitoring: Newrelic

2011-06-09 Thread Ken Krugler

It sounds like roySolr is running embedded Jetty, launching solr using the 
start.jar

If so, then there's no app container where Newrelic can be installed.

-- Ken

On Jun 9, 2011, at 2:28am, Sujatha Arun wrote:

 Try the RPM support  accessed from the accout support page ,Giving all
 details ,they are very helpful.
 
 Regards
 Sujatha
 
 On Thu, Jun 9, 2011 at 2:33 PM, roySolr royrutten1...@gmail.com wrote:
 
 Yes, that's the problem. There is no jetty folder.
 I have try the example/lib directory, it's not working. There is no jetty
 war file, only
 jetty-***.jar files
 
 Same error, could not locate a jetty instance.
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions

Re: Edismax sorting help

2011-06-09 Thread Denis Kuzmenok

Your  solution  seems  to work fine, not perfect, but much better then
mine :)
Thanks!

 If i do query like Samsung i want to see prior most relevant results
 with  isflag:true and bigger popularity, but if i do query like Nokia
 6500  and  there is isflag:false, then it should be higher because of
 exact  match.  Tried different combinations, but didn't found one that
 suites   me.   Just   got   isflag/popularity   sorting   working   or
 isflag/relevancy sorting.

 Multiplicative boosts tend to be more stable...

 Perhaps try replacing
   bf=isflag sqrt(popularity)
 with
   bq=isflag:true^10  // vary the boost to change how much
 isflag counts vs the relevancy score of the main query
   boost=sqrt(popularity)  // this will multiply the result by
 sqrt(popularity)... assumes that every document has a non-zero
 popularity

 You could get more creative in trunk where booleans have better
 support in function queries.

 -Yonik
 http://www.lucidimagination.com

Re: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-09 Thread Koji Sekiguchi


(11/06/09 4:24), Burton-West, Tom wrote:

We are trying to implement highlighting for wildcard (MultiTerm) queries.  This 
seems to work find with the regular highlighter but when we try to use the 
fastVectorHighlighter we don't see any results in the  highlighting section of 
the response.  Appended below are the parameters we are using.


It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery and 
DisjunctionMaxQuery
and Query constructed by those queries.

koji
--
http://www.rondhuit.com/en/

Re: [Mahout] Integration with Solr

2011-06-09 Thread Tommaso Teofili

Hello Adam,
I've managed to create a small POC of integrating Mahout with Solr for a
clustering task, do you want to use it for clustering only or possibly for
other purposes/algorithms?
More generally speaking, I think it'd be nice if Solr could be extended with
a proper API for integrating clustering engines in it so that one can plug
and exchange engines flawlessly (just need an Adapter).
Regards,
Tommaso

2011/6/9 Adam Estrada estrada.adam.gro...@gmail.com

 Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the
 core build but the docs say that it's not very good for very large indexes.
 Anyone have thoughts on this?

 Thanks,
 Adam

Indexing data from multiple datasources

2011-06-09 Thread Greg Georges

Hello all,

I have checked the forums to see if it is possible to create and index from 
multiple datasources. I have found references to SOLR-1358, but I don't think 
this fits my scenario. In all, we have an application where we upload files. On 
the file upload, I use the Tika extract handler to save metadata from the file 
(_attr, literal values, etc..). We also have a database which has information 
on the uploaded files, like the category, type, etc.. I would like to update 
the index to include this information from the db in the index for each 
document. If I run a dataimporthandler after the extract phase I am afraid that 
by updating the doc in the index by its id will just cause that I overwrite the 
old information with the info from the DB (what I understand is that Solr 
updates its index by ID by deleting first then recreating the info).

Anyone have any pointers, is there a clean way to do this, or must I find a way 
to pass the db metadata to the extract handler and save it as literal fields?

Thanks in advance

Greg

[Free Text] Field Tokenizing

All,

I am at a bit of a loss here so any help would be greatly appreciated. I am
using the DIH to grab data from a DB. The field that I am most interested in
has anywhere from 1 word to several paragraphs worth of free text. What I
would really like to do is pull out phrases like Joe's coffee shop rather
than the 3 individual words. I have tried the KeywordTokenizerFactory and
that does seem to do what I want it to do but it is not actually tokenizing
anything so it does what I want it to for the most part but it's not
creating the tokens that I need for further analysis in apps like Mahout.

We can play with the combination of tokenizers and filters all day long and
see what the results are after a quick reindex. I typlically just view them
in Solitas as facets which may be the problem for me too. Does anyone have
an example fieldType they can share with me that shows how to
extract phrases if they are there from the data I described earlier. Am I
even going about this the right way? I am using today's trunk build of Solr
and here is what I have munged together this morning.

fieldType name=text_ws class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
 analyzer 
 charFilter class=solr.HTMLStripCharFilterFactory/
 charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
 filter class=solr.ShingleFilterFactory maxShingleSize=4
outputUnigrams=true outputUnigramIfNoNgram=false/
 filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.EnglishMinimalStemFilterFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.TrimFilterFactory/
 /analyzer
/fieldType

Thanks,
Adam

Re: [Mahout] Integration with Solr

Thanks for the reply, Tommaso! I would like to see tighter integration like
in the way Nutch integrates with Solr. There is a single param that you set
which points to the Solr instance. My interest in Mahout is with it's
abitlity to handle large data and find frequency, co-location of data,
clustering, etc...All the algorithms that are in the core build are great
and I am just now wrapping my head around how to use them all.

Adam

On Thu, Jun 9, 2011 at 10:33 AM, Tommaso Teofili
tommaso.teof...@gmail.comwrote:

 Hello Adam,
 I've managed to create a small POC of integrating Mahout with Solr for a
 clustering task, do you want to use it for clustering only or possibly for
 other purposes/algorithms?
 More generally speaking, I think it'd be nice if Solr could be extended
 with
 a proper API for integrating clustering engines in it so that one can plug
 and exchange engines flawlessly (just need an Adapter).
 Regards,
 Tommaso

 2011/6/9 Adam Estrada estrada.adam.gro...@gmail.com

  Has anyone integrated Mahout with Solr? I know that Carrot2 is part of
 the
  core build but the docs say that it's not very good for very large
 indexes.
  Anyone have thoughts on this?
 
  Thanks,
  Adam

RE: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-09 Thread Burton-West, Tom

Hi Koji,


Thank you for your reply.

 It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery 
 and DisjunctionMaxQuery
 and Query constructed by those queries.

Sorry, I'm not sure I understand.  Are you saying that FVH supports MultiTerm 
highlighting?  

Tom

Re: ExtractingRequestHandler - renaming tika generated fields

One solution to this problem is to change the order of field operation 
(http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations)
 to first do fmap.*= processing, then add the fields from literal.*=. Why would 
anyone want to rename a field they just have explicitly named anyway?

Another solution that would work for me is an option to let ALL tika generated 
fields be prefixed, e.g. tprefix=tika_. But I need Extracting handler to output 
to fields which do not exist in schema.xml. This is because later in the 
UpdateChain I do field choosing and renaming in another UpdateProcessor, so the 
field names coming from ExtractingHandler are only tempoprary and will not be 
sent to Solr. Thus, an option to skip the schema check would be useful, perhaps 
in the form of a whitelist for uprefix 
uprefix.whitelist=fielda,other-non-existing-field, causing uprefix not rename 
those.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. juni 2011, at 11.26, Jan Høydahl wrote:

 Hi,
 
 I post a PDF from a CMS client, which has metadata about the document. One of 
 those metadata is the title. I trust the title of the CMS more than the title 
 extracted from the PDF, but I cannot find a way to both send 
 literal.title=CMS-Title as well as changing the name of the title field 
 generated by Tika/SolrCell. If I do fmap.title=tika_title then my 
 literal.title also also changes name. Any ideas?
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

Re: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-09 Thread Koji Sekiguchi


(11/06/10 0:14), Burton-West, Tom wrote:

Hi Koji,


Thank you for your reply.


It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery and 
DisjunctionMaxQuery
and Query constructed by those queries.


Sorry, I'm not sure I understand.  Are you saying that FVH supports MultiTerm 
highlighting?


Tom,

I'm sorry but FVH doesn't cover MultiTermQuery.

koji
--
http://www.rondhuit.com/en/

Re: Indexing data from multiple datasources

Hmmm, when you say you use Tika, are you using some custom Java code? Because
if you are, the best thing to do is query your database at that point
and add whatever information
you need to the document.

If you're using DIH to do the crawl, consider implementing a
Transformer to do the database
querying and modify the document as necessary This is pretty
simple to do, we can
chat a bit more depending on whether either approach makes sense.

Best
Erick



On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com wrote:
 Hello all,

 I have checked the forums to see if it is possible to create and index from 
 multiple datasources. I have found references to SOLR-1358, but I don't think 
 this fits my scenario. In all, we have an application where we upload files. 
 On the file upload, I use the Tika extract handler to save metadata from the 
 file (_attr, literal values, etc..). We also have a database which has 
 information on the uploaded files, like the category, type, etc.. I would 
 like to update the index to include this information from the db in the index 
 for each document. If I run a dataimporthandler after the extract phase I am 
 afraid that by updating the doc in the index by its id will just cause that I 
 overwrite the old information with the info from the DB (what I understand is 
 that Solr updates its index by ID by deleting first then recreating the info).

 Anyone have any pointers, is there a clean way to do this, or must I find a 
 way to pass the db metadata to the extract handler and save it as literal 
 fields?

 Thanks in advance

 Greg

Re: [Free Text] Field Tokenizing

The problem here is that none of the built-in filters or tokenizers
have a prayer
of recognizing what #you# think are phrases, since it'll be unique to
your situation.

If you have a list of phrases you care about, you could substitute a
single token
for the phrases you care about...

But the overriding question is what determines a phrase you're
interested in? Is it
a list or is there some heuristic you want to apply?

Or could you just recognize them at query time and make them into a
literal phrase
(i.e. with quotationmarks)?

Best
Erick

On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
 All,

 I am at a bit of a loss here so any help would be greatly appreciated. I am
 using the DIH to grab data from a DB. The field that I am most interested in
 has anywhere from 1 word to several paragraphs worth of free text. What I
 would really like to do is pull out phrases like Joe's coffee shop rather
 than the 3 individual words. I have tried the KeywordTokenizerFactory and
 that does seem to do what I want it to do but it is not actually tokenizing
 anything so it does what I want it to for the most part but it's not
 creating the tokens that I need for further analysis in apps like Mahout.

 We can play with the combination of tokenizers and filters all day long and
 see what the results are after a quick reindex. I typlically just view them
 in Solitas as facets which may be the problem for me too. Does anyone have
 an example fieldType they can share with me that shows how to
 extract phrases if they are there from the data I described earlier. Am I
 even going about this the right way? I am using today's trunk build of Solr
 and here is what I have munged together this morning.

 fieldType name=text_ws class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
  analyzer 
  charFilter class=solr.HTMLStripCharFilterFactory/
  charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt/
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
  filter class=solr.ShingleFilterFactory maxShingleSize=4
 outputUnigrams=true outputUnigramIfNoNgram=false/
  filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/
  filter class=solr.EnglishPossessiveFilterFactory/
  filter class=solr.EnglishMinimalStemFilterFactory/
  filter class=solr.ASCIIFoldingFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
  filter class=solr.TrimFilterFactory/
  /analyzer
 /fieldType

 Thanks,
 Adam

Re: [Free Text] Field Tokenizing

Erick,

I totally understand that BUT the keyword tokenizer factory does a really
good job extracting phrases (or what look like phrases from) from my data. I
don't know why exactly but it does do it. I am going to continue working
through it to see if I can't figure it out ;-)

Adam

On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson erickerick...@gmail.comwrote:

 The problem here is that none of the built-in filters or tokenizers
 have a prayer
 of recognizing what #you# think are phrases, since it'll be unique to
 your situation.

 If you have a list of phrases you care about, you could substitute a
 single token
 for the phrases you care about...

 But the overriding question is what determines a phrase you're
 interested in? Is it
 a list or is there some heuristic you want to apply?

 Or could you just recognize them at query time and make them into a
 literal phrase
 (i.e. with quotationmarks)?

 Best
 Erick

 On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
 estrada.adam.gro...@gmail.com wrote:
  All,
 
  I am at a bit of a loss here so any help would be greatly appreciated. I
 am
  using the DIH to grab data from a DB. The field that I am most interested
 in
  has anywhere from 1 word to several paragraphs worth of free text. What I
  would really like to do is pull out phrases like Joe's coffee shop
 rather
  than the 3 individual words. I have tried the KeywordTokenizerFactory and
  that does seem to do what I want it to do but it is not actually
 tokenizing
  anything so it does what I want it to for the most part but it's not
  creating the tokens that I need for further analysis in apps like Mahout.
 
  We can play with the combination of tokenizers and filters all day long
 and
  see what the results are after a quick reindex. I typlically just view
 them
  in Solitas as facets which may be the problem for me too. Does anyone
 have
  an example fieldType they can share with me that shows how to
  extract phrases if they are there from the data I described earlier. Am I
  even going about this the right way? I am using today's trunk build of
 Solr
  and here is what I have munged together this morning.
 
  fieldType name=text_ws class=solr.TextField
 positionIncrementGap=100
  autoGeneratePhraseQueries=true
   analyzer 
   charFilter class=solr.HTMLStripCharFilterFactory/
   charFilter class=solr.MappingCharFilterFactory
  mapping=mapping-ISOLatin1Accent.txt/
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true/
   filter class=solr.ShingleFilterFactory maxShingleSize=4
  outputUnigrams=true outputUnigramIfNoNgram=false/
   filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
   filter class=solr.EnglishPossessiveFilterFactory/
   filter class=solr.EnglishMinimalStemFilterFactory/
   filter class=solr.ASCIIFoldingFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   filter class=solr.TrimFilterFactory/
   /analyzer
  /fieldType
 
  Thanks,
  Adam

Re: [Free Text] Field Tokenizing

The KeywordTokenizer doesn't do anything to break up the input stream,
it just treats the whole input to the field as a single token. So I don't think
you'll be able to extract anything starting with that tokenizer.

Look at the admin/analysis page to see a step-by-step breakdown of what
your analyzer chain does. Be sure to check the verbose checkbox

Best
Erick

On Thu, Jun 9, 2011 at 12:35 PM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
 Erick,

 I totally understand that BUT the keyword tokenizer factory does a really
 good job extracting phrases (or what look like phrases from) from my data. I
 don't know why exactly but it does do it. I am going to continue working
 through it to see if I can't figure it out ;-)

 Adam

 On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 The problem here is that none of the built-in filters or tokenizers
 have a prayer
 of recognizing what #you# think are phrases, since it'll be unique to
 your situation.

 If you have a list of phrases you care about, you could substitute a
 single token
 for the phrases you care about...

 But the overriding question is what determines a phrase you're
 interested in? Is it
 a list or is there some heuristic you want to apply?

 Or could you just recognize them at query time and make them into a
 literal phrase
 (i.e. with quotationmarks)?

 Best
 Erick

 On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
 estrada.adam.gro...@gmail.com wrote:
  All,
 
  I am at a bit of a loss here so any help would be greatly appreciated. I
 am
  using the DIH to grab data from a DB. The field that I am most interested
 in
  has anywhere from 1 word to several paragraphs worth of free text. What I
  would really like to do is pull out phrases like Joe's coffee shop
 rather
  than the 3 individual words. I have tried the KeywordTokenizerFactory and
  that does seem to do what I want it to do but it is not actually
 tokenizing
  anything so it does what I want it to for the most part but it's not
  creating the tokens that I need for further analysis in apps like Mahout.
 
  We can play with the combination of tokenizers and filters all day long
 and
  see what the results are after a quick reindex. I typlically just view
 them
  in Solitas as facets which may be the problem for me too. Does anyone
 have
  an example fieldType they can share with me that shows how to
  extract phrases if they are there from the data I described earlier. Am I
  even going about this the right way? I am using today's trunk build of
 Solr
  and here is what I have munged together this morning.
 
  fieldType name=text_ws class=solr.TextField
 positionIncrementGap=100
  autoGeneratePhraseQueries=true
   analyzer 
   charFilter class=solr.HTMLStripCharFilterFactory/
   charFilter class=solr.MappingCharFilterFactory
  mapping=mapping-ISOLatin1Accent.txt/
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true/
   filter class=solr.ShingleFilterFactory maxShingleSize=4
  outputUnigrams=true outputUnigramIfNoNgram=false/
   filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
   filter class=solr.EnglishPossessiveFilterFactory/
   filter class=solr.EnglishMinimalStemFilterFactory/
   filter class=solr.ASCIIFoldingFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   filter class=solr.TrimFilterFactory/
   /analyzer
  /fieldType
 
  Thanks,
  Adam

RE: Indexing data from multiple datasources

2011-06-09 Thread Greg Georges

Hello Erick,

Thanks for the response. No, I am using the extract handler to extract the data 
from my text files. In your second approach, you say I could use a DIH to 
update the index which would have been created by the extract handler in the 
first phase. I thought that lets say I get info from the DB and update the 
index with the document ID, will I overwrite the data and lose the initial data 
from the extract handler phase? Thanks

Greg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 9 juin 2011 12:15
To: solr-user@lucene.apache.org
Subject: Re: Indexing data from multiple datasources

Hmmm, when you say you use Tika, are you using some custom Java code? Because
if you are, the best thing to do is query your database at that point
and add whatever information
you need to the document.

If you're using DIH to do the crawl, consider implementing a
Transformer to do the database
querying and modify the document as necessary This is pretty
simple to do, we can
chat a bit more depending on whether either approach makes sense.

Best
Erick



On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com wrote:
 Hello all,

 I have checked the forums to see if it is possible to create and index from 
 multiple datasources. I have found references to SOLR-1358, but I don't think 
 this fits my scenario. In all, we have an application where we upload files. 
 On the file upload, I use the Tika extract handler to save metadata from the 
 file (_attr, literal values, etc..). We also have a database which has 
 information on the uploaded files, like the category, type, etc.. I would 
 like to update the index to include this information from the db in the index 
 for each document. If I run a dataimporthandler after the extract phase I am 
 afraid that by updating the doc in the index by its id will just cause that I 
 overwrite the old information with the info from the DB (what I understand is 
 that Solr updates its index by ID by deleting first then recreating the info).

 Anyone have any pointers, is there a clean way to do this, or must I find a 
 way to pass the db metadata to the extract handler and save it as literal 
 fields?

 Thanks in advance

 Greg

Re: Indexing data from multiple datasources

How are you using it? Streaming the files to Solr via HTTP? You can use Tika
on the client to extract the various bits from the structured documents, and
use SolrJ to assemble various bits of that data Tika exposes into a
Solr document
that you then send to Solr. At the point you're transferring data from the
Tika parse to the Solr document, you could add any data from your database that
you wanted.

The result is that you'd be indexing the complete Solr document only once.

You're right that updating a document in Solr overwrites the previous
version and any
data in the previous version is lost

Best
Erick

On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges greg.geor...@biztree.com wrote:
 Hello Erick,

 Thanks for the response. No, I am using the extract handler to extract the 
 data from my text files. In your second approach, you say I could use a DIH 
 to update the index which would have been created by the extract handler in 
 the first phase. I thought that lets say I get info from the DB and update 
 the index with the document ID, will I overwrite the data and lose the 
 initial data from the extract handler phase? Thanks

 Greg

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 9 juin 2011 12:15
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing data from multiple datasources

 Hmmm, when you say you use Tika, are you using some custom Java code? Because
 if you are, the best thing to do is query your database at that point
 and add whatever information
 you need to the document.

 If you're using DIH to do the crawl, consider implementing a
 Transformer to do the database
 querying and modify the document as necessary This is pretty
 simple to do, we can
 chat a bit more depending on whether either approach makes sense.

 Best
 Erick



 On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com 
 wrote:
 Hello all,

 I have checked the forums to see if it is possible to create and index from 
 multiple datasources. I have found references to SOLR-1358, but I don't 
 think this fits my scenario. In all, we have an application where we upload 
 files. On the file upload, I use the Tika extract handler to save metadata 
 from the file (_attr, literal values, etc..). We also have a database which 
 has information on the uploaded files, like the category, type, etc.. I 
 would like to update the index to include this information from the db in 
 the index for each document. If I run a dataimporthandler after the extract 
 phase I am afraid that by updating the doc in the index by its id will just 
 cause that I overwrite the old information with the info from the DB (what I 
 understand is that Solr updates its index by ID by deleting first then 
 recreating the info).

 Anyone have any pointers, is there a clean way to do this, or must I find a 
 way to pass the db metadata to the extract handler and save it as literal 
 fields?

 Thanks in advance

 Greg

RE: Indexing data from multiple datasources

2011-06-09 Thread David Ross


This thread got me thinking a bit...
Does SOLR support the concept of partial updates to documents?  By this I 
mean updating a subset of fields in a document that already exists in the 
index, and without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...

 Date: Thu, 9 Jun 2011 14:00:43 -0400
 Subject: Re: Indexing data from multiple datasources
 From: erickerick...@gmail.com
 To: solr-user@lucene.apache.org
 
 How are you using it? Streaming the files to Solr via HTTP? You can use Tika
 on the client to extract the various bits from the structured documents, and
 use SolrJ to assemble various bits of that data Tika exposes into a
 Solr document
 that you then send to Solr. At the point you're transferring data from the
 Tika parse to the Solr document, you could add any data from your database 
 that
 you wanted.
 
 The result is that you'd be indexing the complete Solr document only once.
 
 You're right that updating a document in Solr overwrites the previous
 version and any
 data in the previous version is lost
 
 Best
 Erick
 
 On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges greg.geor...@biztree.com wrote:
  Hello Erick,
 
  Thanks for the response. No, I am using the extract handler to extract the 
  data from my text files. In your second approach, you say I could use a DIH 
  to update the index which would have been created by the extract handler in 
  the first phase. I thought that lets say I get info from the DB and update 
  the index with the document ID, will I overwrite the data and lose the 
  initial data from the extract handler phase? Thanks
 
  Greg
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: 9 juin 2011 12:15
  To: solr-user@lucene.apache.org
  Subject: Re: Indexing data from multiple datasources
 
  Hmmm, when you say you use Tika, are you using some custom Java code? 
  Because
  if you are, the best thing to do is query your database at that point
  and add whatever information
  you need to the document.
 
  If you're using DIH to do the crawl, consider implementing a
  Transformer to do the database
  querying and modify the document as necessary This is pretty
  simple to do, we can
  chat a bit more depending on whether either approach makes sense.
 
  Best
  Erick
 
 
 
  On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com 
  wrote:
  Hello all,
 
  I have checked the forums to see if it is possible to create and index 
  from multiple datasources. I have found references to SOLR-1358, but I 
  don't think this fits my scenario. In all, we have an application where we 
  upload files. On the file upload, I use the Tika extract handler to save 
  metadata from the file (_attr, literal values, etc..). We also have a 
  database which has information on the uploaded files, like the category, 
  type, etc.. I would like to update the index to include this information 
  from the db in the index for each document. If I run a dataimporthandler 
  after the extract phase I am afraid that by updating the doc in the index 
  by its id will just cause that I overwrite the old information with the 
  info from the DB (what I understand is that Solr updates its index by ID 
  by deleting first then recreating the info).
 
  Anyone have any pointers, is there a clean way to do this, or must I find 
  a way to pass the db metadata to the extract handler and save it as 
  literal fields?
 
  Thanks in advance
 
  Greg

RE: Indexing data from multiple datasources

2011-06-09 Thread Greg Georges

No from what I understand, the way Solr does an update is to delete the 
document, then recreate all the fields, there is no partial updating of the 
file.. maybe because of performance issues or locking?

-Original Message-
From: David Ross [mailto:davidtr...@hotmail.com] 
Sent: 9 juin 2011 15:23
To: solr-user@lucene.apache.org
Subject: RE: Indexing data from multiple datasources

This thread got me thinking a bit...
Does SOLR support the concept of partial updates to documents?  By this I 
mean updating a subset of fields in a document that already exists in the 
index, and without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...

 Date: Thu, 9 Jun 2011 14:00:43 -0400
 Subject: Re: Indexing data from multiple datasources
 From: erickerick...@gmail.com
 To: solr-user@lucene.apache.org

 How are you using it? Streaming the files to Solr via HTTP? You can use Tika
 on the client to extract the various bits from the structured documents, and
 use SolrJ to assemble various bits of that data Tika exposes into a
 Solr document
 that you then send to Solr. At the point you're transferring data from the
 Tika parse to the Solr document, you could add any data from your database 
 that
 you wanted.

 The result is that you'd be indexing the complete Solr document only once.

 You're right that updating a document in Solr overwrites the previous
 version and any
 data in the previous version is lost

 Best
 Erick

 On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges greg.geor...@biztree.com wrote:
  Hello Erick,

  Thanks for the response. No, I am using the extract handler to extract the 
  data from my text files. In your second approach, you say I could use a DIH 
  to update the index which would have been created by the extract handler in 
  the first phase. I thought that lets say I get info from the DB and update 
  the index with the document ID, will I overwrite the data and lose the 
  initial data from the extract handler phase? Thanks

  Greg

  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: 9 juin 2011 12:15
  To: solr-user@lucene.apache.org
  Subject: Re: Indexing data from multiple datasources

  Hmmm, when you say you use Tika, are you using some custom Java code? 
  Because
  if you are, the best thing to do is query your database at that point
  and add whatever information
  you need to the document.

  If you're using DIH to do the crawl, consider implementing a
  Transformer to do the database
  querying and modify the document as necessary This is pretty
  simple to do, we can
  chat a bit more depending on whether either approach makes sense.

  Best
  Erick

  On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com 
  wrote:
  Hello all,

  I have checked the forums to see if it is possible to create and index 
  from multiple datasources. I have found references to SOLR-1358, but I 
  don't think this fits my scenario. In all, we have an application where we 
  upload files. On the file upload, I use the Tika extract handler to save 
  metadata from the file (_attr, literal values, etc..). We also have a 
  database which has information on the uploaded files, like the category, 
  type, etc.. I would like to update the index to include this information 
  from the db in the index for each document. If I run a dataimporthandler 
  after the extract phase I am afraid that by updating the doc in the index 
  by its id will just cause that I overwrite the old information with the 
  info from the DB (what I understand is that Solr updates its index by ID 
  by deleting first then recreating the info).

  Anyone have any pointers, is there a clean way to do this, or must I find 
  a way to pass the db metadata to the extract handler and save it as 
  literal fields?

  Thanks in advance

  Greg

Processing/Indexing CSV

Hi,

there seems to be no way to index CSV using the DataImportHandler.

Using a combination of
LineEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor
 and 
RegexTransformerhttp://wiki.apache.org/solr/DataImportHandler#RegexTransformer
as
proposed in
http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns -
there is no elegant way to segment this using a simple regular expression.

As CSV is still very common esp. in E-Commerce scenarios, I propose that
Solr provides a CSVEntityProcessor that:
1) Handles the case of CSV files with/without and with some double-quote
enclosed columns
2) Allows for a configurable column separator (';',',','\t' etc.)
3) Allows for a leading row containing column headings
4) If there is a leading row with column headings provides a possibility to
address columns by their column names and map them to Solr fields (similar
to the XPathEntityProcessor)
5) Auto-detects encoding of the file (UTF-8 etc.)

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ... So
please let me know.

Regards

Re: Processing/Indexing CSV

Hi,

to make my point more clear: if the CSV has a fixed schema / column layout,
using the RegexTransformer is of course a possibility (however awkward). But
if you want to implement a (more or less) schema free shopping search engine
...

regards

On Thu, Jun 9, 2011 at 9:31 PM, Helmut Hoffer von Ankershoffen
helmut...@googlemail.com wrote:

Hi,

there seems to be no way to index CSV using the DataImportHandler.

Using a combination of
LineEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor
and
RegexTransformerhttp://wiki.apache.org/solr/DataImportHandler#RegexTransformer
as
proposed in
http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns -
there is no elegant way to segment this using a simple regular expression.

As CSV is still very common esp. in E-Commerce scenarios, I propose that
Solr provides a CSVEntityProcessor that:
1) Handles the case of CSV files with/without and with some double-quote
enclosed columns
2) Allows for a configurable column separator (';',',','\t' etc.)
3) Allows for a leading row containing column headings
4) If there is a leading row with column headings provides a possibility to
address columns by their column names and map them to Solr fields (similar
to the XPathEntityProcessor)
5) Auto-detects encoding of the file (UTF-8 etc.)

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ... So
please let me know.

Regards

Unique Results from Edgy Text

2011-06-09 Thread Jamie Johnson

I am using the guide found here (
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/)
to build an autocomplete search capability but in my data set I have some
documents which have the same value for the field that is being returned, so
for instance I have the following being returned:

A test document to see how this works
 A test document to see how this works
 A test document to see how this works
A test document to see how this works
 A test document to see how this works

I'm wondering if there is something I can specify that I want only unique
results to come back.  I know I can do some post processing of the results
to make sure that only unique items come back, but I was hoping there was
something that could be done to the query.  Any thoughts?

RE: Processing/Indexing CSV

2011-06-09 Thread Dyer, James

Helmut,

I recently submitted SOLR-2549 
(https://issues.apache.org/jira/browse/SOLR-2549) to handle both fixed-width 
and delimited flat files.  To be honest, I only needed fixed-width support for 
my app so this might not support everything you mention for delimited files, 
but it should be a good start.  

In particular, you might need to enhance this to handle the double quotes (I 
had though a delimiter regex along these lines might handle it:  
(?:[\]?[,]|[\]$)  ... note this is a sample I just cooked up quick and no 
doubt has errors, and maybe as you say a simple regex might not work at all ) 
... I also didn't do anything with encodings but I'm not sure this will be an 
issue either...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com] 
Sent: Thursday, June 09, 2011 2:32 PM
To: solr-user@lucene.apache.org
Subject: Processing/Indexing CSV

Hi,

there seems to be no way to index CSV using the DataImportHandler.

Using a combination of
LineEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor
 and 
RegexTransformerhttp://wiki.apache.org/solr/DataImportHandler#RegexTransformer
as
proposed in
http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns -
there is no elegant way to segment this using a simple regular expression.

As CSV is still very common esp. in E-Commerce scenarios, I propose that
Solr provides a CSVEntityProcessor that:
1) Handles the case of CSV files with/without and with some double-quote
enclosed columns
2) Allows for a configurable column separator (';',',','\t' etc.)
3) Allows for a leading row containing column headings
4) If there is a leading row with column headings provides a possibility to
address columns by their column names and map them to Solr fields (similar
to the XPathEntityProcessor)
5) Auto-detects encoding of the file (UTF-8 etc.)

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ... So
please let me know.

Regards

RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Bryan Loofbourrow

Ludovic,

 how do you index your html files ? I mean do you create fields for
different
parts of your document (for different stop words lists, stemming, etc) ?
with DIH or solrj or something else ?  

We are sending them over http, and using Tika to strip the HTML, at
present.

We do not split the document itself into separate fields, but what we
index includes a bunch of metadata that has been extracted by processes
earlier in the pipeline. These fields don't enter into the
HTML-hit-highlighting question.

 I developed this week a new highlighter module which transfers the
fields
highlighting to the original document (xml in my case) (I use payloads to
store offsets and lenghts of fields in the index). This way, I use the
good
analyzers to do the highlighting correctly and then, I replace the
different
field parts in the document by the highlighted parts. It is not finished
yet, but I already have some good results. 

Yes, I have been thinking along very similar lines. If you arrive at
something you're happy with, I encourage you to share it.

 This is a client request too. Let me know if the iorixxx's solution is
not enought for your particular use case.

I'm enough of a Solr newb that I'll need to study his suggestion for a
bit, to figure out what it does and does not do. When I've done so, I'll
respond to his message.

Thanks,

-- Bryan

Re: Processing/Indexing CSV

2011-06-09 Thread Yonik Seeley

On Thu, Jun 9, 2011 at 3:31 PM, Helmut Hoffer von Ankershoffen
helmut...@googlemail.com wrote:
Hi,

there seems to be no way to index CSV using the DataImportHandler.

Looking over the features you want, it looks like you're starting from
a CSV file (as opposed to CSV stored in a database).
Is there a reason that you need to use DIH and can't directly use the
CSV loader?
http://wiki.apache.org/solr/UpdateCSV

-Yonik
http://www.lucidimagination.com

E.g. many CSV files have double-quotes enclosing some but not all columns -
there is no elegant way to segment this using a simple regular expression.

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ... So
please let me know.

Regards

RE: Displaying highlights in formatted HTML document

2011-06-09 Thread lboutros

I am not (yet) a tika user, perhaps that the  iorixxx's solution is good for
you.

We will share the highlighter module and 2 other developments soon. ('have
to see how to do that)

Ludovic. 

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Displaying-highlights-in-formatted-HTML-document-tp3041909p3045654.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Processing/Indexing CSV

Hi,

just looked at your code. Definitely an improvement :-)

The problem with the double-quotes is, that the delimiter (let's say ',')
might be part of the column value. The goal is to process something like
this without any tricky configuration

name1,name2,name3
val1,val2,...,val3
...

The user should not have to provide and before-hand knowledge regarding the
column layout or the encoding of the CSV file. Ideally the only thing that
has to be specified is firstLineHasFieldnames=true separator=;.
Autodetecting the separator and encoding would be even more elegant.

If nobody else has this in the works I will start building such a patch next
week.

Best Regards

On Thu, Jun 9, 2011 at 9:45 PM, Dyer, James james.d...@ingrambook.comwrote:

Helmut,

I recently submitted SOLR-2549 (
https://issues.apache.org/jira/browse/SOLR-2549) to handle both
fixed-width and delimited flat files. To be honest, I only needed
fixed-width support for my app so this might not support everything you
mention for delimited files, but it should be a good start.

In particular, you might need to enhance this to handle the double quotes
(I had though a delimiter regex along these lines might handle it:
(?:[\]?[,]|[\]$) ... note this is a sample I just cooked up quick and no
doubt has errors, and maybe as you say a simple regex might not work at all
) ... I also didn't do anything with encodings but I'm not sure this will be
an issue either...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com]
Sent: Thursday, June 09, 2011 2:32 PM
To: solr-user@lucene.apache.org
Subject: Processing/Indexing CSV

Hi,

there seems to be no way to index CSV using the DataImportHandler.

http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns -
there is no elegant way to segment this using a simple regular expression.

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ... So
please let me know.

Regards

Re: Processing/Indexing CSV

s/provide and/provide any/ig ,-)

On Thu, Jun 9, 2011 at 10:01 PM, Helmut Hoffer von Ankershoffen
helmut...@googlemail.com wrote:

Hi,

just looked at your code. Definitely an improvement :-)

The problem with the double-quotes is, that the delimiter (let's say ',')
might be part of the column value. The goal is to process something like
this without any tricky configuration

name1,name2,name3
val1,val2,...,val3
...

If nobody else has this in the works I will start building such a patch
next week.

Best Regards

On Thu, Jun 9, 2011 at 9:45 PM, Dyer, James james.d...@ingrambook.comwrote:

Helmut,

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com]
Sent: Thursday, June 09, 2011 2:32 PM
To: solr-user@lucene.apache.org
Subject: Processing/Indexing CSV

Hi,

there seems to be no way to index CSV using the DataImportHandler.

http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns
-
there is no elegant way to segment this using a simple regular expression.

As CSV is still very common esp. in E-Commerce scenarios, I propose that
Solr provides a CSVEntityProcessor that:
1) Handles the case of CSV files with/without and with some double-quote
enclosed columns
2) Allows for a configurable column separator (';',',','\t' etc.)
3) Allows for a leading row containing column headings
4) If there is a leading row with column headings provides a possibility
to
address columns by their column names and map them to Solr fields (similar
to the XPathEntityProcessor)
5) Auto-detects encoding of the file (UTF-8 etc.)

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ...
So
please let me know.

Regards

RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Bryan Loofbourrow

 -Original Message-
 From: Ahmet Arslan [mailto:iori...@yahoo.com]
 Sent: Wednesday, June 08, 2011 11:56 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Displaying highlights in formatted HTML document

 --- On Thu, 6/9/11, Bryan Loofbourrow bloofbour...@knowledgemosaic.com
 wrote:

  From: Bryan Loofbourrow bloofbour...@knowledgemosaic.com
  Subject: Displaying highlights in formatted HTML document
  To: solr-user@lucene.apache.org
  Date: Thursday, June 9, 2011, 2:14 AM
  Here is my use case:

  I have a large number of HTML documents, sizes in the
  0.5K-50M range, most
  around, say, 10M.

  I want to be able to present the user with the formatted
  HTML document, with
  the hits tagged, so that he may iterate through them, and
  see them in the
  context of the document, with the document looking as it
  would be presented
  by a browser; that is, fully formatted, with its tables and
  italics and font
  sizes and all.

  This is something that the user would explicitly request
  from within a set
  of search results, not something I'd expect to have
  returned from an initial
  search - the initial search merely returns the snippets
  around the hits. But
  if the user wants to dive into one of the returned results
  and see them in
  context, I need to be able to go get that.

  We are currently solving this problem by using an entirely
  separate search
  engine (dtSearch), which performs the tagging of the hits
  in the HTML just
  fine. But the solution is unsatisfactory because there are
  Solr searches
  that dtSearch's capabilities cannot reasonably match.

  Can anyone suggest a good way to use Solr/Lucene for this
  instead? I'm
  thinking a separate core for this purpose might make sense,
  so as not to
  burden the primary search core with the full contents of
  the document. But
  after that, I'm stuck. How can I get Solr to express the
  highlighting in the
  context of the formatted HTML document?

  If Solr does not do this currently, and anyone can suggest
  ways to add the
  feature, any tips on how this might best be incorporated
  into the
  implementation would be welcome.

 I am doing the same thing (solr trunk) using the following field type:

 fieldType name=HTMLText class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 charFilter class=solr.MappingCharFilterFactory
mapping=mappings.txt/
 charFilter class=solr.HTMLStripCharFilterFactory
 mapping=mappings.txt/tokenizer
class=solr.StandardTokenizerFactory/
 filter class=solr.TurkishLowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
 filter class=solr.SynonymFilterFactory synonyms=synonyms_index.txt
 ignoreCase=true expand=true/
 /analyzeranalyzer type=query
 charFilter class=solr.MappingCharFilterFactory
mapping=mappings.txt/
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.TurkishLowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
 /analyzer

 In your separate core - which will is queried when the user wants to
dive
 into one of the returned results - feed your html files in to this
field.

 You may want to increase max analyzed chars too.
 int name=hl.maxAnalyzedChars147483647/int

OK, I think see what you're up to. Might be pretty viable for me as well.
Can you talk about anything in your mappings.txt files that is an
important part of the solution?

Also, isn't there another piece? Don't you need to force it to return the
whole document, rather than its usual context chunks? Or are you somehow
able to map the returned chunks into the separately-stored documents?

We have another requirement I forgot to mention, about wanting to
associate a sequence number with each hit, but I imagine I can deal with
that by putting some sort of identifiable char sequence in a custom prefix
for the highlighting, then replacing that with a sequence number in
postprocessing.

I'm also wondering about the performance of this approach with large
documents, vs. something like what Ludovic is talking about, where you
would just get positions back from Solr, and fetch the document separately
from a filestore.

-- Bryan

Re: Processing/Indexing CSV

Hi,

yes, it's about CSV files loaded via HTTP from shops to be fed into a
shopping search engine.

The CSV Loader cannot map fields (only field values) etc. DIH is flexible
enough for building the importing part of such a thing but misses elegant
handling of CSV data ...

Regards

On Thu, Jun 9, 2011 at 9:50 PM, Yonik Seeley yo...@lucidimagination.comwrote:

On Thu, Jun 9, 2011 at 3:31 PM, Helmut Hoffer von Ankershoffen
helmut...@googlemail.com wrote:
Hi,

there seems to be no way to index CSV using the DataImportHandler.

-Yonik
http://www.lucidimagination.com

http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns
-
there is no elegant way to segment this using a simple regular
expression.

As CSV is still very common esp. in E-Commerce scenarios, I propose that
Solr provides a CSVEntityProcessor that:
1) Handles the case of CSV files with/without and with some double-quote
enclosed columns
2) Allows for a configurable column separator (';',',','\t' etc.)
3) Allows for a leading row containing column headings
4) If there is a leading row with column headings provides a possibility
to
address columns by their column names and map them to Solr fields
(similar
to the XPathEntityProcessor)
5) Auto-detects encoding of the file (UTF-8 etc.)

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ...
So
please let me know.

Regards

Re: Processing/Indexing CSV

2011-06-09 Thread Yonik Seeley

On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
helmut...@googlemail.com wrote:
 Hi,
 yes, it's about CSV files loaded via HTTP from shops to be fed into a
 shopping search engine.
 The CSV Loader cannot map fields (only field values) etc.

You can provide your own list of fieldnames and optionally ignore the
first line of the CSV file (assuming it contains the field names).
http://wiki.apache.org/solr/UpdateCSV#fieldnames

-Yonik
http://www.lucidimagination.com

Re: Processing/Indexing CSV

Hi,

... that would be an option if there is a defined set of field names and a
single column/CSV layout. The scenario however is different csv files (from
different shops) with individual column layouts (separators, encodings
etc.). The idea is to map known field names to defined field names in the
solr schema. If I understand the capabilities of the CSVLoader correctly
(sorry, I am completely new to Solr, started work on it today) this is not
possible - is it?

Best Regards


On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
 helmut...@googlemail.com wrote:
  Hi,
  yes, it's about CSV files loaded via HTTP from shops to be fed into a
  shopping search engine.
  The CSV Loader cannot map fields (only field values) etc.

 You can provide your own list of fieldnames and optionally ignore the
 first line of the CSV file (assuming it contains the field names).
 http://wiki.apache.org/solr/UpdateCSV#fieldnames

 -Yonik
 http://www.lucidimagination.com

RE: Displaying highlights in formatted HTML document

 OK, I think see what you're up to. Might be pretty viable
 for me as well.
 Can you talk about anything in your mappings.txt files that
 is an
 important part of the solution?

It is not important. I just copied it. Plus html strip char filter does not 
have mappings parameter. It was a copy paste mistake.
 
 Also, isn't there another piece? Don't you need to force it
 to return the
 whole document, rather than its usual context chunks? 

Yes you are right.  hl.fragsize=0 is needed.

 We have another requirement I forgot to mention, about
 wanting to
 associate a sequence number with each hit, but I imagine I
 can deal with
 that by putting some sort of identifiable char sequence in
 a custom prefix
 for the highlighting, then replacing that with a sequence
 number in
 postprocessing.
 
 I'm also wondering about the performance of this approach
 with large
 documents, vs. something like what Ludovic is talking
 about, where you
 would just get positions back from Solr, and fetch the
 document separately
 from a filestore.

Highlighting large documents takes time. Storing termVectors can be used to 
speedup. I don't know the answer to performance comparison. Perhaps someone 
familiar with highlighting can answer this.

RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Bryan Loofbourrow

  OK, I think see what you're up to. Might be pretty viable
  for me as well.
  Can you talk about anything in your mappings.txt files that
  is an
  important part of the solution?

 It is not important. I just copied it. Plus html strip char filter does
 not have mappings parameter. It was a copy paste mistake.

Yes, I asked the wrong question. What I was subconsciously getting at is
this: how are you avoiding the possibility of getting hits in the HTML
elements? Is that accomplished by putting tag names in your stopwords, or
by some other mechanism?

-- Bryan

RE: solr Invalid Date in Date Math String/Invalid Date String

2011-06-09 Thread Chris Hostetter


: Here is the error message:
: 
: Fieldtype: tdate (I use the default one in solr schema.xml)
: Field value(Index): 2006-12-22T13:52:13Z
: Field value(query): [2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z]   
: with '[' and ']'
: 
: And it generates the result below:

i think the piece of info people were overlooking here is that you are 
describing input to the analysis.jsp page.

you can't enter arbitrary query expressions on this page -- just *values* 
for hte analyzer of the specifeid field (or field type)

DateField doesn't know abything about the [... TO ...] syntax -- that is 
syntax of the query parser.

all the DateField knows is that what you have entered into the Field 
Value text box is not a date value, and it is not a date match value 
either.



-Hoss

Re: Processing/Indexing CSV

2011-06-09 Thread Ken Krugler


On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:

 Hi,
 
 ... that would be an option if there is a defined set of field names and a
 single column/CSV layout. The scenario however is different csv files (from
 different shops) with individual column layouts (separators, encodings
 etc.). The idea is to map known field names to defined field names in the
 solr schema. If I understand the capabilities of the CSVLoader correctly
 (sorry, I am completely new to Solr, started work on it today) this is not
 possible - is it?

As per the documentation on http://wiki.apache.org/solr/UpdateCSV#fieldnames, 
you can specify the names/positions of fields in the CSV file, and ignore 
fieldnames.

So this seems like it would solve your requirement, as each different layout 
could specify its own such mapping during import.

It could be handy to provide a fieldname map (versus the value map that 
UpdateCSV supports). Then you could use the header, and just provide a mapping 
from header fieldnames to schema fieldnames.

-- Ken
 
 On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley 
 yo...@lucidimagination.comwrote:
 
 On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
 helmut...@googlemail.com wrote:
 Hi,
 yes, it's about CSV files loaded via HTTP from shops to be fed into a
 shopping search engine.
 The CSV Loader cannot map fields (only field values) etc.
 
 You can provide your own list of fieldnames and optionally ignore the
 first line of the CSV file (assuming it contains the field names).
 http://wiki.apache.org/solr/UpdateCSV#fieldnames
 
 -Yonik
 http://www.lucidimagination.com
 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions

Re: Solr Indexing Patterns

2011-06-09 Thread Judioo

Very informative links and statement Jonathan. thank you.



On 6 June 2011 20:55, Jonathan Rochkind rochk...@jhu.edu wrote:

 This is a start, for many common best practices:

 http://wiki.apache.org/solr/SolrRelevancyFAQ

 Many of the questions in there have an answer that involves de-normalizing.
 As an example. It may be that even if your specific problem isn't in there,
  I myself anyway found reading through there gave me a general sense of
 common patterns in Solr.

 ( It's certainly true that some things are hard to do in Solr.  It turns
 out that an RDBMS is a remarkably flexible thing -- but when it doesn't do
 something you need well, and you turn to a specialized tool instead like
 Solr, you certainly give up some things

 One of the biggest areas of limitation involves hieararchical or
 relationship data, definitely. There are a variety of features, some more
 fully baked than others, some not yet in a Solr release, meant to provide
 tools to get at different aspects of this. Including pivot facetting,
  join (https://issues.apache.org/jira/browse/SOLR-2272), and
 field-collapsing.  Each, IMO, is trying to deal with different aspects of
 dealing with hieararchical or multi-class data, or data that is entities
 with relationships. ).


 On 6/6/2011 3:43 PM, Judioo wrote:

 I do think that Solr would be better served if there was a *best practice
 section *of the site.

 Looking at the majority of emails to this list they resolve around how do
 I
 do X?.

 Seems like tutorials with real world examples would serve Solr no end of
 good.

 I still do not have an example of the best method to approach my problem,
 although Erick has  help me understand the limitations of Solr.

 Just thought I'd say.






 On 6 June 2011 20:26, Judioocont...@judioo.com  wrote:

  Thanks


 On 6 June 2011 19:32, Erick Ericksonerickerick...@gmail.com  wrote:

  #Everybody# (including me) who has any RDBMS background
 doesn't want to flatten data, but that's usually the way to go in
 Solr.

 Part of whether it's a good idea or not depends on how big the index
 gets, and unfortunately the only way to figure that out is to test.

 But that's the first approach I'd try.

 Good luck!
 Erick

 On Mon, Jun 6, 2011 at 11:42 AM, Judioocont...@judioo.com  wrote:

 On 5 June 2011 14:42, Erick Ericksonerickerick...@gmail.com  wrote:

  See: http://wiki.apache.org/solr/SchemaXml

 By adding ' multiValued=true ' to the field, you can add
 the same field multiple times in a doc, something like

 add
 doc
  field name=mvvalue1/field
  field name=mvvalue2/field
 /doc
 /add

 I can't see how that would work as one would need to associate the

 right

 start / end dates and price.
 As I understand using multivalued and thus flattening the  discounts

 would

 result in:

 {
name:The Book,
price:$9.99,
price:$3.00,
price:$4.00,synopsis:thanksgiving special,
starts:11-24-2011,
starts:10-10-2011,
ends:11-25-2011,
ends:10-11-2011,
synopsis:Canadian thanksgiving special,
  },

 How does one differentiate the different offers?



  But there's no real ability  in Solr to store sub documents,
 so you'd have to get creative in how you encoded the discounts...

  This is what I'm asking :)
 What is the best / recommended / known patterns for doing this?



  But I suspect a better approach would be to store each discount as
 a separate document. If you're in the trunk version, you could then
 group results by, say, ISBN and get responses grouped together...

  This is an option but seems sub optimal. So say I store the discounts
 in
 multiple documents with ISDN as an attribute and also store the title

 again

 with ISDN as an attribute.

 To get
 all books currently discounted

 requires 2 request

 * get all discounts currently active
 * get all books  using ISDN retrieved from above search

 Not that bad. However what happens when I want
 all books that are currently on discount in the horror genre

 containing

 the word 'elm' in the title.

 The only way I can see in catering for the above search is to duplicate

 all

 searchable fields in my book document in my discount document.

 Coming

 from a RDBM background this seems wrong.

 Is this the correct approach to take?



  Best
 Erick

 On Sat, Jun 4, 2011 at 1:42 AM, Judioocont...@judioo.com  wrote:

 Hi,
 Discounts can change daily. Also there can be a lot of them (over

 time

 and

 in a given time period ).

 Could you give an example of what you mean buy multi-valuing the

 field.

 Thanks

 On 3 June 2011 14:29, Erick Ericksonerickerick...@gmail.com

 wrote:

  How often are the discounts changed? Because you can simply
 re-index the book information with a multiValued discounts field
 and get something similar to your example (wt=json)


 Best
 Erick

 On Fri, Jun 3, 2011 at 8:38 AM, Judioocont...@judioo.com  wrote:

 What is the best practice method to index the following in Solr:

 I'm attempting to use solr for a book store site.

Re: Processing/Indexing CSV

On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler kkrugler_li...@transpac.comwrote:


 On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:

  Hi,
 
  ... that would be an option if there is a defined set of field names and
 a
  single column/CSV layout. The scenario however is different csv files
 (from
  different shops) with individual column layouts (separators, encodings
  etc.). The idea is to map known field names to defined field names in the
  solr schema. If I understand the capabilities of the CSVLoader correctly
  (sorry, I am completely new to Solr, started work on it today) this is
 not
  possible - is it?

 As per the documentation on
 http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the
 names/positions of fields in the CSV file, and ignore fieldnames.

 So this seems like it would solve your requirement, as each different
 layout could specify its own such mapping during import.

 Sure, but the requirement (to keep the process of integrating new shops
efficient) is not to have one mapping per import (cp. the Email regarding
more or less schema free) but to enhance one mapping that maps common
field names to defined fields disregarding order of known fields/columns. As
far as I understand that is not a problem at all with DIH, however DIH and
CSV are not a perfect match ,-)


 It could be handy to provide a fieldname map (versus the value map that
 UpdateCSV supports).

Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in DIH
...


 Then you could use the header, and just provide a mapping from header
 fieldnames to schema fieldnames.

That's the idea -)

= what's the best way to progress. Either someone enhances the CSVLoader by
a field mapper (with multipel input field names mapping to one field name in
the Solr schema) or someone enhances the DIH with a robust CSV loader ,-).
As I am completely new to this Community, please give me the direction to go
(or wait :-).

best regards


 -- Ken

  On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley 
 yo...@lucidimagination.comwrote:
 
  On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
  helmut...@googlemail.com wrote:
  Hi,
  yes, it's about CSV files loaded via HTTP from shops to be fed into a
  shopping search engine.
  The CSV Loader cannot map fields (only field values) etc.
 
  You can provide your own list of fieldnames and optionally ignore the
  first line of the CSV file (assuming it contains the field names).
  http://wiki.apache.org/solr/UpdateCSV#fieldnames
 
  -Yonik
  http://www.lucidimagination.com
 

 --
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 custom data mining solutions

Re: Processing/Indexing CSV

Hi,

btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH
regarding the CSV format (James Dyer) and the effort to maintain the
CSVLoader (Ken Krugler). How about merging your efforts and migrating the
CSVLoader to a CSVEntityProcessor (cp. my initial email)? :-)

Best Regards

On Thu, Jun 9, 2011 at 11:17 PM, Helmut Hoffer von Ankershoffen 
helmut...@googlemail.com wrote:



 On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler 
 kkrugler_li...@transpac.comwrote:


 On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:

  Hi,
 
  ... that would be an option if there is a defined set of field names and
 a
  single column/CSV layout. The scenario however is different csv files
 (from
  different shops) with individual column layouts (separators, encodings
  etc.). The idea is to map known field names to defined field names in
 the
  solr schema. If I understand the capabilities of the CSVLoader correctly
  (sorry, I am completely new to Solr, started work on it today) this is
 not
  possible - is it?

 As per the documentation on
 http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the
 names/positions of fields in the CSV file, and ignore fieldnames.

 So this seems like it would solve your requirement, as each different
 layout could specify its own such mapping during import.

 Sure, but the requirement (to keep the process of integrating new shops
 efficient) is not to have one mapping per import (cp. the Email regarding
 more or less schema free) but to enhance one mapping that maps common
 field names to defined fields disregarding order of known fields/columns. As
 far as I understand that is not a problem at all with DIH, however DIH and
 CSV are not a perfect match ,-)


 It could be handy to provide a fieldname map (versus the value map that
 UpdateCSV supports).

 Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in
 DIH ...


 Then you could use the header, and just provide a mapping from header
 fieldnames to schema fieldnames.

 That's the idea -)

 = what's the best way to progress. Either someone enhances the CSVLoader
 by a field mapper (with multipel input field names mapping to one field name
 in the Solr schema) or someone enhances the DIH with a robust CSV loader
 ,-). As I am completely new to this Community, please give me the direction
 to go (or wait :-).

 best regards


 -- Ken

  On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley 
 yo...@lucidimagination.comwrote:
 
  On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
  helmut...@googlemail.com wrote:
  Hi,
  yes, it's about CSV files loaded via HTTP from shops to be fed into a
  shopping search engine.
  The CSV Loader cannot map fields (only field values) etc.
 
  You can provide your own list of fieldnames and optionally ignore the
  first line of the CSV file (assuming it contains the field names).
  http://wiki.apache.org/solr/UpdateCSV#fieldnames
 
  -Yonik
  http://www.lucidimagination.com
 

 --
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 custom data mining solutions

RE: Displaying highlights in formatted HTML document