Re: upgrading to Tika 0.9 on Solr 1.4.1
We are successfully extracting PDF content with Solr 3.1 and Tika 0.9. Replace fontbox-1.3.1.jar jempbox-1.3.1.jar pdfbox-1.3.1.jar tika-core-0.8.jar tika-parsers-0.8.jar with fontbox-1.4.0.jar jempbox-1.4.0.jar pdfbox-1.4.0.jar tika-core-0.9.jar tika-parsers-0.9.jar I'm not entirely certain, if a recompile of Solr was necessary or not. Andreas From: Surendra csnsha...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, June 21, 2011 5:18:31 AM Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1 Hi Andreas I tried solr 3.1 as well as 3.2... i was not able to overcome these issues with the newer versions too. For me, I need the attr_content:* should return me results (with 1.4.1 this is successful) which is not happening . It indexes well in 3.1 but in 3.2 i have the following issue. Invalid version or the data in not in 'javabin' format --Surendra
Re: upgrading to Tika 0.9 on Solr 1.4.1
I've unsuccessfully attempted to go down this road - there are API changes, some of which I was able to solve by taking code snippets from Solr 3.1. Some extraction-related tests for wouldn't pass (look for 'Solr 1.4.1 and Tika 0.9 - some tests not passing' in the archive). Ultimately, I decided that the then newly released Solr 3.1 was the less rocky route. Not sure if that is an option for you. Andreas From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Mon, June 20, 2011 7:18:34 AM Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1 Hi Surendra, On Jun 20, 2011, at 4:59 AM, Surendra wrote: Hey Chris I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib) after building them using the source provided by TIKA. Now I have an issue with this. I am working with extracting PDF content using Solr. I have added fmap.content to the configurable params as attr_content where I can see the entire extracted document. After the TIKA update i am not able to see attr_content appearing in the search results. When I restore it with old 0.4 TIKA jars again the attr_content appears. I didn't find any exceptions shown up there in the console. Is this a known behavior that someone have faced already? Can you guide me to resolve this? I don't think you can simple add a new tika-core-0.9 and tika-parsers-0.9 to extraction/lib -- I think you'll need to replace the set of prior Tika jars in there. Have a look here to see what jars you would need to replace, HTH: http://tika.apache.org/0.9/gettingstarted.html Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: TikaEntityProcessor
I went unsuccessfully down this path - too many incompatibilities among versions - some code changes and recompiling required. See also thread Solr 1.4.1 and Tika 0.9 - some tests not passing for remaining issues. You'll have better luck with the newer Solr 3.1 release, which already uses Tika 0.8 - still re-compiled from code (no changes as far as I remember) - never tried the library replacement - don't think it's possible. Andreas From: firdous_kind86 naturelov...@gmail.com To: solr-user@lucene.apache.org Sent: Wed, April 20, 2011 12:38:02 AM Subject: Re: TikaEntityProcessor hi, i asked that :) didnt get that.. what dependencies? i am using solr 1.4 and tika 0.9 i replaced tika-core 0.9 and tika-parsers 0.9 at /contrib/extraction/lib also replaced old version of dataimporthandler-extras by apache-solr-dataimporthandler-extras-3.1.0.jar but still same problem.. someone pointed bug SOLR-2116 to me but i guess it is only for solr-3.1 -- View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2841936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 1.4.1 and Tika 0.9 - some tests not passing
Thank you. That is valuable guidance. In light of the recent release of Solr 3.1, I decided to first try that distribution, as it already uses Tika 0.8, which is much closer to my target. Out of the box (i.e., w/o replacing the Tika and PDFBox libraries) the tests pass, yet I see the error below. When I change ignoreException(unknown field 'a'); to ignoreException(unknown field 'meta'); in the testDefaultField test, the error output goes away. I am wondering, if that particular error is expected, or whether the error should in fact be unknown field 'a' and I'm only masking an issue with the change. All extraction test pass also after I replace the Tika and PDFBox libraries with the newer versions. -- Andreas test: [junit] Testsuite: org.apache.solr.handler.ExtractingRequestHandlerTest [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 6.424 sec [junit] [junit] - Standard Error - [junit] 01/04/2011 22:49:59 org.apache.solr.common.SolrException log [junit] SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 'meta' [junit] at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:321) [junit] at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) [junit] at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) [junit] at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) [junit] at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198) [junit] at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) [junit] at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) [junit] at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) [junit] at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:337) [junit] at org.apache.solr.handler.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:373) [junit] at org.apache.solr.handler.ExtractingRequestHandlerTest.testDefaultField(ExtractingRequestHandlerTest.java:156) From: Chris Hostetter hossman_luc...@fucit.org To: solr-user@lucene.apache.org Sent: Thu, March 31, 2011 7:19:05 PM Subject: Re: Solr 1.4.1 and Tika 0.9 - some tests not passing : I'm still interested on what steps I could take to get to the bottom of the : failing tests. Is there additional information that I should provide? i'm not really up to speed on what might have changed in Tika 0.9 to cause this, but the best thing to do would probably be to look at what *does* work compared to what doesn't work. if *none* of hte asserts for dealing with an html doc work, that suggests that fundementally something is just completley broken about the html parsing. Consider this first assertion failure... : assertQ(req(title:Welcome), //*[@numFound='1']); ...in the context of what you said tika 0.9 gives you for that doc on the command line... : $ java -jar tika-app-0.9.jar : ../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html ... : titleWelcome to Solr/title ...if that basic little bit of info can't be extracted, then i'm guessing nothing is being extracted. I would suggest you run the example (with the 0.9 tika jars) and manually attempt to index one document, and then use the schema browser to see exactly what gets indexed. you may need to experiment with tweaking the config options for the extraction handler. -Hoss
Re: Solr 1.4.1 and Tika 0.9 - some tests not passing
I'm still interested on what steps I could take to get to the bottom of the failing tests. Is there additional information that I should provide? Some of the output below got mangled in the email - here are the (hopefully) complete lines: This has a a shape=rect href=http://www.apache.org;linklt;/a. (Tika 0.9) This has a lt;a href=http://www.apache.org;linklt;/a. (Tika 0.4) From: Andreas Kemkes a5s...@yahoo.com To: solr-user@lucene.apache.org Sent: Tue, March 22, 2011 10:30:57 AM Subject: Solr 1.4.1 and Tika 0.9 - some tests not passing Due to some PDF indexing issues with the Solr 1.4.1 distribution, we would like to upgrade it to Tika 0.9, as the issues are not occurring in Tika 0.9. With the changes we made to Solr 1.4.1, we can successfully index the previously failing PDF documents. Unfortunately we cannot get the HTML-related tests to pass. The following asserts in ExtractingRequestHandlerTest.java are failing: assertQ(req(title:Welcome), //*[@numFound='1']); assertQ(req(+id:simple2 +t_href:[* TO *]), //*[@numFound='1']); assertQ(req(t_href:http), //*[@numFound='2']); assertQ(req(t_href:http), //doc[1]/str[.='simple3']); assertQ(req(+id:simple4 +t_content:Solr), //*[@numFound='1']); assertQ(req(defaultExtr:http\\://www.apache.org), //*[@numFound='1']); assertQ(req(+id:simple2 +t_href:[* TO *]), //*[@numFound='1']); assertTrue(val + is not equal to + linkNews, val.equals(linkNews) == true);//there are two a tags, and they get collapesd Below are the differences in output from Tika 0.4 and Tika 0.9 for simple.html. Tika 0.9 has additional meta tags, a shape attribute, and some additional white space. Is this what throws it off? What do we need to consider so that Solr 1.4.1 will process the Tika 0.9 output correctly? Do we need to configure different filters and tokenizers? Which ones? Or is it something else entirely? Thanks in advance for any help, Andreas $ java -jar tika-app-0.4.jar ../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html ?xml version=1.0 encoding=UTF-8? head titleWelcome to Solr/title /head body p Here is some text /p Here is some text in a div This has a link'http://www.apache.org;link. /body /html $ java -jar tika-app-0.9.jar ../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html ?xml version=1.0 encoding=UTF-8? head meta name=Content-Length content=209/ meta name=Content-Encoding content=ISO-8859-1/ meta name=Content-Type content=text/html/ meta name=resourceName content=simple.html/ titleWelcome to Solr/title /head body p Here is some text /p Here is some text in a div This has a link'http://www.apache.org;link. /body /html
Solr 1.4.1 and Tika 0.9 - some tests not passing
Due to some PDF indexing issues with the Solr 1.4.1 distribution, we would like to upgrade it to Tika 0.9, as the issues are not occurring in Tika 0.9. With the changes we made to Solr 1.4.1, we can successfully index the previously failing PDF documents. Unfortunately we cannot get the HTML-related tests to pass. The following asserts in ExtractingRequestHandlerTest.java are failing: assertQ(req(title:Welcome), //*[@numFound='1']); assertQ(req(+id:simple2 +t_href:[* TO *]), //*[@numFound='1']); assertQ(req(t_href:http), //*[@numFound='2']); assertQ(req(t_href:http), //doc[1]/str[.='simple3']); assertQ(req(+id:simple4 +t_content:Solr), //*[@numFound='1']); assertQ(req(defaultExtr:http\\://www.apache.org), //*[@numFound='1']); assertQ(req(+id:simple2 +t_href:[* TO *]), //*[@numFound='1']); assertTrue(val + is not equal to + linkNews, val.equals(linkNews) == true);//there are two a tags, and they get collapesd Below are the differences in output from Tika 0.4 and Tika 0.9 for simple.html. Tika 0.9 has additional meta tags, a shape attribute, and some additional white space. Is this what throws it off? What do we need to consider so that Solr 1.4.1 will process the Tika 0.9 output correctly? Do we need to configure different filters and tokenizers? Which ones? Or is it something else entirely? Thanks in advance for any help, Andreas $ java -jar tika-app-0.4.jar ../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html ?xml version=1.0 encoding=UTF-8? head titleWelcome to Solr/title /head body p Here is some text /p Here is some text in a div This has a link'http://www.apache.org;link. /body /html $ java -jar tika-app-0.9.jar ../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html ?xml version=1.0 encoding=UTF-8? head meta name=Content-Length content=209/ meta name=Content-Encoding content=ISO-8859-1/ meta name=Content-Type content=text/html/ meta name=resourceName content=simple.html/ titleWelcome to Solr/title /head body p Here is some text /p Here is some text in a div This has a link'http://www.apache.org;link. /body /html
Re: Omit hour-min-sec in search?
How about [-MM-DDThh:mm:ssZ/DAY TO -MM-DDThh:mm:ssZ+1DAY/DAY]? See DateField.html in your Solr API documentation for more. Andreas From: Jan Høydahl jan@cominvent.com To: solr-user@lucene.apache.org Sent: Sun, March 6, 2011 1:40:59 PM Subject: Re: Omit hour-min-sec in search? Not sure if there is a means of doing explicitly what you ask, but you could do a date range: +mydate:[-MM-DD 0:0:0 TO -MM-DD 11:59:59] This would not work. It has to be on the -MM-DDT00:00:00Z format. But I agree that it would be handy if the DateField could support a date-only format mydate:[-MM-DD TO -MM-DD] It could simply default to midnight UTC. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Re: More Date Math: NOW/WEEK
Thank you for the clarification. Personally, I believe it is correct for a week to start in a different month/year and it is certainly what I would expect. As you pointed out, these time units don't form a strictly ordered set (...yearmonthday..., weekday...). Complications arise from the different notions of what the first day of the week is (Sunday - US and Canada, Monday - Europe and ISO 8601, Saturday - Middle East). This is handled by the locale, I think. Further complications are introduced by week numbering, but I don't think this applies here (http://en.wikipedia.org/wiki/Seven-day_week#Week_numbering). Both MySQL (http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek) and Postgres have the notion of weeks. All this ignores complications of 5-day or 6-day weeks, which were used in Russia during certain parts of the last century. There might be other historical cases or even current ones, but as you, I believe a definition like A week is a time unit equal to seven days. is commonly accepted. But maybe you are correct and this is special logic and belongs in the client. Regards, Andreas From: Chris Hostetter hossman_luc...@fucit.org To: solr-user@lucene.apache.org Sent: Tue, March 1, 2011 6:30:26 PM Subject: Re: More Date Math: NOW/WEEK : Digging into the source code of DateMathParser.java, i found the following : comment: :99 // NOTE: consciously choosing not to support WEEK at this time, : 100 // because of complexity in rounding down to the nearest week 101 : // arround a month/year boundry. 102 // (Not to mention: it's not clear : what people would *expect*) : : I was able to implement a work-around in my ruby client using the following : pseudo code: : wd=NOW.wday; NOW-#{wd}DAY/DAY the main issue that comment in DateMathParser.java is refering to is what the ambiguity of what should happen when you try do something like 2009-01-02T00:00:00Z/WEEK WEEK would be the only unit where rounding changed a unit *larger* then the one you rounded on -- ie: rounding day only affects hours, minutes, seconds, millis; rounding on month only affects days, hours, minutes, seconds, millies; but in an example like the one above, where Jan 2 2009 was a friday. rounding down a week (using logic similar to what you have) would result in 2008-12-28T00:00:00Z -- changing the month and year. It's not really clear that that is what people would expect -- i'm guessing at least a few people would expect it to stop at the 1st of the month. the ambiguity of what behavior makes the most sense is why never got arround to implementing it -- it's certianly possible, but the various options seemed too confusing to really be very generally useful and easy to understand as you point out: people who really want special logic like this (and know how they want it to behave) have an easy workarround by evaluating NOW in the client since every week has exactly seven days. -Hoss
Re: Tika metadata extracted per supported document format?
Chris: Yes, I only see the output below. I'm familiar with the information in http://wiki.apache.org/solr/ExtractingRequestHandler, except for the tika.config part, which I haven't touched. Even when running documents through Tika directly, the output of metadata is highly dependent on what metadata the document contains (obviously). I haven't found the right place in the Tika source code yet either. Would digging into POI, PDFBox, ... help me any further on my pursuit? A Matrix that lists the complete set of metadata for the most popular formats would sure be helpful to me. I would help providing it, if properly directed. Thanks, Andreas PS: I've also noticed some differences in the date formats being used (using version 0.9). Is that something I should be concerned about when using it through SolrCell? meta name=Creation-Date content=Mon May 17 10:10:15 PDT 2010/ (from a Word document) meta name=Creation-Date content=2011-01-03T18:45:50Z/ (from a PDF) From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Fri, February 25, 2011 4:11:00 PM Subject: Re: Tika metadata extracted per supported document format? Hi Andreas, java -jar tika-app-0.9.jar --list-met-models TikaMetadataKeys PROTECTED RESOURCE_NAME_KEY TikaMimeKeys MIME_TYPE_MAGIC TIKA_MIME_FILE Both 0.8 and 0.9 give me the same list. Is that a configuration issue? Strange -- those are the only met models you're seeing listed? I'm a bit unclear if that gets me to what I was looking for - metadata like content_type or last_modified. Or am I confusing Tika metadata with SolrCell metadata? I thought SolrCell metadata comes from Tika, or does it not? It does come from Tika that's for sure, but in SolrCell, there is a configuration for the ExtractingRequestHandler that remaps the field names from Tika to Solr. So that's probably where it's coming from. Check this out: http://wiki.apache.org/solr/ExtractingRequestHandler HTH! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Tika metadata extracted per supported document format?
Hello, I've asked this on the Tika mailing list w/o an answer, so apologies for cross-posting. I'm trying to find information that tells me specifically what metadata is provided for the different supported document formats. Unfortunately all I was able to find so far is The Metadata produced depends on the type of document submitted. Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), so I'm particularly interested in that version, but also in changes that are provided in newer versions of Tika. Where are the best places to look for such information? Thanks in advance, Andreas
Re: Tika metadata extracted per supported document format?
Hi Chris, Thank you so much - that's a great start. Andreas From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: u...@tika.apache.org u...@tika.apache.org Sent: Fri, February 25, 2011 1:21:33 PM Subject: Re: Tika metadata extracted per supported document format? Hi Andreas, In Tika 0.8+, you can run the --list-met-models command from tika-app: java -jar tika-app-version.jar --list-met-models And get a print out of the met keys that Tika supports. Some parsers add their own that aren't part of this met listing, but this is a relatively comprehensive list. Cheers, Chris On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote: Hello, I've asked this on the Tika mailing list w/o an answer, so apologies for cross-posting. I'm trying to find information that tells me specifically what metadata is provided for the different supported document formats. Unfortunately all I was able to find so far is The Metadata produced depends on the type of document submitted. Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), so I'm particularly interested in that version, but also in changes that are provided in newer versions of Tika. Where are the best places to look for such information? Thanks in advance, Andreas ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: upgrading to Tika 0.9 on Solr 1.4.1
According to the Tika release notes, it's fixed in 0.9. Haven't tried it myself. A critical backwards incompatible bug in PDF parsing that was introduced in Tika 0.8 has been fixed. (TIKA-548) Andreas From: Darx Oman darxo...@gmail.com To: solr-user@lucene.apache.org Sent: Fri, February 25, 2011 10:33:39 AM Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1 hi if you want to index pdf files then use tika 0.6 because 0.7 and 0.8 does not detect the correctly the pdfParse
Re: Tika metadata extracted per supported document format?
Hi Chris, java -jar tika-app-0.9.jar --list-met-models TikaMetadataKeys PROTECTED RESOURCE_NAME_KEY TikaMimeKeys MIME_TYPE_MAGIC TIKA_MIME_FILE Both 0.8 and 0.9 give me the same list. Is that a configuration issue? I'm a bit unclear if that gets me to what I was looking for - metadata like content_type or last_modified. Or am I confusing Tika metadata with SolrCell metadata? I thought SolrCell metadata comes from Tika, or does it not? Regards, Andreas From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: u...@tika.apache.org u...@tika.apache.org Sent: Fri, February 25, 2011 1:21:33 PM Subject: Re: Tika metadata extracted per supported document format? Hi Andreas, In Tika 0.8+, you can run the --list-met-models command from tika-app: java -jar tika-app-version.jar --list-met-models And get a print out of the met keys that Tika supports. Some parsers add their own that aren't part of this met listing, but this is a relatively comprehensive list. Cheers, Chris On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote: Hello, I've asked this on the Tika mailing list w/o an answer, so apologies for cross-posting. I'm trying to find information that tells me specifically what metadata is provided for the different supported document formats. Unfortunately all I was able to find so far is The Metadata produced depends on the type of document submitted. Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), so I'm particularly interested in that version, but also in changes that are provided in newer versions of Tika. Where are the best places to look for such information? Thanks in advance, Andreas ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Date Math
Thank you, that clarifies it. Good catch on -DAY. I had noticed it after submitting but as -1DAY causes the same ParseException, I didn't amend the question. Andreas From: Chris Hostetter hossman_luc...@fucit.org To: solr-user@lucene.apache.org Sent: Tue, February 22, 2011 6:18:56 PM Subject: Re: Date Math : org.apache.lucene.queryParser.ParseException: Cannot parse 'last_modified:-DAY': ... : Are they not supported as a short-cut for NOW-1DAY? I'm using Solr 1.4. No, -1DAY is a valid DateMath string (to the DateMathParser) but as a field value you must specify a valid date string, which can *end* with a DateMath string. so NOW-1DAY is legal, as is 2011-02-22T12:34:56Z-1DAY Note also: you didn't do -1DAY you tried -DAY which isn't valid anywhere. -Hoss
More Date Math: NOW/WEEK
Date Math is great. NOW/MONTH, NOW/DAY are all working and very useful, so naively I tried NOW/WEEK, which failed. Digging into the source code of DateMathParser.java, i found the following comment: 99 // NOTE: consciously choosing not to support WEEK at this time, 100 // because of complexity in rounding down to the nearest week 101 // arround a month/year boundry. 102 // (Not to mention: it's not clear what people would *expect*) I was able to implement a work-around in my ruby client using the following pseudo code: wd=NOW.wday; NOW-#{wd}DAY/DAY This could be extended and integrated into the DateMathParser.java directly using the something like the following mapping: valWEEKS -- (val*7)DAYS date/WEEK -- (date-(date.DAY_OF_WEEK)DAYS)/DAY What other concerns are there to consider? Andreas
Re: Index Design Question
Thank you. These are good general suggestion. Regarding the optimization for indexing vs. querying: are there any specific recommendations for each of those cases available somewhere. A link, for example, would be fabulous. I'm also still curious about solutions that go further. For example, there is a 2007 Lucene Overview presentation by Aaron Bannert claiming that Lucene provides built-in methods to allow queries to span multiple remote Lucene indexes. and A much more involved way to achieving high levels of update performance can be had by dividing the data into separate “columns”, or “silos”. Each column will hold a subset of the overall data, and will only receive updates for data that it controls. By taking advantage of the remote index merging query utility mentioned on an earlier slide, the data can still be searched in its entirety without any loss of accuracy and with negligible performance impact. Is this possible using Solr? How could this be accomplished? Again, any link would be fabulous. The wiki page http://wiki.apache.org/solr/MergingSolrIndexes seems to describe a somewhat different approach to merging. Is this something that could be integrated into master/slave replication by having two masters and one merged slave (in the above sense of separate “columns”, or “silos”)? If yes, what are the performance considerations when using it?
Date Math
The SolrQuerySyntax Wiki page refers to DateMathParser for examples. When I tried -1DAY, I got: org.apache.lucene.queryParser.ParseException: Cannot parse 'last_modified:-DAY': Encountered - - at line 1, column 14. Was expecting one of: ( ... * ... QUOTED ... TERM ... PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... Are they not supported as a short-cut for NOW-1DAY? I'm using Solr 1.4.
Index Design Question
We are indexing documents with several associated fields for search and display, some of which may change with a much higher frequency than the document content. As per my understanding, we have to resubmit the entire gamut of fields with every update. If the reindexing of the documents becomes a performance bottleneck, what choices of design alternatives are there within Solr? Thanks in advance for your contributions.
Controlling Tika's metadata
Just getting my feet wet with the text extraction using both schema and solrconfig settings from the example directory in the 1.4 distribution, so I might miss something obvious. Trying to provide my own title (and discarding the one received through Tika's metadata) wasn't straightforward. I had to use the following: fmap.title=tika_title (to discard the Tika title) literal.attr_title=New Title (to provide the correct one) fmap.attr_title=title (to map it back to the field as I would like to use title in searches) Is there anything easier than the above? How can this best be generalized to other metadata provided by Tika (which in our use case will be mostly ignored, as it is provided separately)? Thanks in advance for your responses.