RE: Spellcheck help
Thanks for the input, i'll check it out! Marc Subject: RE: Spellcheck help Date: Fri, 23 Jul 2010 13:12:04 -0500 From: james.d...@ingrambook.com To: solr-user@lucene.apache.org In org.apache.solr.spelling.SpellingQueryConverter, find the line (#84): final static String PATTERN = (?:(?!( + NMTOKEN + :|\\d+)))[\\p{L}_\\-0-9]+; and remove the |\\d+ to make it: final static String PATTERN = (?:(?! + NMTOKEN + :))[\\p{L}_\\-0-9]+; My testing shows this solves your problem. The caution is to test it against all your use cases because obviously someone thought we should ignore leading digits from keywords. Surely there's a reason why although I can't think of it. James Dyer E-Commerce Systems Ingram Book Company (615) 213-4311 -Original Message- From: dekay...@hotmail.com [mailto:dekay...@hotmail.com] Sent: Saturday, July 17, 2010 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck help Can anybody help me with this? :( -Original Message- From: Marc Ghorayeb Sent: Thursday, July 08, 2010 9:46 AM To: solr-user@lucene.apache.org Subject: Spellcheck help Hello,I've been trying to get rid of a bug when using the spellcheck but so far with no success :(When searching for a word that starts with a number, for example 3dsmax, i get the results that i want, BUT the spellcheck says it is not correctly spelled AND the collation gives me 33dsmax. Further investigation shows that the spellcheck is actually only checking dsmax which it considers does not exist and gives me 3dsmax for better results, but since i have spellcheck.collate = true, the collation that i show is 33dsmax with the first 3 being the one discarded by the spellchecker... Otherwise, the spellcheck works correctly for normal words... any ideas? :(My spellcheck field is fairly classic, whitespace tokenizer, with lowercase filter...Any help would be greatly appreciated :)Thanks,Marc _ Messenger arrive enfin sur iPhone ! Venez le télécharger gratuitement ! http://www.messengersurvotremobile.com/?d=iPhone _ Exclu : Téléchargez la nouvelle version de Messenger ! http://clk.atdmt.com/FRM/go/244627952/direct/01/
Spellcheck help
Hello,I've been trying to get rid of a bug when using the spellcheck but so far with no success :(When searching for a word that starts with a number, for example 3dsmax, i get the results that i want, BUT the spellcheck says it is not correctly spelled AND the collation gives me 33dsmax. Further investigation shows that the spellcheck is actually only checking dsmax which it considers does not exist and gives me 3dsmax for better results, but since i have spellcheck.collate = true, the collation that i show is 33dsmax with the first 3 being the one discarded by the spellchecker... Otherwise, the spellcheck works correctly for normal words... any ideas? :(My spellcheck field is fairly classic, whitespace tokenizer, with lowercase filter...Any help would be greatly appreciated :)Thanks,Marc _ Messenger arrive enfin sur iPhone ! Venez le télécharger gratuitement ! http://www.messengersurvotremobile.com/?d=iPhone
Strange query behavior
Hello, I have a title that says 3DVIA Studio amp; Virtools Maya and 3dsMax Exporters. The analysis tool for this field gives me these tokens:3dviadviastudio;virtoolmaya3dsmaxdssystèmmaxexport However, when i search for 3dsmax, i get no results :( Furthermore, if i search for dsmax i get the spellchecker that suggests me 3dsmax even though it doesn't find any results. If i search for any other token (3dvia, or max for example), the document is found. 3dsmax is the only token that doesn't seem to work!! :( Here is my schema for this field:fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.TrimFilterFactory updateOffsets=true/ filter class=solr.LengthFilterFactory min=2 max=15/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=${Language} protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.TrimFilterFactory updateOffsets=true/ filter class=solr.LengthFilterFactory min=2 max=15/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / filter class=solr.SnowballPorterFilterFactory language=${Language} protected=protwords.txt / /analyzer /fieldType Can anyone help me out please? :( PS: the ${Language} is set to en (for english) in this case... _ La boîte mail NOW Génération vous permet de réunir toutes vos boîtes mail dans Hotmail ! http://www.windowslive.fr/hotmail/nowgeneration/
RE: Copyfield multi valued to single value
Thanks for the update, i'll have to find another way then :s. Marc Date: Mon, 14 Jun 2010 13:44:30 -0700 From: hossman_luc...@fucit.org To: solr-user@lucene.apache.org Subject: Re: Copyfield multi valued to single value : Is there a way to copy a multivalued field to a single value by taking : for example the first index of the multivalued field? Unfortunately no. This would either need to be done with an UpdateProcessor, or on the client constructing hte doc (either the remote client, or in your DIH config if that's how you are using Tika) -Hoss _ Installez gratuitement les nouvelles Emoch'ticones ! http://www.ilovemessenger.fr/emoticones/telecharger-emoticones-emochticones.aspx
Copyfield multi valued to single value
Hello, Is there a way to copy a multivalued field to a single value by taking for example the first index of the multivalued field? I am actually trying to sort my index by Title and my index contains Tika extracted titles which come in as multi valued hence why my title field is multi valued. However when i do a sort on the title field, it crashes because well it cannot compare two arrays i guess which is logical. So my thought was to copy only one value from the array to another field. Maybe there is another way to do that? Can anyone help me? Thanks in advance! Marc _ Vous voulez regarder la TV directement depuis votre PC ? C'est très simple avec Windows 7 http://clk.atdmt.com/FRM/go/229960614/direct/01/
RE: Problem with pdf, upgrading Cell
Great news, thanks :) Marc _ Vous voulez regarder la TV directement depuis votre PC ? C'est très simple avec Windows 7 http://clk.atdmt.com/FRM/go/229960614/direct/01/
RE: Problem with pdf, upgrading Cell
mailto:sagar...@opentext.com wrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com mailto:dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
Praveen, I am indeed using a trunk version from last week's svn i think. You could always try a version from the hudson builds. I did not try this procedure with Solr's 1.4 release though. Marc _ Consultez vos emails Orange, Gmail, Yahoo!, Free ... directement depuis HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/
RE: Problem with pdf, upgrading Cell
org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xml-apis-1.0.b2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xmlbeans-2.3.0.jar' to classloader May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar' to classloader May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to classloader Thanks, Sandhya -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, May 04, 2010 6:13 AM Cc: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Little more info... Seems to be a classloading issue. The tests pass, but they aren't loading the Tika libraries via the Solr ResourceLoader, whereas the example is. Marc, one thing to try is to unjar the Solr WAR file and put the Tika libs in there, as I bet it will then work. Note, however, I haven't tried this. On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote: I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this. It is indeed a bug somewhere (still investigating). It seems that Tika is now picking an EmptyParser implementation when trying to determine which parser to use, despite the fact that it properly identifies the MIME Type. -Grant On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote: I'm investigating. On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote: Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content
RE: Problem with pdf, upgrading Cell
Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Découvrez comment SURFER DISCRETEMENT sur un site de rencontres ! http://clk.atdmt.com/FRM/go/206608211/direct/01/
RE: Problem with pdf, upgrading Cell
Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search _ Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur votre téléphone! http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/
RE: Problem with pdf, upgrading Cell
Okay i've been digging a little bit through the Java code from the SVN, and it seems the load function inside the ExtractingDocumentLoader class does not receive the ContentStream (it is set to null...).Maybe i should send this to the developper mailing list? Marc From: dekay...@hotmail.com To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Date: Fri, 23 Apr 2010 16:03:28 +0200 Seems like i'm not the only one with this no extraction problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently he tried the same thing, building from the trunk, and indexing a pdf, and no extraction occured... Strange. Marc G. _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Découvrez comment SURFER DISCRETEMENT sur un site de rencontres ! http://clk.atdmt.com/FRM/go/206608211/direct/01/
Problem with pdf, upgrading Cell
Hello, I configured a Solr server to be able to extract data from various documents, including pdfs. Unfortunately, the data extraction fails on several pdfs. I have read around here that this may be due to the old Tika library being used?I looked around and saw that the svn had a newer version so i checked out the trunk, and built it using ant dist, and ant example.I then set up my schema in the newly built server, and inserted the library from the newly built cell into the lib directory (in solr's home). However, now all i get is a blank response... The indexing works, but it doesn't extract anything, only the literal values that i pass on are indexed. Any help would be greatly appreciated!! :) Thank you. Marc Ghorayeb _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
I'm launching it with the start.jar utility, and there doesn't seem to be anything weird inside the console when i upload a pdf. Is there a way to output the console to a log file? The only log file that get's updated is a log file in the logs directory, and it seems to only show the input/ouput of the web requests (get and posts...). for example:127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?show=schemawt=json HTTP/1.1 200 21690 127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?wt=json HTTP/1.1 200 780 127.0.0.1 - - [23/Apr/2010:13:06:57 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:06:58 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cmysql-proxy-en.pdfliteral.title=mysql-proxy-en.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fmysql-proxy-en.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:06:59 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cpython-cheat-sheet-v1.pdfliteral.title=python-cheat-sheet-v1.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fpython-cheat-sheet-v1.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/schema.jsp HTTP/1.1 200 26395 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/jquery-1.2.3.min.js HTTP/1.1 304 0 I don't think that's going to help much :) Date: Fri, 23 Apr 2010 06:04:34 -0700 From: otis_gospodne...@yahoo.com Subject: Re: Problem with pdf, upgrading Cell To: solr-user@lucene.apache.org Marc, got anything in your logs? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Marc Ghorayeb dekay...@hotmail.com To: solr-user@lucene.apache.org Sent: Fri, April 23, 2010 8:42:53 AM Subject: Problem with pdf, upgrading Cell Hello, I configured a Solr server to be able to extract data from various documents, including pdfs. Unfortunately, the data extraction fails on several pdfs. I have read around here that this may be due to the old Tika library being used?I looked around and saw that the svn had a newer version so i checked out the trunk, and built it using ant dist, and ant example.I then set up my schema in the newly built server, and inserted the library from the newly built cell into the lib directory (in solr's home). However, now all i get is a blank response... The indexing works, but it doesn't extract anything, only the literal values that i pass on are indexed. Any help would be greatly appreciated!! :) Thank you. Marc Ghorayeb _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/
RE: Problem with pdf, upgrading Cell
Seems like i'm not the only one with this no extraction problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently he tried the same thing, building from the trunk, and indexing a pdf, and no extraction occured... Strange. Marc G. From: dekay...@hotmail.com To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Date: Fri, 23 Apr 2010 15:12:39 +0200 I'm launching it with the start.jar utility, and there doesn't seem to be anything weird inside the console when i upload a pdf. Is there a way to output the console to a log file? The only log file that get's updated is a log file in the logs directory, and it seems to only show the input/ouput of the web requests (get and posts...). for example:127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?show=schemawt=json HTTP/1.1 200 21690 127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?wt=json HTTP/1.1 200 780 127.0.0.1 - - [23/Apr/2010:13:06:57 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:06:58 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cmysql-proxy-en.pdfliteral.title=mysql-proxy-en.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fmysql-proxy-en.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:06:59 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cpython-cheat-sheet-v1.pdfliteral.title=python-cheat-sheet-v1.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fpython-cheat-sheet-v1.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/schema.jsp HTTP/1.1 200 26395 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/jquery-1.2.3.min.js HTTP/1.1 304 0 I don't think that's going to help much :) Date: Fri, 23 Apr 2010 06:04:34 -0700 From: otis_gospodne...@yahoo.com Subject: Re: Problem with pdf, upgrading Cell To: solr-user@lucene.apache.org Marc, got anything in your logs? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Marc Ghorayeb dekay...@hotmail.com To: solr-user@lucene.apache.org Sent: Fri, April 23, 2010 8:42:53 AM Subject: Problem with pdf, upgrading Cell Hello, I configured a Solr server to be able to extract data from various documents, including pdfs. Unfortunately, the data extraction fails on several pdfs. I have read around here that this may be due to the old Tika library being used?I looked around and saw that the svn had a newer version so i checked out the trunk, and built it using ant dist, and ant example.I then set up my schema in the newly built server, and inserted the library from the newly built cell into the lib directory (in solr's home). However, now all i get is a blank response... The indexing works, but it doesn't extract anything, only the literal values that i pass on are indexed. Any help would be greatly appreciated!! :) Thank you. Marc Ghorayeb _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation
RE: Problem with pdf, upgrading Cell
Seems like i'm not the only one with this no extraction problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently he tried the same thing, building from the trunk, and indexing a pdf, and no extraction occured... Strange. Marc G. _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming result for searc...@105585dc main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming searc...@105585dc main from searc...@2efeecca main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming result for searc...@105585dc main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming searc...@105585dc main from searc...@2efeecca main documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming result for searc...@105585dc main documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.core.QuerySenderListener newSearcherINFO: QuerySenderListener sending requests to searc...@105585dc mainApr 23, 2010 5:47:14 PM org.apache.solr.core.QuerySenderListener newSearcherINFO: QuerySenderListener done.Apr 23, 2010 5:47:14 PM org.apache.solr.core.SolrCore registerSearcherINFO: [] Registered new searcher searc...@105585dc mainApr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher closeINFO: Closing searc...@2efeecca main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.update.processor.LogUpdateProcessor finishINFO: {optimize=} 0 46Apr 23, 2010 5:47:14 PM org.apache.solr.core.SolrCore executeINFO: [] webapp=/solr path=/update params={optimize=truewaitSearcher=truemaxSegments=1waitFlush=truewt=javabinversion=1} status=0 QTime=46 Date: Fri, 23 Apr 2010 08:03:14 -0700 From: otis_gospodne...@yahoo.com Subject: Re: Problem with pdf, upgrading Cell To: solr-user@lucene.apache.org Marc, These are your request logs. You want to look at your Solr logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Marc Ghorayeb dekay...@hotmail.com To: solr-user@lucene.apache.org Sent: Fri, April 23, 2010 9:12:39 AM Subject: RE: Problem with pdf, upgrading Cell I'm launching it with the start.jar utility, and there doesn't seem to be anything weird inside the console when i upload a pdf. Is there a way to output the console to a log file? The only log file that get's updated is a log file in the logs directory, and it seems to only show the input/ouput of the web requests (get and posts...). for example:127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?show=schemawt=json HTTP/1.1 200 21690 127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?wt=json HTTP/1.1 200 780 127.0.0.1 - - [23/Apr/2010:13:06:57 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200