Re: upgrading to Tika 0.9 on Solr 1.4.1

2011-06-21 Thread Andreas Kemkes
We are successfully extracting PDF content with Solr 3.1 and Tika 0.9.

Replace
fontbox-1.3.1.jar jempbox-1.3.1.jar pdfbox-1.3.1.jar tika-core-0.8.jar 
tika-parsers-0.8.jar 

with
 
fontbox-1.4.0.jar jempbox-1.4.0.jar pdfbox-1.4.0.jar tika-core-0.9.jar 
tika-parsers-0.9.jar 

I'm not entirely certain, if a recompile of Solr was necessary or not.
Andreas




From: Surendra csnsha...@gmail.com
To: solr-user@lucene.apache.org
Sent: Tue, June 21, 2011 5:18:31 AM
Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1

Hi Andreas
I tried solr 3.1 as well as 3.2... i was not able to overcome these issues with
the newer versions too. For me, I need the attr_content:* should return me
results (with 1.4.1 this is successful) which is not happening . It indexes well
in 3.1 but in 3.2 i have the following issue.
Invalid version or the data in not in 'javabin' format
--Surendra

Re: upgrading to Tika 0.9 on Solr 1.4.1

2011-06-20 Thread Andreas Kemkes
I've unsuccessfully attempted to go down this road - there are API changes, 
some 
of which I was able to solve by taking code snippets from Solr 3.1.  Some 
 extraction-related tests for wouldn't pass (look for 'Solr 1.4.1 and Tika 0.9 
- 
some tests not passing' in the archive).  Ultimately, I decided that the then 
newly released Solr 3.1 was the less rocky route.  Not sure if that is an 
option 
for you.

Andreas




From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Mon, June 20, 2011 7:18:34 AM
Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1

Hi Surendra,

On Jun 20, 2011, at 4:59 AM, Surendra wrote:

 Hey Chris
 
 I have added tika-core 0.9 and tika-parsers 0.9 to Solr1.4.1 (extraction/lib)
 after building them using the source provided by TIKA. Now I have an issue 
with
 this. I am working with extracting PDF content using Solr. I have added
 fmap.content to the configurable params as attr_content where I can see the
 entire extracted document. After the TIKA update i am not able to see
 attr_content appearing in the search results. When I restore it with old 0.4
 TIKA jars again the attr_content appears. I didn't find any exceptions shown 
up
 there in the console. Is this a known behavior that someone have faced 
already?
 Can you guide me to resolve this?

I don't think you can simple add a new tika-core-0.9 and tika-parsers-0.9 to 
extraction/lib -- I think you'll need to replace the set of prior Tika jars in 
there. Have a look here to see what jars you would need to replace, HTH:

http://tika.apache.org/0.9/gettingstarted.html

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

Re: TikaEntityProcessor

2011-04-20 Thread Andreas Kemkes
I went unsuccessfully down this path - too many incompatibilities among 
versions 
- some code changes and recompiling required.  See also thread Solr 1.4.1 and 
Tika 0.9 - some tests not passing for remaining issues.  You'll have better 
luck with the newer Solr 3.1 release, which already uses Tika 0.8 - still 
re-compiled from code (no changes as far as I remember) - never tried the 
library replacement - don't think it's possible.

Andreas  




From: firdous_kind86 naturelov...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wed, April 20, 2011 12:38:02 AM
Subject: Re: TikaEntityProcessor

hi, i asked that :)

didnt get that.. what dependencies?

i am using solr 1.4 and tika 0.9

i replaced tika-core 0.9 and tika-parsers 0.9 at /contrib/extraction/lib
also replaced old version of dataimporthandler-extras by
apache-solr-dataimporthandler-extras-3.1.0.jar

but still same problem..

someone pointed bug SOLR-2116 to me but i guess it is only for solr-3.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2841936.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-04-01 Thread Andreas Kemkes
Thank you.  That is valuable guidance.

In light of the recent release of Solr 3.1, I decided to first try that 
distribution, as it already uses Tika 0.8, which is much closer to my target.

Out of the box (i.e., w/o replacing the Tika and PDFBox libraries) the tests 
pass, yet I see the error below.  When I change

ignoreException(unknown field 'a');

to 

ignoreException(unknown field 'meta');

in the testDefaultField test, the error output goes away.

I am wondering, if that particular error is expected, or whether the error 
should in fact be unknown field 'a' and I'm only masking an issue with the 
change.

All extraction test pass also after I replace the Tika and PDFBox libraries 
with 
the newer versions.

-- Andreas

test:
[junit] Testsuite: org.apache.solr.handler.ExtractingRequestHandlerTest
[junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 6.424 sec
[junit] 
[junit] - Standard Error -
[junit] 01/04/2011 22:49:59 org.apache.solr.common.SolrException log
[junit] SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 
'meta'
[junit] at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:321)
[junit] at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)

[junit] at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)

[junit] at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)

[junit] at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198)

[junit] at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)

[junit] at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)

[junit] at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
[junit] at 
org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:337)
[junit] at 
org.apache.solr.handler.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:373)

[junit] at 
org.apache.solr.handler.ExtractingRequestHandlerTest.testDefaultField(ExtractingRequestHandlerTest.java:156)







From: Chris Hostetter hossman_luc...@fucit.org
To: solr-user@lucene.apache.org
Sent: Thu, March 31, 2011 7:19:05 PM
Subject: Re: Solr 1.4.1 and Tika 0.9 - some tests not passing


: I'm still interested on what steps I could take to get to the bottom of the 
: failing tests.  Is there additional information that I should provide?

i'm not really up to speed on what might have changed in Tika 0.9 to cause 
this, but the best thing to do would probably be to look at what *does* 
work compared to what doesn't work.

if *none* of hte asserts for dealing with an html doc work, that suggests 
that fundementally something is just completley broken about the html 
parsing.

Consider this first assertion failure...

: assertQ(req(title:Welcome), //*[@numFound='1']);

...in the context of what you said tika 0.9 gives you for that doc on the 
command line...

: $ java -jar tika-app-0.9.jar 
: 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html

...
: titleWelcome to Solr/title

...if that basic little bit of info can't be extracted, then i'm guessing 
nothing is being extracted.

I would suggest you run the example (with the 0.9 tika jars) and manually 
attempt to index one document, and then use the schema browser to see 
exactly what gets indexed.

you may need to experiment with tweaking the config options for the 
extraction handler.

-Hoss


Re: Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-03-28 Thread Andreas Kemkes
I'm still interested on what steps I could take to get to the bottom of the 
failing tests.  Is there additional information that I should provide?

Some of the output below got mangled in the email - here are the (hopefully) 
complete lines:

This has a a shape=rect href=http://www.apache.org;linklt;/a. (Tika 0.9)
This has a lt;a href=http://www.apache.org;linklt;/a. (Tika 0.4)




From: Andreas Kemkes a5s...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Tue, March 22, 2011 10:30:57 AM
Subject: Solr 1.4.1 and Tika 0.9 - some tests not passing

Due to some PDF indexing issues with the Solr 1.4.1 distribution, we would like 
to upgrade it to Tika 0.9, as the issues are not occurring in Tika 0.9.

With the changes we made to Solr 1.4.1, we can successfully index the 
previously 

failing PDF documents.

Unfortunately we cannot get the HTML-related tests to pass.

The following asserts in ExtractingRequestHandlerTest.java are failing:

assertQ(req(title:Welcome), //*[@numFound='1']);
assertQ(req(+id:simple2 +t_href:[* TO *]), //*[@numFound='1']);
assertQ(req(t_href:http), //*[@numFound='2']);
assertQ(req(t_href:http), //doc[1]/str[.='simple3']);
assertQ(req(+id:simple4 +t_content:Solr), //*[@numFound='1']);
assertQ(req(defaultExtr:http\\://www.apache.org), //*[@numFound='1']);
assertQ(req(+id:simple2 +t_href:[* TO *]), //*[@numFound='1']);
assertTrue(val +  is not equal to  + linkNews, val.equals(linkNews) == 
true);//there are two a tags, and they get collapesd

Below are the differences in output from Tika 0.4 and Tika 0.9 for simple.html.

Tika 0.9 has additional meta tags, a shape attribute, and some additional white 
space.  Is this what throws it off?  

What do we need to consider so that Solr 1.4.1 will process the Tika 0.9 output 
correctly?

Do we need to configure different filters and tokenizers?  Which ones?

Or is it something else entirely?

Thanks in advance for any help,

Andreas

$ java -jar tika-app-0.4.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html


?xml version=1.0 encoding=UTF-8?
head
titleWelcome to Solr/title
/head
body
p
  Here is some text
/p

Here is some text in a div
This has a link'http://www.apache.org;link.


/body
/html

$ java -jar tika-app-0.9.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html
 

?xml version=1.0 encoding=UTF-8?
head
meta name=Content-Length content=209/
meta name=Content-Encoding content=ISO-8859-1/
meta name=Content-Type content=text/html/
meta name=resourceName content=simple.html/
titleWelcome to Solr/title
/head
body
p
  Here is some text
/p

Here is some text in a div

This has a link'http://www.apache.org;link.

/body
/html

Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-03-22 Thread Andreas Kemkes
Due to some PDF indexing issues with the Solr 1.4.1 distribution, we would like 
to upgrade it to Tika 0.9, as the issues are not occurring in Tika 0.9.

With the changes we made to Solr 1.4.1, we can successfully index the 
previously 
failing PDF documents.

Unfortunately we cannot get the HTML-related tests to pass.

The following asserts in ExtractingRequestHandlerTest.java are failing:

assertQ(req(title:Welcome), //*[@numFound='1']);
assertQ(req(+id:simple2 +t_href:[* TO *]), //*[@numFound='1']);
assertQ(req(t_href:http), //*[@numFound='2']);
assertQ(req(t_href:http), //doc[1]/str[.='simple3']);
assertQ(req(+id:simple4 +t_content:Solr), //*[@numFound='1']);
assertQ(req(defaultExtr:http\\://www.apache.org), //*[@numFound='1']);
assertQ(req(+id:simple2 +t_href:[* TO *]), //*[@numFound='1']);
assertTrue(val +  is not equal to  + linkNews, val.equals(linkNews) == 
true);//there are two a tags, and they get collapesd

Below are the differences in output from Tika 0.4 and Tika 0.9 for simple.html.

Tika 0.9 has additional meta tags, a shape attribute, and some additional white 
space.  Is this what throws it off?  

What do we need to consider so that Solr 1.4.1 will process the Tika 0.9 output 
correctly?

Do we need to configure different filters and tokenizers?  Which ones?

Or is it something else entirely?

Thanks in advance for any help,

Andreas

$ java -jar tika-app-0.4.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html

?xml version=1.0 encoding=UTF-8?
head
titleWelcome to Solr/title
/head
body
p
  Here is some text
/p

Here is some text in a div
This has a link'http://www.apache.org;link.


/body
/html

$ java -jar tika-app-0.9.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html
 

?xml version=1.0 encoding=UTF-8?
head
meta name=Content-Length content=209/
meta name=Content-Encoding content=ISO-8859-1/
meta name=Content-Type content=text/html/
meta name=resourceName content=simple.html/
titleWelcome to Solr/title
/head
body
p
  Here is some text
/p

Here is some text in a div

This has a link'http://www.apache.org;link.

/body
/html



  

Re: Omit hour-min-sec in search?

2011-03-06 Thread Andreas Kemkes
How about [-MM-DDThh:mm:ssZ/DAY TO -MM-DDThh:mm:ssZ+1DAY/DAY]?  See 
DateField.html in your Solr API documentation for more.

Andreas




From: Jan Høydahl jan@cominvent.com
To: solr-user@lucene.apache.org
Sent: Sun, March 6, 2011 1:40:59 PM
Subject: Re: Omit hour-min-sec in search?

 Not sure if there is a means of doing explicitly what you ask, but you
 could do a date range:
 
 +mydate:[-MM-DD 0:0:0 TO -MM-DD 11:59:59]

This would not work. It has to be on the -MM-DDT00:00:00Z format.

But I agree that it would be handy if the DateField could support a date-only 
format mydate:[-MM-DD TO -MM-DD]
It could simply default to midnight UTC.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


  

Re: More Date Math: NOW/WEEK

2011-03-05 Thread Andreas Kemkes
Thank you for the clarification.

Personally, I believe it is correct for a week to start in a different 
month/year and it is certainly what I would expect.  As you pointed out, these 
time units don't form a strictly ordered set (...yearmonthday..., 
weekday...).

Complications arise from the different notions of what the first day of the 
week 
is (Sunday - US and Canada, Monday - Europe and ISO 8601, Saturday - Middle 
East).  This is handled by the locale, I think.

Further complications are introduced by week numbering, but I don't think this 
applies here (http://en.wikipedia.org/wiki/Seven-day_week#Week_numbering).

Both MySQL 
(http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek)
 and Postgres have the notion of weeks.

All this ignores complications of 5-day or 6-day weeks, which were used in 
Russia during certain parts of the last century.  There might be other 
historical cases or even current ones, but as you, I believe a definition like 
A week is a time unit equal to seven days. is commonly accepted.

But maybe you are correct and this is special logic and belongs in the client.

Regards,

Andreas




From: Chris Hostetter hossman_luc...@fucit.org
To: solr-user@lucene.apache.org
Sent: Tue, March 1, 2011 6:30:26 PM
Subject: Re: More Date Math: NOW/WEEK

: Digging into the source code of DateMathParser.java, i found the following 
: comment:
:99   // NOTE: consciously choosing not to support WEEK at this time,  
: 100   // because of complexity in rounding down to the nearest week   101 
 


: // arround a month/year boundry.   102   // (Not to mention: it's not 
clear 

: what people would *expect*) 
: 
: I was able to implement a work-around in my ruby client using the following 
: pseudo code:
:   wd=NOW.wday; NOW-#{wd}DAY/DAY

the main issue that comment in DateMathParser.java is refering to is what 
the ambiguity of what should happen when you try do something like 
2009-01-02T00:00:00Z/WEEK

WEEK would be the only unit where rounding changed a unit 
*larger* then the one you rounded on -- ie: rounding day only affects 
hours, minutes, seconds, millis; rounding on month only affects days, 
hours, minutes, seconds, millies; but in an example like the one above, 
where Jan 2 2009 was a friday.  rounding down a week (using logic similar 
to what you have) would result in 2008-12-28T00:00:00Z -- changing the 
month and year.

It's not really clear that that is what people would expect -- i'm 
guessing at least a few people would expect it to stop at the 1st of the 
month.

the ambiguity of what behavior makes the most sense is why never got 
arround to implementing it -- it's certianly possible, but the 
various options seemed too confusing to really be very generally useful 
and easy to understand 

as you point out: people who really want special logic like this (and know 
how they want it to behave) have an easy workarround by evaluating NOW 
in the client since every week has exactly seven days.



-Hoss



  

Re: Tika metadata extracted per supported document format?

2011-02-28 Thread Andreas Kemkes
Chris:

Yes, I only see the output below.

I'm familiar with the information in 
 http://wiki.apache.org/solr/ExtractingRequestHandler, except for 
the tika.config part, which I haven't touched.

Even when running documents through Tika directly, the output of metadata is 
highly dependent on what metadata the document contains (obviously).  I haven't 
found the right place in the Tika source code yet either.  Would digging into 
POI, PDFBox, ... help me any further on my pursuit?  A Matrix that lists the 
complete set of metadata for the most popular formats would sure be helpful to 
me.  I would help providing it, if properly directed.

Thanks,

Andreas

PS: I've also noticed some differences in the date formats being used (using 
version 0.9).  Is that something I should be concerned about when using it 
through SolrCell?

meta name=Creation-Date content=Mon May 17 10:10:15 PDT 2010/ (from a 
Word 
document)
meta name=Creation-Date content=2011-01-03T18:45:50Z/ (from a PDF)





From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Fri, February 25, 2011 4:11:00 PM
Subject: Re: Tika metadata extracted per supported document format?

Hi Andreas,

 java -jar tika-app-0.9.jar --list-met-models
 TikaMetadataKeys
 PROTECTED
 RESOURCE_NAME_KEY
 TikaMimeKeys
 MIME_TYPE_MAGIC
 TIKA_MIME_FILE
 
 Both 0.8 and 0.9 give me the same list.  Is that a configuration issue?

Strange -- those are the only met models you're seeing listed?

 
 I'm a bit unclear if that gets me to what I was looking for - metadata 
 like content_type or last_modified.  Or am I confusing Tika metadata 
 with SolrCell metadata?
 
 I thought SolrCell metadata comes from Tika, or does it not?

It does come from Tika that's for sure, but in SolrCell, there is a 
configuration for the ExtractingRequestHandler that remaps
the field names from Tika to Solr. So that's probably where it's coming from. 
Check this out:

http://wiki.apache.org/solr/ExtractingRequestHandler

HTH!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


  

Tika metadata extracted per supported document format?

2011-02-25 Thread Andreas Kemkes
Hello,

I've asked this on the Tika mailing list w/o an answer, so apologies for 
cross-posting.

I'm trying to find information that tells me specifically what metadata is 
provided for the different supported document formats.  Unfortunately all I was 
able to find so far is The Metadata produced depends on the type of document 
submitted.

Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), so 
I'm particularly interested in that version, but also in changes that are 
provided in newer versions of Tika.

Where are the best places to look for such information?

Thanks in advance,

Andreas


  

Re: Tika metadata extracted per supported document format?

2011-02-25 Thread Andreas Kemkes
Hi Chris,

Thank you so much - that's a great start.

Andreas




From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Cc: u...@tika.apache.org u...@tika.apache.org
Sent: Fri, February 25, 2011 1:21:33 PM
Subject: Re: Tika metadata extracted per supported document format?

Hi Andreas,

In Tika 0.8+, you can run the --list-met-models command from tika-app:

java -jar tika-app-version.jar --list-met-models

And get a print out of the met keys that Tika supports. Some parsers add their 
own that aren't part of this met listing, but this is a relatively 
comprehensive 
list.

Cheers,
Chris

On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote:

 Hello,
 
 I've asked this on the Tika mailing list w/o an answer, so apologies for 
 cross-posting.
 
 I'm trying to find information that tells me specifically what metadata is 
 provided for the different supported document formats.  Unfortunately all I 
 was 

 able to find so far is The Metadata produced depends on the type of document 
 submitted.
 
 Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), 
 so 

 I'm particularly interested in that version, but also in changes that are 
 provided in newer versions of Tika.
 
 Where are the best places to look for such information?
 
 Thanks in advance,
 
 Andreas
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


  

Re: upgrading to Tika 0.9 on Solr 1.4.1

2011-02-25 Thread Andreas Kemkes
According to the Tika release notes, it's fixed in 0.9.  Haven't tried it 
myself.

A critical backwards incompatible bug in PDF parsing that was introduced in 
Tika 
0.8 has been fixed. (TIKA-548)

Andreas




From: Darx Oman darxo...@gmail.com
To: solr-user@lucene.apache.org
Sent: Fri, February 25, 2011 10:33:39 AM
Subject: Re: upgrading to Tika 0.9 on Solr 1.4.1

hi
if you want to index pdf files then use tika 0.6
because 0.7 and 0.8 does not detect the correctly the pdfParse



  

Re: Tika metadata extracted per supported document format?

2011-02-25 Thread Andreas Kemkes
Hi Chris,

java -jar tika-app-0.9.jar --list-met-models
TikaMetadataKeys
 PROTECTED
 RESOURCE_NAME_KEY
TikaMimeKeys
 MIME_TYPE_MAGIC
 TIKA_MIME_FILE

Both 0.8 and 0.9 give me the same list.  Is that a configuration issue?

I'm a bit unclear if that gets me to what I was looking for - metadata 
like content_type or last_modified.  Or am I confusing Tika metadata 
with SolrCell metadata?

I thought SolrCell metadata comes from Tika, or does it not?

Regards,

Andreas




From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Cc: u...@tika.apache.org u...@tika.apache.org
Sent: Fri, February 25, 2011 1:21:33 PM
Subject: Re: Tika metadata extracted per supported document format?

Hi Andreas,

In Tika 0.8+, you can run the --list-met-models command from tika-app:

java -jar tika-app-version.jar --list-met-models

And get a print out of the met keys that Tika supports. Some parsers add their 
own that aren't part of this met listing, but this is a relatively 
comprehensive 
list.

Cheers,
Chris

On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote:

 Hello,
 
 I've asked this on the Tika mailing list w/o an answer, so apologies for 
 cross-posting.
 
 I'm trying to find information that tells me specifically what metadata is 
 provided for the different supported document formats.  Unfortunately all I 
 was 

 able to find so far is The Metadata produced depends on the type of document 
 submitted.
 
 Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), 
 so 

 I'm particularly interested in that version, but also in changes that are 
 provided in newer versions of Tika.
 
 Where are the best places to look for such information?
 
 Thanks in advance,
 
 Andreas
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


  

Re: Date Math

2011-02-23 Thread Andreas Kemkes
Thank you, that clarifies it.  Good catch on -DAY.  I had noticed it after 
submitting but as -1DAY causes the same ParseException, I didn't amend the 
question.

Andreas




From: Chris Hostetter hossman_luc...@fucit.org
To: solr-user@lucene.apache.org
Sent: Tue, February 22, 2011 6:18:56 PM
Subject: Re: Date Math


: org.apache.lucene.queryParser.ParseException: Cannot parse 
'last_modified:-DAY': 

...
: Are they not supported as a short-cut for NOW-1DAY?  I'm using Solr 1.4.

No, -1DAY is a valid DateMath string (to the DateMathParser) but as a 
field value you must specify a valid date string, which can *end* with a 
DateMath string.  so NOW-1DAY is legal, as is 
2011-02-22T12:34:56Z-1DAY

Note also: you didn't do -1DAY you tried -DAY which isn't valid 
anywhere.


-Hoss



  

More Date Math: NOW/WEEK

2011-02-23 Thread Andreas Kemkes
Date Math is great.
NOW/MONTH, NOW/DAY are all working and very useful, so naively I tried 
NOW/WEEK, 
which failed.
Digging into the source code of DateMathParser.java, i found the following 
comment:
   99   // NOTE: consciously choosing not to support WEEK at this time,   
100   // because of complexity in rounding down to the nearest week   101   

// arround a month/year boundry.   102   // (Not to mention: it's not clear 
what people would *expect*) 

I was able to implement a work-around in my ruby client using the following 
pseudo code:
  wd=NOW.wday; NOW-#{wd}DAY/DAY
This could be extended and integrated into the DateMathParser.java directly 
using the something like the following mapping:
  valWEEKS -- (val*7)DAYS
  date/WEEK -- (date-(date.DAY_OF_WEEK)DAYS)/DAY
What other concerns are there to consider?
Andreas



  

Re: Index Design Question

2011-02-18 Thread Andreas Kemkes
Thank you.  These are good general suggestion.

Regarding the optimization for indexing vs. querying: are there any specific 
recommendations for each of those cases available somewhere.  A link, for 
example, would be fabulous.

I'm also still curious about solutions that go further.

For example, there is a 2007 Lucene Overview presentation by Aaron Bannert 
claiming that Lucene provides built-in methods to allow queries to span 
multiple remote Lucene indexes.  and A much more involved way to achieving 
high levels of update performance can be had by dividing the data into separate 
“columns”, or “silos”. Each column will hold a subset of the overall data, and 
will only receive updates for data that it controls.  By taking advantage 
of the remote index merging query utility mentioned on an earlier slide, 
the data can still be searched in its entirety without any loss of accuracy and 
with negligible performance impact.

Is this possible using Solr?  How could this be accomplished?  Again, any link 
would be fabulous.

The wiki page http://wiki.apache.org/solr/MergingSolrIndexes seems to describe 
a 
somewhat different approach to merging.

Is this something that could be integrated into master/slave replication by 
having two masters and one merged slave (in the above sense of separate 
“columns”, or “silos”)?

If yes, what are the performance considerations when using it?


  

Date Math

2011-02-17 Thread Andreas Kemkes
The SolrQuerySyntax Wiki page refers to DateMathParser for examples.

When I tried -1DAY, I got:

org.apache.lucene.queryParser.ParseException: Cannot parse 
'last_modified:-DAY': 
Encountered  - -  at line 1, column 14.
Was expecting one of: ( ... * ... QUOTED ... TERM ...   
  
PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... 

Are they not supported as a short-cut for NOW-1DAY?  I'm using Solr 1.4.


  

Index Design Question

2011-02-17 Thread Andreas Kemkes
We are indexing documents with several associated fields for search and 
display, 
some of which may change with a much higher frequency than the document 
content. 
 As per my understanding, we have to resubmit the entire gamut of fields with 
every update.

If the reindexing of the documents becomes a performance bottleneck, what 
choices of design alternatives are there within Solr?

Thanks in advance for your contributions.


  

Controlling Tika's metadata

2011-01-28 Thread Andreas Kemkes
Just getting my feet wet with the text extraction using both schema and 
solrconfig settings from the example directory in the 1.4 distribution, so I 
might miss something obvious.

Trying to provide my own title (and discarding the one received through Tika's 
metadata) wasn't straightforward. I had to use the following:

fmap.title=tika_title (to discard the Tika title)
literal.attr_title=New Title (to provide the correct one)
fmap.attr_title=title (to map it back to the field as I would like to use title 
in searches)

Is there anything easier than the above?

How can this best be generalized to other metadata provided by Tika (which in 
our use case will be mostly ignored, as it is provided separately)?

Thanks in advance for your responses.