Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Sixten Otto
On Fri, Jun 18, 2010 at 2:42 PM, Chris Hostetter
 wrote:
> I'm confused ... You're using DIH, and some of your fields are URLs to
> documents that you want to parse with Tika?
>
> Why would you need a custom Transformer?

Yeah, I can definitely vouch that DIH can handle this without
additional coding. (The Lucid article the OP linked to looks like it's
defining a custom Transformer because the document is in a BLOB in the
database.)

However, the DIH in Solr 1.4 doesn't have the Tika support you'd need.
You would need to go with either trunk or branch_3x to make this work.

Sixten


Re: HOWTO get a working copy of SOLR?

2010-06-15 Thread Sixten Otto
On Tue, Jun 15, 2010 at 12:58 AM, Bernd Fehling
 wrote:
> - changed to SOLR branch_3x. Installs fine, runs fine, luke works fine but
>  the extraction with /update/extract (ExtractingRequestHandler) only replies
>  the metadata but not the content.

Sounds like https://issues.apache.org/jira/browse/SOLR-1902

Sixten


Re: Tomcat startup script

2010-06-09 Thread Sixten Otto
On Tue, Jun 8, 2010 at 4:18 PM,   wrote:
> The following should work on centos/redhat, don't forget to edit the paths,
> user, and java options for your environment. You can use chkconfig to add it
> to your startup.

Thanks, Colin.

Sixten


Re: TikaEntityProcessor on Solr 1.4?

2010-06-08 Thread Sixten Otto
2010/5/22 Noble Paul നോബിള്‍  नोब्ळ् :
> just copy the dih-extras jar file from the nightly should be fine

Now that I've finally got a server on which to attempt to set these
things up... this turns out not to be a viable solution. The extras
jar does contain the TikaEntityProcessor class, but NOT the
BinFileDataSource and BinURLDataSource on which it depends. I tried
both replacing the 1.4 DIH jar with the one from the trunk, and adding
those two specific classes to the extras jar, neither of which worked.
(And I apologize, but I didn't copy down the exceptions involved; if I
can find some free time, I might go back and make the attempt again, a
bit more methodically.)

Sixten


Re: Tomcat startup script

2010-06-08 Thread Sixten Otto
On Tue, Jun 8, 2010 at 11:00 AM, K Wong  wrote:
> Okay. I've been running multicore Solr 1.4 on Tomcat 5.5/OpenJDK 6
> straight out of the centos repo and I've not had any issues. We're not
> doing anything wild and crazy with it though.

It's nice to know that the wiki's advice might be out of date. That
doesn't really help me with my immediate problem (lacking the script
the wiki is trying to provide), though, unless I want to rip out what
I've got and start over. :-/

Sixten


Re: Tomcat startup script

2010-06-08 Thread Sixten Otto
On Mon, Jun 7, 2010 at 9:23 PM, K Wong  wrote:
> Did you install tomcat 5.5 from an RPM?

I did not, on the advice of that same Solr wiki article that manual
installation is "recommended because distribution Tomcats are either
old or quirky." There haven't been any issues with this, except that
the broken wiki is preventing me from getting to that script.

(FWIW, I also installed Tomcat 6.)

Sixten


Re: Tomcat startup script

2010-06-07 Thread Sixten Otto
On Mon, Jun 7, 2010 at 2:35 PM, Chris Hostetter
 wrote:
> there is currently a bug with the apache wiki and attachments...
> https://issues.apache.org/jira/browse/INFRA-2773

Glad to know it's not just me.

But does anyone have that script posted anywhere else?

Sixten


Tomcat startup script

2010-06-07 Thread Sixten Otto
So, looking at the wiki article on setting up Solr with Tomcat
(http://wiki.apache.org/solr/SolrTomcat), there's a link to an
attached init.d script for CentOS/RedHat/Fedora. Trouble is, the wiki
won't let me access it. Even after creating an account and logging in,
clicking on the link
(http://wiki.apache.org/solr/SolrTomcat?action=AttachFile&do=view&target=tomcat6)
gives me the error: "You are not allowed to do AttachFile on this
page."

Is that script posted elsewhere online anywhere?
Am I doing something obviously wrong in trying to access it?

Sixten


Re: How real-time are Solr/Lucene queries?

2010-05-26 Thread Sixten Otto
On Wed, May 26, 2010 at 11:30 AM, Thomas J. Buhr
 wrote:
> Basically, I need to know that issuing searches to a local index will not be 
> slower than searching a hashmap or array. How different or similar will the 
> performance be?

If you don't mind my asking... I'm still trying to understand why your
application isn't using something like a hashtable, as opposed to
Lucene. You've said that you have many very tiny pieces of data that
you're storing and looking up, and that you're not analyzing them very
much with Lucene.

You've said that you're looking up these values with Lucene queries,
but haven't said much about the kinds of queries you're using. Your
descriptions read to me like you know what specific things you're
finding, which makes me wonder why a Dictionary (in the abstract
sense) wouldn't work for what you're doing. What role is the search
engine playing that a simpler (and almost certainly faster and less
complicated) data store couldn't?

Perhaps elaborating on that might help folks on this list to better
address your questions about whether Solr/Lucene can meet your
requirements?

Sixten


Re: TikaEntityProcessor on Solr 1.4?

2010-05-21 Thread Sixten Otto
On Fri, May 21, 2010 at 5:30 PM, Chris Harris  wrote:
> Actually, rather than cherry-pick just the changes from SOLR-1358 and
> SOLR-1583 what I did was to merge in all DataImportHandler-related
> changes from between the 1.4 release up through Solr trunk r890679
> (inclusive). I'm not sure if that's what would work best for you, but
> it's one option.

I'd rather, of course, not to have to build my own. But if I'm going
to dabble in the source at all, it's just a slippery slope from the
former to the latter. :-)  (My main hesitation in doing so would be
that I'm new enough to the code that I have no idea what core changes
the trunk's DIH might also depend on. And my Java's pretty rusty.)

How did you arrive at your patch? Just grafting the entire
trunk/solr/contrib/dataimporthandler onto 1.4's code? Or did you go
through Jira/SVN looking for applicable changesets?

I'll be very interested to hear how your testing goes!

Sixten


Re: TikaEntityProcessor on Solr 1.4?

2010-05-21 Thread Sixten Otto
2010/5/19 Noble Paul നോബിള്‍  नोब्ळ् :
> I guess it should work because Tika Entityprocessor does not use any
> new 1.4 APIs
>
> On Wed, May 19, 2010 at 1:17 AM, Sixten Otto  wrote:
>> The TikaEntityProcessor class that enables DataImportHandler to
>> process business documents was added after the release of Solr 1.4,
>> ... Has anyone tried back-porting those changes to Solr 1.4?

Did you mean "new 1.5 APIs" (since TEP was added *after* 1.4 was
released)? Even then, that doesn't make a lot of sense to me, as at
least a couple of new things (the binary data sources) *were* added to
support TikaEntityProcessor.

I'm sorry if I'm being dense, but I'm having trouble understanding this answer.

Sixten


TikaEntityProcessor on Solr 1.4?

2010-05-18 Thread Sixten Otto
Sorry to repeat this question, but I realized that it probably
belonged in its own thread:

The TikaEntityProcessor class that enables DataImportHandler to
process business documents was added after the release of Solr 1.4,
along with some other changes (like the binary DataSources) to support
it. Obviously, there hasn't been an official release of Solr since
then. Has anyone tried back-porting those changes to Solr 1.4?

(I do see that the question was asked last month, without any
response: http://www.lucidimagination.com/search/document/5d2d25bc57c370e9)

The patches for these issues don't seem all that complex or pervasive,
but it's hard for me (as a Solr n00b) to tell whether this is really
all that's involved:
https://issues.apache.org/jira/browse/SOLR-1583
https://issues.apache.org/jira/browse/SOLR-1358

Sixten


Re: Which Solr to use?

2010-05-18 Thread Sixten Otto
On Tue, May 18, 2010 at 10:40 AM, Robert Muir  wrote:
> Some discussions/voting happened and the trunk is intended to be ...
> more like a normal trunk.
>
> If you need features not in an official release, and are looking for a
> codebase with updated features, I would recommend instead considering:
>
> http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/

So features are being actively added to / code rearranged in
trunk/4.0, with some of the work being back-ported to this branch to
form a stable 3.1 release? Is that accurate?

Is there any thinking about when that might drop (beyond the quite
understandable "when it's done")? Or, perhaps more reasonably, when it
might freeze?

(I've done some casual searching of the site + list archives without
finding this information, but by all means if there's a thread I
should go read to bone up on this stuff, a link is all I need.)

Sixten


Which Solr to use?

2010-05-17 Thread Sixten Otto
I've been investigating Solr on and off as a (or even the) search
solution for my employer's content management solution. One of the
biggest questions in my mind at this point is which version to go
with. In general, 1.4 would seem the obvious choice, as it's the only
released version on that list. There's a commercially supported distro
from Lucid, and things should presumably be pretty stable.

What led me down the rabbit hole is that a) we generally have quite a
lot of business documents to index (Word and PDF, mostly), and b) the
"pull" approach implemented in the DataImportHandler is much more
attractive in our architecture than the "push" model we'd otherwise
have to contruct. Unfortunately, the TikaEntityProcessor and the
binary data sources on which it depends were added after 1.4 was
released.

Back in early March, I was able to get things up and running with a
1.5 nightly (and Tika 0.7-snapshot), but since then the course of Solr
development has... changed significantly. The 1.5 branch has been
abandoned, and (to my uninformed eye) it seems that there's a lot of
upheaval in the trunk as things merge with Lucene. And it also appears
that the released Tika 0.7 might not be compatible with Solr? (Judging
by SOLR-1902, that is.)

What I'm looking for is some advice on what course to pursue:
- Plunge ahead with the trunk, and hope that things stabilize by a few
months from now, when we'd be hoping to go live on one of our biggest
client sites.
- Go with the last 1.5 code, knowing that the features we want are in
there, and hope we don't run into anything majorly broken.
- Stick with 1.4, and just accept the necessity of needing to push
content to the HTTP interface.

I don't expect a definitive answer, of course, but I'd like to be
better informed about the risks and benefits.

Also: does anyone have a sense whether it'd be possible to back-port
the TikaEntityProcessor stuff to 1.4?

Sixten