Re: Help with a DIH config file

2019-03-16 Thread Jörn Franke
You have to specify the option recursive=true on the entity files

On Fri, Mar 15, 2019 at 7:59 PM wclarke  wrote:

> One last question.
>
> I have everything running as it should finally.  However, when I pull out
> of
> testing to do the entire directory it's just cycling through.  The
> directory
> is full of folders that have the documents in them.  Do I need an html or
> other file sitting in there randomly to get it to start recursion through
> the folders?  I am attaching my dih config to see the single change I made
> to the base directory.  Am I just being impatient and it will eventually
> start going in the folders?
>
> Thanks! tika-data-config-2.xml
> 
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Help with a DIH config file

2019-03-16 Thread Jörn Franke
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#the-filelistentityprocessor

On Sun, Mar 17, 2019 at 1:32 AM Jörn Franke  wrote:

> You have to specify the option recursive=true on the entity files
>
> On Fri, Mar 15, 2019 at 7:59 PM wclarke  wrote:
>
>> One last question.
>>
>> I have everything running as it should finally.  However, when I pull out
>> of
>> testing to do the entire directory it's just cycling through.  The
>> directory
>> is full of folders that have the documents in them.  Do I need an html or
>> other file sitting in there randomly to get it to start recursion through
>> the folders?  I am attaching my dih config to see the single change I made
>> to the base directory.  Am I just being impatient and it will eventually
>> start going in the folders?
>>
>> Thanks! tika-data-config-2.xml
>> 
>>
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>


Re: Help with a DIH config file

2019-03-15 Thread wclarke
One last question.

I have everything running as it should finally.  However, when I pull out of
testing to do the entire directory it's just cycling through.  The directory
is full of folders that have the documents in them.  Do I need an html or
other file sitting in there randomly to get it to start recursion through
the folders?  I am attaching my dih config to see the single change I made
to the base directory.  Am I just being impatient and it will eventually
start going in the folders?

Thanks! tika-data-config-2.xml
  



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Help with a DIH config file

2019-03-15 Thread wclarke
Thanks! that fixed it.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Help with a DIH config file

2019-03-15 Thread Tim Allison
Haha, looks like Jörn just answered this... onError="skip|continue"

>greatly preferable if the indexing process could ignore exceptions
Please, no.  I'm 100% behind the sentiment that DIH should gracefully
handle Tika exceptions, but the better option is to log the
exceptions, store the stacktraces and report your high priority
problems to Apache Tika and/or its dependencies so that we can fix
them.  Try running tika-eval[0] against a subset of your docs,
perhaps.

That said, DIH's integration with Tika is not intended for robust
production use.  It is intended to get people up to speed quickly and,
effectively, for demo purposes.  I recognize that it is being used in
production around the world, but it really shouldn't be.

See Erick Erickson's[1]:
>But, i wouldn’t really recommend that you just ship the docs to Solr, I’d 
>recommend that you build a little program to do the extraction on one or more 
>clients, the details of why are here:

>https://lucidworks.com/2012/02/14/indexing-with-solrj/

[0] https://wiki.apache.org/tika/TikaEval
[1] 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201903.mbox/ajax/%3CF2034803-D4A8-48E1-889A-DA9E44961EE6%40gmail.com%3E

On Fri, Mar 15, 2019 at 7:44 AM Demian Katz  wrote:
>
> Jörn (and anyone else with more experience with this than I have),
>
> I've been working on Whitney with this issue. It is a PDF file, and it can be 
> opened successfully in a PDF reader. Interestingly, if I try to extract data 
> from it on the command line, Tika version 1.3 throws a lot of warnings but 
> does successfully extract data, but several newer versions, including 1.17 
> and 1.20 (haven't tested other intermediate versions) encounter a fatal error 
> and extract nothing. So this seems like something that used to work but has 
> stopped. Unfortunately, we haven't been able to find a way to downgrade to an 
> old enough Tika in her Solr installation to work around the problem that way.
>
> The bigger question, though, is whether there's a way to allow the DIH to 
> simply ignore errors and keep going. Whitney needs to index several terabytes 
> of arbitrary documents for her project, and at this scale, she can't afford 
> the time to stop and manually intervene for every strange document that 
> happens to be in the collection. It would be greatly preferable if the 
> indexing process could ignore exceptions and proceed on than if it just stops 
> dead at the first problem. (I'm also pretty sure that Whitney is already 
> using the ignoreTikaException attribute in her configuration, but it doesn't 
> seem to help in this instance).
>
> Any suggestions would be greatly appreciated!
>
> thanks,
> Demian
>
> -Original Message-
> From: Jörn Franke 
> Sent: Friday, March 15, 2019 4:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Help with a DIH config file
>
> Do you have an exception?
> It could be that the pdf is broken - can you open it on your computer with a 
> pdfreader?
>
> If the exception is related to Tika and pdf then file an issue with the 
> pdfbox project. If there is an issue with Tika and MsOffice documents then 
> Apache poi is the right project to ask.
>
> > Am 15.03.2019 um 03:41 schrieb wclarke :
> >
> > Thank you so much.  You helped a great deal.  I am running into one
> > last issue where the Tika DIH is stopping at a specific language and
> > fails there (Malayalam).  Do you know of a work around?
> >
> >
> >
> > --
> > Sent from:
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen
> > e.472066.n3.nabble.com%2FSolr-User-f472068.htmldata=02%7C01%7Cdem
> > ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5
> > cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071sdata=NpddZY
> > 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3Dreserved=0


Re: Help with a DIH config file

2019-03-15 Thread Jörn Franke
In the Tika entity processor use the option onError=“skip”

Alternatives are abort (default) or continue (behave as nothing would have 
happened)

Skip skips the current document 

> Am 15.03.2019 um 12:44 schrieb Demian Katz :
> 
> Jörn (and anyone else with more experience with this than I have),
> 
> I've been working on Whitney with this issue. It is a PDF file, and it can be 
> opened successfully in a PDF reader. Interestingly, if I try to extract data 
> from it on the command line, Tika version 1.3 throws a lot of warnings but 
> does successfully extract data, but several newer versions, including 1.17 
> and 1.20 (haven't tested other intermediate versions) encounter a fatal error 
> and extract nothing. So this seems like something that used to work but has 
> stopped. Unfortunately, we haven't been able to find a way to downgrade to an 
> old enough Tika in her Solr installation to work around the problem that way.
> 
> The bigger question, though, is whether there's a way to allow the DIH to 
> simply ignore errors and keep going. Whitney needs to index several terabytes 
> of arbitrary documents for her project, and at this scale, she can't afford 
> the time to stop and manually intervene for every strange document that 
> happens to be in the collection. It would be greatly preferable if the 
> indexing process could ignore exceptions and proceed on than if it just stops 
> dead at the first problem. (I'm also pretty sure that Whitney is already 
> using the ignoreTikaException attribute in her configuration, but it doesn't 
> seem to help in this instance).
> 
> Any suggestions would be greatly appreciated!
> 
> thanks,
> Demian
> 
> -Original Message-
> From: Jörn Franke  
> Sent: Friday, March 15, 2019 4:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Help with a DIH config file
> 
> Do you have an exception?
> It could be that the pdf is broken - can you open it on your computer with a 
> pdfreader?
> 
> If the exception is related to Tika and pdf then file an issue with the 
> pdfbox project. If there is an issue with Tika and MsOffice documents then 
> Apache poi is the right project to ask.
> 
>> Am 15.03.2019 um 03:41 schrieb wclarke :
>> 
>> Thank you so much.  You helped a great deal.  I am running into one 
>> last issue where the Tika DIH is stopping at a specific language and 
>> fails there (Malayalam).  Do you know of a work around?
>> 
>> 
>> 
>> --
>> Sent from: 
>> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen
>> e.472066.n3.nabble.com%2FSolr-User-f472068.htmldata=02%7C01%7Cdem
>> ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5
>> cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071sdata=NpddZY
>> 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3Dreserved=0


RE: Help with a DIH config file

2019-03-15 Thread Demian Katz
Jörn (and anyone else with more experience with this than I have),

I've been working on Whitney with this issue. It is a PDF file, and it can be 
opened successfully in a PDF reader. Interestingly, if I try to extract data 
from it on the command line, Tika version 1.3 throws a lot of warnings but does 
successfully extract data, but several newer versions, including 1.17 and 1.20 
(haven't tested other intermediate versions) encounter a fatal error and 
extract nothing. So this seems like something that used to work but has 
stopped. Unfortunately, we haven't been able to find a way to downgrade to an 
old enough Tika in her Solr installation to work around the problem that way.

The bigger question, though, is whether there's a way to allow the DIH to 
simply ignore errors and keep going. Whitney needs to index several terabytes 
of arbitrary documents for her project, and at this scale, she can't afford the 
time to stop and manually intervene for every strange document that happens to 
be in the collection. It would be greatly preferable if the indexing process 
could ignore exceptions and proceed on than if it just stops dead at the first 
problem. (I'm also pretty sure that Whitney is already using the 
ignoreTikaException attribute in her configuration, but it doesn't seem to help 
in this instance).

Any suggestions would be greatly appreciated!

thanks,
Demian

-Original Message-
From: Jörn Franke  
Sent: Friday, March 15, 2019 4:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Help with a DIH config file

Do you have an exception?
It could be that the pdf is broken - can you open it on your computer with a 
pdfreader?

If the exception is related to Tika and pdf then file an issue with the pdfbox 
project. If there is an issue with Tika and MsOffice documents then Apache poi 
is the right project to ask.

> Am 15.03.2019 um 03:41 schrieb wclarke :
> 
> Thank you so much.  You helped a great deal.  I am running into one 
> last issue where the Tika DIH is stopping at a specific language and 
> fails there (Malayalam).  Do you know of a work around?
> 
> 
> 
> --
> Sent from: 
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen
> e.472066.n3.nabble.com%2FSolr-User-f472068.htmldata=02%7C01%7Cdem
> ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5
> cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071sdata=NpddZY
> 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3Dreserved=0


Re: Help with a DIH config file

2019-03-15 Thread Jörn Franke
Do you have an exception?
It could be that the pdf is broken - can you open it on your computer with a 
pdfreader?

If the exception is related to Tika and pdf then file an issue with the pdfbox 
project. If there is an issue with Tika and MsOffice documents then Apache poi 
is the right project to ask.

> Am 15.03.2019 um 03:41 schrieb wclarke :
> 
> Thank you so much.  You helped a great deal.  I am running into one last
> issue where the Tika DIH is stopping at a specific language and fails there
> (Malayalam).  Do you know of a work around?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Help with a DIH config file

2019-03-15 Thread wclarke
Thank you so much.  You helped a great deal.  I am running into one last
issue where the Tika DIH is stopping at a specific language and fails there
(Malayalam).  Do you know of a work around?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Help with a DIH config file

2019-03-14 Thread Jörn Franke
sorry for my late reply. thanks for sharing

yes this is possible.

maybe my last mail were confusing. I hope the examples below help

Alternative 1 - Use only DIH without update processor
tika-data-config-2xml - add transformer in entity and the transformation in
field (here done for id and for fulltext) - additioanlly set
TikaEntityProcessor format to "text":























Alternative 2 - Regex processor in solrconfig.xml - you need to put
everything into ONE chain

  _text_ fulltext


_text_
fulltext
\n|\r

true



id
url
[^\w|\.]
/
true






[..]


tika-data-config-2.xml
my-chain



On Thu, Mar 14, 2019 at 6:41 AM wclarke  wrote:

> Got each one working individually, but not multiples.  Is it possible?
> Please see attached files.
>
> Thanks!!! tika-data-config-2.xml
> 
> solrconfig.xml
> 
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Help with a DIH config file

2019-03-13 Thread wclarke
Got each one working individually, but not multiples.  Is it possible? 
Please see attached files.

Thanks!!! tika-data-config-2.xml
  
solrconfig.xml
  



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Help with a DIH config file

2019-03-13 Thread wclarke
I didn't know I could do an updateProcessorChain and call it in the config
file.  I tried doing it in the solrconfig, but it just wouldn't take.  I
will try this though!  Thanks

The value is the file path in id/url.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Help with a DIH config file

2019-03-13 Thread wclarke
Absolutely!  I attached it to the original message, But I can post here too. 
I am VERY new to Solr and am winging it and while the documentation has been
a little helpful, I just need more complex examples.

tika-data-config-2.xml
  



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Help with a DIH config file

2019-03-12 Thread Jörn Franke
Some addition: You can also strip HTML in DIH using the HTML Strip
transformer:
https://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer

In that way you can probably live without a UpdateRequestProcessorChain

On Tue, Mar 12, 2019 at 10:24 PM Jörn Franke  wrote:

> Would it be possible to share the DIH config file?
>
> I am not sure if I get all your points correctly.
>
> Ad 1) is this about a value in a field? Then use the regex transformer:
> https://wiki.apache.org/solr/DataImportHandler#RegexTransformer
> Alternatively, use a RegexReplaceProcessorFactoryin solrconfig.xml or a
> ScriptTransformer in DIH. E.g. a RegexReplaceProcessorFactory (
> https://lucene.apache.org/solr/7_3_0//solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html)
> in a custom processing chain in solrconfig.xml
> 
>content
>\n|\t|\r
>
>true
>  
>   
>   
> 
>
> and attach it to your dih in solrconfig.xml
> 
> 
>   data-config.xml
> regex_replace
> 
> 
>
>
>
>
> ad 2) was this html part of the original document or is it "HTML"
> generated by Tika. In the first case then you can use a
> HTMLStripFieldUpdateProcessorFactory that should be configured in the
> solrconfig.xml:
> https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
> You need to create an update processor chain
> https://lucene.apache.org/solr/guide/7_3/update-request-processors.html#custom-update-request-processor-chain
>
>
> 
>   
> myfyfield
>   
>   
>   
> 
>
> and attach it to your dih in solrconfig.xml
> 
> 
>   data-config.xml
> remove_html
> 
> 
>
> In the second case (Tika attaches XML elements) specify
> extractFormat="text" for Tika in DIH :
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>
> add 3) see 1)
>
> Note: You can only create one chain / DIH, so you need to put all the
> processors that you want to apply into one chain. The transformers are
> independent of the processors and are configured in the DIH.
>
>
>
> On Tue, Mar 12, 2019 at 7:47 PM wclarke  wrote:
>
>> I have a previous post that looks like this:
>>
>> I am pulling a large amount of data from a local source
>> D:\foo\resource\.  I
>> am using tika through a DIH to index the multiple file formats with text
>> and
>> metadata.  I have almost all the information being pulled that I want,
>> however, I am having a couple of issues:
>>
>> 1. I need to run a regex replace of the D:\foo\resource\ to be http://,
>> which is part of what I want to use XPath for.  I have the regex written,
>> but not the replacement and I am not sure of where it needs to be located
>> in
>> my data-config.xml file.
>>
>> 2. I want to strip html where necessary also using XPath.
>>
>> 3. I need to remove \n, \t, \r, and any other extra crap I am getting in
>> the
>> text field to just get to the text content of the document, whatever mime
>> type that might be so that it can be searchable.
>>
>> I am running it through the solr admin data import as opposed to the
>> post.jar (I have tried both).  And this is running on Windows and cannot
>> be
>> run on Linux as we have no one who can support it.  I am posting my
>> tika-data-config.xml (not tikaconfig) I named it this way so as not to be
>> confused with our db-config for our catalog pull.
>>
>> Thanks in advance for any help.  And I will upload any additional files
>> that
>> might be helpful upon request - I don't want to overload the post.
>>
>> We are a small non-profit without a great deal of money, however, if there
>> is someone who could finish writing it we would be willing to pay a little
>> something for time.  We really need this done ASAP!
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>


Re: Help with a DIH config file

2019-03-12 Thread Jörn Franke
Would it be possible to share the DIH config file?

I am not sure if I get all your points correctly.

Ad 1) is this about a value in a field? Then use the regex transformer:
https://wiki.apache.org/solr/DataImportHandler#RegexTransformer
Alternatively, use a RegexReplaceProcessorFactoryin solrconfig.xml or a
ScriptTransformer in DIH. E.g. a RegexReplaceProcessorFactory (
https://lucene.apache.org/solr/7_3_0//solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html)
in a custom processing chain in solrconfig.xml

   content
   \n|\t|\r
   
   true
 
  
  


and attach it to your dih in solrconfig.xml


  data-config.xml
regex_replace






ad 2) was this html part of the original document or is it "HTML" generated
by Tika. In the first case then you can use a
HTMLStripFieldUpdateProcessorFactory that should be configured in the
solrconfig.xml:
https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
You need to create an update processor chain
https://lucene.apache.org/solr/guide/7_3/update-request-processors.html#custom-update-request-processor-chain



  
myfyfield
  
  
  


and attach it to your dih in solrconfig.xml


  data-config.xml
remove_html



In the second case (Tika attaches XML elements) specify
extractFormat="text" for Tika in DIH :
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html

add 3) see 1)

Note: You can only create one chain / DIH, so you need to put all the
processors that you want to apply into one chain. The transformers are
independent of the processors and are configured in the DIH.



On Tue, Mar 12, 2019 at 7:47 PM wclarke  wrote:

> I have a previous post that looks like this:
>
> I am pulling a large amount of data from a local source D:\foo\resource\.
> I
> am using tika through a DIH to index the multiple file formats with text
> and
> metadata.  I have almost all the information being pulled that I want,
> however, I am having a couple of issues:
>
> 1. I need to run a regex replace of the D:\foo\resource\ to be http://,
> which is part of what I want to use XPath for.  I have the regex written,
> but not the replacement and I am not sure of where it needs to be located
> in
> my data-config.xml file.
>
> 2. I want to strip html where necessary also using XPath.
>
> 3. I need to remove \n, \t, \r, and any other extra crap I am getting in
> the
> text field to just get to the text content of the document, whatever mime
> type that might be so that it can be searchable.
>
> I am running it through the solr admin data import as opposed to the
> post.jar (I have tried both).  And this is running on Windows and cannot be
> run on Linux as we have no one who can support it.  I am posting my
> tika-data-config.xml (not tikaconfig) I named it this way so as not to be
> confused with our db-config for our catalog pull.
>
> Thanks in advance for any help.  And I will upload any additional files
> that
> might be helpful upon request - I don't want to overload the post.
>
> We are a small non-profit without a great deal of money, however, if there
> is someone who could finish writing it we would be willing to pay a little
> something for time.  We really need this done ASAP!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Help with a DIH config file

2019-03-12 Thread wclarke
I have a previous post that looks like this:

I am pulling a large amount of data from a local source D:\foo\resource\.  I
am using tika through a DIH to index the multiple file formats with text and
metadata.  I have almost all the information being pulled that I want,
however, I am having a couple of issues: 

1. I need to run a regex replace of the D:\foo\resource\ to be http://,
which is part of what I want to use XPath for.  I have the regex written,
but not the replacement and I am not sure of where it needs to be located in
my data-config.xml file. 

2. I want to strip html where necessary also using XPath. 

3. I need to remove \n, \t, \r, and any other extra crap I am getting in the
text field to just get to the text content of the document, whatever mime
type that might be so that it can be searchable. 

I am running it through the solr admin data import as opposed to the
post.jar (I have tried both).  And this is running on Windows and cannot be
run on Linux as we have no one who can support it.  I am posting my
tika-data-config.xml (not tikaconfig) I named it this way so as not to be
confused with our db-config for our catalog pull. 

Thanks in advance for any help.  And I will upload any additional files that
might be helpful upon request - I don't want to overload the post.

We are a small non-profit without a great deal of money, however, if there
is someone who could finish writing it we would be willing to pay a little
something for time.  We really need this done ASAP!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html