Re: [basex-talk] Coding help

2019-08-08 Thread Marc

For this type of project, I regard into the solution with a file:list and a 
stream use of XSLT, or a use of the saxon:discard-document or the use of 
collection as in 
https://ajwelch.blogspot.com/2006/11/using-collection-and-saxondiscard.html

If the query to select file doesn'y need Xquery but XPath, it can be done like 
that. If the collection is to big you can do more than one pass with the file 
selector (files started by a, next b, etc)

Marc

Le 06/08/2019 à 18:12, Majewski, Steven Dennis (sdm7g) a écrit :


Creating a bunch of temporary databases that you’re going to delete
doesn’t sound like the most efficient way to process this data.
But it hard to tell what alternative to recommend without more info
about what your desired end result is.

Is this something that you’re going to do once, or do repeatedly for
different strings ?

Taking your description literally, I would use ‘grep -l ‘ to generate a
list of files with the specific string, and either feed that list into
‘cat’ or else use it to build a database of a subset of files for
further investigation.

There are also other ways to filter the data through a stream to select
a subset if that is what you want to do.

But if you’re going to do this repeatedly for different subsets, then it
might make more sense to try to get everything parsed and indexed into
the database once. If it really is too large for a single database after
adjusting the java memory parameters in the basex scripts, you could try
sharding the data into several databases, repeat the search on each
collection, and concatenate the results.




> On Aug 5, 2019, at 2:41 AM, Greg Kawchuk  > wrote:
>
> Hi everyone,
> I'm wondering if someone could provide what I think is a brief script
> for a scientific project to do the following.
> The goal is to identify XML documents from a very large collection
> that would be too big to load into a database all at once.
>
> Here is how I see the functions provided by the code.
> 1. In the script, the user could enter the path of the target folder
> (with millions of XML documents).
> 2. In the script, the user would enter the number of documents to load
> into a database at a given time (i =. 1,000) depending on memory
> limitations.
> 3. The code would then create a temporary database from the first (i)
> xml files in the target folder.
> 4. The code would then search the 1000 xml documents in the database
> for a pre-defined text string.
> 5. If hits exist for the text string, the code would write those
> documents to a unique XML file.
> 6. Clear the database.
> 7. Read in the next 1000 files (or remaining files in the folder).
> 8. Return to #4.
>
> There would be no need to append XML files in step 5. The resulting
> XML files could be concatenated afterwards.
> Thank you in advance. If you have any questions, please feel free to
> email me here.
> Greg
>
> ***
> Greg Kawchuk BSC, DC, MSc, PhD.
> Professor, Faculty of Rehabilitation Medicine
> University of Alberta
> greg.kawc...@ualberta.ca 
> 780-492-6891





Re: [basex-talk] Coding help

2019-08-06 Thread Majewski, Steven Dennis (sdm7g)

Creating a bunch of temporary databases that you’re going to delete doesn’t 
sound like the most efficient way to process this data. 
But it hard to tell what alternative to recommend without more info about what 
your desired end result is. 

Is this something that you’re going to do once, or do repeatedly for different 
strings ? 

Taking your description literally, I would use ‘grep -l ‘ to generate a list of 
files with the specific string, and either feed that list into ‘cat’ or else 
use it to build a database of a subset of files for further investigation. 

There are also other ways to filter the data through a stream to select a 
subset if that is what you want to do. 

But if you’re going to do this repeatedly for different subsets, then it might 
make more sense to try to get everything parsed and indexed into the database 
once. If it really is too large for a single database after adjusting the java 
memory parameters in the basex scripts, you could try sharding the data into 
several databases, repeat the search on each collection, and concatenate the 
results. 




> On Aug 5, 2019, at 2:41 AM, Greg Kawchuk  wrote:
> 
> Hi everyone,
> I'm wondering if someone could provide what I think is a brief script for a 
> scientific project to do the following. 
> The goal is to identify XML documents from a very large collection that would 
> be too big to load into a database all at once.
> 
> Here is how I see the functions provided by the code. 
> 1. In the script, the user could enter the path of the target folder (with 
> millions of XML documents).
> 2. In the script, the user would enter the number of documents to load into a 
> database at a given time (i =. 1,000) depending on memory limitations.
> 3. The code would then create a temporary database from the first (i) xml 
> files in the target folder.
> 4. The code would then search the 1000 xml documents in the database for a 
> pre-defined text string.
> 5. If hits exist for the text string, the code would write those documents to 
> a unique XML file.
> 6. Clear the database.
> 7. Read in the next 1000 files (or remaining files in the folder).
> 8. Return to #4.
> 
> There would be no need to append XML files in step 5. The resulting XML files 
> could be concatenated afterwards. 
> Thank you in advance. If you have any questions, please feel free to email me 
> here. 
> Greg
> 
> ***
> Greg Kawchuk BSC, DC, MSc, PhD.
> Professor, Faculty of Rehabilitation Medicine
> University of Alberta
> greg.kawc...@ualberta.ca 
> 780-492-6891



smime.p7s
Description: S/MIME cryptographic signature


Re: [basex-talk] Coding help

2019-08-05 Thread Rick Graham
Hi Greg,

So, to be clear and succinct, the goal is to create a single XML file
containing all XML files that have a predefined text string match in them,
yes?

If so, I'm wondering if creating any database is necessary. A single pass
through all the files, searching for the text string, and appending matched
files as you go seems sufficient.

R

On Mon, Aug 5, 2019, 08:42 Greg Kawchuk  wrote:

> Hi everyone,
> I'm wondering if someone could provide what I think is a brief script for
> a scientific project to do the following.
> The goal is to identify XML documents from a very large collection that
> would be too big to load into a database all at once.
>
> Here is how I see the functions provided by the code.
> 1. In the script, the user could enter the path of the target folder (with
> millions of XML documents).
> 2. In the script, the user would enter the number of documents to load
> into a database at a given time (i =. 1,000) depending on memory
> limitations.
> 3. The code would then create a temporary database from the first (i) xml
> files in the target folder.
> 4. The code would then search the 1000 xml documents in the database for a
> pre-defined text string.
> 5. If hits exist for the text string, the code would write those documents
> to a unique XML file.
> 6. Clear the database.
> 7. Read in the next 1000 files (or remaining files in the folder).
> 8. Return to #4.
>
> There would be no need to append XML files in step 5. The resulting XML
> files could be concatenated afterwards.
> Thank you in advance. If you have any questions, please feel free to email
> me here.
> Greg
>
> ***
> Greg Kawchuk BSC, DC, MSc, PhD.
> Professor, Faculty of Rehabilitation Medicine
> University of Alberta
> greg.kawc...@ualberta.ca
> 780-492-6891
>


Re: [basex-talk] Coding help

2019-08-05 Thread Martin Honnen

Am 05.08.2019 um 08:41 schrieb Greg Kawchuk:

Hi everyone,
I'm wondering if someone could provide what I think is a brief script
for a scientific project to do the following.
The goal is to identify XML documents from a very large collection
that would be too big to load into a database all at once.

Here is how I see the functions provided by the code.
1. In the script, the user could enter the path of the target folder
(with millions of XML documents).
2. In the script, the user would enter the number of documents to load
into a database at a given time (i =. 1,000) depending on memory
limitations.
3. The code would then create a temporary database from the first (i)
xml files in the target folder.
4. The code would then search the 1000 xml documents in the database
for a pre-defined text string.



What kind of search is that exactly? Does it depend on any database
related features at all or can't you just use BaseX as a standalone
XQuery processor?


5. If hits exist for the text string, the code would write those
documents to a unique XML file.



What kind of structure would that unique file have, simply

  {collection('foo')[1 to 1000][condition]}


6. Clear the database.
7. Read in the next 1000 files (or remaining files in the folder).
8. Return to #4.

There would be no need to append XML files in step 5. The resulting
XML files could be concatenated afterwards.
Thank you in advance. If you have any questions, please feel free to
email me here.






[basex-talk] Coding help

2019-08-05 Thread Greg Kawchuk
Hi everyone,
I'm wondering if someone could provide what I think is a brief script for a
scientific project to do the following.
The goal is to identify XML documents from a very large collection that
would be too big to load into a database all at once.

Here is how I see the functions provided by the code.
1. In the script, the user could enter the path of the target folder (with
millions of XML documents).
2. In the script, the user would enter the number of documents to load into
a database at a given time (i =. 1,000) depending on memory limitations.
3. The code would then create a temporary database from the first (i) xml
files in the target folder.
4. The code would then search the 1000 xml documents in the database for a
pre-defined text string.
5. If hits exist for the text string, the code would write those documents
to a unique XML file.
6. Clear the database.
7. Read in the next 1000 files (or remaining files in the folder).
8. Return to #4.

There would be no need to append XML files in step 5. The resulting XML
files could be concatenated afterwards.
Thank you in advance. If you have any questions, please feel free to email
me here.
Greg

***
Greg Kawchuk BSC, DC, MSc, PhD.
Professor, Faculty of Rehabilitation Medicine
University of Alberta
greg.kawc...@ualberta.ca
780-492-6891